# Binary and Compressed Files

**Module 06 | Notebook 03**

---

## Objective
By the end of this notebook, you will master:
- Working with raw binary files
- np.fromfile and np.tofile
- Interfacing with other formats (HDF5, etc.)
- Buffer protocol and bytes
- Data exchange with other languages

In [None]:
import numpy as np
import os
np.set_printoptions(precision=2)

---
## 1. np.tofile() and np.fromfile() - Raw Binary

In [None]:
# Write array to raw binary file
arr = np.arange(12, dtype=np.float64).reshape(3, 4)
print(f"Array to save:\n{arr}")
print(f"Dtype: {arr.dtype}, Shape: {arr.shape}")

arr.tofile('raw_array.bin')
print(f"\nFile size: {os.path.getsize('raw_array.bin')} bytes")
print(f"Expected: {arr.nbytes} bytes")

In [None]:
# Read raw binary - MUST know dtype!
loaded = np.fromfile('raw_array.bin', dtype=np.float64)
print(f"Loaded (flat): {loaded}")

# Shape is lost! Must reshape manually
loaded = loaded.reshape(3, 4)
print(f"Reshaped:\n{loaded}")

In [None]:
# Wrong dtype gives garbage
wrong = np.fromfile('raw_array.bin', dtype=np.int32)
print(f"Wrong dtype (int32):\n{wrong[:8]}")

In [None]:
# Read subset with count and offset
# Read 4 elements starting from element 2
subset = np.fromfile('raw_array.bin', dtype=np.float64, 
                      count=4, offset=16)  # 16 = 2 * 8 bytes
print(f"Subset: {subset}")

In [None]:
# tofile with separator (text mode)
arr = np.arange(5)
arr.tofile('text_binary.txt', sep=',')

with open('text_binary.txt', 'r') as f:
    print(f"Text format: {f.read()}")

---
## 2. np.frombuffer() - From Bytes

In [None]:
# Create array from bytes object
arr = np.arange(5, dtype=np.int32)
raw_bytes = arr.tobytes()

print(f"Original array: {arr}")
print(f"As bytes: {raw_bytes}")
print(f"Bytes length: {len(raw_bytes)}")

In [None]:
# Reconstruct from bytes
recovered = np.frombuffer(raw_bytes, dtype=np.int32)
print(f"Recovered: {recovered}")

In [None]:
# frombuffer creates view (if possible)
ba = bytearray(raw_bytes)
view = np.frombuffer(ba, dtype=np.int32)

# Modify bytearray
ba[0:4] = b'\xff\xff\xff\xff'
print(f"Modified via bytearray: {view}")

In [None]:
# Useful for network data, shared memory, etc.
# Example: parse binary protocol header
header_bytes = b'\x01\x00\x00\x00\x0a\x00\x00\x00'  # version=1, length=10
header = np.frombuffer(header_bytes, dtype=np.uint32)
print(f"Version: {header[0]}, Length: {header[1]}")

---
## 3. Byte Order (Endianness)

In [None]:
# Check system byte order
import sys
print(f"System byte order: {sys.byteorder}")

# Little-endian: least significant byte first (x86, ARM)
# Big-endian: most significant byte first (network protocols)

In [None]:
# Specify byte order in dtype
arr_native = np.array([1, 256, 65536], dtype='<i4')  # Little-endian
arr_big = np.array([1, 256, 65536], dtype='>i4')     # Big-endian

print(f"Native bytes: {arr_native.tobytes().hex()}")
print(f"Big-endian bytes: {arr_big.tobytes().hex()}")

In [None]:
# Swap byte order
arr = np.array([1, 256], dtype=np.int32)
print(f"Original: {arr}")

swapped = arr.byteswap()
print(f"Byte-swapped: {swapped}")

In [None]:
# newbyteorder: change interpretation without swapping
arr = np.array([1], dtype='<i4')
print(f"Original dtype: {arr.dtype}")

reinterpreted = arr.view(arr.dtype.newbyteorder('>'))
print(f"Reinterpreted dtype: {reinterpreted.dtype}")
print(f"Reinterpreted value: {reinterpreted}")

---
## 4. Working with Other Formats

In [None]:
# HDF5 (requires h5py)
# pip install h5py

try:
    import h5py
    
    # Create HDF5 file
    with h5py.File('data.h5', 'w') as f:
        f.create_dataset('dataset1', data=np.random.rand(100, 100))
        f.create_dataset('dataset2', data=np.arange(1000))
        f['dataset1'].attrs['description'] = 'Random matrix'
    
    # Read HDF5 file
    with h5py.File('data.h5', 'r') as f:
        print(f"Datasets: {list(f.keys())}")
        data = f['dataset1'][:]
        print(f"dataset1 shape: {data.shape}")
        
    os.remove('data.h5')
    print("HDF5 example complete")
except ImportError:
    print("h5py not installed. Skip HDF5 example.")

In [None]:
# MATLAB .mat files (requires scipy)
try:
    from scipy import io as sio
    
    # Save as .mat
    data = {'arr1': np.random.rand(10, 10), 'arr2': np.arange(100)}
    sio.savemat('data.mat', data)
    
    # Load .mat
    loaded = sio.loadmat('data.mat')
    print(f"Keys: {[k for k in loaded.keys() if not k.startswith('__')]}")
    print(f"arr1 shape: {loaded['arr1'].shape}")
    
    os.remove('data.mat')
    print("MATLAB example complete")
except ImportError:
    print("scipy not installed. Skip MATLAB example.")

In [None]:
# Image files (PIL/Pillow)
try:
    from PIL import Image
    
    # Create image from array
    img_array = np.random.randint(0, 255, (100, 100, 3), dtype=np.uint8)
    img = Image.fromarray(img_array)
    img.save('test_image.png')
    
    # Load image to array
    loaded_img = Image.open('test_image.png')
    loaded_array = np.array(loaded_img)
    
    print(f"Image shape: {loaded_array.shape}")
    print(f"Dtype: {loaded_array.dtype}")
    
    os.remove('test_image.png')
    print("Image example complete")
except ImportError:
    print("Pillow not installed. Skip image example.")

---
## 5. Structured Binary Data

In [None]:
# Read structured binary (e.g., from C program)
# struct { int32 id; float32 value; char name[10]; }

dt = np.dtype([('id', 'i4'), ('value', 'f4'), ('name', 'S10')])
print(f"Record size: {dt.itemsize} bytes")

In [None]:
# Create and save
records = np.array([
    (1, 3.14, b'Alice'),
    (2, 2.71, b'Bob'),
    (3, 1.41, b'Charlie')
], dtype=dt)

records.tofile('records.bin')
print(f"Saved {len(records)} records")

In [None]:
# Read back
loaded = np.fromfile('records.bin', dtype=dt)
print(f"Loaded records:\n{loaded}")
print(f"\nIDs: {loaded['id']}")
print(f"Values: {loaded['value']}")

In [None]:
# Pattern: Save metadata header + data
def save_with_header(filename, arr):
    """Save array with shape/dtype header."""
    with open(filename, 'wb') as f:
        # Write header: ndim, shape, dtype string
        header = f"{arr.ndim},{','.join(map(str, arr.shape))},{arr.dtype}\n"
        f.write(header.encode())
        # Write data
        f.write(arr.tobytes())

def load_with_header(filename):
    """Load array with shape/dtype header."""
    with open(filename, 'rb') as f:
        # Read header
        header = f.readline().decode().strip()
        parts = header.split(',')
        ndim = int(parts[0])
        shape = tuple(map(int, parts[1:1+ndim]))
        dtype = parts[-1]
        # Read data
        data = np.frombuffer(f.read(), dtype=dtype)
        return data.reshape(shape)

# Test
arr = np.random.rand(3, 4)
save_with_header('custom.bin', arr)
loaded = load_with_header('custom.bin')
print(f"Original shape: {arr.shape}")
print(f"Loaded shape: {loaded.shape}")
print(f"Match: {np.allclose(arr, loaded)}")

---
## 6. Data Exchange with C/Fortran

In [None]:
# Ensure contiguous memory for C interop
arr = np.random.rand(3, 4)
transposed = arr.T  # Not C-contiguous!

print(f"Original C-contiguous: {arr.flags['C_CONTIGUOUS']}")
print(f"Transposed C-contiguous: {transposed.flags['C_CONTIGUOUS']}")

In [None]:
# Make contiguous copy
c_contiguous = np.ascontiguousarray(transposed)
f_contiguous = np.asfortranarray(transposed)

print(f"C-contiguous: {c_contiguous.flags['C_CONTIGUOUS']}")
print(f"F-contiguous: {f_contiguous.flags['F_CONTIGUOUS']}")

In [None]:
# Get raw pointer for C extension
arr = np.arange(10, dtype=np.float64)

# ctypes interface
ptr = arr.ctypes.data
print(f"Data pointer: {ptr}")
print(f"Shape pointer: {arr.ctypes.shape}")

---
## Key Points Summary

**Raw Binary:**
- `arr.tofile()`: Write raw bytes
- `np.fromfile()`: Read raw bytes (need dtype!)
- Shape and dtype are NOT saved

**Bytes/Buffer:**
- `arr.tobytes()`: Array to bytes
- `np.frombuffer()`: Bytes to array
- Useful for network/shared memory

**Other Formats:**
- HDF5: h5py library
- MATLAB: scipy.io
- Images: PIL/Pillow

**C Interop:**
- Ensure contiguous memory
- Use ctypes for raw pointers

---
## Interview Tips

**Q1: Difference between tofile and save?**
> - `tofile()`: Raw binary, no metadata (dtype/shape lost)
> - `save()`: NumPy format with full metadata
> Use `save` for Python-only; `tofile` for C interop

**Q2: What is byte order and why does it matter?**
> Byte order (endianness) is how multi-byte values are stored. Little-endian stores LSB first; big-endian stores MSB first. Matters when exchanging binary data between systems.

**Q3: How do you read binary data from a C struct?**
> Define matching structured dtype, then use `np.fromfile()` or `np.frombuffer()`. Ensure byte order and alignment match.

**Q4: How to ensure array is contiguous for C functions?**
> Use `np.ascontiguousarray()` for C or `np.asfortranarray()` for Fortran. Check with `arr.flags['C_CONTIGUOUS']`.

---
## Cleanup

In [None]:
# Clean up test files
import glob

for f in glob.glob('*.bin') + glob.glob('*.txt'):
    os.remove(f)
    print(f"Removed: {f}")

---
## Module 06 Complete!

You have mastered File I/O:
- Saving and Loading Arrays
- Working with Text Files
- Binary and Compressed Files

**Next Module:** 07_performance_optimization - Memory layout, vectorization best practices, and profiling!