# Memory Layout and Views

**Module 07 | Notebook 01**

---

## Objective
By the end of this notebook, you will master:
- Understanding memory layout (C vs F order)
- Strides and contiguous arrays
- Views vs copies for performance
- Cache-friendly access patterns
- Memory alignment and efficiency

In [None]:
import numpy as np
import time
np.set_printoptions(precision=2)

---
## 1. C-Order vs Fortran-Order

In [None]:
# C-order (row-major): last index changes fastest
# Fortran-order (column-major): first index changes fastest

arr_c = np.arange(12).reshape(3, 4, order='C')
arr_f = np.arange(12).reshape(3, 4, order='F')

print("C-order (row-major):")
print(arr_c)
print(f"Memory layout: {arr_c.flatten('K')}")

print("\nFortran-order (column-major):")
print(arr_f)
print(f"Memory layout: {arr_f.flatten('K')}")

In [None]:
# Check order with flags
print(f"C-order: C={arr_c.flags['C_CONTIGUOUS']}, F={arr_c.flags['F_CONTIGUOUS']}")
print(f"F-order: C={arr_f.flags['C_CONTIGUOUS']}, F={arr_f.flags['F_CONTIGUOUS']}")

In [None]:
# Strides reveal memory layout
print(f"C-order strides: {arr_c.strides}")
print(f"F-order strides: {arr_f.strides}")

# C: (16, 4) = jump 16 bytes for row, 4 bytes for column (int32)
# F: (4, 12) = jump 4 bytes for row, 12 bytes for column

In [None]:
# Performance: iterate in memory order!
n = 2000
arr_c = np.random.rand(n, n)
arr_f = np.asfortranarray(arr_c)

# Row-wise sum (good for C-order)
start = time.perf_counter()
for _ in range(10):
    _ = arr_c.sum(axis=1)
c_row_time = time.perf_counter() - start

# Column-wise sum (good for F-order)  
start = time.perf_counter()
for _ in range(10):
    _ = arr_f.sum(axis=0)
f_col_time = time.perf_counter() - start

print(f"C-order row sum: {c_row_time*100:.1f}ms")
print(f"F-order col sum: {f_col_time*100:.1f}ms")

---
## 2. Understanding Strides

In [None]:
arr = np.arange(24).reshape(2, 3, 4)
print(f"Shape: {arr.shape}")
print(f"Strides: {arr.strides}")
print(f"Item size: {arr.itemsize} bytes")

# Strides: (48, 16, 4)
# - Move to next [0] -> 48 bytes = 12 elements * 4 bytes
# - Move to next [1] -> 16 bytes = 4 elements * 4 bytes  
# - Move to next [2] -> 4 bytes = 1 element * 4 bytes

In [None]:
# Slicing changes strides but not data layout
arr = np.arange(20).reshape(4, 5)
print(f"Original strides: {arr.strides}")

# Every other row
sliced = arr[::2]
print(f"Every 2nd row strides: {sliced.strides}")

# Every other column
sliced2 = arr[:, ::2]
print(f"Every 2nd col strides: {sliced2.strides}")

In [None]:
# Transpose: just swaps strides!
arr = np.arange(12).reshape(3, 4)
print(f"Original: shape={arr.shape}, strides={arr.strides}")

trans = arr.T
print(f"Transposed: shape={trans.shape}, strides={trans.strides}")
print(f"Same memory: {np.shares_memory(arr, trans)}")

In [None]:
# Non-contiguous array (after transpose)
print(f"Original contiguous: {arr.flags['C_CONTIGUOUS']}")
print(f"Transposed contiguous: {trans.flags['C_CONTIGUOUS']}")

---
## 3. Views vs Copies Performance

In [None]:
# Views are essentially free (just metadata)
arr = np.random.rand(10000, 10000)

# View operations
start = time.perf_counter()
for _ in range(1000):
    view = arr[::2, ::2]  # View
view_time = time.perf_counter() - start

# Copy operations
start = time.perf_counter()
for _ in range(10):
    copy = arr[::2, ::2].copy()  # Copy
copy_time = (time.perf_counter() - start) * 100  # Scale for comparison

print(f"1000 views: {view_time*1000:.2f}ms")
print(f"1000 copies (estimated): {copy_time*1000:.2f}ms")

In [None]:
# When to make explicit copy:
# 1. Need independent data
# 2. Want contiguous memory for better performance
# 3. Prevent memory leak (small view of huge array)

# Example: Memory leak prevention
huge_arr = np.random.rand(10000, 10000)
small_view = huge_arr[0:10, 0:10]  # Still references huge_arr!

# To free huge_arr, make copy
small_copy = huge_arr[0:10, 0:10].copy()
del huge_arr  # Now huge_arr memory can be freed

---
## 4. Cache-Friendly Access

In [None]:
# Cache lines: CPU loads data in chunks (typically 64 bytes)
# Sequential access = good cache utilization
# Random access = cache misses

n = 5000
arr = np.random.rand(n, n)

# Row-wise iteration (cache-friendly for C-order)
start = time.perf_counter()
total = 0
for i in range(n):
    total += arr[i, :].sum()
row_time = time.perf_counter() - start

# Column-wise iteration (cache-unfriendly for C-order)
start = time.perf_counter()
total = 0
for j in range(n):
    total += arr[:, j].sum()
col_time = time.perf_counter() - start

print(f"Row-wise (cache-friendly): {row_time:.3f}s")
print(f"Column-wise (cache-unfriendly): {col_time:.3f}s")
print(f"Ratio: {col_time/row_time:.1f}x slower")

In [None]:
# Better: use vectorized operations (NumPy handles cache)
start = time.perf_counter()
total = arr.sum(axis=1).sum()  # Row sums
vec_time = time.perf_counter() - start

print(f"Vectorized: {vec_time:.5f}s")
print(f"Speedup vs loop: {row_time/vec_time:.0f}x")

---
## 5. Memory Alignment

In [None]:
# NumPy aligns arrays for SIMD operations
arr = np.random.rand(1000)

# Check alignment (data pointer address)
print(f"Data pointer: {arr.ctypes.data}")
print(f"Aligned to 64 bytes: {arr.ctypes.data % 64 == 0}")

In [None]:
# Slicing can break alignment
arr = np.arange(100, dtype=np.float64)
print(f"Original aligned: {arr.ctypes.data % 64 == 0}")

sliced = arr[1:]  # Offset by 8 bytes
print(f"Sliced[1:] aligned: {sliced.ctypes.data % 64 == 0}")

In [None]:
# Flags tell you about alignment
arr = np.random.rand(100)
print(f"ALIGNED: {arr.flags['ALIGNED']}")
print(f"WRITEABLE: {arr.flags['WRITEABLE']}")
print(f"C_CONTIGUOUS: {arr.flags['C_CONTIGUOUS']}")

---
## 6. Making Arrays Contiguous

In [None]:
arr = np.arange(12).reshape(3, 4)
trans = arr.T  # Non-contiguous

print(f"Transpose contiguous: {trans.flags['C_CONTIGUOUS']}")

In [None]:
# Make contiguous
contiguous = np.ascontiguousarray(trans)
print(f"After ascontiguousarray: {contiguous.flags['C_CONTIGUOUS']}")
print(f"Made copy: {not np.shares_memory(trans, contiguous)}")

In [None]:
# Performance impact
n = 3000
arr = np.random.rand(n, n)
trans = arr.T  # Non-contiguous
contig = np.ascontiguousarray(trans)

# Operation on non-contiguous
start = time.perf_counter()
for _ in range(10):
    _ = trans.sum(axis=1)
non_contig_time = time.perf_counter() - start

# Operation on contiguous
start = time.perf_counter()
for _ in range(10):
    _ = contig.sum(axis=1)
contig_time = time.perf_counter() - start

print(f"Non-contiguous: {non_contig_time*1000:.1f}ms")
print(f"Contiguous: {contig_time*1000:.1f}ms")
print(f"Speedup: {non_contig_time/contig_time:.2f}x")

---
## 7. In-Place Operations

In [None]:
# In-place operations avoid allocation
arr = np.random.rand(10000000)

# Not in-place (allocates new array)
start = time.perf_counter()
for _ in range(100):
    result = arr * 2
alloc_time = time.perf_counter() - start

# In-place
arr = np.random.rand(10000000)
start = time.perf_counter()
for _ in range(100):
    arr *= 2
    arr /= 2  # Undo for fair comparison
inplace_time = time.perf_counter() - start

print(f"Allocating: {alloc_time:.3f}s")
print(f"In-place: {inplace_time:.3f}s")
print(f"Speedup: {alloc_time/inplace_time:.2f}x")

In [None]:
# Using 'out' parameter
a = np.random.rand(1000000)
b = np.random.rand(1000000)
result = np.empty_like(a)

# Pre-allocated output
start = time.perf_counter()
for _ in range(100):
    np.add(a, b, out=result)
out_time = time.perf_counter() - start

# Normal (allocates each time)
start = time.perf_counter()
for _ in range(100):
    result = a + b
normal_time = time.perf_counter() - start

print(f"With out: {out_time:.4f}s")
print(f"Normal: {normal_time:.4f}s")

---
## Key Points Summary

**Memory Order:**
- C-order (row-major): NumPy default
- F-order (column-major): Use for Fortran interop
- Iterate along contiguous dimension for speed

**Strides:**
- Define bytes to jump per dimension
- Views just change strides (no copy)
- Non-contiguous = performance hit

**Best Practices:**
- Use views when possible
- Make contiguous before heavy computation
- Use in-place operations / `out` parameter
- Iterate in memory order

---
## Interview Tips

**Q1: What's the difference between C and Fortran order?**
> C-order stores rows contiguously (last index varies fastest). Fortran-order stores columns contiguously (first index varies fastest). Default is C.

**Q2: Why is iterating over columns slow in C-order?**
> Column iteration causes cache misses. In C-order, consecutive column elements are far apart in memory, defeating CPU cache prefetching.

**Q3: When would you use np.ascontiguousarray?**
> After operations that break contiguity (transpose, complex slicing) when the array will be used heavily afterward, or for C extension interop.

**Q4: How do you avoid memory allocation in tight loops?**
> Use in-place operators (`*=`, `+=`), preallocate output arrays, and use `out` parameter in ufuncs.

---
## Practice Exercises

### Exercise 1: Identify view vs copy

In [None]:
arr = np.arange(100).reshape(10, 10)

# Which are views? Which are copies?
a = arr[5:]
b = arr.reshape(20, 5)
c = arr.flatten()
d = arr.T
e = arr[[1, 3, 5]]


In [None]:
# Solution
arr = np.arange(100).reshape(10, 10)

print(f"a (slice): view = {np.shares_memory(arr, arr[5:])}")
print(f"b (reshape): view = {np.shares_memory(arr, arr.reshape(20, 5))}")
print(f"c (flatten): view = {np.shares_memory(arr, arr.flatten())}")
print(f"d (T): view = {np.shares_memory(arr, arr.T)}")
print(f"e (fancy): view = {np.shares_memory(arr, arr[[1,3,5]])}")

### Exercise 2: Optimize matrix operation

In [None]:
# Optimize this operation for C-order arrays
def slow_column_mean(arr):
    result = np.zeros(arr.shape[1])
    for j in range(arr.shape[1]):
        result[j] = arr[:, j].mean()
    return result

arr = np.random.rand(1000, 100)


In [None]:
# Solution: Use vectorized operation
def fast_column_mean(arr):
    return arr.mean(axis=0)

arr = np.random.rand(1000, 100)

start = time.perf_counter()
for _ in range(100):
    slow_column_mean(arr)
slow_time = time.perf_counter() - start

start = time.perf_counter()
for _ in range(100):
    fast_column_mean(arr)
fast_time = time.perf_counter() - start

print(f"Slow: {slow_time:.4f}s")
print(f"Fast: {fast_time:.4f}s")
print(f"Speedup: {slow_time/fast_time:.0f}x")

---
## Next Notebook
**02_vectorization_best_practices.ipynb** - Advanced vectorization patterns for maximum performance.