# Module 1.1: Tensors Deep Dive

This notebook provides a comprehensive exploration of PyTorch tensors - the fundamental data structure underlying all of deep learning. We'll go beyond basic operations to understand memory layout, performance implications, and common pitfalls.

## Learning Objectives
- Understand tensor creation patterns and when to use each
- Master dtypes and device management (CPU/GPU)
- Comprehend memory layout: strides, contiguity, and their performance implications
- Distinguish between views and copies
- Apply broadcasting rules correctly
- Use in-place operations safely

---

## Setup

In [1]:
import torch
import numpy as np

# Check PyTorch version and CUDA availability
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
    print(f"CUDA version: {torch.version.cuda}")

PyTorch version: 2.9.1+cu128
CUDA available: True
CUDA device: NVIDIA GeForce RTX 4060 Laptop GPU
CUDA version: 12.8


---
## 1. Tensor Creation

PyTorch provides multiple ways to create tensors. Understanding when to use each method is crucial for writing efficient code.

### 1.1 From Python Data

In [2]:
# From Python list - creates a copy
data = [[1, 2, 3], [4, 5, 6]]
t1 = torch.tensor(data)
print(f"From list: {t1}")
print(f"Shape: {t1.shape}, Dtype: {t1.dtype}")

# Note: torch.tensor() ALWAYS copies data
# This is different from torch.as_tensor() which may share memory

From list: tensor([[1, 2, 3],
        [4, 5, 6]])
Shape: torch.Size([2, 3]), Dtype: torch.int64


In [3]:
# From NumPy - torch.from_numpy() shares memory!
np_array = np.array([[1.0, 2.0], [3.0, 4.0]])
t2 = torch.from_numpy(np_array)

print(f"Original NumPy array: {np_array}")
print(f"Tensor from NumPy: {t2}")

# Modify the numpy array
np_array[0, 0] = 999
print(f"\nAfter modifying NumPy array:")
print(f"NumPy: {np_array}")
print(f"Tensor: {t2}")  # Also changed!

# Key insight: from_numpy() creates a VIEW, not a copy

Original NumPy array: [[1. 2.]
 [3. 4.]]
Tensor from NumPy: tensor([[1., 2.],
        [3., 4.]], dtype=torch.float64)

After modifying NumPy array:
NumPy: [[999.   2.]
 [  3.   4.]]
Tensor: tensor([[999.,   2.],
        [  3.,   4.]], dtype=torch.float64)


In [4]:
# torch.as_tensor() - smart conversion (shares memory when possible)
np_array2 = np.array([1.0, 2.0, 3.0])
t3 = torch.as_tensor(np_array2)

# vs torch.tensor() - always copies
t4 = torch.tensor(np_array2)

np_array2[0] = 100
print(f"as_tensor (shares memory): {t3}")  # Changed
print(f"tensor (copied): {t4}")            # Unchanged

as_tensor (shares memory): tensor([100.,   2.,   3.], dtype=torch.float64)
tensor (copied): tensor([1., 2., 3.], dtype=torch.float64)


### 1.2 Factory Functions

In [5]:
# Uninitialized tensor - contains garbage values (fast but dangerous)
t_empty = torch.empty(3, 4)
print(f"Empty (uninitialized):\n{t_empty}\n")

# Zeros and ones
t_zeros = torch.zeros(2, 3)
t_ones = torch.ones(2, 3)
print(f"Zeros:\n{t_zeros}\n")
print(f"Ones:\n{t_ones}")

Empty (uninitialized):
tensor([[3.4163e-20, 0.0000e+00, 0.0000e+00, 1.4013e-45],
        [0.0000e+00, 4.3028e-41, 9.1084e-44, 0.0000e+00],
        [3.4002e-20, 0.0000e+00, 2.6358e-36, 4.3031e-41]])

Zeros:
tensor([[0., 0., 0.],
        [0., 0., 0.]])

Ones:
tensor([[1., 1., 1.],
        [1., 1., 1.]])


In [6]:
# Random tensors - essential for weight initialization
torch.manual_seed(42)  # For reproducibility

# Uniform distribution [0, 1)
t_rand = torch.rand(2, 3)
print(f"Uniform [0,1):\n{t_rand}\n")

# Standard normal distribution (mean=0, std=1)
t_randn = torch.randn(2, 3)
print(f"Standard normal:\n{t_randn}\n")

# Random integers
t_randint = torch.randint(low=0, high=10, size=(2, 3))
print(f"Random integers [0, 10):\n{t_randint}")

Uniform [0,1):
tensor([[0.8823, 0.9150, 0.3829],
        [0.9593, 0.3904, 0.6009]])

Standard normal:
tensor([[ 1.1561,  0.3965, -2.4661],
        [ 0.3623,  0.3765, -0.1808]])

Random integers [0, 10):
tensor([[7, 6, 9],
        [6, 3, 1]])


In [7]:
# Ranges and linear spaces
t_arange = torch.arange(0, 10, 2)  # start, end (exclusive), step
print(f"arange(0, 10, 2): {t_arange}")

t_linspace = torch.linspace(0, 1, 5)  # start, end (inclusive), num_points
print(f"linspace(0, 1, 5): {t_linspace}")

arange(0, 10, 2): tensor([0, 2, 4, 6, 8])
linspace(0, 1, 5): tensor([0.0000, 0.2500, 0.5000, 0.7500, 1.0000])


In [8]:
# "_like" functions - create tensor with same shape/dtype/device
reference = torch.randn(3, 4, device='cpu', dtype=torch.float32)

t_zeros_like = torch.zeros_like(reference)
t_ones_like = torch.ones_like(reference)
t_rand_like = torch.rand_like(reference)

print(f"Reference shape: {reference.shape}, dtype: {reference.dtype}")
print(f"zeros_like matches: shape={t_zeros_like.shape}, dtype={t_zeros_like.dtype}")

Reference shape: torch.Size([3, 4]), dtype: torch.float32
zeros_like matches: shape=torch.Size([3, 4]), dtype=torch.float32


---
## 2. Data Types (dtypes)

Choosing the right dtype affects memory usage, computation speed, and numerical precision.

In [9]:
# Common dtypes
dtypes_info = [
    (torch.float32, "float32 (default for floats)"),
    (torch.float64, "float64 (double precision)"),
    (torch.float16, "float16 (half precision - GPU training)"),
    (torch.bfloat16, "bfloat16 (brain float - better range than fp16)"),
    (torch.int32, "int32"),
    (torch.int64, "int64 (default for integers)"),
    (torch.bool, "bool"),
]

for dtype, desc in dtypes_info:
    t = torch.tensor([1.0], dtype=dtype) if 'float' in str(dtype) or 'bfloat' in str(dtype) else torch.tensor([1], dtype=dtype)
    print(f"{desc}: {t.element_size()} bytes per element")

float32 (default for floats): 4 bytes per element
float64 (double precision): 8 bytes per element
float16 (half precision - GPU training): 2 bytes per element
bfloat16 (brain float - better range than fp16): 2 bytes per element
int32: 4 bytes per element
int64 (default for integers): 8 bytes per element
bool: 1 bytes per element


In [10]:
# Type inference rules
t_int = torch.tensor([1, 2, 3])
t_float = torch.tensor([1.0, 2.0, 3.0])
t_mixed = torch.tensor([1, 2.0, 3])  # Promotes to float

print(f"Integers -> {t_int.dtype}")
print(f"Floats -> {t_float.dtype}")
print(f"Mixed -> {t_mixed.dtype}")

Integers -> torch.int64
Floats -> torch.float32
Mixed -> torch.float32


In [11]:
# Type casting
t = torch.tensor([1.7, 2.3, 3.9])

# Method 1: .to(dtype)
t_int = t.to(torch.int32)
print(f"to(int32): {t_int}")  # Truncates, doesn't round!

# Method 2: Convenience methods
t_long = t.long()    # int64
t_float = t.float()  # float32
t_half = t.half()    # float16

print(f".long(): {t_long}")
print(f".half(): {t_half}")

to(int32): tensor([1, 2, 3], dtype=torch.int32)
.long(): tensor([1, 2, 3])
.half(): tensor([1.7002, 2.3008, 3.9004], dtype=torch.float16)


### Memory Impact of dtypes

In [12]:
# Memory comparison for a typical neural network layer
size = (1024, 1024)  # 1M parameters

for dtype in [torch.float32, torch.float16, torch.bfloat16]:
    t = torch.randn(size, dtype=dtype)
    memory_mb = t.element_size() * t.numel() / (1024 * 1024)
    print(f"{str(dtype):20} -> {memory_mb:.2f} MB")

torch.float32        -> 4.00 MB
torch.float16        -> 2.00 MB
torch.bfloat16       -> 2.00 MB


---
## 3. Devices: CPU vs GPU

PyTorch tensors can live on CPU or GPU. Operations require tensors to be on the same device.

In [13]:
# Check available devices
cpu_device = torch.device('cpu')
print(f"CPU device: {cpu_device}")

if torch.cuda.is_available():
    gpu_device = torch.device('cuda:0')  # First GPU
    print(f"GPU device: {gpu_device}")
    print(f"GPU count: {torch.cuda.device_count()}")

CPU device: cpu
GPU device: cuda:0
GPU count: 1


In [14]:
# Creating tensors on specific devices
t_cpu = torch.randn(3, 3, device='cpu')
print(f"CPU tensor device: {t_cpu.device}")

if torch.cuda.is_available():
    t_gpu = torch.randn(3, 3, device='cuda')
    print(f"GPU tensor device: {t_gpu.device}")

CPU tensor device: cpu
GPU tensor device: cuda:0


In [25]:
# Moving tensors between devices
t = torch.randn(1000, 1000)

if torch.cuda.is_available():
    # Method 1: .to(device)
    t_gpu = t.to('cuda')
    
    # Method 2: .cuda() / .cpu()
    t_gpu2 = t.cuda()
    t_back = t_gpu.cpu()
    
    print(f"Original: {t.device}")
    print(f"After .to('cuda'): {t_gpu.device}")
    print(f"After .cpu(): {t_back.device}")

Original: cpu
After .to('cuda'): cuda:0
After .cpu(): cpu


In [29]:
# Device-agnostic code pattern
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Create tensors on the appropriate device
x = torch.randn(100, 100, device=device)
y = torch.randn(100, 100, device=device)
z = x @ y  # Matrix multiplication on GPU if available

Using device: cuda


In [36]:
# Performance comparison: CPU vs GPU
import time

def benchmark_matmul(size, device, num_iters=100):
    a = torch.randn(size, size, device=device)
    b = torch.randn(size, size, device=device)
    
    # Warmup
    for _ in range(10):
        c = a @ b
    
    if device.type == 'cuda':
        torch.cuda.synchronize()  # Wait for GPU operations to complete
    
    start = time.perf_counter()
    for _ in range(num_iters):
        c = a @ b
    if device.type == 'cuda':
        torch.cuda.synchronize()
    elapsed = time.perf_counter() - start
    
    return elapsed / num_iters * 1000  # ms per operation

size = 2048
cpu_time = benchmark_matmul(size, torch.device('cpu'))
print(f"CPU: {cpu_time:.2f} ms per {size}x{size} matmul")

if torch.cuda.is_available():
    gpu_time = benchmark_matmul(size, torch.device('cuda'))
    print(f"GPU: {gpu_time:.2f} ms per {size}x{size} matmul")
    print(f"Speedup: {cpu_time / gpu_time:.1f}x")

CPU: 32.94 ms per 2048x2048 matmul
GPU: 2.25 ms per 2048x2048 matmul
Speedup: 14.7x


---
## 4. Memory Layout: Strides and Contiguity

Understanding how tensors are stored in memory is crucial for performance optimization and debugging.

### 4.1 What are Strides?

In [37]:
# Strides tell you how many elements to skip in memory to move one position in each dimension
t = torch.arange(12).reshape(3, 4)
print(f"Tensor:\n{t}\n")
print(f"Shape: {t.shape}")
print(f"Strides: {t.stride()}")

# Stride (4, 1) means:
# - Move 4 elements in memory to go to the next row
# - Move 1 element in memory to go to the next column

Tensor:
tensor([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])

Shape: torch.Size([3, 4])
Strides: (4, 1)


In [38]:
# Visualizing memory layout
print("Memory layout (row-major / C-contiguous):")
print(f"Flat memory: {t.flatten().tolist()}")
print("")
print("To access t[1, 2]:")
print(f"  Memory offset = 1 * stride[0] + 2 * stride[1] = 1 * 4 + 2 * 1 = 6")
print(f"  Value at offset 6: {t.flatten()[6].item()} = t[1,2] = {t[1,2].item()}")

Memory layout (row-major / C-contiguous):
Flat memory: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

To access t[1, 2]:
  Memory offset = 1 * stride[0] + 2 * stride[1] = 1 * 4 + 2 * 1 = 6
  Value at offset 6: 6 = t[1,2] = 6


### 4.2 Contiguous vs Non-Contiguous

In [39]:
# Original tensor is contiguous
t = torch.arange(12).reshape(3, 4)
print(f"Original: contiguous={t.is_contiguous()}, strides={t.stride()}")

# Transpose creates a non-contiguous view
t_T = t.T
print(f"Transposed: contiguous={t_T.is_contiguous()}, strides={t_T.stride()}")

# The transposed tensor has reversed strides - it's viewing the same memory differently

Original: contiguous=True, strides=(4, 1)
Transposed: contiguous=False, strides=(1, 4)


In [40]:
# Why does contiguity matter?
t = torch.randn(1000, 1000)
t_T = t.T  # Non-contiguous

# Some operations require contiguous tensors
try:
    # view() requires contiguous input
    t_T.view(-1)
except RuntimeError as e:
    print(f"Error: {e}")
    
# Solution: use .contiguous() or .reshape()
t_T_contig = t_T.contiguous()  # Creates a copy with contiguous memory
flat = t_T_contig.view(-1)  # Now works
print(f"After .contiguous(): {t_T_contig.is_contiguous()}")

# Or use reshape(), which handles non-contiguous tensors automatically
flat2 = t_T.reshape(-1)  # Works without explicit .contiguous()

Error: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
After .contiguous(): True


In [41]:
# Operations that create non-contiguous tensors
t = torch.arange(24).reshape(2, 3, 4)
print(f"Original: contiguous={t.is_contiguous()}, strides={t.stride()}")

# Transpose
t1 = t.transpose(0, 1)
print(f"transpose(0,1): contiguous={t1.is_contiguous()}, strides={t1.stride()}")

# Permute
t2 = t.permute(2, 0, 1)
print(f"permute(2,0,1): contiguous={t2.is_contiguous()}, strides={t2.stride()}")

# Slicing with step
t3 = t[:, ::2, :]
print(f"slice [::2]: contiguous={t3.is_contiguous()}, strides={t3.stride()}")

Original: contiguous=True, strides=(12, 4, 1)
transpose(0,1): contiguous=False, strides=(4, 12, 1)
permute(2,0,1): contiguous=False, strides=(1, 12, 4)
slice [::2]: contiguous=False, strides=(12, 8, 1)


### 4.3 Performance Implications

In [50]:
# Non-contiguous memory access is slower due to cache misses
import time

size = 4096
t = torch.randn(size, size)
t_T = t.T  # Non-contiguous view
t_T_contig = t_T.contiguous()  # Contiguous copy

def time_sum(tensor, name, num_iters=100):
    # Warmup
    for _ in range(10):
        tensor.sum()
    
    start = time.perf_counter()
    for _ in range(num_iters):
        tensor.sum()
    elapsed = (time.perf_counter() - start) / num_iters * 1000
    print(f"{name}: {elapsed:.3f} ms")

time_sum(t, "Original (contiguous)")
time_sum(t_T, "Transposed (non-contiguous)")
time_sum(t_T_contig, "Transposed + contiguous()")

Original (contiguous): 3.670 ms
Transposed (non-contiguous): 2.902 ms
Transposed + contiguous(): 3.013 ms


---
## 5. Views vs Copies

Understanding when PyTorch shares memory vs copies data is essential for:
- Avoiding bugs from unexpected data sharing
- Memory efficiency
- Performance optimization

In [51]:
# Views share memory with the original tensor
original = torch.arange(10)
view = original.view(2, 5)

print(f"Original: {original}")
print(f"View: {view}")

# Modify via the view
view[0, 0] = 999
print(f"\nAfter modifying view[0,0]:")
print(f"Original: {original}")  # Also modified!
print(f"View: {view}")

Original: tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
View: tensor([[0, 1, 2, 3, 4],
        [5, 6, 7, 8, 9]])

After modifying view[0,0]:
Original: tensor([999,   1,   2,   3,   4,   5,   6,   7,   8,   9])
View: tensor([[999,   1,   2,   3,   4],
        [  5,   6,   7,   8,   9]])


In [52]:
# Check if tensors share memory
def shares_memory(a, b):
    return a.storage().data_ptr() == b.storage().data_ptr()

original = torch.arange(10)
view = original.view(2, 5)
copy = original.clone()

print(f"view shares memory: {shares_memory(original, view)}")
print(f"clone shares memory: {shares_memory(original, copy)}")

view shares memory: True
clone shares memory: False


  return a.storage().data_ptr() == b.storage().data_ptr()


In [53]:
# Operations that return VIEWS (share memory)
t = torch.arange(12).reshape(3, 4)

views = [
    ("view/reshape", t.view(4, 3)),
    ("transpose", t.T),
    ("slicing", t[1:, 1:]),
    ("squeeze/unsqueeze", t.unsqueeze(0)),
    ("expand", t[0].expand(3, -1)),
]

for name, v in views:
    print(f"{name:20} shares memory: {shares_memory(t, v)}")

view/reshape         shares memory: True
transpose            shares memory: True
slicing              shares memory: True
squeeze/unsqueeze    shares memory: True
expand               shares memory: True


In [54]:
# Operations that return COPIES (new memory)
t = torch.arange(12).reshape(3, 4)

copies = [
    ("clone", t.clone()),
    ("contiguous (if needed)", t.T.contiguous()),
    ("to (different device/dtype)", t.to(torch.float32)),
    ("masked_select", t[t > 5]),
    ("nonzero", t.nonzero()),
]

for name, c in copies:
    print(f"{name:25} shares memory: {shares_memory(t, c)}")

clone                     shares memory: False
contiguous (if needed)    shares memory: False
to (different device/dtype) shares memory: False
masked_select             shares memory: False
nonzero                   shares memory: False


### Common View Gotcha: Arithmetic Operations

In [55]:
# Arithmetic operations ALWAYS create new tensors
a = torch.tensor([1.0, 2.0, 3.0])
b = a + 1  # New tensor, not a view
c = a * 2  # New tensor

print(f"a + 1 shares memory: {shares_memory(a, b)}")
print(f"a * 2 shares memory: {shares_memory(a, c)}")

a + 1 shares memory: False
a * 2 shares memory: False


---
## 6. Broadcasting

Broadcasting automatically expands tensors to compatible shapes for element-wise operations, without copying data.

### 6.1 Broadcasting Rules

Two tensors are "broadcastable" if:
1. Each tensor has at least one dimension
2. Iterating from trailing (rightmost) dimensions, sizes either:
   - Are equal
   - One of them is 1
   - One of them doesn't exist (shorter tensor)

In [56]:
# Simple broadcasting: scalar with tensor
t = torch.arange(6).reshape(2, 3)
result = t + 10
print(f"Tensor + scalar:\n{result}\n")

# 1D tensor with 2D tensor
row = torch.tensor([100, 200, 300])
result = t + row
print(f"(2,3) + (3,) -> broadcasts row to each row:\n{result}")

Tensor + scalar:
tensor([[10, 11, 12],
        [13, 14, 15]])

(2,3) + (3,) -> broadcasts row to each row:
tensor([[100, 201, 302],
        [103, 204, 305]])


In [57]:
# Column vector broadcasting
t = torch.arange(6).reshape(2, 3)
col = torch.tensor([[10], [20]])  # Shape (2, 1)

result = t + col
print(f"Original (2,3):\n{t}\n")
print(f"Column vector (2,1):\n{col}\n")
print(f"Result (2,3):\n{result}")

Original (2,3):
tensor([[0, 1, 2],
        [3, 4, 5]])

Column vector (2,1):
tensor([[10],
        [20]])

Result (2,3):
tensor([[10, 11, 12],
        [23, 24, 25]])


In [58]:
# Classic broadcasting example: outer product
a = torch.tensor([1, 2, 3])      # Shape (3,)
b = torch.tensor([10, 20])       # Shape (2,)

# Reshape for broadcasting
a = a.unsqueeze(1)  # Shape (3, 1)
b = b.unsqueeze(0)  # Shape (1, 2)

outer = a * b  # Broadcasts to (3, 2)
print(f"a (3,1):\n{a}\n")
print(f"b (1,2):\n{b}\n")
print(f"Outer product (3,2):\n{outer}")

a (3,1):
tensor([[1],
        [2],
        [3]])

b (1,2):
tensor([[10, 20]])

Outer product (3,2):
tensor([[10, 20],
        [20, 40],
        [30, 60]])


In [59]:
# Broadcasting errors
a = torch.randn(2, 3)
b = torch.randn(2, 4)  # Incompatible last dimension

try:
    c = a + b
except RuntimeError as e:
    print(f"Error: {e}")

Error: The size of tensor a (3) must match the size of tensor b (4) at non-singleton dimension 1


### 6.2 Broadcasting is Memory Efficient

In [60]:
# Broadcasting doesn't actually copy data
a = torch.randn(1000, 1)  # 1000 elements
b = a.expand(1000, 1000)  # Appears to be 1M elements

print(f"a.shape: {a.shape}, storage size: {a.storage().size()}")
print(f"b.shape: {b.shape}, storage size: {b.storage().size()}")
print(f"\nBoth use the same storage!")
print(f"b strides: {b.stride()}")  # stride of 0 means broadcasting

a.shape: torch.Size([1000, 1]), storage size: 1000
b.shape: torch.Size([1000, 1000]), storage size: 1000

Both use the same storage!
b strides: (1, 0)


---
## 7. In-Place Operations

In-place operations modify tensors directly without allocating new memory. They're denoted with a trailing underscore (e.g., `add_()`).

In [61]:
# In-place vs out-of-place
t = torch.tensor([1.0, 2.0, 3.0])
print(f"Original: {t}, id: {id(t)}")

# Out-of-place: creates new tensor
t2 = t + 1
print(f"After t + 1: {t2}, id: {id(t2)} (different object)")

# In-place: modifies existing tensor
t.add_(1)
print(f"After t.add_(1): {t}, id: {id(t)} (same object)")

Original: tensor([1., 2., 3.]), id: 131883426926416
After t + 1: tensor([2., 3., 4.]), id: 131883428231552 (different object)
After t.add_(1): tensor([2., 3., 4.]), id: 131883426926416 (same object)


In [62]:
# Common in-place operations
t = torch.ones(3)

t.add_(2)      # t = t + 2
t.mul_(3)      # t = t * 3
t.zero_()      # t = 0
t.fill_(5)     # t = 5
t.uniform_()   # t ~ U(0, 1)

print(f"Final: {t}")

Final: tensor([0.6373, 0.6032, 0.9491])


### 7.1 In-Place Operations and Autograd

In [67]:
# WARNING: In-place operations can break autograd!
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2

# This will cause an error during backward
try:
    y.add_(1)  # In-place modification of a tensor needed for gradient computation
    loss = y.sum()
    loss.backward()
except RuntimeError as e:
    print(f"Error: {e}")

In [70]:
# Safe pattern: don't use in-place ops on tensors in the computation graph
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2
y = y + 1  # Out-of-place, creates new tensor
loss = y.sum()
loss.backward()
print(f"Gradient: {x.grad}")

Gradient: tensor([2., 4., 6.])


### 7.2 In-Place Operations and Views

In [71]:
# In-place ops on views affect the base tensor
base = torch.zeros(6)
view = base.view(2, 3)

print(f"Before: base = {base}")

view[0].fill_(1)
view[1].fill_(2)

print(f"After: base = {base}")

Before: base = tensor([0., 0., 0., 0., 0., 0.])
After: base = tensor([1., 1., 1., 2., 2., 2.])


---
## Exercises

Test your understanding with these exercises.

### Exercise 1: Matrix Operations from Scratch

Implement the following using only basic tensor operations (no `@`, `mm()`, or `matmul()`):
1. Matrix multiplication
2. Batch matrix multiplication

In [75]:
def manual_matmul(A, B):
    """
    Compute matrix multiplication A @ B using only element-wise ops and sum.
    A: shape (m, k)
    B: shape (k, n)
    Returns: shape (m, n)
    
    Hint: Use broadcasting and einsum-like thinking
    A[i,k] * B[k,j] summed over k
    """
    # YOUR CODE HERE
    pass

# Test
A = torch.randn(3, 4)
B = torch.randn(4, 5)
expected = A @ B
result = manual_matmul(A, B)
#  print(f"Correct: {torch.allclose(expected, result)}")

### Exercise 2: Memory Detective

For each operation, predict whether the result shares memory with the input, then verify.

In [None]:
def shares_memory(a, b):
    return a.storage().data_ptr() == b.storage().data_ptr()

t = torch.arange(24).reshape(2, 3, 4)

operations = [
    ("t.flatten()", lambda x: x.flatten()),
    ("t.reshape(-1)", lambda x: x.reshape(-1)),
    ("t[0]", lambda x: x[0]),
    ("t[0].clone()", lambda x: x[0].clone()),
    ("t.transpose(0,1).contiguous()", lambda x: x.transpose(0,1).contiguous()),
    ("t + 0", lambda x: x + 0),
    ("t.float()", lambda x: x.float()),
]

print("Predict before running!\n")
for name, op in operations:
    result = op(t)
    # print(f"{name:35} shares memory: {shares_memory(t, result)}")

### Exercise 3: Broadcasting Challenge

Implement the following using broadcasting (no loops):

In [None]:
def pairwise_distances(X, Y):
    """
    Compute pairwise Euclidean distances between all pairs of points.
    
    X: shape (n, d) - n points in d dimensions
    Y: shape (m, d) - m points in d dimensions
    Returns: shape (n, m) - distance[i,j] = ||X[i] - Y[j]||_2
    
    Hint: ||a-b||^2 = ||a||^2 + ||b||^2 - 2*a.b
    Or use broadcasting: (X[:, None, :] - Y[None, :, :]) gives (n, m, d)
    """
    # YOUR CODE HERE
    pass

# Test
X = torch.randn(5, 3)
Y = torch.randn(7, 3)
dists = pairwise_distances(X, Y)
# print(f"Output shape: {dists.shape}")  # Should be (5, 7)

---
## Solutions

In [72]:
# Exercise 1 Solution
def manual_matmul_solution(A, B):
    """A @ B using broadcasting."""
    # A: (m, k) -> (m, k, 1)
    # B: (k, n) -> (1, k, n)
    # Product: (m, k, n), sum over k -> (m, n)
    return (A.unsqueeze(-1) * B.unsqueeze(0)).sum(dim=1)

A = torch.randn(3, 4)
B = torch.randn(4, 5)
expected = A @ B
result = manual_matmul_solution(A, B)
print(f"Exercise 1 - Correct: {torch.allclose(expected, result)}")

Exercise 1 - Correct: True


In [76]:
# Exercise 2 Solution
t = torch.arange(24).reshape(2, 3, 4)

print("Exercise 2 - Memory sharing results:")
print(f"{'Operation':35} {'Shares Memory':15} {'Why'}")
print("-" * 80)
print(f"{'t.flatten()':35} {str(shares_memory(t, t.flatten())):15} {'View (if contiguous)'}")
print(f"{'t.reshape(-1)':35} {str(shares_memory(t, t.reshape(-1))):15} {'View (if possible)'}")
print(f"{'t[0]':35} {str(shares_memory(t, t[0])):15} {'Slicing is a view'}")
print(f"{'t[0].clone()':35} {str(shares_memory(t, t[0].clone())):15} {'Clone always copies'}")
print(f"{'t.transpose(0,1).contiguous()':35} {str(shares_memory(t, t.transpose(0,1).contiguous())):15} {'contiguous() copies if needed'}")
print(f"{'t + 0':35} {str(shares_memory(t, t + 0)):15} {'Arithmetic creates new tensor'}")
print(f"{'t.float()':35} {str(shares_memory(t, t.float())):15} {'Type conversion copies'}")

Exercise 2 - Memory sharing results:
Operation                           Shares Memory   Why
--------------------------------------------------------------------------------
t.flatten()                         True            View (if contiguous)
t.reshape(-1)                       True            View (if possible)
t[0]                                True            Slicing is a view
t[0].clone()                        False           Clone always copies
t.transpose(0,1).contiguous()       False           contiguous() copies if needed
t + 0                               False           Arithmetic creates new tensor
t.float()                           False           Type conversion copies


In [77]:
# Exercise 3 Solution
def pairwise_distances_solution(X, Y):
    """Compute pairwise Euclidean distances using broadcasting."""
    # Method 1: Direct broadcasting (clearer but uses more memory)
    # diff = X[:, None, :] - Y[None, :, :]  # (n, m, d)
    # return torch.sqrt((diff ** 2).sum(dim=-1))  # (n, m)
    
    # Method 2: Expanded formula (more memory efficient)
    # ||a-b||^2 = ||a||^2 + ||b||^2 - 2*a.b
    X_sqnorm = (X ** 2).sum(dim=1, keepdim=True)  # (n, 1)
    Y_sqnorm = (Y ** 2).sum(dim=1, keepdim=True)  # (m, 1)
    XY = X @ Y.T  # (n, m)
    
    sq_dists = X_sqnorm + Y_sqnorm.T - 2 * XY
    sq_dists = torch.clamp(sq_dists, min=0)  # Numerical stability
    return torch.sqrt(sq_dists)

X = torch.randn(5, 3)
Y = torch.randn(7, 3)
dists = pairwise_distances_solution(X, Y)
print(f"Exercise 3 - Output shape: {dists.shape}")  # Should be (5, 7)

# Verify with explicit loop
expected = torch.zeros(5, 7)
for i in range(5):
    for j in range(7):
        expected[i, j] = torch.sqrt(((X[i] - Y[j]) ** 2).sum())
print(f"Correct: {torch.allclose(dists, expected)}")

Exercise 3 - Output shape: torch.Size([5, 7])
Correct: True


---
## Summary

Key takeaways from this notebook:

1. **Tensor Creation**: Use `torch.tensor()` for copies, `torch.from_numpy()` for shared memory
2. **dtypes**: Choose based on precision/memory trade-off; float16/bfloat16 for GPU training
3. **Devices**: Keep tensors on the same device; use device-agnostic patterns
4. **Memory Layout**: Understand strides and contiguity for performance
5. **Views vs Copies**: Most reshaping ops are views; be aware of shared memory
6. **Broadcasting**: Powerful for vectorized operations; understand the rules
7. **In-place ops**: Use carefully, especially with autograd

---
*Next: Module 1.2 - Autograd Internals*