## Convolutional Neural Networks

Images can vary due to occlusions, lighting, rotations, and translations; models must be robust to intra-class variations but sensitive to inter-class differences.

Manual feature definition (e.g., eyes/nose for faces) is brittle; neural networks learn features hierarchically from data (pixels → edges → complex shapes like eyes).

**Limitations of fully connected networks:**
Flattening 2D images to 1D vectors loses spatial structure. High parameter count (e.g., 100x100 image → 10,000 inputs, leading to millions of weights).

Solution: Connect neurons to local patches of the input to preserve spatial relationships. This is the essence of a CNN. 

To these small patchs, you apply filters, which extact local features such as edges or shapes.

**How Convolution Works:**
1. Input: Typically an image, represented as a 2D or 3D tensor. 
2. Filter: A small matrix (e.g., 3x3 or 5x5) with learnable weights.
3. Convolution Operation: The filter slides over the input image with a specified stride (step size), computing a dot product between the filter weights and the local patch at each position. This produces a feature map (or activation map), which highlights regions where the filter’s pattern (e.g., an edge) is present.



So a CNN is very simlar to a normal NN but instead of directly applying the weights and bias you are first convulting to a small patch. Then you repeat this step with the patch moved over a little bit. 

The next step is to take the feature map of that we got from the filter and apply a non-linearlity. One very popular method to do this is the ReLU fuction, which basicly looks pixel by pixel and if its a negative value, it sets it to zero. 

**Pooling:** Pooling is used to reduce the amount of dimiensions in an image as you go deeper into the network. So what pooling does is instead of convolution where the patches are essentially enlarged, you take the maximum value of a patch and then propagating only the maximums. Another method would be mean pooling where you find the average. 

Pooling layers reduce the spatial dimensions (height and width) of feature maps while preserving important information.

**Global Flattening**:
After several iterations of convolution, ReLU and pooling the feature maps are typically flattened or globally pooled and fed into fully connected layers for tasks like classification.
- Flattening: A 7x7x512 feature map becomes a 7×7×512 = 25,088-element vector, which is fed into dense layers.
- Global Average Pooling: Takes the average of each feature map (e.g., 7x7 → 1 value per channel), producing a 512-element vector. This is more parameter-efficient and common in modern CNNs.

**A typical CNN architecture looks like:**
$$ \text{Input} \rightarrow [\text{Conv} \rightarrow \text{ReLU} \rightarrow \text{Pooling}] \times N \rightarrow \text{Flatten/Global Pool} \rightarrow \text{FC} \rightarrow \text{Output} $$



### Maths:

Note: True convolution involves flipping the kernel (filter) before sliding it over the input.
In practice, CNNs use cross-correlation (no flip), which is mathematically similar but simpler to implement.
Cross-Correlation Operation: For a 2D input matrix $ I $ (e.g., an image) and a kernel $ K $ of size $ k \times k $, the output $ O $ at position $ (i, j) $ is:
$$O_{i,j} = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} I_{i+m, j+n} \cdot K_{m,n}$$
This is the dot product of the kernel and the overlapping input patch.
To arrive at this: Start with the definition of 2D cross-correlation. Align the kernel's top-left with the input's position $ (i,j) $, multiply element-wise, and sum. No flipping of $ K $.
The goal is to extract features like edges or patterns by applying multiple kernels.

1. Methods of applying the filter(kernal) in the convolution/
**Valid Mode**: No padding; the kernel only slides where it fully overlaps the input.
For an input of size $ h \times w $ and kernel $ k \times k $:
Output size: $ (h - k + 1) \times (w - k + 1) $.
**Full Mode:** Adds zero-padding around the input so the kernel can slide beyond edges, producing a larger output.
Padding size: Typically $ \frac{k-1}{2} $ on each side (for odd $ k $).
Output size: $ (h + k - 1) \times (w + k - 1) $.

2. Convolutional Layer: 
    - Forward pass: 
        - Input: A 3d Tensor (Image).
        - Kernels: Multple 3D Filters
        - Computation: $$O[:,:,o] = \sum_{c=1}^{c_{in}} \text{cross\_correlate}(I[:,:,c], K[:,:,c,o]) + b_o$$

        
        
Here is an example in pyTorch. 




In [None]:
import torch                    # Core PyTorch library
import torch.nn as nn            # For building neural network layers
import torch.optim as optim      # For optimization algorithms (like SGD or Adam)
import torchvision               # For datasets, models, and image transforms
import torchvision.transforms as transforms  # For image preprocessing
from torch.utils.data import DataLoader      # To load data in batches

#need to check where to run compute 
device = (torch.device("cuda" if torch.cuda.is_available() else "cpu"))
print("Device Running on {device}")


#need to transform images from the start to the end 
transform = transforms.Compose([
    transforms.Resize((64, 64)),          # Resize all images to 64x64 pixels
    transforms.ToTensor(),                # Convert images to PyTorch tensors
    transforms.Normalize((0.5,), (0.5,))  # Normalize pixel values (-1 to 1)
])

#Load a sample dataset (CIFAR10 has 10 classes of small 32x32 color images)
train_dataset = torchvision.datasets.CIFAR10(
    root="./data", train=True, download=True, transform=transform
)
test_dataset = torchvision.datasets.CIFAR10(
    root="./data", train=False, download=True, transform=transform
)

# batch loaders
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

# define a simple cnn to look at images 
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        # First convolution layer: 3 input channels (RGB), 16 output filters
        self.conv1 = nn.Conv2d(3, 16, 3, padding=1)
        # Second convolution layer
        self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
        # Pooling layer to reduce image size
        self.pool = nn.MaxPool2d(2, 2)
        # Flattened image size = 32 filters * (64/4 * 64/4) pixels = 32 * 16 * 16
        self.fc1 = nn.Linear(32 * 16 * 16, 128)
        self.fc2 = nn.Linear(128, 10)  # 10 output classes for CIFAR10

# reminder: relu is where you take all the neg values and make them 0. 
# pooling is where you take the maxiumum(or average) from each patch inorder to reduce the amount of features. 

    def forward(self, x):
        # Apply first conv layer + ReLU + pooling
        x = self.pool(torch.relu(self.conv1(x)))
        # Apply second conv layer + ReLU + pooling
        x = self.pool(torch.relu(self.conv2(x)))
        # Flatten the image for the fully connected layer
        x = x.view(-1, 32 * 16 * 16)
        # Pass through first fully connected layer + ReLU
        x = torch.relu(self.fc1(x))
        # Final layer (no activation, handled by loss)
        x = self.fc2(x)
        return x
    
# init the model 
model = SimpleCNN().to(device)

#cross entropy loss is a mix of soft max and negative log liklihood(idk); good at classifiying for multiclass
criterion = nn.CrossEntropyLoss()          

#adam is an adaptive optimizer, a smart version of gd 
optimizer = optim.Adam(model.parameters(), lr=0.001)

for epoch in range(3):  #tree full passes over the dataset 
    running_loss = 0.0 #keeping track of losses
    for images, labels in train_loader: #give a batch of images with their labels 
        images, labels = images.to(device), labels.to(device) #if there is a gpu move the model there

        optimizer.zero_grad() # resent all the graidents to 0

        outputs = model(images) # feeds the images into the model 

        loss = criterion(outputs, labels) # checks how far off the models predictions were

        loss.backward() #pytorch computes gradients 

        optimizer.step() # uses gradients to ajust weights of the model (using adam)

        running_loss += loss.item() # checking how well the model is doing each run
    
    print(f"Epoch [{epoch+1}/3] - Loss: {running_loss/len(train_loader):.4f}")

print("Training complete!")

# 🧪 Quick test on one batch
dataiter = iter(test_loader)
images, labels = next(dataiter) # next batch of data 

images, labels = images.to(device), labels.to(device) # 

outputs = model(images) #seeing how well the model does


_, predicted = torch.max(outputs, 1)

print("Predicted labels:", predicted[:10].cpu().numpy())
print("Actual labels:   ", labels[:10].cpu().numpy())

