<a href="https://colab.research.google.com/github/vs1242/Amusement_Park_Map/blob/main/Deep%20Learning%20for%20Image%20Recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**1)** PCA and NMF are techniques used for dimensionality reduction in that data is simplified without losing the important patterns.
PCA is a transformation of the data to a new coordinate system such that each principal component is the axis along which the data set has maximum variance (the sum of squares of the deviations from the axis), calculated as linear combinations of the original features. It enforces the second moment of components as two orthogonal (uncorrelated) and can be used with any numerical data.
NMF, on the other hand, decomposes non-negative data into two non-negative matrices: basis 𝑊 and coefficients H. The results of this approach are more interpretable since data is represented as additive combination of components. Similarities: Both explore the dimensionality, find the patterns and are unsupervised learning methods. The data is approximately modeled by linear transformations. Differences: It can handle all numerical data, orthogonal components and minimum variance based reconstruction error. Non negative data, orthogonality is not guaranteed, deals with additive parts based representation by trying to minimize divergence or error under non negativity constraints is NMF.

To summarize, NMF is excelent for produce interpretable results for non-negative data, like texts or images and PCA is making variance and decorrelating features.

---

**2)**  Weights and biases of neural networks are to be initialized because they affect learning efficiency and performance. Symmetry breaking is assumed by proper weight initialization, one that prevents neurons in a layer from learning the same feature. In addition, it controls the gradient flow during backpropagation: making sure the gradient doesn't either explode or explode is essential to learning.

Methods such as Xavier/Glorot or He initialization proportionally scales weights according to the number of neurons, thereby stabilizing gradient flow and accelerating the convergence leading to a closer solution to an optimal point. You typically use small constant biases (e.g. zero) which shift activations without breaking symmetry.

It’s also good for reducing training time, and avoiding common pitfalls like slow convergence or getting stuck in bad regions of the loss landscape, and it helps the network learn effectively. Furthermore, initialization is the foundation required for achieving stable and efficient neural network training.



---

**3)** Accuracy in ANN is precluded from overfitting the to unseen data. For instance, regularization techniques such as L1/L2 penalties penalize large weights, while dropout randomly kills neurons during training so something else would jump in if too much was placed in that one neuron. Designing smaller architectures with fewer layers or neurons strikes a balance through controlling the model complexity and avoids excessive memorization. When the dataset size is increased through collection or augmentation (e.g., flipping, cropping, rotating) then the model learns from diverse set of examples. Validation loss is monitored early stopping and stops training when validation loss begins to increase, therefore, avoiding over fit. It stabilizes and regularizes activations during training. Moreover, cross-validation, e.g. kfold, is used for tuning hyperparameters and on the evaluation of the generalization. But with these techniques, ANNs are shown to perform robustly on both training and unseen data.


---
**4)** Feedforward Neural Networks are made up to two types, namely Convolutional Neural Networks (CNNs) and Multilayer Perceptrons (MLPs) but are different in structure and applications. CNNs are good for grid like input, things like images and videos, and you therefore take advantage of spatial hierarchies; you end up with convolutional layers that pull out local patterns, pooling layers that reduce dimensionality, and then fully connected layers that you use to do classification. On the other hand, MLPs are fully connected layers in which all features of input are treated equally without any consideration of the spatial relationships.  

A key difference is **parameter sharing**: In a CNNs, they have weights shared in their convolutional layers that drastically decreases the number of parameters required and enables the detection of local features efficiently. Unlike parameter sharing, these parameter genders do not share parameters, thus producing a lot more weights.  

CNNs are good for image classification and object detection type of tasks because features (edges, textures, etc.) are learned automatically (spatial) by them. While structured data like tabular data suits MLPs better than many other models and performs well on small scale problems, MLPs are less preferred than many other models for problems with unstructured data. In general, CNN do best at tasks involving spatial dependencies, while MLPs are more generally useful models.


---

**5)** In particular, **vanishing gradient problem** is caused in neural networks, a lot especially using **sigmoid activation functions**. This is the issue regarding the fact that the gradients, which are being calculated during backpropagation, become very small as the gradients are propagated backwards through the network. This is because the derivative of sigmoid function is close to zero for a large positive or negative input. This means that the gradients for these inputs become small, so that the weights in earlier layers update very slowly and it **learns slowly** or **the training stalls**.  

The problem gets worse as we go deeper into the networks as gradients are multiplied through multiple layers, and exponentially decay. Therefore, such a model cannot learn well in deeper layers.  

Therefore, to prevent vanishing gradients, alternative activation functions such as **ReLU (Rectified Linear Unit)** are used as they keep a gradient of 1 for positive inputs. It also helps when used with **batch normalization** (normalizing layer outputs, reduces the risk of vanishing gradients, and stabilizes training). They allow deeper networks to train better, and those networks to converge faster.


---


**6)** **ResNet (Residual Networks)** has a **residual connection** that help resolve the **vanishing gradient problem**, allowing the gradient to propagate more effectively in the deep network. In traditional neural networks, small gradients in traditional neural networks can propagate backward through multiple layers and it is hard to train deep architectures. **Skip connections or identity mappings** are introduced by ResNet, which allow the input of a layer to be transported without seeping through an intermediary layer and united with the resulting layer.

This shortcut connection allows backprop to strike through certain layers and prevent the gradient from shrinking too much and getting stuck further into layers that are much earlier. We particularly stress this importance in very deep networks, where traditional gradient based optimization approaches face vanishing gradients.

They improve **gradient flow**, and even with hundreds or thousands of layers, effectively train the network. As a result, these allow the training of **deeper networks**, obviating the vanishing gradient issue which hinders thebeneficial performance with respect to complex tasks such as image recognition.

Finally, by improving the propagation of gradients, Residual connections help mitigate the vanishing gradient problem, permitting us to train deeper neural networks and to perform better learning.




# Coding Question Answer

**1)** To improve the performance of the LeNet-5 model on the CIFAR-10 dataset, the following modifications were made:
Data augmentation: All augmentations such as resize_transform (random cropping and horizontal flipping), improves the data distribution in the training set.
Model architecture: To enlarge the capacity of LeNet5 architecture, we change it. Both the number of filters in each convolutional layer increased, and batch normlayers were added after each convolutional layer to make training more stable.
Optimizer: Although AdamW optimizer was used instead of the standard Adam, a weight decay of 1e-4 was used in order to provide better regularization.
Learning rate scheduler: Therefore, in order to simplify the training, I implemented a stepLR scheduler to decrease the learning rate as the training proceeded, thus facilitating the model fine tuning in later epochs.
Dropout: In the classifier part of the network we added a dropout layer with a rate of 0.5 to prevent overfitting.
With these modifications, now, the better feature extraction, decreased overfitting and better training resulted in increased accuracy on the CIFAR-10 dataset.


---
**2)** Test accuracy: 81.01%

Experiment logging record for the best model:
Data augmentation:
Random cropping (32x32 with padding = 4).
Random horizontal flipping
Scaling (mean = 0.5, standard deviation = 0.5 for every channel)
Model architecture:
Raised amount of filtering layers in convolutional layers
Included BN layer right after every convolution at the model.
The classifier has been altered with added fully connected layers and dropout added to the design.
Optimizer:
AdamW scheduler with a learning rate of 0.001 and weight decay of 1e-4
Learning rate scheduler:
StepLR with step size schedule 30 and gamma 0.1
Training parameters:
Batch size: 280
Number of epochs: 10
These changes provided better identification of the features, controlling the problem of overfitting, better training of the model, which contributed to the achievement of 81.01% accuracy on CIFAR-10 dataset.



In [None]:
import os
import time

import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader

from torchvision import datasets
from torchvision import transforms

import matplotlib.pyplot as plt
from PIL import Image


if torch.cuda.is_available():
    torch.backends.cudnn.deterministic = True

## Model Settings

In [None]:
##########################
### SETTINGS
##########################

# Hyperparameters
RANDOM_SEED = 1
LEARNING_RATE = 0.001
BATCH_SIZE = 280
NUM_EPOCHS = 10

# Architecture
NUM_FEATURES = 32*32
NUM_CLASSES = 10

# Other
if torch.cuda.is_available():
    DEVICE = "cuda:0"
else:
    DEVICE = "cpu"

GRAYSCALE = False

### CIFAR-10 Dataset

In [None]:
##########################
### CIFAR-10 Dataset
##########################

train_mean = (0.5, 0.5, 0.5)
train_std = (0.5, 0.5, 0.5)

resize_transform = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(train_mean, train_std)
])

# Note transforms.ToTensor() scales input images
# to 0-1 range
train_dataset = datasets.CIFAR10(root='data',
                                 train=True,
                                 transform=resize_transform,
                                 download=True)

test_dataset = datasets.CIFAR10(root='data',
                                train=False,
                                transform=resize_transform)


train_loader = DataLoader(dataset=train_dataset,
                          batch_size=BATCH_SIZE,
                          num_workers=8,
                          shuffle=True)

test_loader = DataLoader(dataset=test_dataset,
                         batch_size=BATCH_SIZE,
                         num_workers=8,
                         shuffle=False)

# Checking the dataset
for images, labels in train_loader:
    print('Image batch dimensions:', images.shape)
    print('Image label dimensions:', labels.shape)
    break

# Checking the dataset
for images, labels in train_loader:
    print('Image batch dimensions:', images.shape)
    print('Image label dimensions:', labels.shape)
    break

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to data/cifar-10-python.tar.gz


100%|██████████| 170M/170M [00:02<00:00, 85.2MB/s]


Extracting data/cifar-10-python.tar.gz to data
Image batch dimensions: torch.Size([280, 3, 32, 32])
Image label dimensions: torch.Size([280])
Image batch dimensions: torch.Size([280, 3, 32, 32])
Image label dimensions: torch.Size([280])


In [None]:
device = torch.device(DEVICE)
torch.manual_seed(0)

for epoch in range(2):

    for batch_idx, (x, y) in enumerate(train_loader):

        print('Epoch:', epoch+1, end='')
        print(' | Batch index:', batch_idx, end='')
        print(' | Batch size:', y.size()[0])

        x = x.to(device)
        y = y.to(device)
        break

Epoch: 1 | Batch index: 0 | Batch size: 280
Epoch: 2 | Batch index: 0 | Batch size: 280


In [None]:
class LeNet5(nn.Module):
    def __init__(self, num_classes, grayscale=False):
        super(LeNet5, self).__init__()

        self.grayscale = grayscale
        self.num_classes = num_classes

        if self.grayscale:
            in_channels = 1
        else:
            in_channels = 3

        self.features = nn.Sequential(
    nn.Conv2d(in_channels, 64, kernel_size=3, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU(),
    nn.Conv2d(64, 64, kernel_size=3, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2),
    nn.Conv2d(64, 128, kernel_size=3, padding=1),
    nn.BatchNorm2d(128),
    nn.ReLU(),
    nn.Conv2d(128, 128, kernel_size=3, padding=1),
    nn.BatchNorm2d(128),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2)
)

        self.classifier = nn.Sequential(
    nn.Linear(128 * 8 * 8, 1024),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(1024, 512),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(512, num_classes)
)

    def forward(self, x):
        x = self.features(x)
        x = torch.flatten(x, 1)
        logits = self.classifier(x)
        probas = F.softmax(logits, dim=1)
        return logits, probas

In [None]:
torch.manual_seed(RANDOM_SEED)

model = LeNet5(NUM_CLASSES, GRAYSCALE)
model.to(DEVICE)

optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)


## Training

In [None]:
def compute_accuracy(model, data_loader, device):
    correct_pred, num_examples = 0, 0
    for i, (features, targets) in enumerate(data_loader):

        features = features.to(device)
        targets = targets.to(device)

        logits, probas = model(features)
        _, predicted_labels = torch.max(probas, 1)
        num_examples += targets.size(0)
        correct_pred += (predicted_labels == targets).sum()
    return correct_pred.float()/num_examples * 100

NUM_EPOCHS = 20
start_time = time.time()
for epoch in range(NUM_EPOCHS):

    model.train()
    for batch_idx, (features, targets) in enumerate(train_loader):

        features = features.to(DEVICE)
        targets = targets.to(DEVICE)

        ### FORWARD AND BACK PROP
        logits, probas = model(features)
        cost = F.cross_entropy(logits, targets)
        optimizer.zero_grad()

        cost.backward()

        ### UPDATE MODEL PARAMETERS
        optimizer.step()


        ### LOGGING
        if not batch_idx % 50:
            print ('Epoch: %03d/%03d | Batch %04d/%04d | Cost: %.4f'
                   %(epoch+1, NUM_EPOCHS, batch_idx,
                     len(train_loader), cost))



    model.eval()
    with torch.set_grad_enabled(False): # save memory during inference
        print('Epoch: %03d/%03d | Train: %.3f%%' % (
              epoch+1, NUM_EPOCHS,
              compute_accuracy(model, train_loader, device=DEVICE)))

    print('Time elapsed: %.2f min' % ((time.time() - start_time)/60))

print('Total Training Time: %.2f min' % ((time.time() - start_time)/60))

Epoch: 001/020 | Batch 0000/0179 | Cost: 2.2918


## Evaluation

In [None]:
with torch.set_grad_enabled(False): # save memory during inference
    print('Test accuracy: %.2f%%' % (compute_accuracy(model, test_loader, device=DEVICE)))

In [None]:
class UnNormalize(object):
    def __init__(self, mean, std):
        self.mean = mean
        self.std = std

    def __call__(self, tensor):
        """
        Parameters:
        ------------
        tensor (Tensor): Tensor image of size (C, H, W) to be normalized.

        Returns:
        ------------
        Tensor: Normalized image.

        """
        for t, m, s in zip(tensor, self.mean, self.std):
            t.mul_(s).add_(m)
        return tensor

unorm = UnNormalize(mean=train_mean, std=train_std)

In [None]:
test_loader = DataLoader(dataset=train_dataset,
                         batch_size=BATCH_SIZE,
                         shuffle=True)

for features, targets in test_loader:
    break


_, predictions = model.forward(features[:8].to(DEVICE))
predictions = torch.argmax(predictions, dim=1)

d = {0: 'airplane',
     1: 'automobile',
     2: 'bird',
     3: 'cat',
     4: 'deer',
     5: 'dog',
     6: 'frog',
     7: 'horse',
     8: 'ship',
     9: 'truck'}

fig, ax = plt.subplots(1, 8, figsize=(20, 10))
for i in range(8):
    img = unorm(features[i])
    ax[i].imshow(np.transpose(img, (1, 2, 0)))
    ax[i].set_xlabel(d[predictions[i].item()])

plt.show()