## Before AlexNet: The Pre-Deep Neural Network ERA

---

- Researchers used to trained machine learning models on CPUs.

- CPUs had limited capacity so were not able to train the large models.
- Training with large datasets was challenging.
- LeNet was one of the first models trained on medium-sized datasets, but not truly large ones.
- Hardware limitations were a major factor in using small datasets with fewer parameters:
  - NVIDIA's GeForce 256 from 1999 could process at most 480 million floating-point operations.
  
  - There were no meaningful programming frameworks like CUDA to operate these accelerators.
  - In contrast, today's accelerators can perform over 1000 TFLOPs per device.
- Activation functions were not as effective.
- Moreover, datasets were still relatively small: OCR on 60,000 low-resolution 28 X 28 pixel images was considered a highly challenging task.
---
## ImageNet 

-  ImageNet was released in 2009, the dataset comprised 12 million images across 22,000 categories.

-------

![imagnet](../../images/imagent.webp)

-------


- The team which created the ImageNet Datasets started organising the ImageNet Large Scale Visual Recognition Challenge (ILSVRC).

- The team with lowest error rate will win.
- The ImageNet Challenge becomes popular among the reasechers and become the standard to evaluate the performance of the vision models.

---

# AlexNet: The Beginning of the Deep Neural Network ERA

---

- AlexNet had 60 million parameters.

- Training on CPUs was impractical due to the large number of parameters.
- A major breakthrough occurred when Alex Krizhevsky and Ilya Sutskever implemented a deep CNN that could run on GPUs.
- They realized that the computational bottlenecks in CNNs, such as convolutions and matrix multiplications, could be parallelized in hardware.
- Using two NVIDIA GTX 580s with 3GB of memory, each capable of 1.5 TFLOPs, they implemented fast convolutions. Training one model on a single GPU was not possible at the time.
- The two halves of the network would communicate at specific layers to ensure they were not training two separate models.
- AlexNet was released in 2012 and won the ImageNet competition by a large margin in error rate.
- The authors introduced numerous methods to improve the performance of AlexNet.
- This paper completely changed the AI field.

---
- ![imgnet](../../images/imagnet-win.webp) 
----
## Architecture

**The network consists of 8 learned layers:**
- 5 convolutional layers
- 3 fully-connected layers
---
![arc](../../images/alexnet-arc.webp)

---
### ReLU Nonlinearity

- Used f(x) = max(0, x) as the activation function.

- Trains several times faster than tanh units.
- Does not require input normalization to prevent saturation.
---
- ![relu](../../images/relu.webp)


### Local Response Normalization

- Applied after the ReLU activation in certain layers (specifically after the first and second convolutional layers in AlexNet).

- By normalizing the responses, it prevents a single feature from dominating.


---

### Overlapping Pooling

- Pooling layers summarize outputs of neighboring groups of neurons

- Use overlapping pooling: z > s, where z is the filter size and s is the stride
- Reduces top-1 and top-5 error rates rates by 0.4% and 0.3%. 
---
- ![overlapping](../../images/overlapping.webp)

---

## Reducing Overfitting

### Data Augmentation

Two forms of data augmentation were used:

1. Image translations and horizontal reflections
   - Extract random 224x224 patches (and their horizontal reflections) from 256x256 images
   - Increases training set by a factor of 2048

2. Altering RGB channel intensities
   - Performs PCA on RGB pixel values in training set
   - Adds multiples of principal components to each training image

### Dropout

- Randomly drops out neurons during training (probability 0.5)

- Reduces complex co-adaptations of neurons
- Forces the network to learn more robust features
- Used in the first two fully-connected layers
---
- ![dorpout](../../images/dropout2.svg)

--- 

## GPU Utilization 

- The network was trained using two NVIDIA GTX 580 GPUs

- Each GPU responsible for roughly half of the neurons/kernels
- One GPU handles top half of kernels/neurons and  Other GPU handles bottom half, Reduces training time significantly.

## Training 

- Stochastic gradient descent with batch size of 128

- Momentum of 0.9 and weight decay of 0.0005
- Learning rate initialized at 0.01, reduced by factor of 10 when validation error rate stopped improving
- Trained for approximately 90 epochs through the training set of 1.2 million images and 1000 classes

---

# Alexnet In Work (Code)