### <div align="center">Getting Started</div>

- Deep learning is a machine learning technique that uses neural networks to learn from large amounts of data, mimicking the human brain's ability to recognize patterns and make decisions.
- Why Deep Learning Needs More Data:
  - Deep learning networks, especially deep neural nets and CNNs, have huge parameter counts and automatically learn high-level representations—this means more data is needed to avoid overfitting and leverage their modeling power.
  - For small datasets, statistical models often outperform deep learning due to their built-in assumptions and feature engineering, which help with generalization when sample size is low.
- Neural Network Architecture
  - Feed Forward Neural Network (Something like a juicer :-P, Ex: Weather prediction, demand forecasting).
  - Recurrent Neural Network (Something like a soup, prepared taste the output and again adjust salt, masal etc, Ex: Autonomous driving, photo classification, disease diagnosis).
  - Convolutional Neural Network (Used in image or video use case, process each feature and faltten at the end, Ex: Machine translation, speech recognition (e.g., Google Assistant).
  - Transformers (Ex: using tools like ChatGPT)

### <div align="center">Neural Network: Fundamentals</div>

- Neurons
  - Neurons are the basic building blocks of neural networks in deep learning.
  - Neurons apply a weight to the input and use an activation function to determine the output.
- The Purpose of Activation Function
  - Real-world problems are often non-linear in nature, and activation functions help introduce non-linearity in a neural network.
  - With activation functions, neurons either "fire" or "do not fire." This can be thought of as individual detectives who are given a specific task and, after investigation, they give their findings to a judge (which is the neuron in the next layer).
- In the case of the insurance prediction example, in the hidden layer, we have two detectives: one responsible for figuring out awareness and the other for affordability. They give their conclusion that a person's awareness is 0 or 1 (if using a step activation function), or it is a number between 0 to 1, say 0.7 (when using a sigmoid activation function).
- Activation Functions: Sigmoid, ReLU, Tanh, SoftMax
  - Sigmoid, Softmax, tanh, ReLU are the most commonly used activation functions.
  - Sigmoid is primarily used in the output layer for binary classification problems (e.g., will a person buy insurance, is the transaction fraud).
  - Softmax is primarily used in the output layer for multi-class classification problems (e.g., handwritten digits classification, clothes classification).
  - tanh is similar to Sigmoid; the only difference is that the output range is -1 to 1 (for Sigmoid, the range is 0 to 1).
  - ReLU is a default choice for neurons in hidden layers as it is fast to calculate and also doesn’t suffer much from vanishing gradient problems.

### <div align="center">PyTorch</div>

##### Matrix Fundamentals
- In neural networks, weights can be efficiently multiplied with the output from the previous layer using matrix multiplication. If you are using a GPU, this becomes even faster as it will use multiple cores to compute dot products in parallel.
- Neural networks require a lot of matrix multiplications and that is the reason why GPUs got very popular for deep learning as it helps in parallel processing.
##### PyTorch Tensor Basics
- Tensor is basic building blocks of deep learning.
- - Tensor is a generic term for scaler
  - O Dimension Tensor or Single variable or number called Scalar.
  - 1 Dimension Tensor called Vector.
  - 2 Dimension Tensor called Matrix.
  - 3 Dimension Tensor called Cube.
- Tensor can have any number of dimensions.
- Using torch.Tensor, you can create a tensor object. Tensor objects look very much like numpy arrays. Numpy arrays can not be created on GPU directly whereas you can create a tensor object directly in GPU memory.
- Tensor has numpy and dataframe like attributes such as dtype, shape, device etc.
- view() method allows you to reshape the tensor.
- zeros(), ones(), rand() can be used to create a new tensor with specific values.
##### Autograd in PyTorch
- Autograd feature allows to calculate gradients (i.e. partial derivatives) automatically. While training a neural network, we need to calculate gradients during backpropagation step. Automatic gradient calculation helps in this process.
- torch.no_grad can be used if you want to temporarily stop calculating gradients.
##### Numpy Arrays Vs PyTorch Tensors
- PyTorch tensors and numpy arrays have similar functionality but tensor offers 3 key benefits over numpy arrays that are useful in deep learning.
  - Benefit 1: Tensor come with in built support to leverage GPU acceleration.
  - Benefit 2: Tensors have autograd features that computes gradients automatically. Numpy arrays do not have this feature.
  - Benefit 3: Tensors are tightly integrated with PyTorch ecosystem that makes it easier to use with deep learning tasks.

### <div align="center">Neural Networks: Training</div>

##### Training though Backpropagation
- Below are the 5 steps for Neural Network Training:
  1. Initialize neural network with random weights
  2. Feed forward training samples and calculate prediction error
  3. Back propagate error to adjust weights using gradient descent
  4. Repeat the process until certain number of iterations (epochs) or error is reduced significantly
  5. Evaluate Neural Network Performance by using test or validation set
- For training a neural network, we use a supervised training dataset. Feed samples one by one, calculate error and then backpropagate it to adjust weights.
- The main objective of training is to find out the right weights for the neural network. It is like adjusting knobs on a sound board to get the expected audio output.
- Error backpropagation uses partial derivatives to measure how much a specific weight contributes to an error. Based on that, adjustments are made to reduce the error in the next iteration.
- One epoch is feeding all the records in your dataset through the network once during a training process.
- MSE (Mean Squared Error) is one of the many cost functions used to measure error.

##### Gradient Descent
- Gradient Descent is a technique used in neural networks and statistical ML algorithms to find the optimal value of weights that results in minimal prediction error. That optimal point is also known as the global minimum.
- It uses the gradient (or partial derivative) of error with respect to weights to perform weight adjustment.
- `Learning rate` is a hyperparameter that we need to supply in gradient descent. It controls how much the weights adjust with each update
##### Batch Gradient Descent Vs Mini Batch GD Vs Stochastic GD (SGD)
- `Batch Gradient Descent` is an optimization algorithm that updates model parameters by calculating gradients across the entire dataset in each iteration to minimize a loss function.
- `Mini Batch Gradient Descent` is a variant of batch gradient descent that updates model parameters by calculating gradients using a small subset of the dataset, balancing efficiency and stability.
- `Stochastic Gradient Descent` updates model parameters using the gradient calculated from a single, random data point per iteration. This adds noise but speeds up convergence.

### <div align="center">Neural Networks in PyTorch</div>

- `nn.Module` is a base class for all neural network modules.
- A usual practice is to create a subclass out of nn.Module to define your own neural network architecture.
- Calling model(train_data) will internally call the forward method on your subclass.
- Datasets and Data Loaders
  - PyTorch provides a number of pre-loaded datasets that cover images, text, and audio.
  - `DataLoader` lets you create batches from a large dataset easily for training. It also allows reshuffling the data at every epoch to reduce model overfitting.

##### Cost Function - Binary Cross Entropy (a.k.a Log Loss)
Reasons for Using Binary Cross Entropy
- Aligns perfectly with probabilistic outputs, providing a natural fit for binary outcomes.
- Produces convex cost function (when used with sigmoid) which is good for global minimum convergence.
- Provides strong gradient updates, especially for confident, incorrect predictions.
- Incorrect predictions are penalized logarithmically, encouraging accuracy and discouraging overconfidence in errors.
- Cost functions like MSE may not work well for binary classification problems as the cost surface will not be convex and you may get stuck in local minima.
- Binary Cross Entropy (BCE) along with sigmoid activation gives a smooth, convex surface for the cost function which makes convergence easier
- BCE also penalizes high confidence errors, which eventually helps in efficient discovery of global minimum
- BCE aligns perfectly with probabilistic outputs, providing a natural fit for binary outcomes

### <div align="center">Model Optimization: Training Algorithm</div>

##### Model Optimization Overview
- Model Optimization is a process of finding the best way to train a model such that we can train it faster by using less compute resources and the model performs well during prediction phase.
- Model Optimization:
  1. Training Algorithm
  2. Regularization Techniques
  3. Hyperparameter Tuning
- Training Algorithm
  1. Gradient Descent
  2. GD with Momentum
  3. RMSProp
  4. Adam
- Model Optimization is a process of finding the best way to train a model such that we can train it faster by using less compute resources and the model performs well during prediction phase.
- Model optimization can be done using various ways such as Using different optimizers (GD, Momentum, RMSProp, etc.), Regularization techniques (L1, L2, Dropout), Hyperparameter tuning and so on.
- `Exponentially Weighted Moving Average` (EWMA) gives more weight to recent data, smoothing out fluctuations over time.

##### Training Algorithm
- Gradient Descent with Momentum
  - GD with momentum accelerates convergence by building on past gradients, reducing the time to reach the minimum.
  - The momentum term smooths out oscillations, especially in regions with steep, narrow valleys, leading to a more stable optimization path.
  - Momentum helps GD escape small local minima and flat regions, making it more effective in complex loss landscapes.
- RMSProp
  - RMSProp uses an Exponentially Weighted Moving Average of squared gradients to reduce oscillations.
  - Helps models converge faster and works well with noisy gradients.
- Adam
  - Adam combines momentum and RMSProp for efficient updates.
  - Syntax: optimizer = optim.Adam(model.parameters(), lr=0.001)
  - Tracks both mean and squared gradients to stabilize weight updates.

### <div align="center">Model Optimization: Regularization Technique</div>

#### Regularization
- Regularization is a set of techniques used to prevent overfitting — which is when a model performs well on training data but poorly on new, unseen data.
- Regularization Techniques
  1. Dropout Regularization: Dropout regularization drops certain neurons in each hidden layer during the training process. This generalizes the model and stops the network from learning specific details of training samples.
  2. L1, L2 Regularization: Both L1 (Lasso Regression) and L2 (Ridge Regression) help prevent overfitting by adding a penalty to the cost function for large weights
  3. Batch Normalization: Batch Normalization (BN) is a technique used in training deep neural networks to stabilize and accelerate the learning process. It also helps with regularization.
     - Key Benefit
       1. Stabilize learning
       2. Higher Learning rate
       3. Regularization effect
     - Batch normalization normalizes layer inputs to have zero mean and unit variance, enhancing model performance.
     - Allows for higher learning rates, reducing the training time for deep networks.
     - Adds robustness to model initialization, making it less sensitive to initial weights.
     - Reduces overfitting, particularly when used with dropout, by adding slight regularization effects.
     - Learnable parameters gamma (scale) and beta (shift) allow the network to learn the optimal scale and mean for each feature.
  4. Early Stopping
     - Early Stopping monitors model performance on validation data to stop training when there is no further improvement for some fixed number of iterations.
     - This parameter of fixed number of iterations with no improvement is called patience.
     - It prevents overfitting by halting training before the model starts to memorize noise.
     - Saves time and resources by avoiding unnecessary training epochs.

### <div align="center">Model Optimization: Hyperparameter Tuning</div>

#### Hyperparameter Tuning
- Hyperparameter tuning is the process of systematically finding the best values for the hyperparameters of a machine learning model to optimize its performance.
- Hyperparameter Tuning Benefits
  1. Improve Model Accuracy
  2. Reduce Overfitting, Underfitting
  3. Optimize Training Time, Compute Resouces
- Fine-tuning hyperparameters helps optimize model performance by selecting the best values for parameters like learning rate and batch size.
- Unlike model parameters, hyperparameters are set before training and influence the learning process.
- Effective tuning can prevent overfitting or underfitting, leading to better generalization on unseen data.
- Hyperparameter tuning is crucial for enhancing the accuracy and efficiency of deep learning models.

#### Optuna Hyperparameter Tuning
- Optuna uses an efficient approach to find the best parameters using techniques such as:
  1. Bayesian Optimization
  2. Gradient Optimization
  3. Evolutionary Algorithms
- Optuna is a modern, automated hyperparameter optimization framework that uses an efficient, trial-based search.
- It leverages techniques such as Bayesian optimization to find optimal hyperparameters faster than grid or random search.
- Optuna allows for dynamic pruning, stopping unpromising trials early to save computation.
- Ideal for deep learning tasks with large search spaces, where traditional tuning methods may be inefficient.

### <div align="center">Convolutional Neural Network (CNN)</div>

- Disadvantages of FCN/FFN (Fully Connected/FeedForward Neural Networks) for image classification:
  - High computation: Too many parameters and dense connections increase resource usage and training time.
  - Loss of spatial data: Flattening destroys spatial relationships, so important patterns like edges/textures are ignored.
  - Overfitting risk: Excess weights mean the model memorizes noise and outliers, harming generalization.
- Filters or Kernels are nothing but the feature detectors.
- `Pooling` reduces the size of feature maps (and thus computational requirements), decrease the risk of overfitting, and make the model more tolerant to variations and distortions in input data).
  - Benefits of pooling
    1. Dimension & Computation Reduction: Pooling downsamples spatial dimensions, resulting in smaller feature maps. This lowers the number of parameters and computations required in subsequent layers, making models faster and more memory-efficient.
    2. Reduces overfitting: By summarizing features and discarding less important information, pooling provides regularization. This simplification helps prevent the network from memorizing the training data too closely.
    3. Model is tolerant towards variantions, distortions etc: The model retains salient features, so its predictions do not change drastically if the input shifts or gets slightly altered.
- CNN by itself doesn't take care of rotation and scale, we need to have rotated, scaled samples in training dataset. If we do not have such samples then use Data Augmentation to generate rotated, scaled images.
- Convolutional Neural Networks (CNNs) excel at processing grid-like data such as images, identifying patterns through convolutions.
- They use kernels (filters) to extract spatial features like edges and textures from input data.
- CNN architectures combine convolution, pooling, and fully connected layers to learn hierarchical feature representations for tasks like image classification and object detection.
- CNNs by design cannot handle scale and rotation. To address this Include images in the training dataset with variety in scale and rotation use data augmentation to generate new images from the original dataset.

##### Padding and Strides
- Padding preserves input dimensions during convolution, ensuring no loss of edge information.
- Strides control the movement of the convolutional filter, affecting output size and computation speed.
- Padding techniques like "same" and "valid" balance between maintaining dimensions and reducing output size.
- Adjusting padding and strides can influence feature extraction granularity and network efficiency.
##### Train a Neural Network
- ReLU (Rectified Linear Unit): Removed all the black color from image and leave only white and due to this it break linearity. This is rough explanation.
- In max pooling (Pooling and down sampling is samething) we keep the feature as it is by reducing the size and parameter significantly and avoiding overfitting.
##### Data Augmentation
- Data augmentation increases the diversity of the training dataset by applying transformations like rotation, flipping, and scaling.
- Enhances model generalization by exposing it to varied scenarios, reducing overfitting.
- Common techniques include geometric transformations, color adjustments, cropping, and adding noise.
- Augmentation is performed dynamically during training, ensuring the model sees a new variation in each epoch.
- Particularly effective in computer vision tasks where gathering more data can be costly or impractical.

##### Transfer Learning
- Transfer learning is a machine learning technique where a pre-trained model on one task is reused or fine-tuned for a different but related task, such as using a model trained on cars to classify trucks.
- Transfer learning leverages pre-trained models to solve new, related tasks with limited data.
- It significantly reduces training time by reusing learned features from large datasets.
- Commonly used in tasks like image classification and natural language processing to achieve high accuracy with minimal effort.
- Transfer learning involves fine-tuning a pre-trained model or using it as a fixed feature extractor.
- Ideal for scenarios with limited data, enabling effective learning without starting from scratch.

### <div align="center">RNN (Recurrent Neural Networks) - Sequence Models</div>

##### Sequence Model
  - Sequential data refers to data where the order of elements matters, such as time series, text, audio, video, etc.
  - Examples of sequence models include RNNs, LSTMs, GRUs, and Transformers, each designed for specific challenges.
- Supervised
  - Artificial Neural Network (ANN): Used for Regression and Classification
  - Convolution Neural Network (CNN): Used for Computer Vision
  - Recurrent Neural Network (RNN): Used for Time Series Analysis
- Unsupervised
  - Self-Organizing Maps: Used for feature detection
  - Deep Boltzmann Machines: Used for Recommendation Systems
  - Auto Encoder: Used for Recommendation Systems

##### Recurrent Neural Network (RNN)
- Issues with regular neural network:
  1. Regular neural networks require a fixed size input whereas sequences vary in length.
  2. Regular neural networks do not consider order of elements in a sequence.
  3. No parameter sharing
- Benefits of RNN
  - Designed to work with sequential data. Effective for tasks where order and context matter.
  - In-built memory mechanism
  - Parameter sharing
- Recurrent Neural Networks (RNNs) are specialized for sequential data, processing inputs step-by-step while maintaining a memory of past information.
- RNNs use hidden states to capture temporal dependencies, enabling predictions based on sequence history.
- Ideal for tasks like text generation, speech recognition, and time-series forecasting.
##### Types of RNN
- One-to-Many RNNs generate sequences from a single input, like caption generation from an image.
- Many-to-One RNNs summarize sequences into a single output, such as sentiment analysis of a sentence.
- Many-to-Many RNNs handle sequence input and output, such as machine translation or video frame labelling.

##### Vanishing Gradient Problem
- The vanishing gradient problem in fully connected neural networks occurs when gradients shrink during backpropagation, preventing earlier layers from learning effectively.
- Solutions for Vanishing Gradient
  1. ReLU Activation
  2. Batch Normalization
  3. Residual Connections
- The exploding gradient problem in fully connected neural networks occurs when gradients grow uncontrollably during backpropagation, causing unstable training and large weight updates.
- Note: RNN learns via backpropagation through time.
- Solutions to Vanishing Gradient Problem
  1. LSTM
  2. GRU
  3. Residual Connections
- Vanishing gradients occur when gradients become too small during backpropagation, hindering effective weight updates.
- It primarily affects deep networks with activation functions like sigmoid or tanh, leading to slow or stalled learning.
- Layers closer to the input experience smaller gradients, causing them to learn much slower than deeper layers.
- Solutions include using activation functions like ReLU, batch normalization, or architectures like LSTMs with gating mechanisms.
Addressing vanishing gradients is critical for training deep neural networks effectively and efficiently.

##### LSTM (Long Short Term Memory Network)
- Long Short Term Memory (LSTM) network addresses short term memory problem in RNN by introducing long term memory cell (a.k.a cell state).
- It has both short term and long term memory.
- It has 3 gates: Forget, Input, and Output.

### <div align="center">Transformers</div>

##### Overview of Encoder and Decoder
- Transformer architecture has two parts:
  1. Encoder
  2. Decoder
- The purpose of the encoder is to produce contextual embeddings for each word (more precisely, a token) in a given input sentence.
- The purpose of the decoder is to produce an output sequence, which can be a word (for the next word prediction task) or a sequence (such as a translated sentence in case of language translation).
  - 2 input will be the input to decoder context and output line (Take the highest probability word and start preparing the sentence).
- Above points were related to inference stage not training (Normal flow Training -> Inference).
- BERT and GPT are examples of specific models based on transformer architecture.
- BERT has only encoder part where as GPT has only decoder part out of tranformer architecture.
- BERT has 768 where as GPT has 12228 dimension.
- Word embedding is a way to represent a word in a numeric format such that it captures the semantic meaning of that word.

##### Tokenization, Positional Embeddings
- Inside the Encoder:
  - Step 1: Generate Tokens and Token IDS.
  - Step 2: Generate positional encoding from tokens (From Static Word Embeddings Matrix - When a model train it has Static Word Embeddings Matrix).
- Tokens are similar to words (Ex: Sentence: i made sweet indian rice called kheer -> Token: [CLS] i made sweet indian rice call ed [SEP]. Each word has their index and based on that word will be converted to number).
- BERT vocab size: 30522 and GPT3 vocab size: 50257.
- Since models do the parallel processing hence along with Token, positional embedding vector will be their.
##### Attention Mechanism
- Attention mechanisms allow Transformers to focus on relevant parts of the input sequence for each output, improving context understanding.
- Self-attention computes the relationships between all input elements, capturing dependencies regardless of their position.
- Key components of attention include queries, keys, and values, which determine how much focus is given to different parts of the input.

##### Multi Headed Attention
- Flow: I made sweet indian rice called Kheer -> Positional Embedding -> Attention Head (Wq - Query Vector, Wk - Key Vector, Wv - Value Vector) -> Context Aware Embedding.
- The purpose of multiple attention heads is to allow the model to focus on different aspects or types of relationships between tokens (e.g., semantic, positional, syntactic) simultaneously, enriching the contextual understanding of each token.
- The Feed-Forward Network (FFN) enriches each token's embedding by applying non-linear transformations independently, enabling the model to capture complex patterns and higher-order features beyond contextual relationships.
- Normalization layer ensures stable learning improving the gradient flow.
- Multi-headed attention enables Transformers to capture diverse relationships in the data by learning multiple attention patterns simultaneously.
- Each attention head computes self-attention independently, focusing on different parts of the input sequence.
- Outputs from all heads are concatenated and transformed to create a richer representation of the input.
- Multi-headed attention improves the model's ability to understand complex patterns and long-range dependencies.
- It is a key component in Transformer Architecture.

##### Decoder
- Output of encoder is contextual embedding.
- The decoder in Transformer architecture generates the output sequence step-by-step, one token at a time.
- It uses masked self-attention to ensure predictions depend only on previously generated tokens.
- The decoder integrates encoder outputs through cross-attention to incorporate contextual information from the input sequence.
- Fully connected layers in the decoder refine the processed information for final token prediction.
- The decoder is central to tasks like language translation and text generation, where sequential output is crucial.
- Below is the link to visually understand the Transformer Achitecture,
  - https://poloclub.github.io/transformer-explainer/

##### How Transformers are trained ?
- Based on previous word predicting the next word is called `Casual Language Modeling` (CLM). GPT is trained using CLM.
- `Masked Language Modeling` (MLM): We take many words and mask some (ex: 15% is masked) tokens (words). It is bidirectional in nature. BERT (Google) is trained using MLM.
- In Self-supervised learning, labels are generated from the data itself without requiring manual annotations.
- Casual Language Modeling (CLM) and Mask Language Modeling (MLM) are self-supervised learning approaches used to train transformers.

- key feature of multi-headed attention in Transformers: It allows Transformers to capture multiple attention patterns simultaneously.
- The attention mechanism in Transformers allow: To focus on relevant parts of the input sequence for each output.
- Main purpose of BERT in NLP: The main purpose of BERT in NLP is to provide deep, bidirectional context-aware text representations, enabling models to understand the meaning of words based on their surrounding context.
- Main purpose of word embeddings in NLP: To represent a word in a numeric format capturing its semantic meaning.
- Main function of the decoder in Transformer models: To generate the output sequence step-by-step.
- Popular technique for creating static word embeddings: Word2vec
- Purpose of the encoder in Transformer architecture: To generate a contextual embedding for each word (token).