In [5]:
import torch

pytorch tensors - 
1. Tensor - similar to array or matrix, building blocks of neural network

In [6]:
my_list = [[1,2,3],[4,5,6]]

In [7]:
tensor = torch.tensor(my_list)

In [8]:
tensor

tensor([[1, 2, 3],
        [4, 5, 6]])

In [9]:
tensor.shape

torch.Size([2, 3])

In [10]:
tensor.dtype

torch.int64

compatible tensors - when their shapes align

### Building Neural Network using Pytorch

input layer - hidden layer - output layer

### 1. First Neural Network 
- this does not have hidden layer
- output layer is linear layer
- every output neuron connects to every input neurons - **fully connected network**
- this is equivalent to a linear model - helps us understand without adding complexity

In [11]:
import torch.nn as nn 

when designing a neural network, the input and output layer dimensions are pre-defined. 
- input neurons = features
- output neurons = classes (we want to predict)

In [12]:
# create input_tensor with three features - our input layer
input_tensor = torch.tensor([[0.3471,0.4547,-0.2356]])

In [13]:
# define our linear layer
linear_layer = nn.Linear(
    in_features =3,
    out_features=2
)


In [14]:
# pass input through linear layer
output = linear_layer(input_tensor)
print(output)

tensor([[-0.5322,  0.1088]], grad_fn=<AddmmBackward0>)


when input_tensor is passed to linear_layer, a linear function is performed to include weights and biases, each linear_layer has sets of weights and biases - these are the key quantities that define a neuron
- **weight** : reflects the importance of different features
- **bias** : provides the neuron with a baseline output
    - bias are independent of the weights
- at first, linear layer assigns random weights and biases and these are tuned later

In [15]:
print(linear_layer.weight)

Parameter containing:
tensor([[-0.3334, -0.2678, -0.4806],
        [ 0.0719, -0.2007, -0.2185]], requires_grad=True)


In [16]:
print(linear_layer.bias)

Parameter containing:
tensor([-0.4080,  0.1236], requires_grad=True)


**example** - let's say we have a weather dataset with three features - temperature, humidity and wind and we want to predict whether it's going to rain or be cloudy
1. humidity feature will have more significant weight compared to other features as it is a strong predictor of whethers it's going to rain or not
2. the data is for tropical region with high probability of rain, so a **bias** is added to account for this baseline information.
- with these information our model makes a prediction

### 2. Hidden Layers and Parameters

- here we will add more layers to help the network learn complex patterns
- stack three linear layers using nn.Sequential
- nn.Squential is a pytorch container for stacking layers in sequence
- takes input - passes it to each linear layers in sequence - returns output
- layers within nn.Sequential are hidden layers

In [None]:
# create network with three linear layers
model = nn.Sequential(
    nn.Linear(n_features, 8), #n_features represents number of input features
    nn.Linear(8,4),
    nn.Linear(4,n_classes) #n_classes represents number of output classes
)

- we can keep adding as many layers as we want as long as the input dimension of first layers matches the output dimension of the previous one

In [None]:
# adding more layers  - three linear layers
model = nn.Sequential(
    nn.Linear(10,18), # takes 10 layers and output 18
    nn.Linear(18,20), # takes 18 layers and output 20
    nn.Linear(20,5) # takes 20 layers and output 5
)

1. **layers are made of neurons**
- a layer is fully connected when each neuron links to all neurons in the previous layer
- a neuron in a linear layer :
    - performs a linear operation using all neurons from the previous layer
    - has n+1 parameters - n from inputs and 1 from the bias
2. **Paramters and model capacity**
- more hidden layer = more parameters = higher model capacity ( can handle complex dataset but may take longer to train)
- an effective way to assess a models capacity is by calculating it's total number of parameters

In [None]:
# in 2 layer network 
model = nn.Sequential(nn.Linear(8,4),
                      nn.Linear(4,2))

**manual paramter calculation:**
- first layer has 4 neurons, each neuron has 8+1 (8 weights and 1 bias) parameters. 9 times 4 = 36 parameters
- second layer has 2 neurons, each neuron has 4+1 parameters. 5 times 2 = 10 parameters
- in total this model has - 36 + 10 = 46 learnable parameters

we can do this manual calculation in python using .numel() method
- .numel() : returns the number of elements in the tensor

In [None]:
total = 0
for paramter in model.parameters():
    total += paramater.numel()
print(total)

understanding parameter count helps us understand model complexity and efficiency
- too many parameters can lead to long training times or overfitting
- too few parameters might limit learning capacity

### 3. Activation Functions

- **activation functions** add non-linearity to the network
    - sigmoid for binary classification
    - softmax for multi-class classification
- this non-linearity allows networks to learn more complex interactions between inputs and targets than only linear relationship
- **pre-activation** output passed to the activation function
    - we will call the output of last linear layer the **pre-activation** output which would be passed to the activation function to obtain the transformed output 

1. activation sigmoid -
    - for binary

Let's say we are trying to see if an animal is mammal or not, we have three features limbs, eggs, hair, each features goes through linear layers and obtain a number let' say 6
- we take the pre-activation output (6) and pass it to sigmoid function
- it obtans a value between 0 and 1
- if output is > 0.5, class label = 1 (mammal)
- if output is <= 0.5, class label = 0 (not mammal)

In [20]:
import torch
import torch.nn as nn

In [21]:
input_tensor = torch.tensor([[6]])
sigmoid = nn.Sigmoid() # takes one dimensional input tensor
output = sigmoid(input_tensor)
print(output)

tensor([[0.9975]])


one dimensional output - bounded between 0 and 1

In [22]:
# adding activation as the last layer
model = nn.Sequential(
    nn.Linear(6,4), # first linear layer
    nn.Linear(4,1), # second linear layer
    nn.Sigmoid() # sigmoid activation function - automating transforming the output of final layer
)

- sigmoid as last step in network of linear layers is equivalent to traditional logistic regression

2. softmax - multi-class classification 
- takes three-dimesional as input and outputs the same shape
- Let's say we have three classes - bird (0), mammal(1), reptile(2)
    - in this case softmax would take three dimensional as input and outputs the same shape
    - output is a probability distribution
        - each element is a probability (it's bounded between 0 and 1)
        - the sum of the output vector is equals to 1
        - pick up the highest probability
- similar to sigmoid, softmax could be added as the last layer

In [24]:
input_tensor = torch.tensor([[4.3,6.1,2.3]])

# apply softmax along the last dimension 
probabilities = nn.Softmax(dim=-1) # applying softmax to last dimension 
output_tensor = probabilities(input_tensor)
print(output_tensor)

tensor([[0.1392, 0.8420, 0.0188]])


### 4. Running a Forward Pass

- Input data flows through layers
- Calculations performed at each layer transforms the data into new representation at every layer which is passed to the next layer until the final output is produced
- The purpose of the forward pass is to pass input data through the network and produce predictions or output based on the model learn parameters also known as weights and biases
- this process is essential for training and making new predictions
- Possible outcome -
    - binary classification - single probability betwwen 0 and 1 (0.5 threshold) 
    - multi-class classification 
    - regressions - predict continous numerical values 

### 5. Updating loss function to assess model predictions


- tells us how good our model is during training
- takes a model prediction y-hat and ground truth y
- input --> loss function --> output
- input = (model prediction, ground truth), output = (float)

for example -  if we take our previous example 
- class 0 : mammal, class 1 : bird, class 2 : reptile
- model prediction -
    - predicted class = 0, correct = low loss
    - predicted class = 1, wrong = high loss
    - predicted class = 2, wrong = high loss
- our goal is minimize loss

- loss is calculated using a loss function f 
- **loss = F(y,y-hat)**
    - y is a single integer (class label)
        - eg., y = 0 when y is a mammal
    - y-hat is a tensor (prediction before softmax)
        - if N is the number of classes, eg., N=3
        - y-hat is a tensor with N dimensions
            eg., y-hat = [-5.2,4.6,0.8]

**one-hot encoding** concepts -
1. convert an integer y to a tensor of zeros and ones
2. example - if y = 0 with three classes the encoded form is 1,0,0

In [25]:
# transforming labels with one-hot encoding
import torch.nn.functional as F
print(F.one_hot(torch.tensor(0), num_classes = 3))

tensor([1, 0, 0])


In [26]:
print(F.one_hot(torch.tensor(1), num_classes = 3))

tensor([0, 1, 0])


In [27]:
print(F.one_hot(torch.tensor(2), num_classes = 3))

tensor([0, 0, 1])


**cross entropy loss in PyTorch** - most commanly used loss function for classification 
- after encoding we pass our prediction y-hat to a loss function
- here y-hat is stored as tensor scores

In [28]:
from torch.nn import CrossEntropyLoss
scores = torch.tensor([-5.2,4.6,0.8])
one_hot_target = torch.tensor([1,0,0])
criterion = CrossEntropyLoss()
print(criterion(scores.double(),one_hot_target.double())) # ouput is the computed loss value

tensor(9.8222, dtype=torch.float64)


summary - 
1. loss function takes :
    - **scores** - model predictions before the softmax function
    - **one_hot_target** - one hot encoded ground truth label
2. loss function outputs :
    - loss - a single float
    - our goal is to minimize this loss 

### 5. Minimizing Loss - using derivatives to update model parameters

- we can use derviatives or gradient to minimise this loss
- **derivatives** represents the slope of the curve
    - steep slopes - large steps, derviatives is high
    - gentler steps - small steps, derivative is low
    - floor - flat, derivative is zero (we aim to reach this)
- **convex and non-convex functions**
    - convex functions have one global minimum
    - non-convex functions have more than one global minium - values are lower than nearby points but not the lowest overall
    - when minimizing loss functions we aim to locate the global minium where x=1

connecting derivatives and model training
- during training we run a forward pass on the features and compute loss by comparing predictions to the target values
- compute the loss in forward pass during training
- recall, weights and biases are randomly assigned when a model is created, we update them during the training using backward pass or **backpropogation**
- In deep learning, derivatives are known as gradients. We compute the loss function gradients and use them to update the models parameters including weights and biases using backpropogation, repeating until the layers are tuned. 

**Backpropogation** : 
- consider a network with three linear layers, we can calculate local loss layer with respect to each layer parameters.
  - begin with loss gradients for l2
  - use l2 to compute l1 gradients
  - repeat for all layers (l1,l0)


In [None]:
# backpropogation in pytorch

# run a forward pass
model = nn.Sequential(nn.Linear(16,8),
                      nn.Linear(8,4),
                      nn.Linear(4,2))
prediction = model(sample)

# calculate loss and gradients
criterion = CrossEntropyLoss()
loss = criterion(prediction,target)
loss.backward() # calculate gradient based on this loss stored in .grad attributes of each layers weights and biases
# each layer in the model can be indexed starting from 0 to access it's weights, biases and gradients

In [None]:
# updating model parameters manually
# 1. access each layer gradient
#learning rate is typically small 
lr = 0.001
# update the weights
weight = model[0].weight
# access each layer gradient
weight_grad = model[0].weight.grad
# multiply the learning rate and subtract this product from the weight
weight = weight - lr * weight_grad
# update the biases
bias = model[0].bias
bias_grad = model[0].bias.grad
bias = bias - lr * bias_grad

gradient descent - find global minimum of loss functions
1. for non-convex functions, we will use gradient descent
2. pytorch simplifies this with optimizers
    - stochastic gradient descent (SGD) 

In [None]:
import torch.optim as optim 
# create the optimizer
optimizer = optim.SGD(model.parameters(),lr=0.001)
# perform parameter updates
optimizer.step()

In [None]:
# Create the optimizer
optimizer = optim.SGD(model.parameters(), lr=0.001)

loss = criterion(pred, target)
loss.backward()

# Update the model's parameters using the optimizer
optimizer.step()

In [34]:
import pandas as pd

data = {
    'animal_name': ['sparrow', 'eagle', 'cat', 'dog', 'lizard'],
    'hair': [0, 0, 1, 1, 0],
    'feathers': [1, 1, 0, 0, 0],
    'eggs': [1, 1, 0, 0, 1],
    'milk': [0, 0, 1, 1, 0],
    'predator': [0, 1, 1, 0, 1],
    'legs': [2, 2, 4, 4, 4],
    'tail': [1, 1, 1, 1, 1],
    'type': [0, 0, 1, 1, 2] # type categories - bird(0), mammal(1), repltile(2)
}

animals = pd.DataFrame(data)
animals.to_csv('animal_dataset.csv', index=False)


In [40]:
animals.head()

Unnamed: 0,animal_name,hair,feathers,eggs,milk,predator,legs,tail,type
0,sparrow,0,1,1,0,0,2,1,0
1,eagle,0,1,1,0,1,2,1,0
2,cat,1,0,0,1,1,4,1,1
3,dog,1,0,0,1,0,4,1,1
4,lizard,0,0,1,0,1,4,1,2


In [38]:
import numpy as np
# define input features
features = animals.iloc[:,1:-1]
X = features.to_numpy()
print(X)

[[0 1 1 0 0 2 1]
 [0 1 1 0 1 2 1]
 [1 0 0 1 1 4 1]
 [1 0 0 1 0 4 1]
 [0 0 1 0 1 4 1]]


In [39]:
features

Unnamed: 0,hair,feathers,eggs,milk,predator,legs,tail
0,0,1,1,0,0,2,1
1,0,1,1,0,1,2,1
2,1,0,0,1,1,4,1
3,1,0,0,1,0,4,1
4,0,0,1,0,1,4,1


In [45]:
# define target values (ground truth)
target = animals.iloc[:,-1]
y = target.to_numpy()
print(y)

[0 0 1 1 2]


In [47]:
import torch
from torch.utils.data import TensorDataset
# instantiate dataset class
# torch.tensor() - coverts your feature matrix / labels into a tensor
dataset = TensorDataset(torch.tensor(X), torch.tensor(y))
# access to individual sample - first sample in the dataset
input_sample, label_sample = dataset[0]
print('input sample:', input_sample)
print('label sample:', label_sample)

input sample: tensor([0, 1, 1, 0, 0, 2, 1])
label sample: tensor(0)


In [50]:
# data loader - manage data loading
from torch.utils.data import DataLoader
batch_size = 2 # batches to be included in each iteration 
shuffle = True # randomizes the data order at each epoch helps improve model generalisation  
# create a data loader
dataloader = DataLoader(dataset, batch_size = batch_size, shuffle=shuffle)

- epoch : one full pass through the training dataloader
- generalization : model performs well with unseen data

In [51]:
# interating through dataloader
# each element in dataloader is a tuple which we unpack as batch_labels and batch_inputs
for batch_inputs, batch_labels in dataloader : 
    print('batch_inputs:', batch_inputs)
    print('batch_labels:',batch_labels)

batch_inputs: tensor([[1, 0, 0, 1, 0, 4, 1],
        [0, 1, 1, 0, 0, 2, 1]])
batch_labels: tensor([1, 0])
batch_inputs: tensor([[0, 1, 1, 0, 1, 2, 1],
        [1, 0, 0, 1, 1, 4, 1]])
batch_labels: tensor([0, 1])
batch_inputs: tensor([[0, 0, 1, 0, 1, 4, 1]])
batch_labels: tensor([2])


since our dataset contains 5 animals and we set batch size = 2, the first iteration randomely selects two animals and their corresponding labels and so on.  

## Training a neural network

writing our first training loop
1. create a model
2. choose a loss function
3. define a dataset
4. set an optimizer
5. run a training loop
    1. calculate loss (forward pass)
    2. compute gradients (backpropogation)
    3. updating model parameters 

In [52]:
# creating dataset - data scientist salary 
import pandas as pd

data = {
    'experience_level': [0, 1, 2, 1, 2],
    'employment_type': [0, 0, 0, 0, 0],
    'remote_ratio': [0.5, 1.0, 0.0, 1.0, 1.0],
    'company_size': [1, 2, 1, 0, 1],
    'salary_in_usd': [0.036, 0.133, 0.234, 0.076, 0.170]
}

df = pd.DataFrame(data)
df.to_csv('data_science_salaries.csv', index=False)


our dataset - 
- features : categorical, target : salary(USD)
- since the target is continous - this is a regression problem
- for regression we will use a linear layer as final output instead of sigmoid or softmax
- we will also apply a regression-specific loss function as cross entropy is only use for classification task
- we will use mean squared error loss (mse) as loss function for regression

**Mean Squared Error Loss (MSE)**
- MSE loss is the mean of the squared difference between predictions and ground truth

In [None]:
# mean squared error loss (mse) - python version 
def mean_squared_loss(prediction, target):
    return np.mean((prediction - target)**2)

In [None]:
# mean squared error loss (mse) - pytorch version
criterion = nn.MSEloss()
# prediction and target are float tensors
loss = criterion(prediction, target)

In [None]:
# putting everything together 
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
# we have two numpy arrays - features and target containing our data and labels
# we start by passing this to tensor dataset to organise our features and targets into the right data types
# .float() datatype is used by the parameter of our model
features = df.iloc[:,:-1].to_numpy() # all columns except last
target = df.iloc[:,-1].to_numpy() # last column
dataset = TensorDataset(torch.tensor(features).float(),
                        torch.tensor(target).float())
# load the dataset 
dataloader = DataLoader(dataset,batch_size=4,shuffle=True)
# create the model
# dataset has 4 input features and 1 target output
# we won't need one hot encoding as this is a regression problem
model = nn.Sequential(nn.Linear(4,2),
                      nn.Linear(2,1))
# create the loss and optimizer
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(),lr=0.001)

In scikit-learn, the training loop is wrapped in the .fit() method, while in PyTorch, it's set up manually. While this adds flexibility, it requires a custom implementation.

In [None]:
# the training loop - looping through the entire dataset once is called an 'epoch'
# we train over mulitple epochs
for epoch in range(num_epochs):
    for data in dataloader: # for each epoch for loop through a dataloader 
        # each iteration of the dataloader provides a batch of samples
        # set the gradients to zero 
        optimizer.zero_grad() # optimizers stores gradient from previous steps by default
        # get feature and target from the dataloader 
        feature, target = data # feature for forward pass and target for loss calculation
        # run a forward pass 
        pred = model(feature)
        # compute loss and gradients
        loss = criterion(pred,target)
        loss.backward()
        # update the parameters
        optimizer.step() 

## ReLU activation functions

- some activation functions can shrink gradient too much making it inefficient
- so far we have worked with sigmoid and softmax.

1. **limitations of sigmoid function** - used for binary classification
    - outputs bounded between 0 and 1, it is usable anywhere in a network
    - but the gradients are very small for large and small values of x
    - this causes saturation, leading to the vanishing gradient problems - during backpropogation this becomes problematic because each gradient depends on the previous one, when gradients are really small they fail to update the weights effectively known as **vanishing gradients problem** and it makes training deep network very difficult

2. **limitations of softmax function** - used for multi-class classification
    - outputs bounded between 0 and 1
    - have same saturation issue

therefore, both of these activation functions are not ideall for hidden layers and are best used in last layers only.

**Rectified Linear Unit (ReLU)**
- outputs maximum value between it's input and 0 - *f(x) = max(x,0)*
- for positive inputs : outputs equals input
- for negative inputs : output is 0
- this function has no upper bound
- gradients do not approach 0 for large values of x which helps overcome vaishing gradients problem
- in pytorch - *relu = nn.ReLU()*
- it's reliable activation function for many deep learning tasks

**Leaky ReLU**
- variation of ReLU function.
- for positive inputs it behaves exactly like ReLU
- for negative inputs - scaled by a small coefficient (default 0.01) for pytorch
- this ensures the gradients for negative inputs are non-zero - preventing neurons to completly stop learning which can happen with standard value.
- In pytorch :
    leaky_relu = nn.LeakyReLU(neagtive_slope = 0.05)

## Learning rate and momentum

- training a neural network = solving an optimization problem by minimizing the loss function and adjusting the parameters
- to do this we use an algorithm called **Stochastic Gradient Descent (SGD)** optimizer
- optimizer we used to final global minimum of loss functions
- the optimizer takes the model parameter with two key arguments -
    - **learning rate** : controls the step size of updates
    - **momentum** : adds intertia to avoid getting stuck

for example we run our optimizer on a non-convex function - 
- lr = 0.01, momentum = 0, after 100 steps mimium found for x = -1.23 and y = -0.14
- it got stuck in first dip of the function (multiple u's and got stuck in the first) which is not it's global minimum
- lr = 0.01 and momentum = 0.9 we can find minimum - steps were large if previous steps were large

In [None]:
# stochastic gradient descent 
sgd = optim.SGD(model.parameters(), lr=0.01, momentum=0.95)

**summary**

| **Learning Rate**                               | **Momentum**                     |
| ----------------------------------------------- | -------------------------------- |
| Controls the step size                          | Controls the inertia             |
| Too high → poor performance                     | Helps escape local minimum       |
| Too low → slow training                         | Too small → optimizer gets stuck |
| Typical range: `0.01 (10⁻²)` to `0.0001 (10⁻⁴)` | Typical range: `0.85 to 0.99`    |


## Layer Initialization and transfer learning

- Data normalization scales input features for stability, similarly, the weights of the linear layers are initialized to small values also known as **layer initialization**

In [62]:
import torch.nn as nn
layer = nn.Linear(64,128)
print(layer.weight.min(), layer.weight.max())

tensor(-0.1250, grad_fn=<MinBackward1>) tensor(0.1250, grad_fn=<MaxBackward1>)


- the weights are between -0.1250 and 0.1250.
- output of neuron in a linear layer is a weighted sum of input from the previous layer
- keeping both the input data and layer weights small ensures stable output, preventing extreme values that can slow training
- layers can be initalised in different ways (active area of research)
- pytorch provides an easy way to initialize layer weights using nn.init module. 

In [64]:
import torch.nn as nn
layer = nn.Linear(64,128)
nn.init.uniform_(layer.weight)
print(layer.weight.min(), layer.weight.max())

tensor(0.0006, grad_fn=<MinBackward1>) tensor(0.9998, grad_fn=<MaxBackward1>)


here we have initialized layer with uniform distribution, the weight values ranges between 0 to 1.

**Transfer Learning** - 
- takes the model that was used in first task and re-use it for second similar task
- for example - we trained a model on US data scientist salaries, now instead of using another model with randomely initialised weights we can load weights from the first model and use them as starting point to train this new dataset
- torch.save() used for saving weights from previous model
- torch.load() loading the weights from previous model
- the functions save and load works with any kind of pytorch object

In [None]:
import torch
layer = nn.Linear(64,128)
torch.save(layer,'layer.pth')
new_layer = torch.load('layer.pth') 

**Fine Tuning** 
- a type of transfer learning
- load weights from previously trained model but train the model with smaller learning rate
- we can even train part of the network (we freeze some of them)
- rule of thumb : freeze early layers of network and fine-tune layers closer to output layer
- this could be done by setting each parameters requires_grad attribute to False

In [None]:
import torch.nn as nn
model = nn.Sequential(nn.Linear(64,128),
                      nn.Linear(128,256))
for name, param in model.named_parameters():
    if name == '0.weight':
        param.requires_grad = False

in this case we have used model.named_parameters() method, which returns the name and parameter itself and we set requires_grad for first layer to False

## Evaluating Model Performance

- A dataset is typically split into three subsets - 
    1. Training - 80-90% data - adjust model parameters (weights and biases) 
    2. Validation - 10-20% data - tunes hyperparameters (learning rate and momentum) 
    3. Test - 5-10% - evaluates final model performance
- Track loss and accuracy during training and validation 

**Calculating Training loss** 
for each epoch : 
- sum the loss across all batches in the training dataloader
- at the end of each epoch we compute the mean training loss by dividing the total loss by number of batches 

In [None]:
# we begin by setting training loss to 0 
training_loss = 0.0 
# iterate through the train loader
for inputs, labels in trainloader:
    # run the forward pass
    outputs = model(inputs)
    # compute the loss
    loss = criterion(outputs,labels)
    # backpropogation 
    loss.backward() # compute gradients
    optimizer.step() # update the weights
    optimizer.zero_grad() # reset gradients
    # calculate and sum the loss
    training_loss += loss.item()
    epoch_loss = training_loss / len(trainloader)

**Calculating validation loss** 


In [None]:
validation_mode = 0.0
# put the model in evaluation mode
model.eval()
# disable gradients for efficiency (Since we don't update weights during validation)
with torch.no_grad():
    # iterate through validation data loader
    for inputs, labels in validationloader:
        # run the forward pass
        outputs = model(inputs)
        # calculate the loss
        loss = criterion(outputs, labels)
        validation_loss += loss.item()

epoch_loss = validation_loss / len(validationloader) # compute mean validation loss
# switch back to training mode - preparing it for next training epoch
model.train()

- keeping track of validation loss and training loss helps us keep track of overfitting
- when a model overfits training loss keeps decreasing but validation loss starts to rise, this means the model is learning the training data too well and would not perform well in new data
- loss tell us our model is learning but doesn't alwasy makes accurate prediction 

**Calculating accuracy with torchmetrics** 

In [None]:
import torchmetrics 
# create accuracy metric
metric = trochmetrics.Accuracy(task="multiclass",num_classes=3)
# as model process each batch we update this metric using it's prediction and actual label
for features, labels in dataloader:
    outputs = model(features) # forward pass
    # compute batch accuracy (keeping argmax for on-hot labels)
    # model outputs probabilities for multiple classes we use argmax(dim=-1) to select the class with highest probability 
    # this converts one-hot encoded predictions into class index before passing it to the metrics 
    metric.update(outputs, labels.argmax(dim=-1))

# at end of each epoch we calculate final accuracy 
accuracy = metric.compute()

# reset metric for the next epoch
metric.reset()

### Fighting Overfitting

- **overfitting** - the model does not generalize to unseen data

- to avoid overfitting
    - reducing model size or adding dropout layer
    - using weight decay to force parameters to remain small
    - obtaining new data or augmenting data

common way to avoid overfitting is to add dropout layer to our neural network - **regularization**
- randmoley zeros out elements of the input tensor during training - preventing the model getting too dependent to specific features
- dropout layers are added after the activation functions
- behaves differently in training vs. evaluation - use model.train() for training and model.eval() to disable dropout during evaluation.

In [67]:
model = nn.Sequential(nn.Linear(8,4),
                      nn.ReLU(),
                      nn.Dropout(p=0.5)) # p determines the probability of the neurons set to zero

features = torch.randn((1,8))
print(model(features))

tensor([[0.0000, 1.0263, 0.0120, 0.0000]], grad_fn=<MulBackward0>)


next strategy to reduce overfitting is using **weight decay** another form of regualarization.
- in pytorch - weight decay is added to the optimizer using weight_decay parameter - typically set to a very small value.
- this parameter adds a penalty to loss function, encouraging smaller weights and helping the model generalize better 
- during back propogation this penalty is subtracted from the gradient, preventing excessive weight growth
- the higher we set the weight decay, the stronger the regularization, making overfitting less likely

In [None]:
# weight_decay using pytorch
optimizer = optmin.SGD(model.parameters(), lr=0.001, weight_decay=0.0001)

**Data Augmentation**
- collected large data could be expensive, but there is a way to expand datasets artifically using data augmentation
- Data Augmentation is commanly applied to image data, which can be rotated and scaled so that different views of the same face become available as "new" data points.

### Improving Model Performance

Steps to maximize performance - 
1. **Overfit the training set**
   - create a model that can overfit the training set, this will ensure that the problem is solvable.
   - we also set a performance baseline to aim for the validation set 
2. **Reduce Overfitting**
   -  we need to reduce overfitting to increase performace on the validation set
3. **Fine-Tune the Parameters**
    - we can slightly adjust the different hyperparamteres to ensure we achieve the best possible performance

**Step 1 - overfit the training set**

In [None]:
# modify the training loop to overfit a single data point
features, labels = next(iter(dataloader))
for i in range(1000):
    outputs = model(features)
    loss = criterion(outputs, labels)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

- when the model is set up properly, it should quickly reach near-zero loss and 100% accuracy on that data point.
- once this step is succesful, we scale up to the entire training set. At this stage, we use an existing model architecture large enough to overfit while keeoping hyperparameters like the learning rate at their defaults.

**Step 2 - Reduce Overfitting**
- goal : maximize the validation accuracy
- experiment with -
    - dropout
    - data augmentation
    - weight decay
    - reducing model capacity
- keep track of the different parameters and the corresponding validation accuracy for each set of experiments.
- reducing overfitting often comes at a cost, as applying regularization can significantly impact model performance
- the original model overfits the training set, achieving high accuracy but failing to generalize well to new data
- In contrast, with too much regularization, the update model shows a drop in training and validation accuracy, limiting its ability to learn effectively.
- This highlights the importance of balancing overfitting reduction strategies while closely monitoring key metics to find the best performing model
- once we are statisfied with performace, the next step is fine-tune hyperparameters.

**Step 3 - fine-tune hyperparameters**
- this is often done on optimizer setting likes learning rate or momentum
- Grid Search tests paramters at fixed intervals
- random search - instead of testing set values, it randomely selects them withing a given range
    - random search is more efficient as it avoids uneccesary tests and increase the chance of finding the optimial settings

In [None]:
# grid search
for factor in range(2,6):       
    lr = 10 ** -factor

In [None]:
# random search
factor = np.random.uniform(2,6)
lr = 10 ** -factor