# Pytorch Basics

### Table of Contents

> Table Of Contents:
 1. Using AutoGrad for Stochastic Gradient Descent
 2. Converting Data Between Pytorch and Numpy
 3. The Input Pipeline
 4. Running a Pre-Trained Model
 5. Saving and Loading Models


## Using Auto Grad for Stochastic Gradient Descent

> <b> What is AutoGrad? </b>
>> AutoGrad (automatic gradient) is the automatic differentiation technique that powers nerual network training by computing a value, and automatically constructing a procedure for computing derivatives of that value.

> <b>What is Gradient Descent?</b>
>> It trains machine learning models by minimizing errors between predited and actual results.

> <b>Why is the gradient important?</b>
>> As written by Goodfellow et al. (ch.4.3), deep-learning involves the optimization of an objective function. And so, the gradient allows us to get the direction with the highest rate of change at a specific point. By then taking the opposite direction (i.e., gradient descent), we can minimize the objective function (a.k.a, loss function) which results in a relatively good set of parameters.

><b> How to import AutoGrad?</b>
>> AutoGrad is a part of the torch module, so we only need to import torch to use the AutoGrad tool

In [1]:
#import the torch module to use AutoGrad

import torch

> <b>How to use AutoGrad?</b>
>> As the name implies, AutoGrad works automatically in the background; however, we must specify when we want it to be keeping track of tensor change history

In [15]:
#Create tensors specifying the gradient will be needed for later.
# A tensor can be created with requires_grad=True 
# so that torch.autograd records operations on them 
# for automatic differentiation.
x = torch.tensor(1.0, requires_grad = True)
w = torch.tensor(2.0, requires_grad = True)
b = torch.tensor(3.0, requires_grad = True)

#Print the current tensors made
print(f"x = {x}")
print(f"w = {w}")
print(f"b = {b}")

x = 1.0
w = 2.0
b = 3.0


> <b>torch.tensor</b>
- a multi-dimensional matrix containing a single data type.
- always copies data
- "requires_grad = True" : torch.autograd records operations on them for automtic differentiation

> How to prevent it from tracking the gradient/ history
>> x.requires_grad(False) 
>> x.detach()
>> with torch.no_grad():

> <b>In what way do we "change" the tensors? </b>
>> We can apply forward propogation by using mathematical functions such as a linear function to combine our tensors and get a resulting tensor. This works out well because an artificial neural network is nothing more than a mathematical function, mapping our input to the output

> <b> What is Forward Propagation</b>
>> Forward Propagation is the process of feeding all of our input data into a model to get the resulting output (i.e., get the prediction made by a model)

In [16]:
# Apply forward propagation via a simple linear equation 
# and save the result

y = w*x + b
print(y)

tensor(5., grad_fn=<AddBackward0>)


grad_fn: gradient function
e.g.,AddBackward, MulBackward, MeanBackward

> <b>We have some history of changes made; now what? </b>
>>After applying forward propagation and now having some history for the output tensor, we can determine the gradient by using backwards propagation.

> <b>What is backwards propagation?</b>
>> Backwards propagation the process of comparing the ouput data to the correct label and having that error self-correct the parameteers in a model by working backwards. A subset of the training data (batch) is used to calculate a vector which is proportional to the negative gradient,allowing us to take a single gradient descent step to correct the parameter values (i.e., lessen the error of the model on the training data)

><b>What is Stochastic Gradient Descent?</b>
>> The repeated process of using backwards propagation until a certain number of passes through the entire dataset (i.e., epochs) are made is called Stochastuc Graduebt Descent

In [8]:
# Perform backwards propagation from the y tensor
y.backward() # dy/dx, dy/dw, dy/db

In [10]:
# Print out the gradient
# y = wx + b
print(x.grad) # x = 2.0 y' = w = 2
print(w.grad) # w = 1.0 y' = x = 1
print(b.grad) # b = 1.0  y' = 1

tensor(2.)
tensor(1.)
tensor(1.)


In [18]:
# Create tensors of shape (10,3) and (10,2)
x = torch.randn(10,3)
y = torch.randn(10,2)

In [12]:
import torchvision
import torch.nn as nn
import numpy as np
import torchvision.transforms as transforms

In [19]:
# Build a fully connected layer
linear = nn.Linear(3,2)
print('w: ', linear.weight)
print('b: ', linear.bias)


w:  Parameter containing:
tensor([[-0.3066, -0.3411,  0.2897],
        [-0.1023,  0.1660,  0.4839]], requires_grad=True)
b:  Parameter containing:
tensor([0.0641, 0.4660], requires_grad=True)


> nn.Linear
- applies a linear transformation to the incoming data:
    $ y = xA^T + b$

- Paremeters
    - in_features (int) : size of each input sample
    - out_features (int) : size of each output sample
    - bias (bool): If set to False, the layer will not learn an additive bias. Default: True

- Variables
    - weight (torch.Tensor - multi-dimensional matrix): the learnable weights of the module of shape
    - bias: the learnable bias of the module of shape

> Linear Regression Algortihm <br>
- to predict the most appropriate value (parameter)<br>
>> $\hat{y} = wx + b$ (w: weight, b: bias)
 - independent variable: x, dependent variable: $\hat{y}$

When we find the parameter, we need to get the least error rate.
To calculate this, we use Least Squared Method.

> What is Least Squared Method?
>> the method which is the least value of $\sum$(real solution - approximated solution)$^2$<br>

> Why it needs to be squared?
>> when we subtract approximated solution from real solution, the value can be positive or negative. To remove the sign (-/+).

> Why it needs to add?
>> Since each data occurs each error, it needs to add all the errors when it happened in whole data.

> 2 Methods to find the linear approximation using least squared method
>> 1. Find weight and bias which minimize RSS (Residual Sum of Squares)
>> 2. The gradient vector of RSS should be 0. (Gradient Descent)

> What is Residual?
>> Subtract approximated value ($\hat{y}$) from real value ($y$) <br>
$ y - \hat{y}$

> Calculate Weight
>> $ w = \frac{\sum(x-\bar{x})(y-\bar{y})}{\sum(x-\bar{x})^2}$<br>
>> $b = \bar{y} - (\bar{x}* a) $


In [20]:
# Build loss function and optimizer
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(linear.parameters(), lr = 0.01)

Loss Function
that we want to minimize
gradient of this loss <br>
x<br>
&nbsp;  \ &nbsp;&nbsp; &nbsp;&nbsp;&nbsp;      z<br>
&nbsp;&nbsp;     $>$ f ----> sth ----> sth ---> Loss <br>
&nbsp;  /<br>
y<br>

  d Loss/ dx = dLoss/dz * dz/dx


> What is Loss function?
>> the loss(error) about one data<br>
>> $(\hat{y}-y)^2 = (wx-y)^2$

> What is Cost function?
>> the mean of loss (error) about every data<br>
Thus, Cost function is the average of $\sum$ Loss function.<br>

>> MSE (Mean Square Error)<br>
>> $ MSE = J(w,b) = \frac{1}{N}\sum_{i=1}^{n}(y_i - (wx_i + b))^2$

each iteration to find $J_m(w)$ uses update rules
> $ w = w_{old} - \alpha* dw$

> $ b = b_{old} - \alpha*db$

>learning rate(lr)
>> big: might be fast, jump a lot -> never find minimum <br>
>> small: takes longer time, finally reach minimum

1. Forward Pass & Compute Loss
2. Compute local gradients
3. Backward pass: Compute dLoss/dWeights using the Chain Rule

In [21]:
# Forward Pass
pred = linear(x)

# Compute Loss
loss = criterion(pred, y) # MSE => cost function 
#pred = hat_y,  y = real soltuion
print('loss: ', loss.item()) # MSE; cost function

#Backward pass
loss.backward() # compute dLoss/dw

# print out the gradients
print('dL/dw: ', linear.weight.grad)
print('dL/db: ', linear.bias.grad)

loss:  1.1485817432403564
dL/dw:  tensor([[-0.4082,  0.1735, -0.2471],
        [ 0.3535,  0.1480,  0.3524]])
dL/db:  tensor([ 0.3624, -0.0277])


In [23]:
# 1-step gradient descent
optimizer.step()

# You can also perform gradient descent at the low level
# linear.weight.data.sub_(0.01*linear.weight.grad.data)
# linear.bias.data.sub_(0.01*linear.bias.grad.data)

# Print out the loss after 1-step gradient descent
pred = linear(x)
loss = criterion(pred, y)
print('loss after 1 step optimization: ', loss.item())

loss after 1 step optimization:  1.135599136352539


In [24]:
# Loading data from numpy

#create a numpy array
x = np.array([[1,2],[3,4]])

# Convert the numpy array to a torch tensor
y = torch.from_numpy(x)

# Convert the torch tensor to a numpy array
z = y.numpy

In [26]:
# Input pipline

# Download and construct CIFAR-10 dataset
train_dataset = torchvision.datasets.CIFAR10(root = '../../data/', train = True, transform = transforms.ToTensor(), download = True)

# Fetch one data pair (read data from disk)
image, label = train_dataset[0]
print(image.size())
print(label)



Files already downloaded and verified
torch.Size([3, 32, 32])
6


In [29]:
# Data loader (this provides queues and threads in a very simple way)
train_loader = torch.utils.data.DataLoader(dataset = train_dataset, batch_size = 64, shuffle = True)

# When iteration starts, queue and thread start to load data from files
data_iter = iter(train_loader)

# Mini-batch images and labels
images, labels = next(data_iter)

# Actual usage of the data loader is as below.
for imgaes, labels in train_loader:
    #Training code should be written here
    print(images)
    print(labels)
    pass

tensor([[[[0.2863, 0.2706, 0.2627,  ..., 0.2863, 0.2902, 0.3098],
          [0.2784, 0.2627, 0.2549,  ..., 0.2627, 0.2706, 0.2902],
          [0.2706, 0.2588, 0.2510,  ..., 0.2510, 0.2627, 0.2784],
          ...,
          [0.2510, 0.2510, 0.2549,  ..., 0.1608, 0.1569, 0.1529],
          [0.2431, 0.2392, 0.2431,  ..., 0.1608, 0.1569, 0.1490],
          [0.2314, 0.2275, 0.2314,  ..., 0.1529, 0.1451, 0.1451]],

         [[0.3176, 0.3059, 0.3020,  ..., 0.3333, 0.3373, 0.3529],
          [0.3098, 0.2980, 0.2902,  ..., 0.3059, 0.3137, 0.3294],
          [0.3020, 0.2902, 0.2863,  ..., 0.2902, 0.3020, 0.3216],
          ...,
          [0.2902, 0.2902, 0.2941,  ..., 0.1882, 0.1843, 0.1804],
          [0.2824, 0.2784, 0.2824,  ..., 0.1882, 0.1843, 0.1765],
          [0.2706, 0.2627, 0.2667,  ..., 0.1882, 0.1804, 0.1725]],

         [[0.1137, 0.1020, 0.0941,  ..., 0.0863, 0.0902, 0.0941],
          [0.1098, 0.1059, 0.1020,  ..., 0.0745, 0.0824, 0.0824],
          [0.1020, 0.1020, 0.1020,  ..., 0

In [None]:
# Input pipline for custom dataset

# You should build your custom dataset as below
class CustomeDataset(torch.utils.data.Dataset):
    def __init__(self):
        train_dataset = torch.utils.data.Dataset

    def __getitem__(self, index):
        image, label = train_dataset[0]
        print(image.size())
        print(label)


# I don't know here
