# PyTorch for Linear Regression

So we can understand how PyTorch works

In [1]:
import torch
import numpy as np
import sys

In [2]:
torch.__version__

'1.13.0+cu117'

In [3]:
torch.cuda.is_available()

True

In [4]:
#as you all know, things can speed up if you have NVIDIA GPU
#CUDA is the framework that NVIDIA develops, which allows us to use the GPU for calculations

#my PC is MAC, and I don't have CUDA

device = torch.device("cuda:0" if (torch.cuda.is_available()) else "cpu")
device

device(type='cuda', index=0)

Plan for today:

1. ETL 
   1. Specifying some some random input
   2. PyTorch Dataset and DataLoader
2. EDA - we gonna just skip because we are lazy...
3. Feature Engineering / Cleaning - which we don't need to....
4. Modeling 
   1. `nn.Linear` (luckily, you already understand this!  Yay!)
   2. Define loss function (mse for regression, ce for classification)
   3. Define the optimizer function (gradient descent ; adam)
   4. Train the model
5. Inference / Testing

## 1. ETL

### 1.1 Specify some input

Consider this data:

<img src = "../figures/japan.png" width="500">

In a linear regression model, each target variable is estimated to be a weighted sum of the input variables, offset by some constant, known as a bias :

$$\text{yield}_\text{apple}  = w_{11} * \text{temp} + w_{12} * \text{rainfall} + w_{13} * \text{humidity} + b_{1}$$

$$\text{yield}_\text{orange} = w_{21} * \text{temp} + w_{22} * \text{rainfall} + w_{23} * \text{humidity} + b_{2}$$

Visually, it means that the yield of apples is a linear or planar function of temperature, rainfall and humidity:

<img src = "../figures/japan2.png" width="400">

The learning part of linear regression is to figure out a set of weights <code>w11, w12,... w23, b1 \& b2</code> using gradient descent

In [5]:
#X (temp, rainfall, hum)
X_train = np.array([[73, 67, 43], [91, 88, 64], [87, 134, 58], 
                   [102, 43, 37], [69, 96, 70], [73, 67, 43], 
                   [91, 88, 64], [87, 134, 58], [102, 43, 37], 
                   [69, 96, 70], [73, 67, 43], [91, 88, 64], 
                   [87, 134, 58], [102, 43, 37], [69, 96, 70]], 
                  dtype='float32')

# Targets (apples, oranges)
Y_train = np.array([[56, 70], [81, 101], [119, 133], 
                    [22, 37], [103, 119], [56, 70], 
                    [81, 101], [119, 133], [22, 37], 
                    [103, 119], [56, 70], [81, 101], 
                    [119, 133], [22, 37], [103, 119]], 
                   dtype='float32')

In [6]:
#please create tensors from these numpy array
#torch.from_numpy (copy)  or torch.tensor  (not a copy!)
inputs  = torch.tensor(X_train)
targets = torch.tensor(Y_train)

#please print the shape of these tensors
#use either .size() or .shape
inputs.shape, targets.shape


(torch.Size([15, 3]), torch.Size([15, 2]))

### 1.2 Dataset

We gonna create `TensorDataset` on top of these tensors, so we can access each row from inputs and targets as tuples.   

Note:  This must be done, if we want to use `DataLoader`.

In [7]:
from torch.utils.data import TensorDataset

In [8]:
#put this dataset on top of our inputs and targets
#format: TensorDataset(X, y) where X.shape is (m, n) and y.shape is (m, k)
ds = TensorDataset(inputs, targets)

In [9]:
ds[1] #this is a tuple of two tensors, the x and the corresponding y
#this IS THE FORMAT that pyTorch wants!!!

(tensor([91., 88., 64.]), tensor([ 81., 101.]))

### 1.3 DataLoader

By default, PyTorch works in batch (remember the mini-batch gradient descent!).

In simple words, it will ALWAYS take some mini-batch, and perform gradient descent.

Why PyTorch assume mini-batch; because PyTorch assumes you won't be able to fit in ~1M samples into your GPU ram....(3, 4, 6, 11, 12, 64).

In [10]:
#this dataloader will automatically create an enumerator, look at each batch
#means, you can simply perform a for loop onto this dataloader
#if you DON'T WANT TO use this DataLoader, it's fine!  But you have
#to manually select the mini-batch (just like we do in our LR mini-batch class)
from torch.utils.data import DataLoader  #this guy is randomized (if you set Shuffle=true)

batch_size = 3  #this is any number you like
#too small then your code runs slow
#too big then you may get "out of memory" error

dl = DataLoader(ds, batch_size, shuffle=True)


In [11]:
#now, this dl is basically an enumerator, in which we can loop on....

# for x, y in dl:
#     print(f"X: {x}")  
#     print(f"Y: {y}")
#     break

#this dl has an internal counter, that keeps where it is currently
#this dl keeps on running; which is intentional; because we have the concept of "epochs"
#epochs means that how many times we "exhaust" the whole dataset

## 2. EDA - skip because we are lazy

## 3. Modeling

### 3.1 Define our neural network

- how many linear layers we want???

In [12]:
import torch.nn as nn #stands for neural network; modules that contains many possible layers
#define our neural network
#just use one layer....
#we gonna come back here and add one more layer....
#format: nn.Linear(in_features, out_features)
#format: nn.Linear(temp;rainfall;hum  ,  orange; apples)

# model = nn.Linear(3, 2)

#linear layers are basically simple matrix multiplication....
#Many other names:  In Keras, we called Dense.  In TensorFlow, we called FullyConnected

#Keras are very high-level - not good for research / development (mainly for education...)
#TensorFlow is developed by Google - it's quite good

#for very huge, complex, high performance model - TensorFlow is much better / optimized
    #they are more low-level than PyTorch
#for very generally almost any model that we use (even in research) - PyTorch is much better 
# due to its computational graph.....

#TensorFlow supports something called TensorFlowLite, which is the way
#you want to use for mobile phones....

In [13]:
#I wonder whether have one extra layer, can reduce the loss!
# model = nn.Sequential(
#     nn.Linear(3, 10),
#     nn.Linear(10, 2)
# )  #this is fine, but this is not the best practice!!
   #because in the future, there are many layers and complex stuffs in neural network....

In [14]:
#class is the perfect and the best practice for creating a neural network of any type...

#format:
'''
class AnyNameCapitalized(nn.Module): #it must inherit nn.Module
    def __init__():
        super().__init__()  #super is basically inheriting nn.Module init
        #we define all the layers here.....
        
    def forward(self, x):   #YOU CANNOT CHANGE THIS NAME, it MUST BE "forward"
        x = layer1()
        x = layer2()
        return x
'''

class NeuralNetwork(nn.Module):
    
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)
        self.relu = nn.ReLU()
    
    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out


In [15]:
model = NeuralNetwork(  3  ,  10   ,  2 )

In [16]:
#model.weight
#model.fc1.weight
#model.fc2.weight
# model.fc1.weight #by default, these weights are uniformly close to 0

In [17]:
# model.weight.shape  #this one is basically in the shape (out_features, in_feature)

#you can imagine X @ W^T
#after you transpose W, W^T becomes [3, 2]
#which now you can do X @ W^T which is (anything, 3) @ (3, 2)

In [18]:
# model.bias  #why two bias???
# model.fc1.bias

In [19]:
# list(model.parameters())  #this will list all the parameters (it's a object)

In [20]:
#p.numel() just flatten everything....
sum(p.numel() for p in model.parameters() if p.requires_grad)

#why 8 here??? - 6 weights and 2 bias.....

62

In [21]:
#so how do we use our model
model

NeuralNetwork(
  (fc1): Linear(in_features=3, out_features=10, bias=True)
  (fc2): Linear(in_features=10, out_features=2, bias=True)
  (relu): ReLU()
)

In [22]:
#so you can perform a forward pass, simply using 
#format: model(inputs)

# print("Inputs: ", inputs.shape)

# output = model(inputs)  #(15, 3) @ (3, 2) = (15, 2)
# print(output.shape)  #why output.shape is 15, 2??

### 3.2 Define the loss function

- should we use MSE or Cross Entropy

In [23]:
#under the nn module, there are many loss function
J_fn = nn.MSELoss()

#later on, you will know how to use this.....

### 3.3 Define the optimizer
- Gradient Descent

In [24]:
#normally, in sklearn, we simply call fit, and it will do gradient descent
#in code from scratch, we need to like specify how we want to update the gradients
#optimizer handles how we update the parameters
#   if we use w = w - alpha (gradient) ==> gradient descent
#optimizer is handles by the `torch.optim` module
#stochastic gradient descent ==> is NOT one sample; is basically mini-batch 
optim = torch.optim.SGD(model.parameters(), lr=0.0001)

### 3.4 Actually train the model

- 1. Predict
- 2. Loss
- 3. Gradient
- 4. Update the weights

In [25]:
# num_epochs = 5  #it depends....trial and error....
# #for num_epochs
# for epoch in range(num_epochs):
#     #for dataloader
#     for x, y in dl:  
#         #x: (batch, features) = (3, 3) 
#         #y: (batch, target)   = (3, 2)
        
#         #optional: you can put your x and y inside the CUDA (GPU) for speed up
#         x.to(device)  #device is either cpu or cuda
#         y.to(device)

#         #1. predict (forward pass)
#         yhat = model(x)
        
#         #2. calculate loss
#         #sklearn.metrics.mse(y, yhat)
#         #format: J_fn(inputs, targets)
#         loss = J_fn(yhat, y)
        
#         #3. calculate gradient
#         #3.1 clear out the previous gradients
#         #format: optimizer.zero_grad()
#         optim.zero_grad()
        
#         #3.2 called backward() on loss to retrieve all the gradients (backpropagation/backward pass)
#         loss.backward()  #why called backward on loss!?
#         #backward DOES NOT adjust the weight YET....just backpropagation
#         #we want to calculate the gradients of all parameters (8 - 6 weights and 2 bias)
#         #IN RESPECT TO THE LOSS....dJ/dw11,  dJ/dw12, dJ/dw13....., dJ/db1, dJ/db2
        
#         #4. update the parameters using the optim
#         #W = W - alpha * gradient  #we don't need to do this here.....
#         optim.step()  #optim already has learning rate, it also know all the parameters

In [25]:
num_epochs = 10 
for epoch in range(num_epochs):
    for x, y in dl:  
        x.to(device)  #device is either cpu or cuda
        y.to(device)

        yhat = model(x)        #1. predict
        loss = J_fn(yhat, y)   #2. calculate loss
        optim.zero_grad()      #3.1 clear out the previous gradients
        loss.backward()        #3.2 called backward()
        optim.step()           #4. update the parameters using the optim
        
        print(f"Epoch: {epoch} - Loss: {loss}")
        
        #can you guys help tell what is the best hidden size?
        #final exam is on Nov 22 9 - 12
            #signal processing
            #deep learning - pytorch
        #project will be decided by Mr. Akraradet
        #8 classes......14 lectures....
        

Epoch: 0 - Loss: 7598.77685546875
Epoch: 0 - Loss: 3534.873046875
Epoch: 0 - Loss: 5347.306640625
Epoch: 0 - Loss: 13583.6806640625
Epoch: 0 - Loss: 5481.267578125
Epoch: 1 - Loss: 9452.2353515625
Epoch: 1 - Loss: 10786.2001953125
Epoch: 1 - Loss: 8439.3837890625
Epoch: 1 - Loss: 4776.49755859375
Epoch: 1 - Loss: 8265.47265625
Epoch: 2 - Loss: 11960.0087890625
Epoch: 2 - Loss: 5780.46728515625
Epoch: 2 - Loss: 5927.58056640625
Epoch: 2 - Loss: 6924.73828125
Epoch: 2 - Loss: 11047.0966796875
Epoch: 3 - Loss: 4014.132080078125
Epoch: 3 - Loss: 13403.4404296875
Epoch: 3 - Loss: 4740.02685546875
Epoch: 3 - Loss: 7222.46728515625
Epoch: 3 - Loss: 12218.046875
Epoch: 4 - Loss: 9428.4921875
Epoch: 4 - Loss: 5765.31298828125
Epoch: 4 - Loss: 11032.6845703125
Epoch: 4 - Loss: 8394.74609375
Epoch: 4 - Loss: 6941.064453125
Epoch: 5 - Loss: 8392.1806640625
Epoch: 5 - Loss: 8239.9755859375
Epoch: 5 - Loss: 6937.49658203125
Epoch: 5 - Loss: 4424.21142578125
Epoch: 5 - Loss: 13532.9296875
Epoch: 6 - 

## 4. Inference / Testing

Test some random data

In [26]:
ds[0:2]

(tensor([[73., 67., 43.],
         [91., 88., 64.]]),
 tensor([[ 56.,  70.],
         [ 81., 101.]]))

In [27]:
#please create two numpy array of 
# [74, 68, 42], [92, 88, 65]
x = np.array([[74, 68, 42], [92, 88, 65]], dtype='float32')
#float  means 32 bits
#double means 64 bits

#please make it a tensor
x_tensor = torch.tensor(x)

#then use our model to predict the number of oranges and apples
yhat = model(x_tensor)
yhat

#print the loss comparing with ds[0] and ds[1] - look at the y part ok...
ytest = ds[0:2][1]
loss = J_fn(yhat, ytest)
print(loss)


tensor(6149.1826, grad_fn=<MseLossBackward0>)
