<a href="https://colab.research.google.com/github/sdgroeve/D012554_Machine_Learning_2023/blob/main/01_logisitc_regression_in_pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Logitic regression in PyTorch


In [None]:
!pip install tqdm

[PyTorch](https://pytorch.org/) is an open source deep learning framework. 

In this notebook we will learn about a PyTorch training and evaluation workflow for fitting a logistic regression model on a toy dataset.

First, we import the required PyTorch libraries and fix the random seed.

In [None]:
import torch
from torch import nn 

torch.manual_seed(46)

# Check PyTorch version
torch.__version__

A tyical PyTorch workflow involves:

- Preparing the data
- Building the model
- Fitting the model to the data (training)
- Computing predictions and evaluating the model
- Saving the model

Let's discuss these steps in more detail by fitting a logistic regression model.

## Preparing the data

The dataset is in a flat file called `dataset_logistic_regression.csv`. 

We read this file into a Pandas DataFrame.

In [None]:
import pandas as pd

data_path = "https://raw.githubusercontent.com/sdgroeve/D012554_Machine_Learning_2023/main/datasets/dataset_logistic_regression.csv"

dataset = pd.read_csv(data_path)

dataset.head()

The dataset as two features `x_1` and `x_2`, and one label `y`. 

Let's plot this data. 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.lmplot(x="x_1",y="x_2",hue="y",data=dataset,fit_reg=False)
plt.show()

We put the feature columns in a DataFrame called `X` and the label column in a DataFrame called `y`.

In [None]:
y = dataset.pop('y')
X = dataset

A typical deep learning workflow would involve a train, a validation and a test split of the dataset.

Each split serves a specific purpose:

| Split | Purpose | Amount of total data | How often is it used? |
| ----- | ----- | ----- | ----- |
| **train set** | The model learns from this data. | ~60-80% | Always |
| **validation set** | The model gets tuned on this data. | ~10-20% | Often but not always |
| **test set** | The model gets evaluated on this data to test what it has learned. | ~10-20% | Always |

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5, random_state=42)
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

In PyTorch we work with Tensor representaions of the dataset. A PyTorch Tensor is basically the same as a numpy array: it does not know anything about deep learning or computational graphs or gradients, and is just a generic n-dimensional array to be used for arbitrary numeric computation.

To create a Tensor we need to first extract the NumPy data from the Pandas DataFrames.

In [None]:
X_train, X_val, X_test = X_train.values, X_val.values, X_test.values
y_train, y_val, y_test = y_train.values, y_val.values, y_test.values

In [None]:
X_train

Now we can create the Tensors.

In [None]:
X_train, X_val, X_test = torch.Tensor(X_train),torch.Tensor(X_val),torch.Tensor(X_test)
y_train, y_val, y_test = torch.Tensor(y_train),torch.Tensor(y_val),torch.Tensor(y_test)

## Building the model

To build a model in PyTorch we need to create a subclass of `torch.nn.Module` such that this subclass inherits all functionality required for fitting our model.

In [None]:
class LogisticRegression(torch.nn.Module):
  def __init__(self, input_dim, output_dim):
    super(LogisticRegression, self).__init__()
    #our model has just one linear layer
    self.linear = torch.nn.Linear(input_dim, output_dim)   
    #the modelparameters are initialized at random
    torch.nn.init.uniform_(self.linear.weight) 
  def forward(self, x):
    #the output is a linear function of the features followed by the sigmoid function
    outputs = torch.sigmoid(self.linear(x))
    return outputs

For our model class `LogisticRegression()` we need to define at least two methods: `__init__()` and `forward()`.

### `__init()__`

The method `__init__()` is called when an instance of our class `LogisticRegression` is created. This is done in the following code.

In [None]:
# Two inputs x_1 and x_2
input_dim = 2  
# Single binary output 
output_dim = 1 

# Create an instance of the model (this is a subclass of nn.Module that contains nn.Parameter(s))
model = LogisticRegression(input_dim, output_dim)

This code created a linear model with two modelparameters that each have a random value. 

Because we inherit all functionality of the `totch.nn.Module` class we can now, for instance, call the inherited `.state_dict()` method to get the state (what the model contains) of the model.

In [None]:
model.state_dict()

### `forward()`

When we pass data to our model, it'll go through the model's `forward()` method and produce a result using the computation we've defined. 

Let's make some predictions for the first 10 feature vectors in the test set.

In [None]:
with torch.inference_mode(): 
    predictions = model(X_test[:10])

predictions

Because we are working with Tensors the model outputs an array of (1-dimensional) arrays.

In [None]:
predictions.shape

We use the PyTorch method `squeeze()` to reshape this Tensor to a 1-dimensional array.

In [None]:
predictions = torch.squeeze(predictions)

predictions

As with the Pandas example above, we can extract the data as a NumPy array.

In [None]:
predictions = predictions.detach().numpy()

predictions

The Tansor.detach() method is used to detach a tensor from the current computational graph (more about this later). 

We also need to detach a tensor when we need to move the tensor from GPU to CPU.

Now we can compute evaluation metrics for the predicitons, e.g. the AUC.

In [None]:
from sklearn.metrics import roc_auc_score

with torch.inference_mode(): 
    predictions = model(X_test)

predictions = torch.squeeze(predictions)
predictions = predictions.detach().numpy()

print("test set AUC: {}".format(roc_auc_score(y_test,predictions)))

You probably noticed we used [`torch.inference_mode()`](https://pytorch.org/docs/stable/generated/torch.inference_mode.html) as a [context manager](https://realpython.com/python-with-statement/) (that's what the `with torch.inference_mode():` is) to make the predictions.

As the name suggests, `torch.inference_mode()` is used when using a model for inference (making predictions).

`torch.inference_mode()` turns off a bunch of things (like gradient tracking, which is necessary for training but not for inference) to make **forward-passes** (data going through the `forward()` method) faster.

## Training the model

Our model is making predictions using random modelparameter values.


For our model to update its parameters on its own, we'll need to add a few more things to our recipe.

To train the model we need to add a **loss function** and an **optimizer**. The loss function measures how wrong the model predictions are compared to the true labels. The optimizer tells your model how to update its modelparameters to best lower the loss.

Let's create a loss function and an optimizer we can use to help improve our model.

Depending on what kind of problem you're working on will depend on what loss function and what optimizer you use.

However, there are some common values, that are known to work well such as the SGD (stochastic gradient descent) or Adam optimizer. And the MAE (mean absolute error) loss function for regression problems or cross entropy loss function for classification problems, as for our dataset. 

For the optimizer we will SGD, `torch.optim.SGD(params, lr)` where:

* `params` are the modelparameters we want to optimize
* `lr` is the **learning rate** you'd like the optimizer to update the modelparameters at

In [None]:
learning_rate = 0.001

#the loss function
loss_func = torch.nn.CrossEntropyLoss()

#the optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

Now we've got a loss function and an optimizer, it's now time to create a **training loop** (and **validation loop**).

For the training loop, we have to code the following steps:

1. Forward pass: the model goes through all of the training data once, performing its `forward()` function calculations.

2. Calculate the loss: the model's outputs (predictions) are compared to the ground truth and evaluated to see how wrong they are.
3. Zero the gradients: the optimizers gradients are set to zero (they are accumulated by default) so they can be recalculated for the specific training step.
4. Perform backpropagation on the loss: computes the gradient of the loss with respect to every modelparameter 
5. Update the optimizer (**gradient descent**): update the modelparameter values with respect to the loss gradients.


In [None]:
#number of times we iterate trough the train set
num_epochs = 30

for epoch in range(num_epochs):

    #step 1
    predictions_train = torch.squeeze(model(X_train)) 

    #step 2
    loss = loss_func(predictions_train, y_train) 
    print("training loss: {}".format(loss))    

    #step 3
    optimizer.zero_grad() 

    #step 4
    loss.backward() 

    #step 5
    optimizer.step()
        
    #compute AUC on validation set
    predictions_val = torch.squeeze(model(X_val)).round().detach().numpy()
    print("validation AUC: {}".format(roc_auc_score(y_val,predictions_val)))

## Computing predictions and evaluating the model


In [None]:
model.eval()

with torch.inference_mode(): 
    predictions_test = model(X_test)

predictions_test = torch.squeeze(predictions_test).detach().numpy()

print("test set AUC: {}".format(roc_auc_score(y_test,predictions_test)))

## Saving (and loading) the model

The [recommended way](https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-loading-model-for-inference) for saving a model for inference (making predictions) is by saving the modelparameter values in `state_dict()`.

We call `torch.save(obj, f)` where `obj` is the target model's `state_dict()` and `f` is the filename of where to save the model.

It's common convention for PyTorch saved models or objects to end with `.pt` or `.pth`, like `saved_model_01.pth`.


In [None]:
model_filename = "model_logistic_regression.pth"
torch.save(obj=model.state_dict(), f=model_filename) 

To load a model, we first load the `state_dict()` with `torch.load()` and then pass that `state_dict()` to a new instance of our model (which is a subclass of `nn.Module`).


In [None]:
# Instantiate a new instance of our model (this will be instantiated with random weights)
loaded_model = LogisticRegression(input_dim, output_dim)

# Load the state_dict of our saved model (this will update the new instance of our model with trained weights)
loaded_model.load_state_dict(torch.load(f=model_filename))

Excellent! It looks like things matched up.

Now to test our loaded model, let's perform inference with it (make predictions) on the test data.

Remember the rules for performing inference with PyTorch models?

If not, here's a refresher:

<details>
    <summary>PyTorch inference rules</summary>
    <ol>
      <li> Set the model in evaluation mode (<code>model.eval()</code>). </li>
      <li> Make the predictions using the inference mode context manager (<code>with torch.inference_mode(): ...</code>). </li>
      <li> All predictions should be made with objects on the same device (e.g. data and model on GPU only or data and model on CPU only).</li>
    </ol> 
</details>



In [None]:
# 1. Put the loaded model into evaluation mode
loaded_model.eval()

# 2. Use the inference mode context manager to make predictions
with torch.inference_mode():
    loaded_model_preds = loaded_model(X_test) # perform a forward pass on the test data with the loaded model