# Introduction to Neural Network (PyTorch)

Version: 2024-8-6

The widespread adoption of artificial intelligence in recent years has been largely driven by advancement in neural networks. 

Neural network is fundamentally numeric computation, so any software with decent numeric computation capabilities can be used to construct and train a neural network. That said, while in theory you can construct a neural network in Excel, in practice it will be very troublesome since Excel is not designed with neural network in mind. Modern neural network applications have consolidated around three platforms:
- [Tensorflow](https://www.tensorflow.org/) from Google.
- [Flax](https://github.com/google/flax), also from Google.
- [PyTorch](http://pytorch.org/), originally from Meta but now managed by an independent foundation.

At the lowest level, these platforms are essentially NumPy with the ability to run on GPUs. 
We do not want to write the basic building blocks of neural networks from scratch, however
if we are just trying to learn how they work.
Therefore in this course, we will focus on two types of components that build on top of these platforms:

1. **High-level API for constructing neural network**: 
    [`keras`](https://keras.io/) of Tensorflow and and 
    [`nn.Module`](https://pytorch.org/docs/stable/nn.html) of PyTorch 
    provides ready-to-use building blocks for the construction of neural networks.
2. **Libraries that provides access to fully-trained models**. 
    The most prominent examples are Hugging Face's [Transformers](https://huggingface.co/docs/transformers/index) 
    and [fastai](https://github.com/fastai/fastai).

In this notebook we will use PyTorch, which is currently the platform of choice for research. 
There is a separate notebook on the same topic that uses Keras on Tensorflow instead.

<img src="https://scrp.econ.cuhk.edu.hk/workshops/ai/images/nn_libraries_2024.png" width="80%">

## A. PyTorch vs Keras-Tensorflow

If you are familiar with how neural networks are constructed in Keras,
there is no exact equivalent on PyTorch. 
The main differences are as follows:
- The model structure is defined within a subclass of `torch.nn.Module`.
- You have to specify&mdash;i.e. code&mdash;what happen during the forward pass.
- Pure PyTorch also requires you to code the training loop, as well as what happens during
    validation, testing and inference. These can be replaced by trainers from libraries 
    such as [pytorch-accelerated](https://pytorch-accelerated.readthedocs.io/en/latest/),
  [PyTorch Lightning](https://pytorch-lightning.readthedocs.io/en/latest/),
  and most recently, [Keras 3](https://keras.io/keras_3/).
- Data needs to be manually placed in the right device. This can be automatically handled
    by Hugging Face's [Accelerate](https://github.com/huggingface/accelerate) library.

These differences make PyTorch a bit harder to use, but you gain more flexibility as a result.

Before we start, we will first disable the server's GPU so that everything runs on its CPU. Later we will turn it back on to see how much speed up we can get. This setting has no effect if you do not have a (Nvidia) GPU.

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""

## A Simple Example: Binary Neural Network Classifier

As a first example, we will train a neural network to the following classification task:

|$y$|$x_1$|$x_2$|
|-|-|-|
|0|1|2|
|1|0|5|

with $y$ being $1 - x_1$ and $x_2$ being just an irelevant random number.

To be clear: there is absolutely no need to use neural network for such as simple task. A simpler model such as logit will train a lot faster and potentially with better accuracy.

We first load the data:

In [None]:
import numpy as np 
import pandas as pd
from sklearn.model_selection import train_test_split

# Import data
data = pd.read_csv("../data/D1-data-1.csv")
y = data['y']
X = data[['x1','x2']]

#Shuffle and split data into train set and test set
X_train, X_test, y_train, y_test = train_test_split(X,y)

### PyTorch DataLoader

Unlike Keras, NumPy arrays cannot be directly provided to PyTorch models. 
We will instead do the following:
1. Create PyTorch tensors from NumPy arrays.
2. Create a PyTorch Dataset from the tensors.
3. Create a PyTorch DataLoader from the Dataset.

Note that Tensorflow does have a similar data loading structure.
It is just that Keras provides a simplier interface while PyTorch does not.

Because we have to do this for every dataset we use, 
we will write a function the takes NumPy arrays and return a PyTorch Dataloader:

In [None]:
import torch
from torch import Tensor
from torch.nn.functional import one_hot
from torch.utils.data import TensorDataset, DataLoader

def numpyToDataLoader(X,y,batch_size=32):
    # Transform numpy array to torch tensor

    # create datset and dataloader


We then apply this function to our train and test data:

In [None]:
dl_train = numpyToDataLoader(X_train,y_train)
dl_test = numpyToDataLoader(X_test,y_test)

### Building the Neural Network

We will construct a neural network classifier for this task. 

A neural network model is made up of multiple layers. The simpliest model would have three layers:
- An *input layer*. This layer specify the nature of the input data. In this example, we only need to tell Keras that we have two variables to input.
- A *hidden layer*. This layer contains neuron(s) that process the input data.
- An *ouput layer*. The neurons in this layer process the output from the hidden layer and generate predictions. This layer contains as many neurons as the number of target variables we try to predict. 

Below is the simplest neural network one can come up with, with only one hidden neuron. The neuron computes the following function:
}
$$
F \left( b + \sum\nolimits_{i}{w_{i}x_{i}} \right)
$$

where $x_i$ are inputs, b the intercept (called *bias* in machine learning), $w_i$ coefficients (called *weights*) and $F$ is an *activation function*. In this example we will use the logistic function (also called the *sigmoid function*) as the activation function:

$$
F(z) = \frac{e^z}{1+e^z}
$$

So the neuron is essentially a logit regression.

In [None]:
from torch import nn, cuda, optim
import pytorch_lightning as pl

# Use GPU if available, otherwise use CPU
device = "cuda" if cuda.is_available() else "cpu"
print(f"Using {device} device")

# Your neural network needs to be implemented in a subclass
# of torch.nn.Module


# Create the model and transfer it to the chosen device


### Training Loop

PyTorch has no trainer build in, so we have to write a training loop that does the following:
1. Loop through each epoch.
2. Within each epoch, loop through each mini-batch.
3. Within each mini-batch:
    1. Compute loss.
    2. Compute gradients.
    3. Update parameters.

The process requires us to specify the loss function and optimizer, which we will provide
later. Additionally, we also need to write code to keep track of progress.

Since we are going to train multiple models, we put the loop in a function so that we can 
reuse it later.

In [None]:
def train(dataloader, model, loss_fn, optimizer, epochs=5, quiet=False):
    # Set the model into training mode
    # Does not actually train the model


    # Double loop: epoch - mini-batch
        
        # Create an empty list to store mini-batch loss in this epoch
        
        # Mini-batch loop
        
            # Transfer mini-batch to chosen device
            

            # Compute prediction error
           

            # Backpropagation
              # Reset gradients to zero
              # Compute the gradients
              # Update parameters based on the chosen optimizer

            # Append loss to loss list
            

        if not quiet:
            # Display overall loss
            

We train the model by calling the `train` function:

In [None]:
# Set the loss function, optimizer and number of epochs


# Start training
train(dl_train, model, loss_fn, optimizer, epochs=10)

### Evaluating the Model

To evaluate the model, we have a loop similar to the training loop,
but without going through multiple epochs and without updating parameters.

In [None]:
def test(dataloader, model, loss_fn):
    
    loss_list = []
    
    for batch, (X, y) in enumerate(dataloader):
        # Transfer mini-batch to chosen device
        Xb = X.to(device)
        yb = y.to(device)

        # Compute prediction error
        pred = model(Xb)
        loss = loss_fn(pred, yb)

        # Append loss to loss list
        loss_list.append(loss.item())

    # Overall loss
    loss_overall = np.mean(loss_list)
    return loss_overall
    
test(dl_test, model, loss_fn)

Unlike OLS, a neural network's performance could vary across runs. Run the code a few more times and see how the performance vary.

### Inference

We can make prediction (this is called *inference* in machine learning) with yet another loop,
this time without computing the loss:

In [None]:
# Data
x = np.array([[0,1]])

# Set up tensor, dataset and datalaoder


# List to save prediction


# Inference loop

    # Transfer mini-batch to chosen device
    
    
    # Compute prediction
    

# The combined array of predictions


### Saving and Loading Models

Training neural network models are time consuming, so we usually want to save 
trained models for reuse.

In [None]:
# Save
torch.save(model.state_dict(), "model.pth")

# Load
model = NeuralNetwork()
model.load_state_dict(torch.load("model.pth"))

## C. Activations

Different activation can have profound impact on model performance. Besides ```nn.Sigmoid```, which is just a different name for the logistic function, there are other activation function such as ```nn,Tanh``` and ```nn.ReLU```. *ReLU*, which stands for **RE**ctified **L**inear **U**nit, is a particular common choice due to its good performance.

In [None]:
# Use ReLU instead of sigmoid in the hidden layer


# Create the model and transfer it to the chosen device
model = NeuralNetwork().to(device)
print(model)

# Loss and optimizer
loss_fn = nn.BCELoss()
optimizer = optim.Adam(model.parameters())

# Train the model
train(dl_train, model, loss_fn, optimizer, epochs=10)
test(dl_test, model, loss_fn)

Why is ReLU preferred over the logistic function? Let us take a look at the shape of each function:

<img src="https://scrp.econ.cuhk.edu.hk/workshops/ai/images/logistic_v_relu.png">

The most prominent feature of the logistic function is that it is bounded between 0 and 1. This means it is virtually flat for very large or very small input values, and flat means small gradient. As gradient descent relies on gradient to learn, small gradient implies slow learning. ReLU avoids this issue by being linear above zero.

## E. Dropout

As neural networks are highly flexible, they can easily overfit. Dropout is a regularization technique that works by randomly setting the outputs of some neurons to zero, thereby forcing the network to not rely too much on a specific neurons or feature. The function below added a 50% dropout to the hidden layer:

## F. Neural Network Regression

Next we are going use a neural network in a regression task. The true data generating process (DGP) is as follows:

$$
y = x^5 -2x^3 + 6x^2 + 10x - 5
$$

The model does not know the true DGP, so it needs to figure out the relationship between $y$ and $x$ from the data.

First we generate the data:

In [None]:
#Generate 1000 samples
X = np.random.rand(1000,1)
y = X**5 - 2*X**3 + 6*X**2 + 10*X - 5

#Shuffle and split data into train set and test set
X_train, X_test, y_train, y_test = train_test_split(X,y)

dl_train = numpyToDataLoader(X_train,y_train)
dl_test = numpyToDataLoader(X_test,y_test)

Then we construct the model:

In [None]:
# Single hidden layer with 100 neurons

        
# Create the model and transfer it to the chosen device


# Loss and optimizer


# Train the model
train(dl_train, model, loss_fn, optimizer, epochs=20)
test(dl_test, model, loss_fn)

### Make Model Construct Modifiable

We can add arguments to the `__init__` method of our subclass of `nn.Module`, 
allowing us to create models with different settings:

In [None]:
# Single hidden layer with variable hidden neurons and activation


There is still a lot of code repetition outside of the subclass.
We will enclose them in a function:

In [None]:
import time

def polyNN(data, 
           epochs=200, 
           batch_size=32,
           **kwargs):
    
    # Record the start time
    start = time.time()    
    
    # Unpack the data
    X_train, X_test, y_train, y_test = data    

    # Convert data to PyTorch tensor
    dl_train = numpyToDataLoader(X_train,y_train,batch_size=batch_size)
    dl_test = numpyToDataLoader(X_test,y_test,batch_size=batch_size)    
    
    # Create the model and transfer it to the chosen device
    model = NNReg(**kwargs).to(device)

    # Loss and optimizer
    loss_fn = nn.MSELoss()
    optimizer = optim.Adam(model.parameters())

    # Train and the model
    train(dl_train, model, loss_fn, optimizer, epochs, quiet=True)
    
    # Collect and display info
    loss_tr = round(test(dl_train, model, loss_fn),4)
    loss_te = round(test(dl_test, model, loss_fn),4)
    param_count = sum(p.numel() for p in model.parameters())
    elapsed = round(time.time() - start,2)  
    
    print("Hidden count:",str(kwargs['hidden_count']).ljust(5),
          "Parameters:",str(param_count).ljust(6),
          "loss (train,test):",str(loss_tr).ljust(7),str(loss_te).ljust(7),
          "Time:",str(elapsed)+"s",
         )    

Now we can easily try out different settings:

In [None]:
data = train_test_split(X,y)

polyNN(data,hidden_count=1)
polyNN(data,hidden_count=10)
polyNN(data,hidden_count=50)
polyNN(data,hidden_count=100)
polyNN(data,hidden_count=500)

Here we see the universal approximation theorem in work: the more neurons we have the better the fit.

One trick that can often improve performance: *standardizing* data.

In [None]:
from sklearn import preprocessing
scalar = preprocessing.StandardScaler().fit(X)
X_std = scalar.transform(X)

data_std = train_test_split(X_std,y)

polyNN(data_std,hidden_count=1)
polyNN(data_std,hidden_count=10)
polyNN(data_std,hidden_count=50)
polyNN(data_std,hidden_count=100)
polyNN(data_std,hidden_count=500)

While `StandardScaler` works quite well when there is only a single feature, its sensitivity to outliers makes it unsuitable for situations with mulitple highly unbalanced features. Scikit-learn offers <a href="https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py">other scalers</a> such as `RobustScaler` that might work better in those cases. 


In [None]:
# Single hidden layer with variable hidden neurons and activation


In [None]:
polyNN(data_std,hidden_count=1)
polyNN(data_std,hidden_count=10)
polyNN(data_std,hidden_count=50)
polyNN(data_std,hidden_count=100)
polyNN(data_std,hidden_count=500)

Our model was not overfitting to begin with, which makes the use of dropout in this case a *bad* idea
&mdash;it increases in-sample error but does nothing to reduce out-of-sample error.

## G. Speed Things Up

Due to its complexity, neural network trains a lot slower than the other techniques we have covered previously. To speed up training, we can ask PyTorch to go through more samples before updating the model's parameters by specifying a larger ```batch_size```. Doing so allows PyTorch to make better use of the CPU's parallel processing capabitilies.

We previously set the default batch size to 32. We will try 128 instead:

In [None]:
# Set batch size
batch_size = 

# Run training again
polyNN(data_std,hidden_count=1,batch_size=batch_size)
polyNN(data_std,hidden_count=10,batch_size=batch_size)
polyNN(data_std,hidden_count=50,batch_size=batch_size)
polyNN(data_std,hidden_count=100,batch_size=batch_size)
polyNN(data_std,hidden_count=500,batch_size=batch_size)

Holding the number of epochs constant, what you should see with a larger batch size is faster training but also larger error. The latter is due to the fact that we are updating the parameters less often, resulting in slower learn. This can be countered by increasing the number of epochs.

## H. Running Model on GPU

If you have a GPU in your computer, you can now turn it on to see how much it speeds up the process of training.

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [None]:
polyNN(data,hidden_count=1)

With a GPU you can take advantage of its high number of core count by setting a much higher batch size, such as 1000:

In [None]:
batch_size = 1000
polyNN(data,hidden_count=1,batch_size=batch_size)
polyNN(data,hidden_count=10,batch_size=batch_size)
polyNN(data,hidden_count=50,batch_size=batch_size)
polyNN(data,hidden_count=100,batch_size=batch_size)
polyNN(data,hidden_count=500,batch_size=batch_size)

To compensate for the less frequent update, we can increase the number of epochs:

In [None]:
batch_size = 1000
epochs = 600
polyNN(data,hidden_count=1,epochs=epochs,batch_size=batch_size)
polyNN(data,hidden_count=10,epochs=epochs,batch_size=batch_size)
polyNN(data,hidden_count=50,epochs=epochs,batch_size=batch_size)
polyNN(data,hidden_count=100,epochs=epochs,batch_size=batch_size)
polyNN(data,hidden_count=500,epochs=epochs,batch_size=batch_size)

## I. Reducing Boilerplate Code

Let us go back to the main differences between PyTorch and Keras/Tensorflow:
- The model structure is defined within a subclass of `torch.nn.Module`.
- You have to specify&mdash;i.e. code&mdash;what happen during the forward pass.
- Pure PyTorch also requires you to code the training loop, as well as what happens during
    validation, testing and inference. These can be replaced by trainers from libraries 
    such as [pytorch-accelerated](https://pytorch-accelerated.readthedocs.io/en/latest/)
    or [PyTorch Lightning](https://pytorch-lightning.readthedocs.io/en/latest/).
- Data needs to be manually placed in the right device. This can be automatically handled
    by Hugging Face's [Accelerate](https://github.com/huggingface/accelerate) library.
    
We will now go through the third-party libraries mentioned above.

### Hugging Face Accelerate

`accelerate` removes the need to manually move our model and data to the right device. 
The benefit of doing so is not huge when we are running our model on a single GPU,
so the primary intended usage is multi-GPU training.

To use `accelerate`, simply add the following lines:

```python
from accelerate import Accelerator
accelerator = Accelerator()

# Load model, build model and choose optimizer here

model, optimizer, data = accelerator.prepare(model, optimizer, data)

# Train your model here

```

In [None]:
# Hugging Face Accelerate

from accelerate import Accelerator

accelerator = Accelerator()

# Load model, build model and choose optimizer here

model, optimizer, data = accelerator.prepare(model, optimizer, data)

train(dl_train, model, loss_fn, optimizer, epochs=10)            

### PyTorch-Accelerated

`pytorch_accelerated` builds on top of `accelerate` and offers a `Trainer` class
to do the job of the training and evaluation loops. There are also callbacks for 
tasks such as logging and early stopping. This makes the usage of `pytorch_accelerated`
very similar to Keras on Tensorflow.

`pytorch_accelerated` requires PyTorch *datasets* instead of dataloaders:

In [None]:
def numpyToDataset(X,y,batch_size=32):
    # Transform numpy array to torch tensor
    tensor_X = Tensor(X)
    tensor_y = Tensor(y)

    # create datset
    return TensorDataset(tensor_X,tensor_y) 

ds_train = numpyToDataset(X_train,y_train)
ds_test = numpyToDataset(X_test,y_test)

We then pass the model, loss function ,optimizer and callbacks to `Trainer`.
Training is initialized by `Trainer.train()` and evaluation with `Trianer.evaluate`:

In [None]:
# pytorch-accelerated
from pytorch_accelerated import Trainer
from pytorch_accelerated.callbacks import *

# Create the model
model = NeuralNetwork()

# Set the loss function, optimizer and number of epochs
loss_fn = nn.BCELoss()
optimizer = optim.Adam(model.parameters())

# Callbacks. The first five are included by default.
callbacks = [MoveModulesToDeviceCallback, 
             TerminateOnNaNCallback, 
             PrintProgressCallback, 
             ProgressBarCallback, 
             LogMetricsCallback,
             EarlyStoppingCallback(early_stopping_patience=3)]

# Set up pytorch-accelerated trainer
trainer = Trainer(
            model,
            loss_func=loss_fn,
            optimizer=optimizer,
            callbacks=callbacks
            )

# Train the model
trainer.train(
        train_dataset=ds_train,
        eval_dataset=ds_test,
        num_epochs=10,
        per_device_batch_size=32,
        )

# Evaluate
trainer.evaluate(
    dataset=ds_test,
    per_device_batch_size=64,
)
