# Classification using Neural Networks

We will now train our a neural network classifier using PyTorch.  We will use the titanic dataset for this.


Let's load the data using pandas as we learnt in the previous notebooks.

In [None]:
import pandas as pd  # type: ignore
import numpy as np   # type: ignore
# we are loading data from github. 
dataurl = 'https://github.com/rrr-uom-projects/MPiCRT-AI/raw/main/Data/titanic.csv' 
pax = pd.read_csv(dataurl, sep = ',')

Let's remember the data to make sense of it. Here is a short description of the series:

- **PassengerId** Arbitrary nr between 1 and 841
- **Survived** Weather Survived or not: 0 = No, 1 = Yes
- **Pclass** Ticket class: 1 = 1st, 2 = 2nd, 3 = 3rd
- **Name** Name of the Passenger
- **Sex** Female/male
- **Age** Age in years
- **SibSp** No. of siblings / spouses aboard the Titanic
- **Parch** No. of parents / children aboard the Titanic
- **Ticket** Ticket number
- **Fare** Passenger fare
- **Cabin** Cabin number
- **Embarked** Port of Embarkation:C = Cherbourg, Q = Queenstown, S = Southampton


## Preprocessing and creatomg dummy variables

During the last tutorials we processed the data to get it in numerical coding.  Let's bring the relevant code here.

### Imputing Age


In [None]:
medianAges = pax.groupby(['Sex','Pclass','Embarked'], observed=True)[['Age']].median()
medianAges = medianAges.reset_index()

def getMedianAgeForCategory(row):
    # using the dataframe medianAges created above.
    condition = (
        (medianAges['Sex'] == row['Sex']) & 
        (medianAges['Pclass'] == row['Pclass']) & 
        (medianAges['Embarked'] == row['Embarked'])
    ) 
    return medianAges[condition]['Age'].values[0]

def imputeIfNeeded(row):
    return getMedianAgeForCategory(row) if np.isnan(row['Age']) else row['Age']

#let's make a copy of the values before imputing
pax['Age'] = pax.apply(imputeIfNeeded, axis=1)
pax.info()

In [None]:
# eliminate cabin and eliminate missing values --> 889 
pax.dropna(subset=['Embarked'],inplace=True)
pax.info()

### Titles and tytle types

In [None]:
# First we need to cast the type of the Name series to str. 
pax['Name'] = pax['Name'].astype('string')
surnamefirstnames = pax['Name'].str.split(',')  # this splits the string by the token given (,)
pax['Surname'] = surnamefirstnames.str.get(0)   # here we get the first bit of the divided sentence
afterComma = surnamefirstnames.str.get(1).str.split('.')# this splits the string by the token given (.)
pax['Title'] = afterComma.str.get(0).str.strip()        # here we get the first bit of the divided sentence and eliminate empty spaces

Title_Dictionary = {
    "Capt": "Officer",
    "Col": "Officer",
    "Major": "Officer",
    "Jonkheer": "Royalty",
    "Don": "Royalty",
    "Sir" : "Royalty",
    "Dr": "Officer",
    "Rev": "Officer",
    "the Countess":"Royalty",
    "Mme": "Mrs",
    "Mlle": "Miss",
    "Ms": "Mrs",
    "Mr" : "Mr",
    "Mrs" : "Mrs",
    "Miss" : "Miss",
    "Master" : "Master",
    "Lady" : "Royalty"
}
pax['TitleType'] = pax['Title'].map(Title_Dictionary)
pax['TitleType'] = pax['TitleType'].astype('category')


### Family sizes and types

In [None]:
pax['FamilySize'] = pax['SibSp']+pax['Parch']+1 
def getFamilyType(famsize):
    return 'single' if famsize == 1 else ('smallFamily' if famsize < 5 else 'largeFamily')

pax['FamilyType'] = pax['FamilySize'].apply(getFamilyType)
pax['FamilyType'] = pax['FamilyType'].astype('category')

### Categorical variables to dummy variables

In [None]:
pax['Sex'] = pax['Sex'].map({'male': 0, 'female': 1}).astype(int)
# pax['Survived'] is already 0s and 1s. Not converted to category
# pax['Pclass'] is already 1s, 2s, or 3s.  Not converted to category
pax['Embarked'] = pax['Embarked'].astype("category")
pax = pd.get_dummies(pax, prefix='Embarked',columns=['Embarked'],dtype=int)
pax = pd.get_dummies(pax, prefix='TitleType',columns=['TitleType'],dtype=int)
pax = pd.get_dummies(pax, prefix='FamilyType',columns=['FamilyType'],dtype=int)

### Eliminate variables 

Now we have extracted extra information from the data stored for each passanger. We can now clean up our dataframe in preparation to model training.

In [None]:
pax.columns

In [None]:
pax.info()

In [None]:
cleanpax = pax.loc[:,['Survived', 'Pclass', 'Sex', 'Age', 'SibSp',
       'Parch', 'Fare', 'FamilySize',
       'Embarked_C', 'Embarked_Q', 'Embarked_S', 'TitleType_Master',
       'TitleType_Miss', 'TitleType_Mr', 'TitleType_Mrs', 'TitleType_Officer',
       'TitleType_Royalty', 'FamilyType_largeFamily', 'FamilyType_single',
       'FamilyType_smallFamily']]


cleanpax.info()

# Data splitting

Before we do any training, let's divide the dataset in *training* and *validation*.  Ideally, we will have another dataset, *test*, to test for generalisability.  Kaggle kept a good portion of the data as test.

In [None]:
Y = cleanpax.loc[:,'Survived'] # This is the target!
X = cleanpax.loc[:, cleanpax.columns != 'Survived'] # This are the features/variables we wll use to predict

# to divide the data in train/validation, we an use train_test_split from sklearn
from sklearn import model_selection # type: ignore
X_train, x_val, Y_train, y_val = model_selection.train_test_split(X, Y, test_size=0.2, random_state=1234) # train 80%, validation 20%
print(X_train.shape,x_val.shape,Y_train.shape,y_val.shape)

We can use X_train and Y_train to create our models, and use x_test and y_test to test for overfitting!  We will learn more about this later.

# PyTorch

PyTorch is an open-source machine learning framework developed by Meta and widely used for deep learning applications. It is very popular as it enables to build and train neural networks 'easily'.

Some key features of PyTorch:
- Autograd (Automatic Differentiation) is built-in to automatically computate gradients. This is what allows neural network optimisation (backpropagation).
- GPU Acceleration, supporting CUDA, which allows to train much faster than in normal CPUs.
- Integration with NumPy, allowing easy data conversion.

For learning PyTorch in depth, it is worth following their tutorials: https://pytorch.org/tutorials/beginner/basics/intro.html  

In [None]:
import torch  # type: ignore

An important point in training is knowing whether we have GPU acceleration or not.  Let's find which device we are using:

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

## Tensors
*PyTorch* encodes inputs and outputs of a model, as well as the model’s parameters as *tensors*.  Tensor is similar to arrays/matrics, very similar to *ndarray*'s in *NumPy*. An important difference is that tensors include ways to store and use the data in GPUs and they are optimised for automatic differentation. 

In [None]:
from torch.autograd import Variable  # type: ignore
X_train_t = Variable(torch.Tensor(X_train.values), requires_grad=True)
Y_train_t = Variable(torch.Tensor(Y_train.values), requires_grad=False).unsqueeze_(-1)
x_val_t = Variable(torch.Tensor(x_val.values), requires_grad=False)
y_val_t = Variable(torch.Tensor(y_val.values), requires_grad=False).unsqueeze_(-1)
print('Dataframes:', X_train.shape,x_val.shape,Y_train.shape,y_val.shape)
print('Tensors:', X_train_t.shape, Y_train_t.shape, x_val_t.shape, y_val_t.shape)

In [None]:
# Move tensor to GPU if available
X_train_t = X_train_t.to(device)
Y_train_t = Y_train_t.to(device)
x_val_t = x_val_t.to(device)
y_val_t = y_val_t.to(device)

## Defining the Neural Network
Let's define a neural network with 3 hidden layers where each layer has 7 neurons, with a ReLU activation function. I came across this configuration from the optimisation presented here: https://github.com/davidtvs/kaggle-titanic/blob/master/pytorch/surviving-the-titanic.ipynb

To create a network, the best practice is to create a class with your configuration. For this example, we will inherite from the class [Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html) in PyTorch. We need to define the initialiser (what goes in the __init__() function) as well as the forward function, which is how you apply the neural network to a given 'x'.


In [None]:
import torch.nn as nn  # type: ignore

# Define Neural Network class
class NeuralNetwork(nn.Module): # here we are inheriting from the Module class
    def __init__(self):
        super().__init__()          # here we call the 'parent' initialiser. It should be called first!!!
        self.model = nn.Sequential(
            nn.Linear(19, 7),       # Layer 1. From 19 variables in originally in our dataframe to 7 neurons in the input layer
            nn.ReLU(),              # activation
            nn.Linear(7, 7),        # Hidden layer
            nn.ReLU(),              # activation
            nn.Linear(7, 7),        # Hidden layer
            nn.ReLU(),              # activation
            nn.Linear(7, 1),        # Hidden layer
            nn.Sigmoid()            # sigmoid activation to finish the network
        )

    def forward(self, x):
        return self.model(x)

We can now create the model. The model is an object of the class NeuralNetwork, and we need to assign the device we are working with (either cpu or cuda).

In [None]:
model = NeuralNetwork().to(device)
print(model)

## Training the neural network 

Remember that training is optimising the values of the trainable parameters using some hyperparameters to define the model characteristics and guide the optimisation process.  Here we have already decided on the model characteristics (nr of layers, nr of hidden neurons, etc).  The next part is how to set up the training process.


### Loss function
We also need a loss function.  We learnt about MSE (in registration), and cross-entropy.  For binary classification, binary cross-entropy (BCE) is the most appropriate loss function.  Here you can find the documentation: https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html#torch.nn.BCELoss 

In [None]:
loss_fn = torch.nn.BCELoss()

### Training 1 epoch

We need to write what to do when the model is train for one epoch (that is, when the optimiser sees the complete dataset and updates the parameters).  In this case, we need to compute the error (loss function) and do backpropagation.

In [None]:
def train_1epoch(X, Y, model, loss_fn, optimiser ):
    model.train()   # we tell the model we will be training it now

    # Compute prediction error
    pred = model(X)
    loss = loss_fn(pred, Y)

    # let's compute accuracy to check the training accuracy and compare it to what we got in the previous tutorial
    accuracy = (pred.round() == Y).type(torch.float).sum().item() / Y.size(0)

    # Backpropagation
    loss.backward()
    optimiser.step()
    optimiser.zero_grad()
    return loss.item(), accuracy # return the training loss and accuracy

### Evaluating without changing parameters
We should also check the model's performance against the validation dataset to be sure it is learning and assess wether it is underfiting, overfitting, or doing just right.

In [None]:
def evaluate(x, y, model, loss_fn):
    model.eval()    # we tell the model we will use it to evaluate or predict now.
    with torch.no_grad():   # we do not need to keep track of the gradients here --> faster and lighter in memory
        pred = model(x)
        test_loss = loss_fn(pred, y).item()
        accuracy = (pred.round() == y).type(torch.float).sum().item() / y.size(0)
    return test_loss, accuracy

### Optimiser
To train a model, we need an optimizer. We learnt that *Adam* is the most popular optimiser.  Adam requires a hyperparameter to control the optimisation: learning rate.  It is always good to check for published values to guide your selection.  A very small value will make the training very slow but it is likely to converge. In contrast, a very large value will be quick, but it may miss the global minima.  We will use 0.01.

In [None]:
optimiser = torch.optim.Adam(model.parameters(), lr=0.01)

### Training loop

Finally, we need to make the trainable loop:

In [None]:
epochs = 200
trainloss = []
validationloss = []
trainaccuracies = []
valaccuracies = []

In [None]:
for t in range(epochs):
    tl, ta = train_1epoch(X_train_t,Y_train_t, model, loss_fn, optimiser)
    vl, va = evaluate(x_val_t,y_val_t, model, loss_fn)
    if (t+1)%10 == 0 :
        print(f"Epoch {t+1}\tLosses:  Training {tl}, Validation {vl}." )
    trainloss.append(tl)
    validationloss.append(vl)
    trainaccuracies.append(ta)
    valaccuracies.append(va)
print("Done!")


### Plotting learning curves
Let's try and identify if we trained the best possible

In [None]:
import matplotlib.pyplot as plt  # type: ignore

fig, axs = plt.subplots(1,2,figsize=(9, 4)) # plotting multiple panels: 1 row, 2 columns
axs[0].plot(trainloss,'b-',label="Training Loss")
axs[0].plot(validationloss,'r-', label="Validation Loss")
axs[0].set_ylim(0, 1)
axs[0].set_xlabel("Epochs")
axs[0].set_ylabel("Loss")
axs[0].legend()

axs[1].plot(trainaccuracies,'b-',label="Training Accuracy")
axs[1].plot(valaccuracies,'r-', label="Validation Accuracy")
axs[1].set_ylim(0, 1)
axs[1].set_xlabel("Epochs")
axs[1].set_ylabel("Loss")
axs[1].legend()

plt.show()

And now you could test the model in the external test dataset to check generalisability. 

Training models is more of an art. I encourage you to look at other sources to learn more about it. A good resource is PyTorch's website. For example https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html or implementing grid-search with k-fold validation as in https://github.com/davidtvs/kaggle-titanic/blob/master/pytorch/surviving-the-titanic.ipynb

