<a href="https://colab.research.google.com/github/srikamal75/hello-world/blob/master/Titanic_Pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!wget -nc https://gist.githubusercontent.com/pfernandom/38ff7aa53993efd755b84ea1c89bf72b/raw/80c4cde766b74c93f4f9b452e1b7136e20c91140/train.csv

File ‘train.csv’ already there; not retrieving.



In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

## Read the data from the CVS files

We read our data from 'train.csv'


In [None]:
train = pd.read_csv('train.csv')

In [None]:
train.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Clean the data

A few transformations are needed:
- Drop columns that don't provide information for the classification (e.g. ticket number)
- Fill empty values (e.g. fill empty age values with the median of existing age records)
- Group columns into more meaningful labels (e.g. add the values `SibSp` and `Parch` into a single category  `FamilySize`, and transform them into a category (e.g. single, couple, large))
- Drop redundant columns (e.g. `SibSp` and `Parch`, as they have been merged into `FamilyS`)

If you imported more than one datasets (like an extra test dataset, or unlabeled data to predict), this cleanup needs to be done for all of them.

**Note**: Data cleanup is specific to the dataset. You need to understand what your dataset is trying to achieve, which columns have a direct relation with the prediction result, which columns are unnecessary and how to better fill empty values and group ca

In [None]:
def clean_data(dataset):
  dataset_title = [i.split(',')[1].split('.')[0].strip() for i in dataset['Name']]
  dataset['Title'] = pd.Series(dataset_title)
  dataset['Title'].value_counts()
  dataset['Title'] = dataset['Title'].replace(['Lady', 'the Countess', 'Countess', 'Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona', 'Ms', 'Mme', 'Mlle'], 'Rare')

  dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1

  def count_family(x):
    if x < 2:
        return 'Single'
    elif x == 2:
        return 'Couple'
    elif x <= 4:
        return 'InterM'
    else:
        return 'Large'

  dataset['FamilySize'] = dataset['FamilySize'].apply(count_family)
  dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace=True)
  dataset['Age'].fillna(dataset['Age'].median(), inplace=True)
  dataset = dataset.drop(['PassengerId', 'Cabin', 'Name', 'SibSp', 'Parch', 'Ticket'], axis=1)
  return dataset


train_clean = clean_data(train)

print(train_clean)

X = train_clean.iloc[:, 1:]
y = train_clean.iloc[:, 0]

X

     Survived  Pclass     Sex   Age     Fare Embarked Title FamilySize
0           0       3    male  22.0   7.2500        S    Mr     Couple
1           1       1  female  38.0  71.2833        C   Mrs     Couple
2           1       3  female  26.0   7.9250        S  Miss     Single
3           1       1  female  35.0  53.1000        S   Mrs     Couple
4           0       3    male  35.0   8.0500        S    Mr     Single
..        ...     ...     ...   ...      ...      ...   ...        ...
886         0       2    male  27.0  13.0000        S  Rare     Single
887         1       1  female  19.0  30.0000        S  Miss     Single
888         0       3  female  28.0  23.4500        S  Miss     InterM
889         1       1    male  26.0  30.0000        C    Mr     Single
890         0       3    male  32.0   7.7500        Q    Mr     Single

[891 rows x 8 columns]


Unnamed: 0,Pclass,Sex,Age,Fare,Embarked,Title,FamilySize
0,3,male,22.0,7.2500,S,Mr,Couple
1,1,female,38.0,71.2833,C,Mrs,Couple
2,3,female,26.0,7.9250,S,Miss,Single
3,1,female,35.0,53.1000,S,Mrs,Couple
4,3,male,35.0,8.0500,S,Mr,Single
...,...,...,...,...,...,...,...
886,2,male,27.0,13.0000,S,Rare,Single
887,1,female,19.0,30.0000,S,Miss,Single
888,3,female,28.0,23.4500,S,Miss,InterM
889,1,male,26.0,30.0000,C,Mr,Single


## Encode data
Pytorch only accepts numbers for attributes, so we convert all categorical/string data into numbers.

`pd.get_dummies` gets the values in `categorical_columns` and creates a new numeric column for each category in those columns (e.g. the 'Sex':{male, female} column is split into two columns: 'SexMale': {1,0} and 'SexFemale': {1,0} )

Just as with data cleanup, this also needs to be done for unlabeled data for which we will try to get actual predictions out of the trained model

In [None]:
def encode_data(X):
  categorical_columns = ['Pclass','Sex', 'FamilySize', 'Embarked', 'Title']

  X_enc = pd.get_dummies(X, prefix=categorical_columns, columns = categorical_columns, drop_first=True)
  return X_enc

X_enc = encode_data(X)
X_enc.head()

Unnamed: 0,Age,Fare,Pclass_2,Pclass_3,Sex_male,FamilySize_InterM,FamilySize_Large,FamilySize_Single,Embarked_Q,Embarked_S,Title_Miss,Title_Mr,Title_Mrs,Title_Rare
0,22.0,7.25,0,1,1,0,0,0,0,1,0,1,0,0
1,38.0,71.2833,0,0,0,0,0,0,0,0,0,0,1,0
2,26.0,7.925,0,1,0,0,0,1,0,1,1,0,0,0
3,35.0,53.1,0,0,0,0,0,0,0,1,0,0,1,0
4,35.0,8.05,0,1,1,0,0,1,0,1,0,1,0,0


## Split test data into train and validation datasets

Since the test data set has no labels, we cannot use it to evaluate our model's performance.

We split our training data into `train` and `val`. During training, the model will learn using `train` and we then will use `val` to measure the number of times our model predicted correctly or not.

We split 90% for training and 10% of data for validation.

Don't forget to also split the labels `y`, so they are assigned to the correct sub-datasets

In [None]:
# split training data into training and test
x_train, x_val, y_train, y_val = train_test_split(X_enc, y, test_size = 0.1)


We confirm that the data has been split in the expected percentages

In [None]:
print(X_enc.shape, x_train.shape, x_val.shape)

(891, 14) (801, 14) (90, 14)


## The Pytorch model

This Pytorch model defines a neural network with the following layers:

- Linear transformation with an input of 14 columns, and an output of size 270
- A drouput layer with a chance of %1 of droping a specific row
- A ReLu activation layer
- Another linear transformation with an input of 270 (the output of the first linear transformation) and an output of 2 (the probabilities of surviving/not surviving)


https://pytorch.org/docs/stable/generated/torch.nn.Linear.html
https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/


In [None]:
# Pythorch model:

import torch
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(14, 270)
        self.fc2 = nn.Linear(270, 2)

    def forward(self, x):
        x = self.fc1(x)
        x = F.dropout(x, p=0.1)
        x = F.relu(x)
        x = self.fc2(x)
        x = torch.sigmoid(x)

        return x

net = Net()

## Training parameters
We train the model in batches of 50 samples to avoid over-fitting.

We train the model for 50 epochs/iterations, with a learning rate of 0.01 to iteratively train the model.

We usee the Cross entropy loss function (commonly used for binary classification) to calculate how wrong out model is doing at a given training step/epoch.

We use the Adam optimizer for finding the gradients (more details on gradients and backpropagation in http://neuralnetworksanddeeplearning.com/chap2.html)

https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html
https://analyticsindiamag.com/ultimate-guide-to-pytorch-optimizers/

In [None]:
# Model params:
batch_size = 50
num_epochs = 50
learning_rate = 0.01
batch_no = len(x_train) // batch_size

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate)

## Train the model

We train the `network` model. More details in the code comments.

In [None]:
from sklearn.utils import shuffle
from torch.autograd import Variable

# Iterate for the number of epochs
for epoch in range(num_epochs):
    # Print the epoch number every 5 epochs
    if epoch % 5 == 0:
        print('Epoch {}'.format(epoch+1))

    # Shufle the datasets to randomize the data rows that
    # will be added to the batch and avoid training over the same 50 rows
    # at each epoch
    x_train, y_train = shuffle(x_train, y_train)
    # Mini batch learning
    for i in range(batch_no):
        start = i * batch_size
        end = start + batch_size

        # Convert the Pandas dataset into Pytorch variables of the size
        # of the batch
        x_var = Variable(torch.FloatTensor(x_train.values[start:end]))
        y_var = Variable(torch.LongTensor(y_train.values[start:end]))

        # Restart the gradients
        optimizer.zero_grad()

        # Run a training step: Pass the training data to
        # the neural network layers
        ypred_var = net(x_var)

        # Calculate the training loss
        loss =criterion(ypred_var, y_var)

        # update the gradients based on the training loss for the batch
        loss.backward()
        optimizer.step()

Epoch 1
Epoch 6
Epoch 11
Epoch 16
Epoch 21
Epoch 26
Epoch 31
Epoch 36
Epoch 41
Epoch 46


# Measure the model's performance

We now predict the labels for our validation set `x_val`.

The results of the prediction in `result = net(test_var)` will be a matrix of where the rows are the predicted values for each validation row, and the columns are:
- the percentage of probability that the passanger survived
- the percentage of probability that the passanger didn't survive

Both columns should add up to 1 for a given row, and the closer to 1 for each of those columns, the most certain the model is about the result

In [None]:
## convert the Pandas dataframe for the validation data to a Pytorch variable
validation_data = Variable(torch.FloatTensor(x_val.values), requires_grad=True)

## Use "no_grad" to not update the model's gradients, as we are not training
## this model anymore
with torch.no_grad():
    ## get the predicted values
    result = net(validation_data)

## Sample 5 results
result[0:5, :]

tensor([[1.4149e-07, 9.9859e-01],
        [9.9999e-01, 4.8924e-12],
        [5.1183e-08, 9.9968e-01],
        [1.0000e+00, 7.7117e-22],
        [1.0000e+00, 4.1081e-16]])

Since we want a binary classification (survived/didn't survive), we get the result with the highest percentage of confidence (1=survived, otherwise 0)

https://pytorch.org/docs/stable/generated/torch.max.html

In [None]:
values, labels = torch.max(result, 1)
## sample the first 5 results
labels[0:5]

tensor([1, 0, 1, 0, 0])

We calculate a simple percentage of accuracy:

`num_right` = the number of rows where the prediction matched the actual value in the validation set

`all_rows` = the total number of rows

accuracy = `num_right` / `all_rows`

In [None]:
num_right = np.sum(labels.data.numpy() == y_val)
all_rows = len(y_val)
print('Accuracy {:.2f}'.format(num_right / all_rows))

Accuracy 0.89


## Final notes
The accuracy is based only in the existing examples in the dataset. Once we use it in data outside of this dataset, accuracy may be lower as they may contain attributes that don't exist in this dataset.

The accuracy will vary between model training executions, as the training data is split randomly between the training and validation datasets.

Also, accuracy is a kind-of misleading metric: If the model predicts all data samples as "survived" (which is cleary indicative of an incorrect model), you might still get a relatively high accuracy. This is why we also use other metrics like the [F1 score](https://towardsdatascience.com/the-f1-score-bec2bbc38aa6), and do further analysis on the results with [confussion matrices](https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62).

## Extra material

- https://pandas.pydata.org/docs/
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html