# Classification

We will now train our first classifier.  We will use the titanic dataset for this.


Let's load the data using pandas as we learnt in the previous notebooks.

In [None]:
import pandas as pd
import numpy as np
# we are loading data from github. 
dataurl = 'https://github.com/rrr-uom-projects/MPiCRT-AI/raw/main/Data/titanic.csv' 
pax = pd.read_csv(dataurl, sep = ',')

We need to understand the data we have to start making sense of it. Here is a short description of the series:

- **PassengerId** Arbitrary nr between 1 and 841
- **Survived** Weather Survived or not: 0 = No, 1 = Yes
- **Pclass** Ticket class: 1 = 1st, 2 = 2nd, 3 = 3rd
- **Name** Name of the Passenger
- **Sex** Female/male
- **Age** Age in years
- **SibSp** No. of siblings / spouses aboard the Titanic
- **Parch** No. of parents / children aboard the Titanic
- **Ticket** Ticket number
- **Fare** Passenger fare
- **Cabin** Cabin number
- **Embarked** Port of Embarkation:C = Cherbourg, Q = Queenstown, S = Southampton


Let's sort the categorical variables correctly here.

In [None]:
pax['Sex'] = pax['Sex'].astype('category')
pax['Survived'] = pax['Survived'].astype("category")
pax['Pclass'] = pax['Pclass'].astype("category")
pax['Embarked'] = pax['Embarked'].astype("category")

pax.info()

## Preprocessing

During the last tutorial we explored the data and extracted some extra bits from some variables.  Let's bring the relevant code here.

### Imputing Age

We learnt how to impute age accounting for Sex, Pclass, Embarked, etc. Let's copy the relevant code here:

In [None]:
medianAges = pax.groupby(['Sex','Pclass','Embarked'], observed=True)[['Age']].median()
medianAges = medianAges.reset_index()

def getMedianAgeForCategory(row):
    # using the dataframe medianAges created above.
    condition = (
        (medianAges['Sex'] == row['Sex']) & 
        (medianAges['Pclass'] == row['Pclass']) & 
        (medianAges['Embarked'] == row['Embarked'])
    ) 
    return medianAges[condition]['Age'].values[0]

def imputeIfNeeded(row):
    return getMedianAgeForCategory(row) if np.isnan(row['Age']) else row['Age']

#let's make a copy of the values before imputing
pax['Age'] = pax.apply(imputeIfNeeded, axis=1)
pax.info()

### Titles and tytle types
We also extracted titles from the passanger's name, and coded this title based on domain knowledge. 

Let's copy the relevant code here.

In [None]:
# First we need to cast the type of the Name series to str. 
pax['Name'] = pax['Name'].astype('string')
surnamefirstnames = pax['Name'].str.split(',')  # this splits the string by the token given (,)
pax['Surname'] = surnamefirstnames.str.get(0)   # here we get the first bit of the divided sentence
afterComma = surnamefirstnames.str.get(1).str.split('.')# this splits the string by the token given (.)
pax['Title'] = afterComma.str.get(0).str.strip()        # here we get the first bit of the divided sentence and eliminate empty spaces

Title_Dictionary = {
    "Capt": "Officer",
    "Col": "Officer",
    "Major": "Officer",
    "Jonkheer": "Royalty",
    "Don": "Royalty",
    "Sir" : "Royalty",
    "Dr": "Officer",
    "Rev": "Officer",
    "the Countess":"Royalty",
    "Mme": "Mrs",
    "Mlle": "Miss",
    "Ms": "Mrs",
    "Mr" : "Mr",
    "Mrs" : "Mrs",
    "Miss" : "Miss",
    "Master" : "Master",
    "Lady" : "Royalty"
}
pax['TitleType'] = pax['Title'].map(Title_Dictionary)
pax['TitleType'] = pax['TitleType'].astype('category')


### Family sizes and types
And we also created a new variable quantifying the size of the family for each passanger, and classified it in three classes: single, small family, large family. Let's bring the relevant code here:

In [None]:
pax['FamilySize'] = pax['SibSp']+pax['Parch']+1 
def getFamilyType(famsize):
    return 'single' if famsize == 1 else ('smallFamily' if famsize < 5 else 'largeFamily')

pax['FamilyType'] = pax['FamilySize'].apply(getFamilyType)
pax['FamilyType'] = pax['FamilyType'].astype('category')

## Eliminate variables 

Now we have extracted extra information from the data stored for each passanger. We can now clean up our dataframe in preparation to model training.

In [None]:
pax.columns

In [None]:
pax.info()

In [None]:
cleanpax = pax.loc[:,['Survived', 'Pclass', 'Sex', 'Age', 'SibSp',
       'Parch', 'Fare', 'Embarked', 'TitleType', 'FamilySize', 'FamilyType']]
cleanpax.dropna(inplace=True)
cleanpax.info()

# Data splitting

Before we do any training, let's divide the dataset in *training* and *validation*.  Ideally, we will have another dataset, *test*, to test for generalisability.  Kaggle kept a good portion of the data as test.  We won't use it in our tutorial.  But if you feel like seeing how generalisable are your models, join Kaggle and submit your solutions!

In [None]:
Y = cleanpax.loc[:,'Survived'] # This is the target!
X = cleanpax.loc[:, cleanpax.columns != 'Survived'] # This are the features/variables we wll use to predict

# to divide the data in train/validation, we an use train_test_split from sklearn
from sklearn.model_selection import train_test_split
X_train, x_val, Y_train, y_val = train_test_split(X, Y, test_size=0.2, random_state=1234) # train 80%, validation 20%

print(f'Features for train/validation datasets: {X_train.shape} and  {x_val.shape}' )
print(f'  In percentages: {100*X_train.shape[0]/pax.shape[0]:.2f}% and  {100*x_val.shape[0]/pax.shape[0]:.2f}%' )
survcounts = [Y_train.value_counts(),y_val.value_counts()]
print(f'Percentage survived for train/validation datasets: {100*survcounts[0][1]/Y_train.shape[0]:.2f}, {100*survcounts[1][1]/y_val.shape[0]:.2f}')

We can use X_train and Y_train to create our models, and use x_test and y_test to test for overfitting!  We will learn more about this later.

# Further Data pre-processing
We will use SVM and RFC for our classifiers, as implemented in sklearn.  These implementation require all features converted to *numerical* features. So we need to convert the categorical values into numbers.

## Binary categories
First, let's do the binary categories: only two possible values are allowed.  In this case, one of the values can be mapped to 0 and the other one to 1.

In [None]:
X_train_num = X_train
X_train_num['Sex'] = X_train_num['Sex'].map({'male': 0, 'female': 1}).astype(int)

Y_train_num = Y_train.astype(int)


## Categories with multiple values
Categorical variables with multiple values are a bit more tricky. In this case, we can use the function get_dummies to convert them to a set of columns, one column per category value.  

In [None]:
X_train_num = pd.get_dummies(X_train_num, prefix='FamilyType',columns=['FamilyType'],dtype=int)
X_train_num.info()

In [None]:
X_train_num = pd.get_dummies(X_train_num, prefix='Embarked',columns=['Embarked'],dtype=int)
X_train_num = pd.get_dummies(X_train_num, prefix='TitleType',columns=['TitleType'],dtype=int)
X_train_num = pd.get_dummies(X_train_num, prefix='Pclass',columns=['Pclass'],dtype=int)

X_train_num.info()

We need to repeat the same operations to the validation dataset.

In [None]:
x_val_num = x_val
x_val_num['Sex'] = x_val_num['Sex'].map({'male': 0, 'female': 1}).astype(int)
x_val_num = pd.get_dummies(x_val_num, prefix='FamilyType',columns=['FamilyType'],dtype=int)
x_val_num = pd.get_dummies(x_val_num, prefix='Embarked',columns=['Embarked'],dtype=int)
x_val_num = pd.get_dummies(x_val_num, prefix='TitleType',columns=['TitleType'],dtype=int)
x_val_num = pd.get_dummies(x_val_num, prefix='Pclass',columns=['Pclass'],dtype=int)
print(x_val_num.info())

y_val_num = y_val.astype(int)

# Support Vector Machines

Let's classify first with SVMs.  That means finding the best weights, bias and support vectors in the training dataset.  This is done easily using sklearn:

In [None]:
from sklearn import svm

classifier = svm.SVC(kernel="rbf", gamma=0.5, probability=True)
classifier.fit(X_train_num, Y_train_num)  # here is where the magic happens ;-)


## Evaluate the fit

Now let's see how the model works for the data we kept apart:

In [None]:
score = classifier.score(x_val_num, y_val_num)
print('Accuracy: ',score)

Not that good, eh?  

## Sometimes less is more!
Let's try with less variables.  We learnt that Sex, Age and Pclass were very strongly correlated to Survived in the previous tutorial... Let's see if this works better:

In [None]:
X_train_small = X_train_num.loc[:,['Sex', 'Age', 'Pclass_1', 'Pclass_2', 'Pclass_3']]
x_val_small = x_val_num.loc[:,['Sex', 'Age', 'Pclass_1', 'Pclass_2', 'Pclass_3']]
classifier.fit(X_train_small, Y_train_num)  # here is where the magic happens ;-)
score = classifier.score(x_val_small, y_val_num)

print('Accuracy: ',score)

This shows that more data/more features is not always better!!

# Random Forest Classifiers

Let's explore now random forest classifiers.  
Let's choose 100 trees in the forest and to keep out-of-the-bag score for an idea of how well the training went.


In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, oob_score=True)
rf.fit(X_train_num, Y_train_num.astype(int))

## Evaluate the fit

In [None]:
from sklearn import metrics
y_pred = rf.predict(x_val_num)

print( f'Accuracy: {metrics.accuracy_score(y_val, y_pred)}')

What about having less features?

In [None]:
rf_small = RandomForestClassifier(n_estimators=100, oob_score=True)
rf_small.fit(X_train_small, Y_train_num.astype(int))
y_pred = rf_small.predict(x_val_small)

print( f'Accuracy: {metrics.accuracy_score(y_val, y_pred)}')

In this case, the performance was not better with less figures. Trees are able to 'squeeze' more information from the other dimensions!  At the same time, you need to be more aware of overfitting!

## Feature importances
Another great feature of RFC is that you can investigate which features were used more frequently, which you could use to make the trees simpler.  Let's see this for the first forest we trained.

In [None]:
rf.feature_importances_

Let's visualise them:

In [None]:
feature_imp = pd.Series(rf.feature_importances_, index=X_train_num.columns).sort_values(ascending=True)
feature_imp.plot(kind='barh')

# Other classification metrics
We can use classification_report() to get a summary of most common classification metrics.  Let's see these metrics for the RFCs:

In [None]:
y_pred = rf.predict(x_val_num)
print(metrics.classification_report(y_val, y_pred))

You can find information on these metrics in the scikit-learn documentation: https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics 

We are done for today!