# Preprocessing data before machine learning

In preparation to the machine learning tutorial/practical, we are going to preprocess the data to get the most out of it.  We are using some ideas proposed in https://www.ahmedbesbes.com/blog/kaggle-titanic-competition  

Let's load the data using pandas as we learnt in the previous notebooks.

In [1]:
import pandas as pd
# we are loading data from github. 
dataurl = 'https://github.com/rrr-uom-projects/MPiCRT-AI/raw/main/Data/titanic.csv' 
pax = pd.read_csv(dataurl, sep = ',')

We need to understand the data we have to start making sense of it. Here is a short description of the series:

- **PassengerId** Arbitrary nr between 1 and 841
- **Survived** Weather Survived or not: 0 = No, 1 = Yes
- **Pclass** Ticket class: 1 = 1st, 2 = 2nd, 3 = 3rd
- **Name** Name of the Passenger
- **Sex** Female/male
- **Age** Age in years
- **SibSp** No. of siblings / spouses aboard the Titanic
- **Parch** No. of parents / children aboard the Titanic
- **Ticket** Ticket number
- **Fare** Passenger fare
- **Cabin** Cabin number
- **Embarked** Port of Embarkation:C = Cherbourg, Q = Queenstown, S = Southampton


Let's sort the categorical variables correctly here.

In [None]:
pax['Sex'] = pax['Sex'].astype('category')
pax['Survived'] = pax['Survived'].astype("category")
pax['Pclass'] = pax['Pclass'].astype("category")
pax['Embarked'] = pax['Embarked'].astype("category")

pax.info()

Notice that the data we have is the same as the training dataset of the Kaggle competition: https://www.kaggle.com/c/titanic/  
We will learn about training/validation/testing data split on next lecture.

Let's import seaborn for visualisation and other useful libraries.

In [3]:
from matplotlib import pyplot as plt
import numpy as np
import seaborn as sns

In machine learning, the outcome we want to predict is called the **target variable**.  Anything that is used to predict the target is called a **feature**. Many of the algorithms that we will use do not support missing data in the features nor categorical variables, so we will need to do some data transformation to make it compatible.

## Important variables - Univariable testing with SciPy

Let's see if the patterns/differences we observed last time are 'statiscally' solid. 

### Sex? Class? - Categorical Variables

In the previous notebook we were able to see some patterns and realised that females were more likely to survive, and that people in the lowest class was less likely to survive.  Let's plot what we mean:

In [None]:
fig, axs = plt.subplots(1,2,figsize=(10, 3)) # plotting multiple panels
sns.countplot( x='Sex',hue='Survived', data=pax, stat='percent', ax=axs[0] )
sns.countplot( x='Pclass',hue='Survived', data=pax, stat='percent', ax=axs[1] )

We can also test this to see whether what we see is statistically significances. In this case, we want to test whether two categorical variables are related, e.g., Sex vs Survived or Pclass vs Survived.  The common test used in this situation is the Chi-squared test. 

Let's use the implementation in SciPy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html

In [5]:
from scipy.stats import chi2_contingency

To use this test, we need to build a contingency table.  In this table we simply summarise the number of times all combinations of the categorical values are observed.  Let's do this:


In [None]:
sex_suriv_contingency_table = pd.crosstab(pax["Sex"], pax["Survived"])
print(sex_suriv_contingency_table)

In [7]:
chi2, p, dof, expected = chi2_contingency(sex_suriv_contingency_table)

Given the number of categories, the expected contigency table if there was no relationship between these variables is:

In [None]:
print("Expected frequencies:")
print(pd.DataFrame(expected, index=sex_suriv_contingency_table.index, columns=sex_suriv_contingency_table.columns))

Checking the p-value will tell us whether the differences we are seeing are significant or not. 

Given a level of significance (often 0.05) we can check whether the differences we see are due to chance:
- If p < 0.05, we reject the null hypothesis. In this case, we will say that there is likely a relationship between Sex and Survival. 
- If p > 0.05 then we say that the difference is not significant and it can be attributed to chance.

In [None]:
print(f"P-value: {p}")
print(f"Is it significant at 0.05? {p<0.05}")

In this case, the value is super small, so we can say that the difference in survival between males/females is very unlikely to be caused by chance.  



We can run a similar analysis for Pclass:

In [None]:
pclass_suriv_contingency_table = pd.crosstab(pax["Pclass"], pax["Survived"])
print(pclass_suriv_contingency_table)

chi2, p, dof, expected = chi2_contingency(pclass_suriv_contingency_table)

print("Expected frequencies:")
print(pd.DataFrame(expected, index=pclass_suriv_contingency_table.index, columns=pclass_suriv_contingency_table.columns))

print(f"P-value: {p}")
print(f"Is it significant at 0.05? {p<0.05}")


Looking at the p-value, what can you conclude?

You can also repeat this analysis with Embarked.

In [11]:
# Write some code here?


### Age? Fare? - Continous variable

We also saw a different pattern related on the Age and Fare.

In [None]:
fig, axs = plt.subplots(1,2,figsize=(8, 3)) # plotting multiple panels
sns.violinplot(y='Age', hue='Survived', data=pax,  split=True, ax=axs[0] )
sns.violinplot(y='Fare', hue='Survived', data=pax,  split=True, ax=axs[1] )

We can test this difference using, for example, the t-test: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

In [13]:
from scipy.stats import ttest_ind

In [None]:
agesurv0 = pax.loc[pax['Survived']==0,'Age']
agesurv1 = pax.loc[pax['Survived']==1,'Age']
print(np.mean(agesurv0), np.mean(agesurv1))

In [None]:
tstat,p=ttest_ind(agesurv0,agesurv1)
print(f"P-value: {p}")

Why did we get nan??? 

In [16]:
### count values that are nan here:

# what do they mean? 


A way around this is to tell scipy to ignore nan values when running the ttest:

In [None]:
tstat,p=ttest_ind(agesurv0,agesurv1,nan_policy='omit')
print(f"P-value: {p}")
print(f"Is it significant at a 0.05 level? {p<0.05}")
print(f"Is it significant at a 0.01 level? {p<0.01}")

Inspecting the p-value... What is your conclusion? (assuming a level of significance of 0.05)

Let's repeat the analysis for Fare.  We already know that Fare is correlated to Pclass, so the question is whether with find a similar correlation with Survived.  

In [None]:
fig, axs = plt.subplots(1,1,figsize=(4, 3)) # plotting multiple panels
sns.boxplot(x='Pclass',y='Fare',data=pax, ax=axs )

In [None]:
faresurv0 = pax.loc[pax['Survived']==0,'Fare']
faresurv1 = pax.loc[pax['Survived']==1,'Fare']
print(np.mean(faresurv0), np.mean(faresurv1))
tstat,p=ttest_ind(faresurv0,faresurv1,nan_policy='omit')
print(f"P-value: {p}")

You could argue that Fare is not normally distributed. Alternatively, you could use Mann Withney U test: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html

In [None]:
from scipy.stats import mannwhitneyu
print(np.median(faresurv0), np.median(faresurv1))
stat,p=mannwhitneyu(faresurv0,faresurv1,nan_policy='omit')
print(f"P-value: {p}")

What can you conclude? 

## Dealing with missing values

We already saw that having NaN values can make our life difficult.  This is why it is important to decide what to do about missing data.  

One approach is to 'drop' all the entries with missing values. That can be done with the function dropna: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html

Let's see what this would do to our dataset:

In [None]:
paxnonans = pax.dropna()
print('Before dropping NaNs:', pax.shape)
print('After dropping NaNs: ', paxnonans.shape)

In this case we are left with ~20% of the data (183 rather than 891 passangers).  This will heavily limit our ability to do any machine learning on this data!  Instead, let's first figure out which are the values that have more missing values and whether it is worth working on it.

In [None]:
pax.info()

In order: 
- Cabin has >77% missing values. Is this an important variable??
- Age has almost 20% missing values. Is this an important variable? 
- Embarked has 2 missing values.

### Imputing Age

We need to see whether there are differences for Age related to other features, e.g. Sex, Pclass, Embarked, etc.

In [None]:
fig, axs = plt.subplots(1,3,figsize=(10, 3)) # plotting multiple panels
sns.boxplot(x='Pclass',y='Age',data=pax, ax=axs[0] )
sns.violinplot(hue='Sex',y='Age',data=pax, split=True,ax=axs[1] )
sns.boxenplot(x='Embarked',y='Age',data=pax, ax=axs[2] )

In [None]:
medianAges = pax.groupby(['Sex','Pclass','Embarked'], observed=True)[['Age']].median()
medianAges = medianAges.reset_index()
print(medianAges)

We can create a function tht given a line with missing values, returns the median according to their Sex, Pclass and Embarkment point:

In [25]:
def getMedianAgeForCategory(row):
    # using the dataframe medianAges created above.
    condition = (
        (medianAges['Sex'] == row['Sex']) & 
        (medianAges['Pclass'] == row['Pclass']) & 
        (medianAges['Embarked'] == row['Embarked'])
    ) 
    return medianAges[condition]['Age'].values[0]

Let's find one passenger with Age == NaN and see it applied:

In [None]:
print(pax.loc[np.isnan(pax['Age']),['Pclass','Sex','Embarked','Age']])

In [None]:
index  = 19
row =  pax.loc[index,['Pclass','Sex','Embarked','Age']]
print(row)
print('>> Imputed age: ', getMedianAgeForCategory(row))


Let's now impute for all values that are missing in the Series Age, using the function Apply(): https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html 

In [None]:
def imputeIfNeeded(row):
    return getMedianAgeForCategory(row) if np.isnan(row['Age']) else row['Age']

# Alternatively:
#def imputeIfNeeded(row):
#    theage = row['Age']
#    if np.isnan(theage):
#        theage = getMedianAgeForCategory(row)
#    return theage


#let's make a copy of the values before imputing
pax['AgeWithoutNaNs'] = pax.apply(imputeIfNeeded, axis=1)
pax.info()

In [None]:
#Let's check that we have not messed up the values that were there before.
sns.scatterplot(x='Age',y='AgeWithoutNaNs',data=pax)

## Coding extra information

Up to now we have not done much with the variable Name (beside counting Mary's in the first notebook!) nor SibSp/Parch. We could still extract some information from these.



### Titles?
If we observe this variable, each name has the title of the person:

In [None]:
pax['Name']

With some domain information, we can see that these titles can give us extra information. For example, if there was any person from the Royal family, it is likely they would survive independent of the Sex, Fare, Embarkment, etc.  

Let' try and extract these data.  We see that the title is always after a comma and finishes in a dot. We have already extracted the surname in the first notebook, let's copy the relevant code and extend it here:

In [None]:
# First we need to cast the type of the Name series to str. 
pax['Name'] = pax['Name'].astype('string')
surnamefirstnames = pax['Name'].str.split(',')  # this splits the string by the token given (,)
pax['Surname'] = surnamefirstnames.str.get(0)   # here we get the first bit of the divided sentence
afterComma = surnamefirstnames.str.get(1).str.split('.')# this splits the string by the token given (.)
pax['Title'] = afterComma.str.get(0).str.strip()        # here we get the first bit of the divided sentence and eliminate empty spaces
print(pax['Title'].value_counts())

With some knowledge of titles (and checking the blog), we see that these titles correspond to crew/officers of the ship, Royalty, etc..

- Capt: Officer,
- Col: Officer,
- Major: Officer,
- Jonkheer: Royalty,
- Don: Royalty,
- Sir : Royalty,
- Dr: Officer,
- Rev: Officer,
- the Countess:Royalty,
- Mme: Mrs,
- Mlle: Miss,
- Ms: Mrs,
- Mr : Mr,
- Mrs : Mrs,
- Miss : Miss,
- Master : Master,
- Lady : Royalty

We can use a similar strategy as you used in a previous practical to create a new category: TitleType.  For this we can create a dictionary and use map().

In [None]:
Title_Dictionary = {
    "Capt": "Officer",
    "Col": "Officer",
    "Major": "Officer",
    "Jonkheer": "Royalty",
    "Don": "Royalty",
    "Sir" : "Royalty",
    "Dr": "Officer",
    "Rev": "Officer",
    "the Countess":"Royalty",
    "Mme": "Mrs",
    "Mlle": "Miss",
    "Ms": "Mrs",
    "Mr" : "Mr",
    "Mrs" : "Mrs",
    "Miss" : "Miss",
    "Master" : "Master",
    "Lady" : "Royalty"
}
pax['TitleType'] = pax['Title'].map(Title_Dictionary)
pax['TitleType'] = pax['TitleType'].astype('category')
print(pax['TitleType'].value_counts())

In [None]:
fig, axs = plt.subplots(1,1,figsize=(4, 3)) # plotting multiple panels
sns.countplot( x='TitleType',hue='Survived', data=pax, stat='percent', ax=axs )

Is it significant?  If so, is it confounded?

In [None]:
title_suriv_contingency_table = pd.crosstab(pax["TitleType"], pax["Survived"])
print(title_suriv_contingency_table)

chi2, p, dof, expected = chi2_contingency(title_suriv_contingency_table)

print("Expected frequencies:")
print(pd.DataFrame(expected, index=title_suriv_contingency_table.index, columns=title_suriv_contingency_table.columns))

print(f"P-value: {p}")
print(f"Is it significant at 0.05? {p<0.05}")


### Family sizes?

We have also ignored up to now the variables SibSp (nr of Siblings/Spouses) and Parch (nr of parents/children) related to a given passanger.  We could assume that as large families are grouped together, then they are more likely to get rescued than people traveling/floating alone.

We can then create a variable 'family size' and binary variables identifying whether the passanger was travelling on their own, or as part of a small or larger family.

In [34]:
pax['FamilySize'] = pax['SibSp']+pax['Parch']+1 # why +1?

In [35]:
def getFamilyType(famsize):
    return 'single' if famsize == 1 else ('smallFamily' if famsize < 5 else 'largeFamily')

pax['FamilyType'] = pax['FamilySize'].apply(getFamilyType)
pax['FamilyType'] = pax['FamilyType'].astype('category')

In [None]:
fig, axs = plt.subplots(1,2,figsize=(10, 3)) # plotting multiple panels
sns.countplot( x='FamilySize',hue='Survived', data=pax, stat='percent', ax=axs[0] )
sns.countplot( x='FamilyType',hue='Survived', data=pax, stat='percent', ax=axs[1] )

Was the assumption of larger families more likely to survive supported by the data?  You could also do some testing here!

In [None]:
famtype_suriv_contingency_table = pd.crosstab(pax["FamilyType"], pax["Survived"])
print(famtype_suriv_contingency_table)

chi2, p, dof, expected = chi2_contingency(famtype_suriv_contingency_table)

print("Expected frequencies:")
print(pd.DataFrame(expected, index=famtype_suriv_contingency_table.index, columns=famtype_suriv_contingency_table.columns))

print(f"P-value: {p}")
print(f"Is it significant at 0.05? {p<0.05}")


## Dummy Variables

Some of the approaches we will test later on do not work good with categories. So, we need to create 'dummy' variables, where the content are just True/False depending on the value of the category.  For example the variable FamilyType would create 3 dummy variables: FamilyType_single, FamilyType_smallFamily and FamilyType_LargeFamily.  

The function 'get_dummies' from pandas would do that for us:

In [None]:
family_dummies = pd.get_dummies(pax['FamilyType'], prefix='FamilyType')
print(family_dummies)

We will do this to all important categorical variables in the dataset for the classification practical next week.