In [1]:
import warnings
warnings.filterwarnings('ignore')

# Data Preparation 

# I. Data in Machine Learning

A good predictive power should come first from the data, not the model.

What are the different types of data?

## Structured vs Unstructured data

- Structured data basically comes in form of a table: e.g. a database, excel table, csv file...
- Unstructured data does not fit in a table: sound, language, video...

<center><img src="images/Structured-v-Unstructured-Data.png" width="600"></center>

## Qualitative vs Quantitative data

Quantitative data is **ordered**, e.g.:
- 100 € is greater than 10 €
- 1.8 m is taller than 1.6 m
- 18 y.o. is younger than 80 y.o.
- ...

<center><img src="images/quantidata.png" width="350"></center>

Qualitative data has **no intrinsic order**, e.g.:
- Blue is not better than red
- A dog is not greater than a cat
- A bathroom is not better than a kitchen
- ...

<center><img src="images/qualidata.png" width="350"></center>

## Continous vs discrete data

Data can be continuous, e.g.:
- A weight
- A volume
- ...

Data can be discrete, e.g.:
- A sport score
- A color
- ...

> N.B.: discrete does not necessarily mean categorical!

# II. Data Preparation

Let's work again with a really classic: the Titanic dataset

Suppose we want to fit a binary classifier to the Titanic data in order to predict who is going to survive. 

For the sake of the example, we define here a function `fit_and_eval_model` that will train and evaluate a Logistic Regression model with accuracy.

We will then play wisely with the data, and try to improve iteratively our model performance.

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

def fit_and_eval_model(data, features_to_use):
    lr = LogisticRegression(C=0.01)
    X_train, X_test, y_train, y_test = train_test_split(data[features_to_use],
                                                        data["Survived"],
                                                        test_size=0.2, 
                                                        random_state=0)

    lr.fit(X_train, y_train)
    print('Accuracy:', lr.score(X_test, y_test))

## II.1. Data Cleaning

## Missing values

The first thing to do is to have a look and check the data:
- What are the available features?
- Are there missing values?
- Are there duplicates?
- Are there outliers?
- If so, how to handle them?

First load the dataset, and have a look at the features:

In [3]:
import pandas as pd
# We load the data
df = pd.read_csv('titanic.csv', index_col=0)
print("Dimensions:", df.shape)
df.head()

Dimensions: (891, 11)


Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Here we have 891 samples, and 11 columns: 10 features + 1 label.

We can first check for missing values:

In [4]:
# Check for missing values
df.isna().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

We have missing values:
- 177 missing ages
- 687 missing cabins
- 2 missing embarked

How would you handle that?

We will choose several policies:
- 177 missing ages: replace by the mean age
- 687 missing cabins: remove the feature completely
- 2 missing embarked: just remove those two rows

In [5]:
# Cleaning Cabin - removing column
df = df.drop(['Cabin'], axis=1)

# Cleaning Embarked - removing rows with missing values
df.dropna(subset=["Embarked"], inplace=True)

# Cleaning Age - replacing by mean value
df["Age"] = df["Age"].fillna(df["Age"].mean())

## Duplicates

It may happen that one or several lines are duplicated. 

In this case, it may be necessary to remove them.

In [6]:
df.duplicated().sum()

0

Here, we have no duplicates.

Otherwise, we could have removed them by using `df.drop_duplicates()` easily.

## Outliers

Sometimes, it happens to be outliers in the dataset: values that are very different from the rest, even impossible sometimes.

This can be due to measurement error (e.g. broken sensor) or human typo for example. 

In some case, it is necessary to remove the outliers, sometimes you have to keep them, depending on their meaning.

## Baseline performance

Let's now train our very first model on the cleaned dataset.

What features would you use?

In [7]:
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


Let's use the following features:
- Pclass
- Age
- Fare
- SibSp
- Parch

In [8]:
fit_and_eval_model(df, ["Pclass", "Age", "Fare", "SibSp", "Parch"])

Accuracy: 0.6573033707865169


We have an accuracy of 65.7 %. Better than a random model.

## II.2 Categorical Data Transformation

## Categorical data: Binary variables

How can we incorporate the `Sex` (string `male` or `female`) into the features X of our model?

> Reminder: a model works only with numbers, not characters!

In [9]:
df['Sex'].head()

PassengerId
1      male
2    female
3    female
4    female
5      male
Name: Sex, dtype: object

We just need to convert it to a binary variable, e.g.:
- `0` for `male`
- `1` for `female`

> Or the opposite, the model does not care

In [10]:
# Using `.loc`
df['Sex'].loc[df['Sex'] == 'male'] = 0
df['Sex'].loc[df['Sex'] == 'female'] = 1
df['Sex'].head()

PassengerId
1    0
2    1
3    1
4    1
5    0
Name: Sex, dtype: object

We have now replaced the strings `male` and `female` in the `Sex` column with `0` and `1`.

Let's see if this improves our model performances:

In [11]:
# Adding the encoded column `Sex` in the features
fit_and_eval_model(df, ["Pclass", "Age", "Fare", "SibSp", "Parch", "Sex"])

Accuracy: 0.702247191011236


Accuracy has been improved of about 5%.

## Categorical data: polytomous/multinomial variables

We now need to add the feature `Embarked` which corresponds to the harbour where the passenger embarked:
- `C` = Cherbourg
- `Q` = Queenstown
- `S` = Southampton

How to transform such information into numbers?

In [12]:
df['Embarked'].tail()

PassengerId
887    S
888    S
889    S
890    C
891    Q
Name: Embarked, dtype: object

A genuine approach would be:
- `C` = `0`
- `Q` = `1`
- `S` = `2`

Is this okay?

Such mapping could "mean" for the model: embarking at S has 2 times the effect of embarking at Q.

Is this true?

Any other solution?

We will do what we call **dummy variables** or **one hot encoding**: we simply add binary columns saying if yes (`1`) or no (`0`) the passenger embarked at the corresponding harbour.

Basically, we will perform the following transformation:
- `C` = `1, 0, 0`
- `Q` = `0, 1, 0`
- `S` = `0, 0, 1`

In [13]:
pd.get_dummies(df['Embarked']).tail()

Unnamed: 0_level_0,C,Q,S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
887,0,0,1
888,0,0,1
889,0,0,1
890,1,0,0
891,0,1,0


Is it really optimal? Isn't there any redundant information?

We can actually drop one column without removing any information: if it's not `S` nor `Q`, it has to be `C`!

In [14]:
dummies = pd.get_dummies(df['Embarked'], drop_first=True)
dummies.tail()

Unnamed: 0_level_0,Q,S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
887,0,1
888,0,1
889,0,1
890,0,0
891,1,0


We have now taken into account all categorical features, let's run our model on it.

Let's see if this improves our model performances:

In [15]:
# First create a new dataset with the one hot encoded data with concatenation
data = pd.concat([df, dummies], axis=1)

# Then train and evaluate our model
fit_and_eval_model(data, ["Pclass", "Age", "Fare", "SibSp", "Parch", "Sex", "Q", "S"])

Accuracy: 0.6853932584269663


The accuracy did not increase this time unfortunately. It seems that the `Embarked` feature did not bring much predictive power.

It happens, and this is part of the job: it's hard to guess what features will improve the performances.

## II.3. Quantitative data transformation

Quantitative data also has to be prepared.

Let's have a look at our quantitative data:

In [16]:
quantitative = ['Pclass', 'Age', 'Fare', 'SibSp', 'Parch']
df[quantitative].describe()

Unnamed: 0,Pclass,Age,Fare,SibSp,Parch
count,889.0,889.0,889.0,889.0,889.0
mean,2.311586,29.642093,32.096681,0.524184,0.382452
std,0.8347,12.968346,49.697504,1.103705,0.806761
min,1.0,0.42,0.0,0.0,0.0
25%,2.0,22.0,7.8958,0.0,0.0
50%,3.0,29.642093,14.4542,0.0,0.0
75%,3.0,35.0,31.0,1.0,0.0
max,3.0,80.0,512.3292,8.0,6.0


Is `Fare` more important than `Age`?

Is `Age` more important than `SibSp`?

Probably not, but this is what our model might 'think'.

Thus, we have to rescale those features, to set them on the same level.

How could we do such a thing?

## Standard scaling

One way is standard scaling:
- Center all features on 0
- Set the standard deviation to 1

In [17]:
df[quantitative] = (df[quantitative] - df[quantitative].mean()) / df[quantitative].std()
df[quantitative].describe()

Unnamed: 0,Pclass,Age,Fare,SibSp,Parch
count,889.0,889.0,889.0,889.0,889.0
mean,-1.878263e-16,-2.877338e-16,1.358743e-16,-7.992607e-18,-4.795564e-17
std,1.0,1.0,1.0,1.0,1.0
min,-1.571327,-2.25334,-0.6458409,-0.4749317,-0.474059
25%,-0.3732912,-0.5892881,-0.4869637,-0.4749317,-0.474059
50%,0.8247444,-5.479054e-16,-0.3549973,-0.4749317,-0.474059
75%,0.8247444,0.4131527,-0.02206712,0.4311076,-0.474059
max,0.8247444,3.88314,9.663111,6.773383,6.96309


Now let's try again to train our model with those rescaled data.

In [18]:
# First create a new dataset with the one hot encoded data with concatenation
data = pd.concat([df, dummies], axis=1)

# Then train and evaluate our model
fit_and_eval_model(data, ["Pclass", "Age", "Fare", "SibSp", "Parch", "Sex", "Q", "S"])

Accuracy: 0.702247191011236


OK, we climbed back to 70 % accuracy. Feature rescaling does not always improve performances, but it is a good practice and is always recommended.

## More rescaling

There exist more rescaling strategies:
- Standard scaler (the one we just used)
- Min Max Scaler
- Robust Scaler
- ...

Have a look [here](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html) for more information about scaling strategies.

## II.4. Feature Engineering

A powerful way of improving a model is through **feature engineering**.

Feature engineering means creating new features, possibly with a great predictive power.

What would be a nice feature to create in our dataset?

In [19]:
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0,0.824744,"Braund, Mr. Owen Harris",0,-0.589288,0.431108,-0.474059,A/5 21171,-0.499958,S
2,1,-1.571327,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,0.644485,0.431108,-0.474059,PC 17599,0.788503,C
3,1,0.824744,"Heikkinen, Miss. Laina",1,-0.280845,-0.474932,-0.474059,STON/O2. 3101282,-0.486376,S
4,1,-1.571327,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,0.413153,0.431108,-0.474059,113803,0.422623,S
5,0,0.824744,"Allen, Mr. William Henry",0,0.413153,-0.474932,-0.474059,373450,-0.483861,S


I propose here to create a feature `Title` based on the `Name` columns, keeping only 5 titles:
- Mr
- Mrs
- Miss
- Master
- Others

Then we will one hot encode it.

In [20]:
# We keep only the title here
df['Title'] = df['Name'].str.split(' ').str[1]
df['Title'].tail()

PassengerId
887     Rev.
888    Miss.
889    Miss.
890      Mr.
891      Mr.
Name: Title, dtype: object

In [21]:
# Then we replace all other titles by Others
df.loc[~df['Title'].isin(['Mr.', 'Mrs.', 'Miss.', 'Master.']), "Title"] = "Others"
df['Title'].tail()

PassengerId
887    Others
888     Miss.
889     Miss.
890       Mr.
891       Mr.
Name: Title, dtype: object

In [22]:
dummies = pd.get_dummies(df[['Embarked', 'Title']], drop_first=True)
dummies.tail()

Unnamed: 0_level_0,Embarked_Q,Embarked_S,Title_Miss.,Title_Mr.,Title_Mrs.,Title_Others
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
887,0,1,0,0,0,1
888,0,1,1,0,0,0
889,0,1,1,0,0,0
890,0,0,0,1,0,0
891,1,0,0,1,0,0


Let's try again to evaluate our model performances with this new feature:

In [23]:
# First create a new dataset with the one hot encoded data with concatenation
data = pd.concat([df, dummies], axis=1)

# Then train and evaluate our model
fit_and_eval_model(data, ["Pclass", "Age", "Fare", "SibSp", "Parch", "Sex", "Embarked_Q", "Embarked_S", 
                          "Title_Miss.", "Title_Mr.", "Title_Mrs.", "Title_Others"])

Accuracy: 0.7247191011235955


The accuracy climbed a bit this time.

Many more features could be created based on this dataset, and probably improve the performances.

## II.5. Saving the data

We sometimes need or want to save the prepared and cleaned data. 

There are many ways to do so in Python. One of the most popular is using pickle library, that allows to save in binary format:

In [24]:
import pickle

features = ["Pclass", "Age", "Fare", "SibSp", "Parch", "Sex", "Embarked_Q", "Embarked_S", 
            "Title_Miss.", "Title_Mr.", "Title_Mrs.", "Title_Others"]

X = data[features]
y = data['Survived']

with open('titanic_cleaned.pkl', 'wb') as file:
    pickle.dump((X, y), file)

## Data Preparation conclusion

There are several points to remember when doing any Machine Learning project:
- Data has to be cleaned: missing values, duplicates, outliers
- Qualitative data has to be prepared: one hot encoding is the most used technique
- Quantitative data has to be rescaled (e.g. with standard scaling)
- Feature engineering is the creation of new features and can improve results

In [25]:
library(ggplot2)
library(dplyr)
library(GGally)
library(rpart)
library(rpart.plot)
library(randomForest)

NameError: name 'library' is not defined

In [None]:
df <- read.csv('titanic.csv',stringsAsFactors = FALSE)
df

In [None]:
colSums(is.na(df))

In [None]:
colSums(df=="")

In [None]:
count(df, vars=Embarked, sort=TRUE)

In [None]:
subset(df, select = -c(Embarked) )