- Importance of data in any model
- Types of data: quantitative and qualitative, continuous and discrete, structured and unstructured...
- Preparation of quantitative: MinMax, standard, centering...
- Preparation of qualitative: one hot encoding, hashing...

## Data Preparation 🧹

___

<img src="https://drive.google.com/uc?export=view&id=1LtcapL_DimAeUWGoxRu_1xaqNhY7bGTC">

___

Brace yourself! It's relatively easy to fit a Machine Learning and retrieve predictions on a *clean* dataset. But the struggle is real when you deal with real-life datasets.

One of the most **important** and **time-consuming** job of data analysts/scientists is to **prepare and clean your data** beforehand.

> 📌 Remember: **the quality of a Machine Learning model depends on the quality *(and quantity)* of the data that you use to feed it**.

# 0. Intro

Let's work again with our now beloved Titanic dataset 🚢

In [6]:
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns

data_path = os.path.join("..", "..", "..", "..", "data", "titanic.csv")
data = pd.read_csv(data_path)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Suppose we want to fit a classifier to our Titanic data in order to predict who is going to survive. 

Let's we define a function `fit_lr_with_selected_features` that will fit a Logistic Regression based on a list of selected features and print the **score** of the model.

We will try to improve iteratively our model performance by **choosing carefully the features we use**.

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

def fit_lr_with_selected_features(data, features_to_use):
    lr = LogisticRegression()
    X_train, X_test, y_train, y_test = train_test_split(data[features_to_use],
                                                        data["Survived"],
                                                        test_size=0.2, 
                                                        random_state=0)

    lr.fit(X_train, y_train)
    lr_score = lr.score(X_test, y_test)
    print("Score={}".format(lr_score))
    return lr

___

# I. Data Cleaning

## I.1. Missing values

We have already seen that we must first:
- drop or replace missing/NaN values
- remove duplicated lines

In [8]:
# Cleaning Cabin - removing column
data_cleaned = data.drop(['Cabin'], axis=1)

# Cleaning Embarked - removing rows with missing values
data_cleaned.dropna(subset=["Embarked"], inplace=True)

# Cleaning Age - replacing by mean value
data_cleaned["Age"] = data["Age"].fillna(data["Age"].mean())

## I.2. Outliers

Sometimes, it makes sense to remove some data points that are very far from your data distribution. 

This can be due to measurement error (broken sensor? bad manipulation? etc.) for example. 

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1_Ek5PqlAQIeX8Kk0GxNBS85PSaZXJ1SG" width="300">
</p>

> ⚠️ **Warning**: Of course, do not remove every data points that you think could be outliers just because "it's more convenient" and it gives better results. This is cheating. 🙃

In [9]:
# We can already fit a model based on some of the existing features
# Quiz: which features can you use?
# SibSp = # of siblings / spouses aboard the Titanic
# Parch = # of parents / children aboard the Titanic

data_cleaned.head(n=2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C


In [10]:
fit_lr_with_selected_features(data_cleaned, ["Pclass", "Age", "Fare", "SibSp", "Parch"])
print("We have already an 67.4% accuracy by just using a few features.\n\
It is better than random!")

Score=0.6741573033707865
We have already an 67.4% accuracy by just using a few features.
It is better than random!




## I.3. Duplicated data

Sometimes, there are duplicated data. You want to remove those data, they can alter the performances of your model.

You have the method `.duplicated()` that gives you the duplicates, and `drop_duplicates()` to remove them.

In [5]:
print(data_cleaned.duplicated().sum())

0


Well, zero duplicates, so nothing to do! But always nice to check!

## I.4. Categorical data: Binary variables

How can we incorporate the `Sex` (string `M` of `S`) into the features X of our model?

In [27]:
# Adding `Sex` in the features will not work as it is a categorical data
fit_lr_with_selected_features(data_cleaned, ["Pclass", "Age", "Fare", "SibSp", "Parch", "Sex"])

ValueError: could not convert string to float: 'male'

#### Dealing with binary variables

Our model needs to deal with figures in order to compute distances between data points.

##### Different ways to encode the `Sex` column

- **Using `.loc`**

In [13]:
# Using `.loc`
data_cleaned['Sex'].loc[data_cleaned['Sex'] == 'male'] = 1
data_cleaned['Sex'].loc[data_cleaned['Sex'] == 'female'] = 0

- **Using `.apply()`**

In [None]:
def encode_sex(x):
    if x == 'male':
        return 1
    else:
        return 0

In [None]:
data_cleaned['Sex'] = data_cleaned['Sex'].apply(encode_sex)

- **Using `.map()`**: we can map the binary variables to numbers: 0 for M and 1 for S for example (or the other way around).

In [None]:
genders = sorted(data_cleaned['Sex'].unique())
genders

In [None]:
genders_mapping = dict(zip(genders, range(0, len(genders))))
genders_mapping

In [None]:
data_cleaned["Sex"] = data_cleaned['Sex'].map(genders_mapping).astype(int)

We have now replaced the strings "female" and "male" in the `Sex` column with `0` and `1`.

In [31]:
data_cleaned.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,S


**Now we can use the newly encoded column `Sex` as a feature in our model !**

In [32]:
# Adding the encoded column `Sex` in the features
fit_lr_with_selected_features(data_cleaned, ["Pclass", "Age", "Fare", "SibSp", "Parch", "Sex"])
print("Yeah! 🎉 By adding the gender in the proper way, we went from 67% to 71% !")

Score=0.7134831460674157
Yeah! 🎉 By adding the gender in the proper way, we went from 67% to 71% !


## 1.5. Categorical data: dealing with polytomous variables with `pd.get_dummies()`

Now suppose we want to incorporate the variable `Embarked` which corresponds to the harbour where the passenger embarked (C = Cherbourg, Q = Queenstown, S = Southampton).

This time, we cannot say we will map C to 0, Q to 1 and S to 2, because it would mean: embarking at S has 2 times the effect of embarking at Q.

This is ok for continuous variables (`Age`, `Fare`, etc.) as it can be seen as distance metrics.

We will do what we call **dummy variables**: we simply add binary columns saying if yes (1) or no (0) the passenger embarked at the corresponding harbour.

In [14]:
# Transform Embarked from a string to dummy variables
data_cleaned = pd.concat([data_cleaned,
                          pd.get_dummies(data_cleaned['Embarked'],
                                         prefix='Embarked_Val')], axis=1)
data_cleaned.head(n=2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Embarked_Val_C,Embarked_Val_Q,Embarked_Val_S
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,S,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C,1,0,0


In [36]:
# Adding `Embarked_Val_C`, `Embarked_Val_Q`, `Embarked_Val_S` in the features
fit_lr_with_selected_features(data_cleaned, ["Pclass", "Age", "Fare", "SibSp", "Parch", "Sex",
                                             "Embarked_Val_C", "Embarked_Val_Q", "Embarked_Val_S"])
print("Doh! 😓 This time the accuracy went down. We did nothing wrong but it can happen and \
we will see later why.")

Score=0.7078651685393258
Doh! 😓 This time the accuracy went down. We did nothing wrong but it can happen and we will see later why.


___

# II. Features

Let's now talk about features (the columns/variables that you are going to feed to the model so that it can learn patterns).

Most of the time, you are the one selecting the features and you need to select them carefully (the more the better, but beware of traps...) !

## II.1. Features can be redundant

Remember the dummy variables `Embarked`? 

Do you think we need the 3 columns (`Embarked_Val_C`, `Embarked_Val_Q`, `Embarked_Val_S`) in our classifier?

Nope! Knowing only two of them (no matter which two), we can get the value of the last one so keeping the 3 will not be more useful than keeping only 2.

In [37]:
# If we remove one of the `Embarked_Val`, accuracy should not go down
# (as a matter of fact, it even goes up !)
fit_lr_with_selected_features(data_cleaned, ["Pclass", "Age", "Fare", "SibSp", "Parch", "Sex",
                                             "Embarked_Val_C", "Embarked_Val_Q"])

Score=0.7134831460674157


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

>**Hint** : Even better, you can automatically remove one column when using `get_dummies()` with the option `drop_first=True`

## II.2. Beware of traps! ⚠️

What if you add `PassengerId` to the list of features to select?

In [38]:
fit_lr_with_selected_features(data_cleaned, ["PassengerId", "Pclass", "Age", "Fare", "SibSp", "Parch",
                                             "Sex", "Embarked_Val_C", "Embarked_Val_Q"])
print("Accuracy improved...\n⚠️ But we shouldn't use PassengerId, it is just an index of the table \
and should not be fed as info to the model")

Score=0.7247191011235955
Accuracy improved...
⚠️ But we shouldn't use PassengerId, it is just an index of the table and should not be fed as info to the model


## II.3. Feature Engineering

Now, suppose we incoporated all possible features from the dataset that made sense. 

How can we even improve the performance of our model?

### II.3.A. Create new "smart" features

We can enrich our dataset with new "smart" features that we create.

In [39]:
# Quiz: what smart features can we create based on this data?
data_cleaned.tail(n=3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Embarked_Val_C,Embarked_Val_Q,Embarked_Val_S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",0,29.699118,1,2,W./C. 6607,23.45,S,0,0,1
889,890,1,1,"Behr, Mr. Karl Howell",1,26.0,0,0,111369,30.0,C,1,0,0
890,891,0,3,"Dooley, Mr. Patrick",1,32.0,0,0,370376,7.75,Q,0,1,0


> **Reminder**:
- the python operator **`and`** is written with the **`&`** symbol in pandas and numpy.
- the python operator **`or`** is written with the **`|`** symbol in pandas and numpy.

In [15]:
# What about we create a new binary feature, `isAlone`, equals to 1 if the passenger is alone on the ship
data_cleaned['isAlone'] = ((data_cleaned['SibSp'] == 0) & (data_cleaned['Parch'] == 0)).astype(int)
data_cleaned.head(n=3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Embarked_Val_C,Embarked_Val_Q,Embarked_Val_S,isAlone
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,S,0,0,1,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C,1,0,0,0
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,S,0,0,1,1


In [41]:
# What if we use this new feature? 
fit_lr_with_selected_features(data_cleaned, ["Pclass", "Age", "Fare", "SibSp", "Parch",
                                             "Sex", "Embarked_Val_C", "Embarked_Val_Q",
                                             "isAlone"])
print("Accuracy decreased again.. Do not worry too much yet.")

Score=0.7078651685393258
Accuracy decreased again.. Do not worry too much yet.


In [19]:
# What if we create a new feature, categorizing the `Title` of the passenger ("Mr.", "Mrs", "Miss", )
data_cleaned['Title'] = data_cleaned['Name'].str.split(' ').str[1]

print(data_cleaned['Title'].value_counts())

Mr.             502
Miss.           178
Mrs.            120
Master.          40
Dr.               7
Rev.              6
y                 4
Impe,             3
Planke,           3
Gordon,           2
Col.              2
Major.            2
Mlle.             2
Shawah,           1
Messemaeker,      1
Mulder,           1
Jonkheer.         1
Cruyssen,         1
Capt.             1
Billiard,         1
Don.              1
Melkebeke,        1
the               1
Ms.               1
der               1
Carlo,            1
Velde,            1
Pelsmaeker,       1
Walle,            1
Steen,            1
Mme.              1
Name: Title, dtype: int64


Let's put all titles that are not either Mr., Mrs. Miss. or Master. as a rare title and use that afterward

In [20]:
data_cleaned.loc[~data_cleaned['Title'].isin(['Mr.', 'Mrs.', 'Miss.', 'Master.']), "Title"] = "Rare_Title"

In [21]:
data_cleaned = pd.concat([data_cleaned, pd.get_dummies(data_cleaned['Title'],
                                                       prefix='Title_val')], axis=1)
data_cleaned.tail(n=3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,Embarked_Val_C,Embarked_Val_Q,Embarked_Val_S,isAlone,Title,Title_val_Master.,Title_val_Miss.,Title_val_Mr.,Title_val_Mrs.,Title_val_Rare_Title
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",0,29.699118,1,2,W./C. 6607,23.45,...,0,0,1,0,Miss.,0,1,0,0,0
889,890,1,1,"Behr, Mr. Karl Howell",1,26.0,0,0,111369,30.0,...,1,0,0,1,Mr.,0,0,1,0,0
890,891,0,3,"Dooley, Mr. Patrick",1,32.0,0,0,370376,7.75,...,0,1,0,1,Mr.,0,0,1,0,0


In [22]:
# What if we use this new feature? 
fit_lr_with_selected_features(data_cleaned, ["Pclass", "Age", "Fare", "SibSp", "Parch",
                                             "Sex", "Embarked_Val_C", "Embarked_Val_Q",
                                             "isAlone", "Title_val_Mr.", "Title_val_Mrs.",
                                             "Title_val_Rare_Title", "Title_val_Miss."])
print("Accuracy increased this time!")

Score=0.7471910112359551
Accuracy increased this time!


### II.3.B. Enrich data with new features

Sometimes, it can be really clever to enrich your data with **external** data in order to enrich your set of features.

For example with open datasets, or API calls, etc.

___

# III. Data scaling

What do you think when you compare variables `Sex` vs variable `Fare`?

In [23]:
features_to_use = ["Pclass", "Age", "Fare", "SibSp", "Parch",
                   "Sex", "Embarked_Val_C", "Embarked_Val_Q",
                   "isAlone", "Title_val_Mr.", "Title_val_Mrs.",
                   "Title_val_Rare_Title", "Title_val_Miss."] 
X = data_cleaned[features_to_use]
X.head(n=2)

Unnamed: 0,Pclass,Age,Fare,SibSp,Parch,Sex,Embarked_Val_C,Embarked_Val_Q,isAlone,Title_val_Mr.,Title_val_Mrs.,Title_val_Rare_Title,Title_val_Miss.
0,3,22.0,7.25,1,0,1,0,0,0,1,0,0,0
1,1,38.0,71.2833,1,0,0,1,0,0,0,1,0,0


`Age` has a **much bigger impact** during model's learning compared to `Sex`, because its values can vary between 0 and 80+ (while `Sex` is just 0 or 1).

It is important to always **scale** your data, otherwise one might have too much weight compared to another.

In [24]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
print(X_scaled)

[[ 0.82520863 -0.59049493 -0.50023975 ... -0.39502761 -0.24152295
  -0.50035149]
 [-1.57221121  0.64397101  0.78894661 ...  2.53146861 -0.24152295
  -0.50035149]
 [ 0.82520863 -0.28187844 -0.48664993 ... -0.39502761 -0.24152295
   1.99859501]
 ...
 [ 0.82520863  0.00352373 -0.17408416 ... -0.39502761 -0.24152295
   1.99859501]
 [-1.57221121 -0.28187844 -0.0422126  ... -0.39502761 -0.24152295
  -0.50035149]
 [ 0.82520863  0.18104628 -0.49017322 ... -0.39502761 -0.24152295
  -0.50035149]]


In [64]:
lr = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(X_scaled,
                                                    data_cleaned["Survived"],
                                                    test_size=0.2, 
                                                    random_state=0)
lr.fit(X_train, y_train)
lr_score = lr.score(X_test, y_test)
print(lr_score)
print("Accuracy slightly decreased... It can happen, you still need to scale your data.")

0.7415730337078652
Accuracy slightly decreased... It can happen, you still need to scale your data.


In [67]:
from sklearn.svm import SVC
clf = SVC()

# Fitting a SVM on unscaled data
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    data_cleaned["Survived"],
                                                    test_size=0.2, 
                                                    random_state=0)

clf.fit(X_train, y_train)
clf_score = clf.score(X_test, y_test)
print('SVC score with no scaling : ', clf_score)

# Fitting a SVM on scaled data
X_train, X_test, y_train, y_test = train_test_split(X_scaled,
                                                    data_cleaned["Survived"],
                                                    test_size=0.2, 
                                                    random_state=0)

clf.fit(X_train, y_train)
clf_score = clf.score(X_test, y_test)
print('SVC score with scaling : ', clf_score, '\nWith a SVM, the accuracy increase following scaling of the data is much more obvious !')

SVC score with no scaling :  0.6853932584269663
SVC score with scaling :  0.7584269662921348 
With a SVM, the accuracy increase following scaling of the data is much more obvious !


___

# IV. Saving data

What if you work on some data in a notebook and you want to load it somewhere else (in your source code, in another notebook, send it somewhere, etc.)?

For now it only lives in Jupyter memory - and when you close it, you lose it.

**Pickle** to the rescue! 🥗

2 functions you must know:
- `pickle.dump(obj, file)`
- `obj = pickle.load(file)`

> ⚠️ **Warning**: `file` does not refer to the path but to the the opened file (for example: `open(path_to_file, "wb")`) 

## IV.1. Saving pickle

In [58]:
!ls

05-Data-Preparation.ipynb [1m[36mimages[m[m


In [25]:
import pickle

with open('data_cleaned.pkl', 'wb') as f:
    pickle.dump(data_cleaned, f)

# > 🔦 Hint: In "wb", w means that you allow "write" access to the file
# and b refers to binary mode. This means that the data will be written in byte objects

In [60]:
!ls

05-Data-Preparation.ipynb data_cleaned.pkl          [1m[36mimages[m[m


## IV.2. Loading pickle

In [26]:
with open("data_cleaned.pkl", "rb") as f:
    data_pickle = pickle.load(f)

data_pickle.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,Embarked_Val_C,Embarked_Val_Q,Embarked_Val_S,isAlone,Title,Title_val_Master.,Title_val_Miss.,Title_val_Mr.,Title_val_Mrs.,Title_val_Rare_Title
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,...,0,0,1,0,Mr.,0,0,1,0,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,...,1,0,0,0,Mrs.,0,0,0,1,0
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,...,0,0,1,1,Miss.,0,1,0,0,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,...,0,0,1,0,Mrs.,0,0,0,1,0
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,...,0,0,1,1,Mr.,0,0,1,0,0
