# Optional - Advanced Solution 🧙🧙

The last solution is perfectly valid and we applied the rules from the lectures strictly. Now that we are more comfortable with Preprocessing with python, let's take a step back and see what we could have done differently by digging into the interpretation of the variables a little deeper.

1. Load the titanic dataset again

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [7]:
df = pd.read_csv("/Users/qxzjy/vscworkspace/dsfs-ft-34/ml_module/exercices/data/titanic.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Let's explore the features more in details and try to extract more information than previously:

**A. Preprocessing to be planned with pandas**

**Unnecessary columns for prediction, to be thrown away** :
- _PassengerId_ and _Name_ are passenger identifiers, we won't use them for prediction (these columns don't contain any information)

<Note type="tip" title="Actually, _Name_ contains useful information !">

As it is true that _Name_ cannot be used as such for prediction, it contains valuable information on the socio-economic background of the passenger in the form of their title. We will try and extract a _Title_ variable from the variable _Name_

</Note>

- _Ticket_ and _Cabin_ have too many different modalities, they might not be very useful and if we had to pass them in OneHotEncoding, they would make the number of columns explode in relation to the number of rows.

<Note type="tip" title="We can do something with the _Cabin_ variable !">

_Ticket_ and _Cabin_ do have way too many modalities in order to be useful for prediction, however, the _Cabin_ variable can easily be used after a slight transformation : let's create a new variable _HasCabin_ which is equal to 1 when the passenger has a cabin number and 0 otherwise.

</Note>

**Columns with too many missing values, to be discarded** : Cabin


**Target variable/target (Y) that we will try to predict, to separate from the others** : Survived

**------------**

**B. Preprocessings to be planned with scikit-learn**.

**Explanatory variables (X)**
We need to identify which columns contain categorical variables and which columns contain numerical variables, as they will be treated differently.

- Categorical variables : Sex, Embarked, HasCabin, Title
- Numerical variables : Class, Age, Bbsp, Parch, Fare.

In this dataset, we have both types of variables. It will thus be necessary to plan to create a numeric_transformer (which will call the StandardScaler class) and a categorical_transformer (which will call the OneHotEncoder class). Moreover, as we observe missing values in the _Age_ and _Embarked_ columns, we will have to plan to call the SimpleImputer class to handle the missing values. 

**Target variable Y**
Here, the target variable Y is categorical (survival vs. death) but we notice that it is already encoded in numbers (1 vs. 0). It will therefore not be necessary to go through a label encoding step.

## Preprocessing - pandas part ##
2. Create a column _HasCabin_ in the dataset that is constant equal to 1

In [8]:
df["HasCabin"] = 1
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,HasCabin
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1


3. Using a mask, change the value of the variable _HasCabin_ to 0 wherever Cabin is missing.

In [9]:
df.loc[df["Cabin"].isnull(), "HasCabin"] = 0
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,HasCabin
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0


5. Create a column _Title_ that only contains the title extracted from the _Name_ variable. 

<Note type="tip" title="Remember pandas handles columns of strings efficiently">

Some method from [the str module](https://docs.python.org/3.3/library/stdtypes.html?highlight=split) can be helpful 😉
You can create a function that allows to extract the title from one element of the column, and then use the `apply()` method to apply this function to the whole column.

</Note>

In [12]:
def extract_title(name):
    return name.split(", ")[1].split(".")[0]

In [13]:
df["Title"] = df["Name"].apply(extract_title)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,HasCabin,Title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0,Mr
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1,Mrs
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,Miss
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1,Mrs
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0,Mr


6. Display all the possible values and number of instances of each of these values in your dataset for the new _Title_ variable.

In [15]:
df["Title"].value_counts()

Title
Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Mlle              2
Major             2
Col               2
the Countess      1
Capt              1
Ms                1
Sir               1
Lady              1
Mme               1
Don               1
Jonkheer          1
Name: count, dtype: int64

7. Some of these values represent only very few instances, and other values seem to represent the similar categories of people. Bring the similar categories under one name, and create a new category called _Rare_ that will represent all the underrepresented modalities.

In [17]:
df['Title'] = df['Title'].replace(['Mlle', 'Ms'], 'Miss')
df['Title'] = df['Title'].replace('Mme', 'Mrs')
df['Title'] = df['Title'].replace(['Lady', 'the Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
df['Title'].value_counts()

Title
Mr        517
Miss      185
Mrs       126
Master     40
Rare       23
Name: count, dtype: int64

8. Now that we are done squeezing some extra information out of our variables, let's reproduce all the subsequent steps from the first solution and let's compare our models' performances.

In [18]:
column_to_drop = ["PassengerId", "Ticket", "Cabin", "Name"]
df.drop(columns=column_to_drop, axis=1, inplace=True)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,HasCabin,Title
0,0,3,male,22.0,1,0,7.25,S,0,Mr
1,1,1,female,38.0,1,0,71.2833,C,1,Mrs
2,1,3,female,26.0,0,0,7.925,S,0,Miss
3,1,1,female,35.0,1,0,53.1,S,1,Mrs
4,0,3,male,35.0,0,0,8.05,S,0,Mr


In [19]:
target_variable = "Survived"

X = df.drop(target_variable, axis=1) 
y = df[target_variable]

display(X.head())
display(y.head())

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,HasCabin,Title
0,3,male,22.0,1,0,7.25,S,0,Mr
1,1,female,38.0,1,0,71.2833,C,1,Mrs
2,3,female,26.0,0,0,7.925,S,0,Miss
3,1,female,35.0,1,0,53.1,S,1,Mrs
4,3,male,35.0,0,0,8.05,S,0,Mr


0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

In [20]:
X_train_unproc, X_test_unproc, y_train_unproc, y_test_unproc = train_test_split(X, y, test_size=0.15, random_state=0)

In [21]:
numeric_features = ["Pclass", "Age", "SibSp", "Parch", "Fare"]
numeric_transformer = Pipeline(
    steps=[
        (
            "imputer",
            SimpleImputer(strategy="median"),
        ),
        ("scaler", StandardScaler()),
    ]
)

In [22]:
categorical_features = ["Sex", "Embarked", "HasCabin", "Title"]
categorical_transformer = Pipeline(
    steps=[
        (
            "imputer",
            SimpleImputer(strategy="most_frequent"),
        ),
        (
            "encoder",
            OneHotEncoder(drop="first"),
        ),
    ]
)

In [23]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

In [24]:
X_train = preprocessor.fit_transform(X_train_unproc)
print(X_train[0:5])

print()

X_test = preprocessor.transform(X_test_unproc)
print(X_test[0:5,:])

[[-1.60067161  2.62354063 -0.46346837 -0.46599785 -0.10960455  1.
   0.          1.          0.          0.          1.          0.
   0.        ]
 [ 0.81068841 -0.66498389 -0.46346837 -0.46599785 -0.47113394  1.
   0.          1.          0.          0.          1.          0.
   0.        ]
 [ 0.81068841 -0.05316537  0.4315458  -0.46599785 -0.47717621  1.
   1.          0.          0.          0.          1.          0.
   0.        ]
 [ 0.81068841  0.78808508  0.4315458  -0.46599785 -0.44243314  0.
   0.          1.          0.          0.          0.          1.
   0.        ]
 [-0.3949916   1.09399434  0.4315458  -0.46599785 -0.10960455  1.
   0.          1.          0.          0.          1.          0.
   0.        ]]

[[ 0.81068841 -0.05316537 -0.46346837 -0.46599785 -0.34206493  1.
   0.          0.          0.          0.          1.          0.
   0.        ]
 [ 0.81068841 -0.05316537 -0.46346837 -0.46599785 -0.4812044   1.
   0.          1.          0.          0.         

In [25]:
labelencoder = LabelEncoder()

y_train = labelencoder.fit_transform(y_train_unproc)
print(y_train[0:5])

print()

y_test = labelencoder.transform(y_test_unproc)

[0 0 0 0 0]



### Training model

In [26]:
from sklearn.linear_model import LogisticRegression

In [27]:
# Train model
model = LogisticRegression()

print("Training model...")
model.fit(X_train, y_train) # Training is always done on train set !!
print("...Done.")

Training model...
...Done.


### Predictions

In [28]:
# Predictions on training set
print("Predictions on training set...")
y_train_pred = model.predict(X_train)
print("...Done.")
print(y_train_pred[0:5])
print()

Predictions on training set...
...Done.
[0 0 0 1 0]



In [29]:
# Predictions on test set
print("Predictions on test set...")
y_test_pred = model.predict(X_test)
print("...Done.")
print(y_test_pred[0:5])
print()

Predictions on test set...
...Done.
[0 0 0 1 1]



### Performances evaluation

In [30]:
from sklearn.metrics import accuracy_score

In [31]:
# Print scores
print("Accuracy on training set : ", accuracy_score(y_train, y_train_pred))
print("Accuracy on test set : ", accuracy_score(y_test, y_test_pred))

Accuracy on training set :  0.8282694848084544
Accuracy on test set :  0.8208955223880597


Tada 🥳 If you worked well, the score has improved a bit!
This example shows that by adding a little additional information to a model, it is possible to create a significant impact on the performances of the predictive model. Knowing and applying the preprocessing guidelines is great but always remember to check for two important things before you proceed :

* If a variable contains missing values, ask yourself why this value is missing and whether you could use it as information to feed the model with. In the above example, the fact that a passenger does not have a cabin number simply means that they have no cabin. It is very common for missing values to contain hidden meaning, completely random missing values (caused by a bug or other unpredictable causes) are very rare.

* When a non-numerical variable is not usable as is, always ask yourself whether you could still extract some information from it. Here the _Name_ variable cannot be used, however it mentions the passenger's title which can be useful information.