# Discovering the Titanic 🚢🚢

The Titanic is a well known ship, but did you know that it is also one of the most popular datasets in Data Science ? Here's the link to the dataset:

<a href="https://www.kaggle.com/c/titanic/"> Titanic </a>

Machine Learning is of course all about statistical prediction and understanding of data. The objective of this exercise is to predict whether a passenger survived the sinking of the Titanic, based on the information available about that passenger. The part of the code to train the model, make predictions and evaluate its performance has already been coded. You have to complete the upstream part, which will allow you to prepare the dataset before training the model (preprocessing).

1. Download the dataset _titanic.csv_.
2. Try to understand what's in this dataset.
    1. You will find all the explanations via this link : <a href="https://www.kaggle.com/c/titanic/data"> Titanic Data </a>

3. Place the file _titanic.csv_ in the same folder as this notebook and read it.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [12]:
df = pd.read_csv("/Users/qxzjy/vscworkspace/dsfs-ft-34/ml_module/exercices/data/titanic.csv")

4. Explore the dataset and determine which columns are useful for prediction and what preprocessing you will do.

In [13]:
print("Number of rows : {}".format(df.shape[0]))

display(df.head())

display(df.describe(include="all"))

display(100 * df.isnull().sum() / df.shape[0])

Number of rows : 891


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Braund, Mr. Owen Harris",male,,,,347082.0,,B96 B98,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


PassengerId     0.000000
Survived        0.000000
Pclass          0.000000
Name            0.000000
Sex             0.000000
Age            19.865320
SibSp           0.000000
Parch           0.000000
Ticket          0.000000
Fare            0.000000
Cabin          77.104377
Embarked        0.224467
dtype: float64

## Preprocessing - pandas part 🐼🐼 
5. Use the pandas library to discard columns you won't use for prediction.

In this dataset, some categorical variables have too many modalities, we will have to think about throwing them away: typically, for a dataset that is less than 1000 lines long, we will tend to reject categorical variables that have more than 15-20 possible values. So pay attention to the number of unique values in each column, to decide which ones you will keep.

In [14]:
column_to_drop = ["PassengerId", "Ticket", "Cabin", "Name"]
df.drop(columns=column_to_drop, axis=1, inplace=True)

6. Separate the target variable (Y) from the explanatory variables (X)

In [15]:
target_variable = "Survived"

X = df.drop(target_variable, axis=1) 
y = df[target_variable]

display(X.head())
display(y.head())

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,22.0,1,0,7.25,S
1,1,female,38.0,1,0,71.2833,C
2,3,female,26.0,0,0,7.925,S
3,1,female,35.0,1,0,53.1,S
4,3,male,35.0,0,0,8.05,S


0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

## Preprocessing - scikit-learn part 🔬🔬
7. Separate your data to create a train set and a test set, the latter should represent 15% of the available data.

In [23]:
X_train_unproc, X_test_unproc, y_train_unproc, y_test_unproc = train_test_split(X, y, test_size=0.15, random_state=0)

8. Create the preprocessing pipeline for numeric columns

In [17]:
numeric_features = ["Pclass", "Age", "SibSp", "Parch", "Fare"]
numeric_transformer = Pipeline(
    steps=[
        (
            "imputer",
            SimpleImputer(strategy="median"),
        ),
        ("scaler", StandardScaler()),
    ]
)

9. Create the preprocessing pipeline for category columns

In [18]:
categorical_features = ["Sex", "Embarked"]
categorical_transformer = Pipeline(
    steps=[
        (
            "imputer",
            SimpleImputer(strategy="most_frequent"),
        ),
        (
            "encoder",
            OneHotEncoder(drop="first"),
        ),
    ]
)

In [19]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

10. Use the preprocessing pipelines of questions 9 and 10 to transform X_train and X_test

Reminder: you need to call `fit_transform()` on X_train and only `transform()` on X_test, to ensure that the latter gets the same transformations as X_train.

In [24]:
X_train = preprocessor.fit_transform(X_train_unproc)
print(X_train[0:5])

print()

X_test = preprocessor.transform(X_test_unproc)
print(X_test[0:5,:])

[[-1.60067161  2.62354063 -0.46346837 -0.46599785 -0.10960455  1.
   0.          1.        ]
 [ 0.81068841 -0.66498389 -0.46346837 -0.46599785 -0.47113394  1.
   0.          1.        ]
 [ 0.81068841 -0.05316537  0.4315458  -0.46599785 -0.47717621  1.
   1.          0.        ]
 [ 0.81068841  0.78808508  0.4315458  -0.46599785 -0.44243314  0.
   0.          1.        ]
 [-0.3949916   1.09399434  0.4315458  -0.46599785 -0.10960455  1.
   0.          1.        ]]

[[ 0.81068841 -0.05316537 -0.46346837 -0.46599785 -0.34206493  1.
   0.          0.        ]
 [ 0.81068841 -0.05316537 -0.46346837 -0.46599785 -0.4812044   1.
   0.          1.        ]
 [ 0.81068841 -1.73566628  3.11658831  0.78050523 -0.0466642   1.
   1.          0.        ]
 [-1.60067161 -0.05316537  0.4315458  -0.46599785  2.31779438  0.
   0.          0.        ]
 [ 0.81068841 -0.05316537 -0.46346837  2.02700832 -0.32620396  0.
   0.          0.        ]]


In [25]:
labelencoder = LabelEncoder()

y_train = labelencoder.fit_transform(y_train_unproc)
print(y_train[0:5])

print()

y_test = labelencoder.transform(y_test_unproc)

[0 0 0 0 0]



### Training model

In [26]:
from sklearn.linear_model import LogisticRegression

In [28]:
# Train model
model = LogisticRegression()

print("Training model...")
model.fit(X_train, y_train) # Training is always done on train set !!
print("...Done.")

Training model...
...Done.


### Predictions

In [29]:
# Predictions on training set
print("Predictions on training set...")
y_train_pred = model.predict(X_train)
print("...Done.")
print(y_train_pred[0:5])
print()

Predictions on training set...
...Done.
[0 0 0 0 0]



In [30]:
# Predictions on test set
print("Predictions on test set...")
y_test_pred = model.predict(X_test)
print("...Done.")
print(y_test_pred[0:5])
print()

Predictions on test set...
...Done.
[0 0 0 1 1]



### Performances evaluation

In [31]:
from sklearn.metrics import accuracy_score

In [32]:
# Print scores
print("Accuracy on training set : ", accuracy_score(y_train, y_train_pred))
print("Accuracy on test set : ", accuracy_score(y_test, y_test_pred))

Accuracy on training set :  0.8031704095112285
Accuracy on test set :  0.7910447761194029


If you get a score close to 0.79 on the test set, it means that you managed to do all the preprocessings with a good methodology! :-)