## Importing

In [1]:
import pandas as pd

train = pd.read_csv("data_titanic/train.csv")

In [2]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Brief Exploration

In [3]:
#Categorical features
train.describe(include = object)

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Braund, Mr. Owen Harris",male,347082,B96 B98,S
freq,1,577,7,4,644


In [4]:
#Numerical features
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Let's work only with the following for simplicity
Categorical:
- Sex
- Embarked

Numerical:
- Survived: *target* 0 = No, 1 = Yes
- Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
- Age: Age in years
 
More detailed info: https://www.kaggle.com/c/titanic

In [5]:
#Let's keep only the desired columns
train = train[['Sex','Embarked','Pclass', 'Age','Survived']]

In [6]:
train.shape

(891, 5)

In [7]:
#Check for missing values
train.isna().sum()

Sex           0
Embarked      2
Pclass        0
Age         177
Survived      0
dtype: int64

For simplicity, we drop rows with missing values. If you will later experiment with composite transformers, comment out this cell so that you try to include also missing value imputation.

In [8]:
train = train.dropna(axis=0)

In [9]:
train.head()

Unnamed: 0,Sex,Embarked,Pclass,Age,Survived
0,male,S,3,22.0,0
1,female,C,1,38.0,1
2,female,S,3,26.0,1
3,female,S,1,35.0,1
4,male,S,3,35.0,0


## Feature Engineering
With our current knowledge, we can try to implement individually various transformers from scikit-learn. Let's not forget to create a holdout set!

In [10]:
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(train[['Pclass', 'Age', 'Sex', 'Embarked']],
                                                    train['Survived'], 
                                                    test_size=0.2, 
                                                    random_state=42)

### Numerical Features
- Pclass
- Age  
Let's just scale these two features using MinMax scaler.

In [11]:
scaler = preprocessing.MinMaxScaler()
scaler.fit(X_train[['Pclass', 'Age']])
X_train_transformed_numerical = scaler.transform(X_train[['Pclass', 'Age']])
X_test_transformed_numerical = scaler.transform(X_test[['Pclass', 'Age']])

print(X_train_transformed_numerical.shape)
print(X_test_transformed_numerical.shape)

(569, 2)
(143, 2)


### Categorical Features
*   Sex
*   Embarked

We can simply one-hot encode these.

In [12]:
encoder = preprocessing.OneHotEncoder(sparse=False)
encoder.fit(X_train[['Sex', 'Embarked']])
X_train_transformed_categorical = encoder.transform(X_train[['Sex', 'Embarked']])
X_test_transformed_categorical = encoder.transform(X_test[['Sex', 'Embarked']])

print(X_train_transformed_categorical.shape)
print(X_test_transformed_categorical.shape)

(569, 5)
(143, 5)


## HANDS-ON 1: Baseline Model & Model Evaluation
Time for first exercise! At first, let's put together the transformed numerical and categorical features.

In [13]:
X_train_transformed = np.concatenate((X_train_transformed_numerical,X_train_transformed_categorical), axis = 1)
X_test_transformed = np.concatenate((X_test_transformed_numerical,X_test_transformed_categorical), axis = 1)

print(X_train_transformed.shape)
print(X_test_transformed.shape)

(569, 7)
(143, 7)


In [35]:
# TASK 1: Fit sklearn.DummyClassifier. Then, let the model predict for train (X_train_transformed) and holdout set(X_test_transformed).
# Store the prediction as y_pred_TRAIN_DUMMY (training set) and as y_pred_HOLDOUT_DUMMY (holdout set)

from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X_train_transformed, y_train)

y_pred_TRAIN_DUMMY = dummy_clf.predict(X_train_transformed)
y_pred_HOLDOUT_DUMMY = dummy_clf.predict(X_test_transformed)

In [36]:
# OPTIONAL TASK 1: Think about a simple heuristic that can be used as baseline. 
# One possibility is to use gender and for example predict that every men or every woman has survived.
# You can store the result as y_pred_TRAIN_HEURISTIC and as y_pred_HOLDOUT_HEURISTIC.

y_pred_TRAIN_HEURISTIC = np.array([1 if idx==0 else 0 for idx in X_train_transformed[:,3]])
y_pred_HOLDOUT_HEURISTIC =np.array([1 if idx==0 else 0 for idx in X_test_transformed[:,3]])

Great! We have our first prediction! It is time to evaluate how good our (poor dummy) model is. It is time to use the *sklearn.metrics* module.   
https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics

In [40]:
from sklearn import metrics

#TASK 2A: Display ACCURACY on TRAIN set.
print(metrics.accuracy_score(y_train, y_pred_TRAIN_DUMMY))
print(metrics.accuracy_score(y_train,y_pred_TRAIN_HEURISTIC))  #Optional Task 1
print()
#TASK 2B: Display ACCURACY on HOLDOUT set.
print(metrics.accuracy_score(y_test, y_pred_HOLDOUT_DUMMY))
print(metrics.accuracy_score(y_test, y_pred_HOLDOUT_HEURISTIC))  #Optional Task 1

#OPTIONAL TASK 2C: Can you think of better measure than accuracy based on the domain problem? If yes, use it the same way.

0.6045694200351494
0.7873462214411248

0.5594405594405595
0.7482517482517482


Great, now we would also like to see confusion matrix as it is always a good idea to see visually the quality of our predictions.

In [38]:
#TASK 3: Display a CONFUSION MATRIX on HOLDOUT set. Hint: do not use plot_confusion_matrix but confusion_matrix only.
metrics.confusion_matrix(y_test, y_pred_HOLDOUT_DUMMY)

array([[80,  0],
       [63,  0]])

In [41]:
metrics.confusion_matrix(y_test, y_pred_HOLDOUT_HEURISTIC)

array([[68, 12],
       [24, 39]])

## HANDS-ON 2: Composite Estimators
Let's nicely wrap our Feature Engineering and model fitting into a nice composite estimator. We will be very simplistic and only use two  
They will not nest into each other at once.

### Feature Engineering wrapped into ColumnTransformer
The two feature transformations can be easily wrapped up into a single ColumnTransformer. This will ensure that our Feature Engineering is a **bit more robust and nicely encapsulated**. Refer to the section 6.1.4 of the following link. It will showcase the exact application that we intend to create:

https://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data

In [42]:
#TASK 3: Wrap MinMaxScaler and OneHotEncoder into a single ColumnTransformer. The transformers should be applied to according columns only.
#Store the resulting composite as feature_engineering
# Hint: use argument remainder='passthrough'

from sklearn.compose import ColumnTransformer

feature_engineering = ColumnTransformer([('numerical_scaler', preprocessing.MinMaxScaler(),['Pclass', 'Age']),
                                         ('ohe', preprocessing.OneHotEncoder(sparse=False), ['Sex', 'Embarked'])
                                        ],
                                        remainder='passthrough')

### Predictive Model Wrapped into Pipeline
Let's now wrap together feature engineering with the model into a single Pipeline Composite estimator. Here is a pseudocode:
- entire_pipeline = feature_engineering -> model  

Both components are already available. From step above, we can directly reuse the object feature_engineering. As model, we just call new DummyClassifier, just as we did before.

In [43]:
# TASK 4: Wrap Feature Engineering and Predictive Model (dummy) into a single Pipeline composite estimator. 
# Store the result as entire_pipeline
from sklearn.pipeline import Pipeline

entire_pipeline = Pipeline([('feature_engineering', feature_engineering), ('dummy', DummyClassifier(strategy="most_frequent"))])

In [44]:
# TASK: Uncomment the line and try to train the pipeline.
# It should not return an error. 
# Notice that we are using untransformed data again (X_train) as the pipeline contains the transformers.

entire_pipeline.fit(X = X_train, y = y_train)

Pipeline(steps=[('feature_engineering',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('numerical_scaler',
                                                  MinMaxScaler(),
                                                  ['Pclass', 'Age']),
                                                 ('ohe',
                                                  OneHotEncoder(sparse=False),
                                                  ['Sex', 'Embarked'])])),
                ('dummy', DummyClassifier(strategy='most_frequent'))])

In [45]:
#Predict for training data
y_pred_TRAIN_DUMMY = entire_pipeline.predict(X_train)

#Predict for holdout data
y_pred_HOLDOUT_DUMMY = entire_pipeline.predict(X_test)

#Results should be the same as before
print(metrics.accuracy_score(y_train, y_pred_TRAIN_DUMMY))

#TASK 2B: Display ACCURACY on HOLDOUT set.
print(metrics.accuracy_score(y_test, y_pred_HOLDOUT_DUMMY))

0.6045694200351494
0.5594405594405595


OPTIONAL TASK:   
A notebook 'nice_pipeline' was made to exemplify some examples of more complex pipelines. Feel free to scroll through it and learn how a process of preparing a complex composite looks like. You can then come back here and try to implement various components. For example, if I would not drop rows with missing values at the beginning of this notebook, constructing a composite would get a bit trickier. 

## HANDS-ON 3: Tree-based Models & Hyperparameter Tuning
Hold your constructed Pipeline firmly! The only thing that we need to do now, is to replace the DummyClassifier with a proper learning model. We can start by a decision tree.

### Fitting Learning Model - Decision Tree

In [46]:
# TASK 5: Reuse your composite, instead of a dummy, fit a decision tree with default parameters.
# Store the result as dt_pipeline
from sklearn.tree import DecisionTreeClassifier

dt_pipeline = Pipeline([('feature_engineering', feature_engineering), ('decision_tree', DecisionTreeClassifier())])

# Train the pipeline
dt_pipeline.fit(X = X_train, y = y_train)

Pipeline(steps=[('feature_engineering',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('numerical_scaler',
                                                  MinMaxScaler(),
                                                  ['Pclass', 'Age']),
                                                 ('ohe',
                                                  OneHotEncoder(sparse=False),
                                                  ['Sex', 'Embarked'])])),
                ('decision_tree', DecisionTreeClassifier())])

In [47]:
# TASK 5B: Let the pipeline predict for TRAINING set. Store the result as y_pred_TRAIN_DT
# Also, Display accuracy.

y_pred_TRAIN_DT = dt_pipeline.predict(X_train)
print(metrics.accuracy_score(y_train, y_pred_TRAIN_DT))

0.9209138840070299


In [49]:
# TASK 5C: Let the pipeline predict for HOLDOUT set. Store the result as y_pred_HOLDOUT_DT
# Also, Display accuracy.

y_pred_HOLDOUT_DT = dt_pipeline.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred_HOLDOUT_DT))

0.7552447552447552


Looking at the accuracy on training and holdout set, what can you infer about over model? Will it generalize well?

In [50]:
# OPTIONAL TASK 6: Do the same steps with RandomForest with default parameters. 
# Does the RandomForest display similar results as decision tree? If not, why?
from sklearn.ensemble import RandomForestClassifier

rf_pipeline = Pipeline([('feature_engineering', feature_engineering), ('random_forest', RandomForestClassifier())])

# Train the pipeline
rf_pipeline.fit(X = X_train, y = y_train)

#Predict and show accuracy TRAIN
y_pred_TRAIN_RF = rf_pipeline.predict(X_train)
print(metrics.accuracy_score(y_train, y_pred_TRAIN_RF))

#Predict and show accuracy HOLDOUT
y_pred_HOLDOUT_RF = rf_pipeline.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred_HOLDOUT_RF))

0.9209138840070299
0.7832167832167832


### Tuning Hyperparameters of our Decision Tree
Time to improve the performance of our learning model by finding its optimal set of hyperparameters.  
We start by examining **what hyperparameters are available** in our decision tree pipeline.

In [51]:
dt_pipeline.get_params()

{'memory': None,
 'steps': [('feature_engineering',
   ColumnTransformer(remainder='passthrough',
                     transformers=[('numerical_scaler', MinMaxScaler(),
                                    ['Pclass', 'Age']),
                                   ('ohe', OneHotEncoder(sparse=False),
                                    ['Sex', 'Embarked'])])),
  ('decision_tree', DecisionTreeClassifier())],
 'verbose': False,
 'feature_engineering': ColumnTransformer(remainder='passthrough',
                   transformers=[('numerical_scaler', MinMaxScaler(),
                                  ['Pclass', 'Age']),
                                 ('ohe', OneHotEncoder(sparse=False),
                                  ['Sex', 'Embarked'])]),
 'decision_tree': DecisionTreeClassifier(),
 'feature_engineering__n_jobs': None,
 'feature_engineering__remainder': 'passthrough',
 'feature_engineering__sparse_threshold': 0.3,
 'feature_engineering__transformer_weights': None,
 'feature_engineering__tr

We would like to tune max_depth and min_samples_split.  
Notice that to access them, we also need to navigate within the composite and call them as *decision_tree__max_depth*.  

In [52]:
# TASK 7: Define a grid through which we should search. Tune parameters: max_depth and min_samples_split.
# The values which you pick for parameters are up to you. You can think about them intuitively.

param_grid = {'decision_tree__max_depth':[3, 4, 5, 6, 7, 8, 9], 
              'decision_tree__min_samples_split':[ 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25] }

In [53]:
from sklearn import tree
from sklearn.model_selection import GridSearchCV

#Model
dt_pipeline

#Searching strategy, providing grid
tuning = GridSearchCV(dt_pipeline, param_grid)

#Train
tuning.fit(X_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('feature_engineering',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('numerical_scaler',
                                                                         MinMaxScaler(),
                                                                         ['Pclass',
                                                                          'Age']),
                                                                        ('ohe',
                                                                         OneHotEncoder(sparse=False),
                                                                         ['Sex',
                                                                          'Embarked'])])),
                                       ('decision_tree',
                                        DecisionTreeClassifier())]),
             param_grid={

In [54]:
#Let's get the best parameters
tuning.best_params_

{'decision_tree__max_depth': 6, 'decision_tree__min_samples_split': 5}

In [55]:
# TASK 8: Use the best setting of the two hyperparameters and fit a optimized decision tree. Hint: Reuse the pipeline, just when declaring it, specify the params.
# Store it as dt_pipeline_tuned

dt_pipeline_tuned = Pipeline([('feature_engineering', feature_engineering), 
                              ('decision_tree', DecisionTreeClassifier(max_depth=6, min_samples_split=5))])

# Train
dt_pipeline_tuned.fit(X_train, y_train)

Pipeline(steps=[('feature_engineering',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('numerical_scaler',
                                                  MinMaxScaler(),
                                                  ['Pclass', 'Age']),
                                                 ('ohe',
                                                  OneHotEncoder(sparse=False),
                                                  ['Sex', 'Embarked'])])),
                ('decision_tree',
                 DecisionTreeClassifier(max_depth=6, min_samples_split=5))])

In [56]:
# TASK 8B: Display accuracy on TRAINING set of the optimized decision tree.

print(metrics.accuracy_score(y_train, dt_pipeline_tuned.predict(X_train)))

0.8506151142355008


In [57]:
# TASK 8C: Display accuracy on HOLDOUT set of the optimized decision tree.
print(metrics.accuracy_score(y_test, dt_pipeline_tuned.predict(X_test)))

0.7342657342657343


Does the optimized decision tree perform better then the one with default parameters?

### Optional Advanced TASK: Tuning Random Forest
When you are tuning a more complex model, it is a good practice to search available literature on which hyperparameters should be tuned. Below I have predefined some. You can play around with the grid, for example expand or narrow it. Keep in mind that as our feature set is extremely limited, its hard for hyperparameter tuning to arrive to something meaningful.

In [58]:
# OPTIONAL TASK 9
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

#Define a pipeline
rf_pipeline = Pipeline([('feature_engineering', feature_engineering), ('random_forest', RandomForestClassifier())])

# Create the parameter grid based on the results of random search 
param_grid_rf = {
    'random_forest__bootstrap': [True, False],
    'random_forest__max_depth': [3, 5, 10, 15],
    'random_forest__max_features': [2, 3],
    'random_forest__min_samples_leaf': [3, 4, 5],
    'random_forest__min_samples_split': [5, 8, 10, 12],
    'random_forest__n_estimators': [5, 10, 15, 20, 25]
}
# Create a based model
rf = RandomForestClassifier()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf_pipeline, 
                           param_grid = param_grid_rf, 
                           cv = 3, 
                           n_jobs = -1, 
                           verbose = 2)

#Searching strategy, providing grid
tuning_rf = GridSearchCV(rf_pipeline, param_grid_rf)

#Train
tuning_rf.fit(X_train, y_train)

#Cross-validated score (more robust than holdout set most likely)
print(tuning_rf.best_score_)
print(tuning_rf.best_params_)

0.8137866790870982
{'random_forest__bootstrap': True, 'random_forest__max_depth': 15, 'random_forest__max_features': 2, 'random_forest__min_samples_leaf': 3, 'random_forest__min_samples_split': 10, 'random_forest__n_estimators': 20}


### Optional Advanced TASK: Check Kaggle competitions and join one of them!  
https://www.kaggle.com/