# Overview

The goal of this tutorial is to demonstrate the key components of an end-to-end data science/machine learning project. Note that the focus is to show the overall workflow not to build the best performing model.

The following shows the key steps:

- Load and split train/test data
- Exploratory Data Analysis (EDA)
- Data pre-processing and pipeline
- Model building, evaluation, tuning, and selection
- Feature importance analysis and feature selection
- Model persistence

See other related code and examples (such as Machine Learning Web App via Streamlit, AutoML, MLOps with ClearML, etc.) at https://harrywang.me/mini-ml/

In [None]:
# import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn')  # change the default style

In [None]:
# read csv data into pandas dataframe
df = pd.read_csv('/kaggle/input/titanic/train.csv')

In [None]:
# basic shape, data type, null values
df.info()

In [None]:
# first 5 lines of data
df.head()

In [None]:
# Prepare the data by separating X and y
# dropping unimportant features, such as passenger id, name, ticket number and cabin number
# note that interesting features might be engieered from the dropped features above

# axis = 1 below means dropping by columns, 0 means by rows
X = df.drop(['Survived', 'PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
y = df['Survived']
X.info()

In [None]:
# Split the data into a training set and a test set. 
# Any number for the random_state is fine, see 42: https://en.wikipedia.org/wiki/42_(number) 
# We choose to use 20% (test_size=0.2) of the data set as the test set.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train.shape)
print(X_test.shape)

# Basic EDA
You can show basic descriptive statistics using pandas easily. 

In [None]:
# basic stats
X_train.describe(include='all')

## Histogram
Use the histogram to check the followings:

- The distributions of the data 
- center and spread of the data
- skewness of the data
- presence of outliers

In [None]:
# histograms for all numerical features
X_train.hist(figsize=(15,15))

In [None]:
# key findings with potential processing
# long tail (skewed left): log transformation
# some outliers: outlier removal
X_train['Fare'].hist(bins=100)

## Box Plot

A boxplot displays the dataset based on a five-number summary:

- Median (Q2 / 50th Percentile) : the middle value of the dataset.

- First quartile (Q1 / 25th Percentile) : the middle value between the smallest number and the median of the dataset.

- Third quartile (Q3 / 75th Percentile) : the middle value between the largest number and the median of the dataset.

Interquartile Range (IQR) is the distance between the upper and lower quartile: IQR=Q3-Q1, 
IQR is used to determine outliers, which are points that are more than 1.5IQR from the median

- Minimum (NOT the smallest): the lowest data point excluding any outliers.

- Maximum (NOT the largest): the largest data point excluding any outliers.




### A box plot identifies the middle 50% of the data (the box), the median (the line in the box), and the outliers (the dots outside the max and min)

In [None]:
X_train['Age'].plot.box()

## Scatter Plot

Scatter plot is often used for **correlation analysis** between different features. Correlation coefficient is between -1 and 1, representing negative and positive correlations. 0 means there is no liner correlation. Correlation is said to be linear if the ratio of change is constant, otherwise is non-linear. 

In [None]:
fig, ax = plt.subplots()
ax.scatter(x=X_train['Age'], y=X_train['Fare'], alpha=0.2) # alpha=0.2 specifies the opacity
ax.set_xlabel('Age')
ax.set_ylabel('Fare')

In [None]:
# pairplot example using seaborn
sns.pairplot(data=X_train)

## Data pre-processing
We will build a pipeline to do some of the following tasks:

- Missing data
- Feature scaling (important for certain model such as Gradient Descent based models)
- Categorical feature encoding
- Outlier removal
- Transformation
- Custom processing

In [None]:
# any missing values?
X_train.isnull().sum()

In [None]:
# We will train our decision tree classifier with the following features:
# Numerical Features: ['Age', 'SibSp', 'Fare', 'Parch']
# Categorical Features:['Sex', 'Embarked', 'Pclass'

num_features = ['Age', 'SibSp', 'Parch', 'Fare']
cat_features = ['Sex', 'Embarked', 'Pclass']

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Create the preprocessing pipeline for numerical features
# There are two steps in this pipeline
# Pipeline(steps=[(name1, transform1), (name2, transform2), ...]) 
# NOTE the step names can be arbitrary

# Step 1 is what we discussed before - filling the missing values if any using mean
# Step 2 is feature scaling via standardization - making features look like normal-distributed 
# see sandardization: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
num_pipeline = Pipeline(
    steps=[
        ('num_imputer', SimpleImputer()),  # we will tune differet strategies later
        ('scaler', StandardScaler()),
        ]
)

# Create the preprocessing pipelines for the categorical features
# There are two steps in this pipeline:
# Step 1: filling the missing values if any using the most frequent value
# Step 2: one hot encoding

cat_pipeline = Pipeline(
    steps=[
        ('cat_imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder()),
    ]
)

# Assign features to the pipelines and Combine two pipelines to form the preprocessor
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num_pipeline', num_pipeline, num_features),
        ('cat_pipeline', cat_pipeline, cat_features),
    ]
)

## Baseline prediction

It's always helpful to have some baseline predictions based on heuristics/rules so that you can benchmark your model performance. The following shows that female has much higher survival rate than male so that we can have a rule-based baseline to predict all female survivied and male died. For regression problem, an easy baseline could be using the training sample mean for all predictions.

In [None]:
# calculate the survival rates by gender
# female survival rate: 74.2%
# male survival rate: 18.9%
group_norm = df.groupby('Sex')['Survived'].value_counts(normalize=True)
group_norm

In [None]:
X_test.head()

In [None]:
# rule-based prediction
baseline_pred = X_test['Sex'].apply(lambda x: 0 if x == 'male' else 1)

In [None]:
from sklearn.metrics import accuracy_score
print(f'Baseline Accuracy Score : {accuracy_score(y_test, baseline_pred)}')

## Model traning, tuning, evaluation and selection

Next, I attach three different models (Decision Tree, SVC, Random Forest) to the same pre-processing pipeline and tune the some parameters using GridSearch with cross validation. Then, we compare their performance and choose the best model to proceed. 

In [None]:
# Specify the model to use, which is DecisionTreeClassifier
# Make a full pipeline by combining preprocessor and the model
from sklearn.tree import DecisionTreeClassifier

pipeline_dt = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('clf_dt', DecisionTreeClassifier()),
    ]
)

In [None]:
# we show how to use GridSearch with K-fold cross validation (K=10) to fine tune the model
# we use the accuracy as the scoring metric with training score return_train_score=True
from sklearn.model_selection import GridSearchCV

# set up the values of hyperparameters you want to evaluate
# here you must use the step names as the prefix followed by two under_scores to sepecify the parameter names and the "full path" of the steps

# we are trying 2 different impputer strategies 
# 2x5 different decision tree models with different parameters
# in total we are trying 2x2x5 = 20 different combinations

param_grid_dt = [
    {
        'preprocessor__num_pipeline__num_imputer__strategy': ['mean', 'median'],
        'clf_dt__criterion': ['gini', 'entropy'], 
        'clf_dt__max_depth': [3, 4, 5, 6, 7],
    }
]

# set up the grid search 
grid_search_dt = GridSearchCV(pipeline_dt, param_grid_dt, cv=10, scoring='accuracy')

In [None]:
# train the model using the full pipeline
grid_search_dt.fit(X_train, y_train)

In [None]:
# check the best performing parameter combination
grid_search_dt.best_params_

In [None]:
# build-in CV results keys
sorted(grid_search_dt.cv_results_.keys())

In [None]:
# test score for the 20 decision tree models
grid_search_dt.cv_results_['mean_test_score']

In [None]:
# best decistion tree model test score
grid_search_dt.best_score_

In [None]:
# try SVM classifer
from sklearn.svm import SVC

# SVC pipeline
pipeline_svc = Pipeline([
    ('preprocessor', preprocessor),
    ('clf_svc', SVC(probability=True)),  # we need the probability scores later
])

# here we are trying three different kernel and three degree values for polynomail kernel
# in total 5 different combinations
param_grid_svc = [
    {
        'clf_svc__kernel': ['linear', 'poly', 'rbf'], 
        'clf_svc__degree': [3, 4, 5],  # only for poly kernel
    }
]

# set up the grid search 
grid_search_svc = GridSearchCV(pipeline_svc, param_grid_svc, cv=10, scoring='accuracy')

In [None]:
# train the model using the full pipeline
grid_search_svc.fit(X_train, y_train)

In [None]:
# best test score
grid_search_svc.best_score_

In [None]:
# try random forest classifer
from sklearn.ensemble import RandomForestClassifier

# rf pipeline
pipeline_rf = Pipeline([
    ('preprocessor', preprocessor),
    ('clf_rf', RandomForestClassifier()),
])

# here we are trying 2x3 different rf models
param_grid_rf = [
    {
        'clf_rf__criterion': ['gini', 'entropy'], 
        'clf_rf__n_estimators': [50, 100, 150],  
    }
]

# set up the grid search 
grid_search_rf = GridSearchCV(pipeline_rf, param_grid_rf, cv=10, scoring='accuracy')

In [None]:
%%time
# train the model using the full pipeline
grid_search_rf.fit(X_train, y_train)

In [None]:
# best test score
print('best dt score is: ', grid_search_dt.best_score_)
print('best svc score is: ', grid_search_svc.best_score_)
print('best rf score is: ', grid_search_rf.best_score_)

In [None]:
# select the best model
# the best parameters are shown, note SimpleImputer() implies that mean strategry is used
clf_best = grid_search_dt.best_estimator_
clf_best

In [None]:
# final test on the testing set
# To predict on new data: simply calling the predict method 
# the full pipeline steps will be applied to the testing set followed by the prediction
y_pred = clf_best.predict(X_test)

# calculate accuracy, Note: y_test is the ground truth for the tesing set
# we have similiar score for the testing set as the cross validation score - good

print(f'Accuracy Score : {accuracy_score(y_test, y_pred)}')

In [None]:
# plot the confusion matrix 
plt.style.use('default') # use default styple for confusion matrix plots

from sklearn.metrics import plot_confusion_matrix

# the default confusion matrix with default label order (ascending order, 0, 1, etc.)
# by defualt the default True Positive is in the bottom right quadrant
#matrix = plot_confusion_matrix(clf_best, X_test, y_test) 

# plot the confusion matrix with preferred label order and label name
# true positive in the left upper quadrant
class_names = ['1: Survived', '0: Died']
matrix = plot_confusion_matrix(clf_best, X_test, y_test, labels=[1, 0], display_labels=class_names)

# disp.confusion_matrix returns confusion matrix as an array
print(f'Confusion Matrix: \n {matrix.confusion_matrix}' )

In [None]:
# our model is better than the base line - good
print(f'Baseline Accuracy Score : {accuracy_score(y_test, baseline_pred)}')
print(f'Our Best Accuracy Score : {accuracy_score(y_test, y_pred)}')

## Precision-Recall Trade-off and ROC/AUC
I chose the best model based on accuracy score above. In classification, the final prediction for a data point actually has probability scores. For example, for one person in this Titanic dataset, a prediction looks like `[0.35, 0.65]`, which means the predicted probability for this person to be 0 (died) is 0.35 and the predicted probability to be 1 (survived) is 0.65. The default decision threshold is 0.5 for decision tree classifier, therefore this person is predicted to be 1 (survived). However, if we change the threshold to be 0.7, then the same person would be predicted to be 0 (died). 

Changing the decision threshold often leads to changes in precision and recall. Increasing precision often decreases recall and vice versa, which is called precision-recall trade-off. Given a specific context, you may favor precision over recall or the other way around. 

Receiver Operating Characteristic (ROC) is another metric to evaluate classifier output quality using Recall (True Positive Rate) and FPR (False Positive Rate). For classification problems with very imbalanced data (such as the current COVID-19 testing data, way more people are negative), the default threshold can result in poor model performance. **ROC/AUC is often a better metric than accuracy for imbalancd data.**

Next, I show the AUC scores for the three different models and plot the ROC curves.

In [None]:
# get the probability score the decision tree model
# for each prediction, we have two probabilities for two labels 0 means died, 1 means survived
y_pred_proba_dt = grid_search_dt.best_estimator_.predict_proba(X_test)
print(y_pred_proba_dt[0])  # 0.881 died, 0.118 survivied
y_scores_dt = y_pred_proba_dt[:, 1]  # this is the score of positive class

In [None]:
# get the probability scores for svc and random forest
y_pred_proba_svc = grid_search_svc.best_estimator_.predict_proba(X_test)
y_scores_svc = y_pred_proba_svc[:, 1]  # this is the score of positive class

y_pred_proba_rf = grid_search_rf.best_estimator_.predict_proba(X_test)
y_scores_rf = y_pred_proba_rf[:, 1]  # this is the score of positive class

In [None]:
# fpt: false positive rate, tpr: true positive rate (recall)
# random forest is the best model according to AUC score and its ROC curve is closer to the top-left corner
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

fpr_dt, tpr_dt, thresholds_dt = roc_curve(y_test, y_scores_dt)
fpr_svc, tpr_svc, thresholds_svc = roc_curve(y_test, y_scores_svc)
fpr_rf, tpr_rf, thresholds_rf = roc_curve(y_test, y_scores_rf)

print(f'AUC score for dt is {roc_auc_score(y_test, y_scores_dt)}')
print(f'AUC score for svc is {roc_auc_score(y_test, y_scores_svc)}')
print(f'AUC score for rf is {roc_auc_score(y_test, y_scores_rf)}')

# plot the ROC Curve
plt.style.use('seaborn')
fig, ax = plt.subplots()

ax.plot(fpr_dt, tpr_dt, label="Decision Tree")
ax.plot(fpr_svc, tpr_svc, label="SVC")
ax.plot(fpr_rf, tpr_rf, label="Ramdom Forest")
ax.set_title('ROC Curve')
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate (Recall)')
ax.legend()

## Feature Importance

Given that we are using pipeline and one-hot encoding, the feature importance scores are not very straightforward to get. The following code shows how to get the feature importance scores from the decision tree model and create a plot.

In [None]:
clf_best.named_steps

In [None]:
clf_best.named_steps['preprocessor']

In [None]:
i = clf_best['clf_dt'].feature_importances_
i

In [None]:
clf_best['preprocessor'].transformers_

In [None]:
# get columnTransformer
clf_best[0] 

In [None]:
clf_best[0].transformers_

In [None]:
num_original_feature_names = clf_best[0].transformers_[0][2]
num_original_feature_names

In [None]:
cat_original_feature_names = clf_best[0].transformers_[1][2]
cat_original_feature_names

In [None]:
cat_new_feature_names = list(clf_best[0].transformers_[1][1]['onehot'].get_feature_names(cat_original_feature_names))
cat_new_feature_names

In [None]:
feature_names = num_original_feature_names + cat_new_feature_names
feature_names

In [None]:
r = pd.DataFrame(i, index=feature_names, columns=['importance'])
r

In [None]:
r.sort_values('importance', ascending=False)

In [None]:
r.sort_values('importance', ascending=False).plot.bar()

In [None]:
# we remove the most important feature Sex and see how the model is affected
# result: accuracy drops from ~0.826 to ~0.716
num_features = ['Age', 'SibSp', 'Parch', 'Fare']
cat_features = ['Embarked', 'Pclass']

# you must update preprocess and pipeline after changing the feature list
preprocessor = ColumnTransformer(
    transformers=[
        ('num_pipeline', num_pipeline, num_features),
        ('cat_pipeline', cat_pipeline, cat_features),
    ]
)

pipeline_dt = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('clf_dt', DecisionTreeClassifier()),
    ]
)

# update the grid search 
grid_search_dt_updated = GridSearchCV(pipeline_dt, param_grid_dt, cv=10, scoring='accuracy')

# train the model using the updated full pipeline
grid_search_dt_updated.fit(X_train, y_train)  # # note here X_train is still having 7 features only 6 is used

print('best dt score is: ', grid_search_dt.best_score_)
print('best dt score after feature selection is: ', grid_search_dt_updated.best_score_)

In [None]:
# we remove unimportant features: Parch and Embarked and see the model is affected
# result: no difference with less features!!
num_features = ['Age', 'SibSp', 'Fare']
cat_features = ['Sex', 'Pclass']

# you must update preprocess and pipeline after changing the feature list
preprocessor = ColumnTransformer(
    transformers=[
        ('num_pipeline', num_pipeline, num_features),
        ('cat_pipeline', cat_pipeline, cat_features),
    ]
)

pipeline_dt = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('clf_dt', DecisionTreeClassifier()),
    ]
)

# update the grid search 
grid_search_dt_updated = GridSearchCV(pipeline_dt, param_grid_dt, cv=10, scoring='accuracy')

# train the model using the updated full pipeline
grid_search_dt_updated.fit(X_train, y_train) # note here X_train is still having 7 features only 5 is used

print('best dt score is: ', grid_search_dt.best_score_)
print('best dt score after feature selection is: ', grid_search_dt_updated.best_score_)

In [None]:
# we need to split the data to make X_train expect 5 features instead of 7

# drop 'Parch', 'Embarked'
X = df.drop(['Survived', 'PassengerId', 'Name', 'Ticket', 'Cabin', 'Parch', 'Embarked'], axis=1)
y = df['Survived']

# re-split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# updated feature lists
num_features = ['Age', 'SibSp', 'Fare']
cat_features = ['Sex', 'Pclass']

# updated preprocess and pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num_pipeline', num_pipeline, num_features),
        ('cat_pipeline', cat_pipeline, cat_features),
    ]
)

pipeline_dt = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('clf_dt', DecisionTreeClassifier()),
    ]
)

# updated the grid search 
grid_search_dt_updated = GridSearchCV(pipeline_dt, param_grid_dt, cv=10, scoring='accuracy')

# train the model using the updated full pipeline
grid_search_dt_updated.fit(X_train, y_train) # note here X_train is still having 7 features only 5 is used

print('best dt score is: ', grid_search_dt.best_score_)
print('best dt score after feature selection is: ', grid_search_dt_updated.best_score_)

In [None]:
# reassign the best model to have only 5 features
clf_best = grid_search_dt_updated.best_estimator_

## Persist the Model
The following code shows how to save the trained model as a pickle file, which can be loaded in to make predictions.

In [None]:
# Save the model as a pickle file
import joblib
joblib.dump(clf_best, "clf-best.pickle")

In [None]:
# Load the model from a pickle file
saved_tree_clf = joblib.load("clf-best.pickle")
saved_tree_clf

In [None]:
passenger1 = pd.DataFrame(
    {
        'Pclass': [3],
        'Sex': ['male'], 
        'Age': [23],
        'SibSp': [0],
        'Fare': [5.5],
    }
)
passenger1

In [None]:
passenger2 = pd.DataFrame(
    {
        'Pclass': [1],
        'Sex': ['female'], 
        'Age': [21],
        'SibSp': [0],
        'Fare': [80],
    }
)
passenger2

In [None]:
# died
pred1 = saved_tree_clf.predict(passenger1)
pred1

In [None]:
# survived
pred2 = saved_tree_clf.predict(passenger2)
pred2