In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelBinarizer, OneHotEncoder, StandardScaler
from sklearn import set_config

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In this notebook, we'll use the Titanic dataset to train and tune a sklearn pipeline. We'll define boilerplate code that you can easily reuse in your other projects. 

This notebook is structured in 4 parts: 
1. Exploratory Data Analysis (EDA) 
2. sklearn pipeline 
3. Hyperparameter Tuning 
4. Final Pipeline and submission

In [None]:
# Let's first load the data using pandas
train = pd.read_csv('/kaggle/input/titanic/train.csv')
test = pd.read_csv('/kaggle/input/titanic/test.csv')

## 1. Exploratory Data Analysis - EDA

Then, let's explore the data using pandas and seaborn.

In [None]:
#constants
features = ['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
target = 'Survived'

print(train.shape)
train.head()

In [None]:
sns.pairplot(train, hue="Survived")

In [None]:
pd.DataFrame(np.round(train.isna().sum()/len(train)*100,2), columns=['percentage_missing'])

Cabin has too many missing values, so we might as well want to drop it.

## 2. Machine Learning Pipeline

Let's define a sklearn pipeline with an imputer and a scaler for both numerical and categorical features. 

In [None]:
def pipeline(numerical_imputer, numerical_scaler, numerical_features, categorical_imputer, categorical_encoder, categorical_features, estimator):
    numerical_transformer = Pipeline(
        steps=[
            ("numerical_imputer", numerical_imputer),
            ("numerical_scaler", numerical_scaler),
        ]
    )
    categorical_transformer = Pipeline(
        steps=[
            ("categorical_imputer", categorical_imputer),
            ("categorical_encoder", categorical_encoder),
        ]
    )
    preprocessor = ColumnTransformer(
        transformers=[
            ("num", numerical_transformer, numerical_features),
            ("cat", categorical_transformer, categorical_features),
        ]
    )
    clf = Pipeline(  # or just Pipeline if we don't care about PMML format
        steps=[("preprocessor", preprocessor), ("classifier", estimator)]
    )
    return clf

In [None]:
#parameters of the pipeline
random_state = 0
numerical_imputer = SimpleImputer(strategy='mean')
numerical_scaler = StandardScaler()
numerical_features = ['Age', 'Fare']
categorical_imputer = SimpleImputer(strategy='most_frequent')
categorical_encoder = OneHotEncoder(handle_unknown='ignore')
categorical_features = ['Pclass'
            , 'Sex'	
            , 'SibSp'
            , 'Parch'
            , 'Ticket'
            , 'Embarked'
           ]
estimator = GradientBoostingClassifier(random_state = random_state)

X_train = train[numerical_features + categorical_features]
X_test = test[numerical_features + categorical_features]
y_train = train[target]

In [None]:
model = pipeline(numerical_imputer = numerical_imputer
                 , numerical_scaler = numerical_scaler
                 , numerical_features = numerical_features  
                 , categorical_imputer = categorical_imputer
                 , categorical_encoder = categorical_encoder
                 , categorical_features = categorical_features
                 , estimator = estimator)

This boilerplate code is very useful, feel free to copy it for your other projects. Also you can use sklearn's set_config to visualize your pipeline. 

Our pipeline splits between numerical and categorical variables, does some preprocessing and finally has an estimator (here a GradientBoostingClassifier)

In [None]:
set_config(display="diagram")
model

In [None]:
scores = cross_val_score(model, X_train, y_train, cv=5, scoring = 'roc_auc')
print("Average CV score:", scores.mean())

Thats good, our default pipeline gives us an cross validated AUC of .86. Let's see if we can improve this through parameter tuning! 

## 3. Hyperparameter Tuning

Let's use sklearn's GridSearchCV to perform hyperparameter tuning on n_estimators. There are many other hyperparameters to optimize so feel free to fork and try to improve the performance! 

In [None]:
parameter_grid = {
    "classifier__n_estimators": [250, 500, 750, 1000],
    
}
search = GridSearchCV(model, parameter_grid, n_jobs=2)
search = search.fit(X_train, y_train)

In [None]:
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)
pd.DataFrame(search.cv_results_)[['param_classifier__n_estimators', 'mean_test_score', 'std_test_score', 'rank_test_score']]


## 4. Final Model and Submission

Now that we have a good cross validation score, let's retrain on the full training set (as the performance of machine learning models increases with the amount of data).

Also, increasing the number of estimators and decreasing the learning rate proportionally usually gives better results. So you can try 2x the optimal n_estimator and /2 the optimal learning rate

In [None]:
final_estimator = GradientBoostingClassifier(random_state = random_state, n_estimators=1000)
final_model = pipeline(numerical_imputer = numerical_imputer
                 , numerical_scaler = numerical_scaler
                 , numerical_features = numerical_features  
                 , categorical_imputer = categorical_imputer
                 , categorical_encoder = categorical_encoder
                 , categorical_features = categorical_features
                 , estimator = final_estimator)
final_scores = cross_val_score(final_model, X_train, y_train, cv=5, scoring = 'roc_auc')
print("Average CV score:", final_scores.mean())

We increased the AUC from .86 to .88
But there are plenty of other parameters to play with! You can adjust a lot of other parameters, notably: 
- learning_rate
- loss
- subsample 

In [None]:
# get the final predictions
final_model.fit(X=X_train, y=train[target])
predictions = final_model.predict(X_test)

In [None]:
#prepare the submission and load to submission.csv which is then used by kaggle (note that here we have to get the passenger id from test data in order for it to work)
submission = pd.DataFrame({'PassengerId': test.PassengerId, 'Survived': predictions})
submission.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

__Hope you liked this notebook! If so, please don't forget to upvote this notebook! Happy learning!__