# BOOSTING EXCERCISE

Tugas: Gunakan titanic.csv untuk mencoba boosting model.
* Splitting: 80-20, stratify: y, random state 2020

* Preprocessing: 
>* drop deck
>* Isi missing value (age, embarked town) menggunakan simple imputer 
>* onehot encoding: sex, alone, class, embarked town  

* evaluation metric yang dipakai: F1_score
* model selection: Decision Tree Classifier, AdaBoost Classifier, GBoost Classifier, XGBoost Classifier.
* Hyperparameter tunning model yang terpilih.
* Buat summary untuk hasil evaluasi, dan kesimpulan mana model yang terbaik untuk titanic.csv

protip: gunakan pipeline dan function ketika memungkinkan

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import category_encoders as ce
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV, train_test_split, StratifiedKFold
from sklearn.metrics import classification_report, f1_score
from sklearn.metrics import plot_roc_curve, plot_precision_recall_curve

from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree

import warnings

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

In [2]:
# load dataset
df = pd.read_csv('titanic.csv')
df.head(3)

Unnamed: 0,sex,age,parch,fare,class,deck,embark_town,alive,alone
0,male,22.0,0,7.25,Third,,Southampton,no,False
1,female,38.0,0,71.2833,First,C,Cherbourg,yes,False
2,female,26.0,0,7.925,Third,,Southampton,yes,True


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   sex          891 non-null    object 
 1   age          714 non-null    float64
 2   parch        891 non-null    int64  
 3   fare         891 non-null    float64
 4   class        891 non-null    object 
 5   deck         203 non-null    object 
 6   embark_town  889 non-null    object 
 7   alive        891 non-null    object 
 8   alone        891 non-null    bool   
dtypes: bool(1), float64(2), int64(1), object(5)
memory usage: 56.7+ KB


In [4]:
df.describe()

Unnamed: 0,age,parch,fare
count,714.0,891.0,891.0
mean,29.699118,0.381594,32.204208
std,14.526497,0.806057,49.693429
min,0.42,0.0,0.0
25%,20.125,0.0,7.9104
50%,28.0,0.0,14.4542
75%,38.0,0.0,31.0
max,80.0,6.0,512.3292


In [5]:
df.isna().sum()

sex              0
age            177
parch            0
fare             0
class            0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

# 1. Preprocessing

## Drop

In [6]:
# drop column 'deck' karena missing value terlalu banyak
df = df.drop(columns='deck')

## Pipeline dan Transformer

In [7]:
# pipeline berisi imputing lalu onehot encoding untuk 'embark_town' nanti
embark_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('one hot encoder', OneHotEncoder(drop='first'))
])

# transformer 
transformer = ColumnTransformer([
    ('imputer', SimpleImputer(strategy='median'), ['age']),
    ('embark_pipeline', embark_pipeline, ['embark_town']),
    ('one hot encoder', OneHotEncoder(drop='first'), ['sex','alone','class'])
], remainder='passthrough')

## Split Data

In [8]:
# ganti target ('alive') jadi 0-1 ('label)
df['label'] = np.where(df['alive']=='yes', 1, 0)

# drop column 'alive'
df = df.drop(columns='alive')

df.head(3)

Unnamed: 0,sex,age,parch,fare,class,embark_town,alone,label
0,male,22.0,0,7.25,Third,Southampton,False,0
1,female,38.0,0,71.2833,First,Cherbourg,False,1
2,female,26.0,0,7.925,Third,Southampton,True,1


In [9]:
df['label'].value_counts()

0    549
1    342
Name: label, dtype: int64

In [10]:
# define X dan y
# X drop alive dan label
X = df.drop(columns='label')
y = df['label']

In [11]:
# split data
# X_train di sini maksudnya adalah X_train_val
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    stratify=y,
    test_size=0.2,
    random_state=2020
)

# 2. Model Selection

termasuk sekalian Data Transforming

* evaluation metric yang dipakai: F1_score
* model selection: Decision Tree Classifier, AdaBoost Classifier, GBoost Classifier, XGBoost Classifier.
* Hyperparameter tunning model yang terpilih.
* Buat summary untuk hasil evaluasi, dan kesimpulan mana model yang terbaik untuk titanic.csv

## Define Model

In [12]:

# DecsionTree
tree = DecisionTreeClassifier(max_depth=3)

# Adaboost
ada = AdaBoostClassifier(
    tree,
    n_estimators=200,
    learning_rate=0.1,
    random_state=10 
)

# Gradientboost
gbc = GradientBoostingClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=3,
    random_state=10 
)

# ExtremeGradientBosst
xgbc = XGBClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=3,
    random_state=10 
)


## Data Transforming and Fitting
tanpa cross validation (seharusnya pakai)

In [13]:
# Pipeline untuk Data Transforming and Fitting

tree_pipeline = Pipeline([
    ('transformer', transformer),
    ('clf', tree)
])

ada_pipeline = Pipeline([
    ('transformer', transformer),
    ('clf', ada)
])

gbc_pipeline = Pipeline([
    ('transformer', transformer),
    ('clf', gbc)
])

xgbc_pipeline = Pipeline([
    ('transformer', transformer),
    ('clf', xgbc)
])

In [14]:
# Fit dan Predict Data

def classification(model):
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    print('Classification Report')
    return print(classification_report(y_test, y_pred))

In [15]:
# DecsionTree
classification(tree_pipeline)

Classification Report
              precision    recall  f1-score   support

           0       0.80      0.85      0.82       110
           1       0.73      0.67      0.70        69

    accuracy                           0.78       179
   macro avg       0.77      0.76      0.76       179
weighted avg       0.77      0.78      0.77       179



In [16]:
# AdaBoost
classification(ada_pipeline)

Classification Report
              precision    recall  f1-score   support

           0       0.78      0.83      0.81       110
           1       0.70      0.64      0.67        69

    accuracy                           0.75       179
   macro avg       0.74      0.73      0.74       179
weighted avg       0.75      0.75      0.75       179



In [17]:
# GradientBoost
classification(gbc_pipeline)

Classification Report
              precision    recall  f1-score   support

           0       0.80      0.93      0.86       110
           1       0.85      0.64      0.73        69

    accuracy                           0.82       179
   macro avg       0.82      0.78      0.79       179
weighted avg       0.82      0.82      0.81       179



In [18]:
# ExtremeGradientBoost
classification(xgbc_pipeline)

Classification Report
              precision    recall  f1-score   support

           0       0.82      0.90      0.86       110
           1       0.81      0.68      0.74        69

    accuracy                           0.82       179
   macro avg       0.81      0.79      0.80       179
weighted avg       0.82      0.82      0.81       179



## Kesimpulan Berdasarkan F1 Score

### model terbaik adalah ExtremeGradientBoost (xgbc)

## Model Selection dengan GridSearch

In [19]:
# pipeline untuk transformer/preprocessing dan model
estimator = Pipeline([
    ('transformer', transformer),
    ('clf', tree)
])

# memilih model terbaik
hyperparam_space = {
    'clf':[tree, ada, gbc, xgbc]
}

In [20]:
# skfold (berapa kali cross validation)
skf = StratifiedKFold(n_splits=5)

# Grid Search
grid_search = GridSearchCV(
    estimator,
    param_grid = hyperparam_space,
    cv = skf,
    n_jobs = -1,
    scoring = 'f1'  
)

In [21]:
# fit data
grid_search.fit(X_train, y_train)

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=None, shuffle=False),
             estimator=Pipeline(steps=[('transformer',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('imputer',
                                                                         SimpleImputer(strategy='median'),
                                                                         ['age']),
                                                                        ('embark_pipeline',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer(strategy='most_frequent')),
                                                                                         ('one '
                                                                                       

In [22]:
# melihat best score dan best parameter
print('best_score_', grid_search.best_score_)
print('best_params_', grid_search.best_params_)

best_score_ 0.7578484352351057
best_params_ {'clf': GradientBoostingClassifier(n_estimators=200, random_state=10)}


In [23]:
# fit data dengan model terbaik dari random_search
grid_search.best_estimator_.fit(X_train, y_train)

Pipeline(steps=[('transformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('imputer',
                                                  SimpleImputer(strategy='median'),
                                                  ['age']),
                                                 ('embark_pipeline',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('one hot '
                                                                   'encoder',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['embark_town']),
                                                 ('one hot encoder',
                                                  OneHo

In [24]:
# Predict data
y_pred_gbc = grid_search.best_estimator_.predict(X_test)

# lihat f1 score nya
print(classification_report(y_test, y_pred_gbc))

              precision    recall  f1-score   support

           0       0.80      0.93      0.86       110
           1       0.85      0.64      0.73        69

    accuracy                           0.82       179
   macro avg       0.82      0.78      0.79       179
weighted avg       0.82      0.82      0.81       179



## Kesimpulan Model Terbaik dengan Random Search

model terbaik adalah:

GradientBoostingClassifier(n_estimators=200, random_state=10)

# 3. Hyperparameter Tuning pada Gradient Boost

In [25]:
# pipeline untuk model gradient boost
estimator_gbc = Pipeline([
    ('transformer', transformer),
    ('clf', gbc)
])

# hyperparam space
hyperparam_space=[
    {'clf__learning_rate':[0.1],'clf__n_estimators':[200]},
    {'clf__learning_rate':[0.05],'clf__n_estimators':[400]},
    {'clf__learning_rate':[0.01],'clf__n_estimators':[2000]},
    {'clf__learning_rate':[0.005],'clf__n_estimators':[4000]}
]


In [26]:
# skfold (berapa kali cross validation)
skf = StratifiedKFold(n_splits=5)

# Random Search
random_search = RandomizedSearchCV(
    estimator_gbc,
    param_distributions = hyperparam_space,
    cv = skf,
    n_jobs = -1,
    scoring = 'f1'  
)

In [27]:
# fit data dengan random_search
random_search.fit(X_train, y_train)

RandomizedSearchCV(cv=StratifiedKFold(n_splits=5, random_state=None, shuffle=False),
                   estimator=Pipeline(steps=[('transformer',
                                              ColumnTransformer(remainder='passthrough',
                                                                transformers=[('imputer',
                                                                               SimpleImputer(strategy='median'),
                                                                               ['age']),
                                                                              ('embark_pipeline',
                                                                               Pipeline(steps=[('imputer',
                                                                                                SimpleImputer(strategy='most_frequent')),
                                                                                               ('one '
                           

In [28]:
# melihat best score dan best parameter pada GradientBoost
print('best_score_', random_search.best_score_)
print('best_params_', random_search.best_params_)

best_score_ 0.765613903455513
best_params_ {'clf__n_estimators': 400, 'clf__learning_rate': 0.05}


## Terakhir, Predict data dengan: 
### Model dan Parameter setelah hyperparameter tuning (atau dirangkum dengan best_estimator)

In [29]:
random_search.best_estimator_

Pipeline(steps=[('transformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('imputer',
                                                  SimpleImputer(strategy='median'),
                                                  ['age']),
                                                 ('embark_pipeline',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('one hot '
                                                                   'encoder',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['embark_town']),
                                                 ('one hot encoder',
                                                  OneHo

In [30]:
y_pred_gbc_tuning = random_search.best_estimator_.predict(X_test)

print(classification_report(y_test, y_pred_gbc_tuning))

              precision    recall  f1-score   support

           0       0.80      0.93      0.86       110
           1       0.84      0.62      0.72        69

    accuracy                           0.81       179
   macro avg       0.82      0.78      0.79       179
weighted avg       0.81      0.81      0.80       179



# Kesimpulan

### Hyperparameter tidak meningkatkan performa model

- f1 score XGBC tanpa random search : 0.74 (ini gak pake cross validation)
- f1 score GBC dengan random search : 0.73
- f1 score GBC after tuning : 0.72
