Dengan menggunakan data titanic:   
    * carilah model dan parameter terbaik untuk memprediksi apakah seseorang selamat atau tidak   
    * konteks : memprediksi kemungkinan seseorang selamat jika kapal yang **akan** dinaikinya karam   
    * model yang dicoba:   
        * logistic regression, decision tree classifier, knn classifier   
    * Pilih 1 model terbaik dari hasil cross validasinya, lalu tuning model tersebut     
     
kumpulkan ke brigita.gems@gmail.com dengan subject : algorithm chain   

In [100]:
import numpy as np
import pandas as pd

from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
from sklearn.preprocessing import OneHotEncoder

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import precision_score, classification_report

# Model classification
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree

# Dataset

In [43]:
df = pd.read_csv('titanic.csv')
df

Unnamed: 0,sex,age,parch,fare,class,deck,embark_town,alive,alone
0,male,22.0,0,7.2500,Third,,Southampton,no,False
1,female,38.0,0,71.2833,First,C,Cherbourg,yes,False
2,female,26.0,0,7.9250,Third,,Southampton,yes,True
3,female,35.0,0,53.1000,First,C,Southampton,yes,False
4,male,35.0,0,8.0500,Third,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...
886,male,27.0,0,13.0000,Second,,Southampton,no,True
887,female,19.0,0,30.0000,First,B,Southampton,yes,True
888,female,,2,23.4500,Third,,Southampton,no,False
889,male,26.0,0,30.0000,First,C,Cherbourg,yes,True


In [44]:
df.isna().sum()/df.shape[0]*100

sex             0.000000
age            19.865320
parch           0.000000
fare            0.000000
class           0.000000
deck           77.216611
embark_town     0.224467
alive           0.000000
alone           0.000000
dtype: float64

`deck` **di drop karena memiliki banyak missing value (77%)**

In [45]:
for i in df.columns[df.dtypes==object]:
    print(f"{i} : {df[i].unique()}")

sex : ['male' 'female']
class : ['Third' 'First' 'Second']
deck : [nan 'C' 'E' 'G' 'D' 'A' 'B' 'F']
embark_town : ['Southampton' 'Cherbourg' 'Queenstown' nan]
alive : ['no' 'yes']


# Preprocessing   

x : semua feature kecuali deck   
y : alive   

skema:   
    1. simple imputer modus, one hot encoder : sex ,embark_town    
    3. KNNImputer : age   
    3. ordinal mapping : class  

In [46]:
# onehot pipeline
onehot_pipe = Pipeline([
    ('modus imputer', SimpleImputer(strategy='most_frequent')),
    ('one hot encoder', OneHotEncoder(drop='first'))
])

In [47]:
# Ordinal Encoding with Map
df['class'] = df['class'].map({'First': 1, 'Second': 2, 'Third':3})

In [48]:
transformer = ColumnTransformer([
    ('one hot', onehot_pipe, ['sex', 'embark_town']),
    ('knn imputer', KNNImputer(n_neighbors=5), ['age'])
], remainder='passthrough')

In [49]:
transformer.fit_transform(df.drop(columns=['deck','alive'])) # coba transformer

array([[1.0, 0.0, 1.0, ..., 7.25, 3, False],
       [0.0, 0.0, 0.0, ..., 71.2833, 1, False],
       [0.0, 0.0, 1.0, ..., 7.925, 3, True],
       ...,
       [0.0, 0.0, 1.0, ..., 23.45, 3, False],
       [1.0, 0.0, 0.0, ..., 30.0, 1, True],
       [1.0, 1.0, 0.0, ..., 7.75, 3, True]], dtype=object)

# Splitting Data

In [50]:
x = df.drop(columns=['deck', 'alive'])
y = np.where(df['alive'] == 'yes', 1, 0)

In [51]:
x_train, x_test, y_train, y_test = train_test_split(
    x,
    y,
    stratify=y,
    random_state=2020
)

# Benchmark

## Confusion Matrix

|             	| NOT SURVIVE                                                                	| SURVIVE                                                              	|
|-------------	|----------------------------------------------------------------------------	|----------------------------------------------------------------------	|
| NOT SURVIVE 	| TRUE NEGATIVE<br>Penumpang yang tidak selamat <br>diprediksi tidak selamat 	| FALSE POSITIVE<br>Penumpang yang tidak selamat<br>diprediksi selamat 	|
| SURVIVE     	| FALSE NEGATIVE<br>Penumpang yang selamat<br>diprediksi tidak selamat       	| TRUE POSITIVE<br>Penumpang yang selamat<br>diprediksi selamat        	|

Dampak **FALSE POSITIVE** : Penumpang mengira dirinya selamat padahal tidak selamat sehingga tetap naik kapal   
Dampak **FALSE NEGATIVE** : Penumpang mengira dirinya tidak selamat sehingga tidak jadi naik kapal

Untuk mengoptimalisasi prediksi penumpang yang memang benar selamat, maka perlu meminimalisi prediksi **FALSE POSITIVE**  maka digunakan **Precission Scoring**

## Cross Validation

In [68]:
models = [LogisticRegression(solver='liblinear', random_state=2020), KNeighborsClassifier(), DecisionTreeClassifier()]
cv_score = []
cv_mean = []
cv_std = []

for i in models:
    skfold = StratifiedKFold(n_splits=5)
    estimator = Pipeline([
        ('preprocess', transformer),
        ('model', i)
    ])
    
    model_cv = cross_val_score(estimator, x_train, y_train, cv=skfold, scoring='precision')
    cv_score.append(model_cv)
    cv_mean.append(model_cv.mean())
    cv_std.append(model_cv.std())

In [69]:
pd.DataFrame({
    'model': ['logreg', 'knn', 'tree'],
    'score': cv_score,
    'mean': cv_mean,
    'std': cv_std
})

Unnamed: 0,model,score,mean,std
0,logreg,"[0.7727272727272727, 0.6875, 0.775510204081632...",0.761895,0.042534
1,knn,"[0.6857142857142857, 0.6170212765957447, 0.583...",0.647908,0.040587
2,tree,"[0.7555555555555555, 0.7115384615384616, 0.681...",0.712626,0.023905


Dari hasil cross validation, **Logistic Regression** memiliki performa rata-rata yang paling baik walaupun sedikit lebih tidak stabil

# Model Fitting & Evaluation

## Hyperparameter Tuning

In [70]:
logreg = LogisticRegression(solver='liblinear', random_state=2020)
estimator = Pipeline([
    ('preprocess', transformer),
    ('model', logreg)
])

In [74]:
#estimator.get_params()

In [73]:
hyperparam_space = {
    'preprocess__knn imputer__n_neighbors': [3,5,7,9,11,13], # benchmark 5
    'preprocess__knn imputer__weights': ['uniform', 'distance'], # benchmark uniform
    'model__C': [100, 10, 1, 0.1, 0.01], # benchmark 1
    'model__solver': ['liblinear', 'newton-cg'], # benchmark liblinear
    'model__max_iter': [50, 75, 100, 250, 500, 750], # benchmark 100
}

In [76]:
skfold = StratifiedKFold(n_splits=5)
grid_search = GridSearchCV(
    estimator,
    param_grid = hyperparam_space,
    cv = skfold,
    scoring='precision',
    n_jobs=-1
)

In [77]:
grid_search.fit(x_train, y_train)

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=None, shuffle=False),
             estimator=Pipeline(steps=[('preprocess',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('one '
                                                                         'hot',
                                                                         Pipeline(steps=[('modus '
                                                                                          'imputer',
                                                                                          SimpleImputer(strategy='most_frequent')),
                                                                                         ('one '
                                                                                          'hot '
                                                                                          'encoder

In [78]:
print(grid_search.best_score_)
print(grid_search.best_params_)

0.7921497584541062
{'model__C': 0.01, 'model__max_iter': 50, 'model__solver': 'newton-cg', 'preprocess__knn imputer__n_neighbors': 3, 'preprocess__knn imputer__weights': 'uniform'}


## Before and After Tuning

In [79]:
# before tuning
logreg = LogisticRegression(solver='liblinear', random_state=2020)
estimator = Pipeline([
    ('preprocess', transformer),
    ('model', logreg)
])
estimator.fit(x_train, y_train)

Pipeline(steps=[('preprocess',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('one hot',
                                                  Pipeline(steps=[('modus '
                                                                   'imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('one hot '
                                                                   'encoder',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['sex', 'embark_town']),
                                                 ('knn imputer', KNNImputer(),
                                                  ['age'])])),
                ('model',
                 LogisticRegression(random_state=2020, solver='liblinear'))])

In [105]:
y_pred = estimator.predict(x_test)
print(precision_score(y_test, y_pred))

0.6931818181818182


In [107]:
#print(classification_report(y_test, y_pred))

In [108]:
# after tuning
best_model = grid_search.best_estimator_
best_model.fit(x_train, y_train)

Pipeline(steps=[('preprocess',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('one hot',
                                                  Pipeline(steps=[('modus '
                                                                   'imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('one hot '
                                                                   'encoder',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['sex', 'embark_town']),
                                                 ('knn imputer',
                                                  KNNImputer(n_neighbors=3),
                                                  ['age'])])),
                ('model',
                 Logisti

In [111]:
y_pred_final = best_model.predict(x_test)
print(precision_score(y_test, y_pred_final))

0.7555555555555555


**Setelah dilakukan hyperparameter tuning, performa model meningkat dari 69.31% menjadi 75.55%**