<p align="center">
<img style="width:80%" src="https://c4.wallpaperflare.com/wallpaper/378/267/803/titanic-ship-cruise-ship-drawing-night-hd-wallpaper-preview.jpg">
</p>

[Image source](https://www.wallpaperflare.com/titanic-ship-cruise-ship-drawing-night-hd-digital-artwork-wallpaper-mzpsf/)

<h1 style="text-align: center; color:#01872A; font-size: 80px;
background:#daf2e1; border-radius: 20px;
">Titanic.<br> Part 3. Model selection.</h1>

## Please use nbviewer to read this notebook to use all it's features:

https://nbviewer.org/github/sersonSerson/Projects/blob/master/Classification/Titanic/03%20Model%20selection%20and%20Ensembles.ipynb

# <span style="color:#01872A; display: block; padding:10px; background:#daf2e1;border-radius:20px; text-align: center; font-size: 40px; "> Contents </span>

## 5.	[Feature scaling.](#Step5)
## 6.	[Choose models.](#Step6)
## 7.   [Ensembles of models.](#Step7)
## 8.   [Create submission.](#Step8)
## 9.   [Conclusion.](#Step9)

In [373]:
import pandas as pd

from sklearn.feature_selection import RFECV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier

In [374]:
pd.options.display.max_columns = 80
pd.options.display.max_rows = 30
pd.options.display.max_colwidth = 60

In [375]:
import warnings
warnings.filterwarnings('ignore')

## Load data

In [379]:
filled_df = pd.read_csv('data/Preprocessed data.csv', index_col='PassengerId')

<div id="Step5">
</div>

# <span style="color:#01872A; display: block; padding:10px; background:#daf2e1;border-radius:20px; text-align: center; font-size: 40px; "> Step 5. Feature scaling.</span>

## Remove useless features

In [380]:
redundant_columns = ['Name', 'FirstName', 'LastName', 'Ticket', 'Cabin','Title']
filled_df.drop(redundant_columns, axis=1, inplace=True)
filled_df.shape

(1309, 67)

## Split into the train and test sets

In [381]:
train_df = filled_df[filled_df['Survived'].isna() == False]
X_train = train_df.drop('Survived', axis=1)
y_train = train_df['Survived']

test_df = filled_df[filled_df['Survived'].isna() == True]
X_test = test_df.drop('Survived', axis=1)

## Scale data for non-tree-based models.
1. MinMax scaler works better for KNN model.
2. Standard scaler works better for other models.

In [382]:
scaler = MinMaxScaler()
scaler.fit(X_train)
X_scaled_knn = \
    pd.DataFrame(scaler.transform(X_train), columns=X_train.columns,
                 index=X_train.index)
X_test_scaled_knn = \
    pd.DataFrame(scaler.transform(X_test), columns=X_test.columns,
                 index=X_test.index)

In [383]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = pd.DataFrame(scaler.transform(X_train), columns=X_train
                              .columns,
                        index=X_train.index)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns,
                             index=X_test.index)


## <span style="color:#01872A;display: block; font-style: italic;padding:10px; background:#daf2e1;border-radius:20px; text-align: left; font-size: 30px; "> Step 5 results: </span>

1. Remove useless columns from data.
2. Created train and test sets.
3. Scaled data with different scalers for different models.

<div id="Step6">
</div>

# <span style="color:#01872A; display: block; padding:10px; background:#daf2e1;border-radius:20px; text-align: center; font-size: 40px; "> Step 6. Choose models.</span>


Function generates DataFrame for Kaggle submission ('PassengerId' and 'Survived'
columns).

In [384]:
def generate_submission(model, X, y, X_test):
    model.fit(X, y)
    preds = model.predict(X_test).astype(int)
    submission_df = pd.DataFrame({'PassengerId': X_test.index,
                                'Survived': preds})
    submission_df.to_csv('data/DecisionTreeSubmission.csv', index=False)
    print('Ready')
    return preds

Function makes grid search for model with passed grid.

In [385]:
def search_grid(model, grid, X, y):
    grid_search = GridSearchCV(model, param_grid=grid, n_jobs=-1, cv=2)
    grid_search.fit(X, y)
    return grid_search.best_params_, grid_search.best_score_

## Decision tree classifier.

In [386]:
dt_grid = {'min_samples_split': [2, 3, 5],
        'max_leaf_nodes': [None, 3, 5, 10, 15],
        'max_depth': [3, 4, 5, 6, 7, 8, 9, 10],
        'min_samples_leaf': [2, 3, 4, 5, 10, 15, 20]
        }
dt = DecisionTreeClassifier(random_state=0)
dt_best_params, dt_best_score = search_grid(dt, dt_grid, X_train, y_train)
dt.set_params(**dt_best_params)
print(f'Decision tree best params: {dt_best_params}')
print(f'Decision tree best score: {dt_best_score}')
dt_preds = generate_submission(dt, X_train, y_train, X_test)

Decision tree best params: {'max_depth': 4, 'max_leaf_nodes': 5, 'min_samples_leaf': 2, 'min_samples_split': 2}
Decision tree best score: 0.8226684133622211
Ready


## KNN  classifier.

In [387]:
knn_grid = {'n_neighbors': [i for i in range(1, 30)],
        'metric': ['euclidean'],
        'p': [0.5, 1, 2, 3, 4, 5]}
knn = KNeighborsClassifier(n_jobs=-1)
knn_best_params, knn_best_score = search_grid(knn, knn_grid, X_scaled_knn,
                                              y_train)
knn.set_params(**knn_best_params)
print(f'KNN best params: {knn_best_params}')
print(f'KNN best score: {knn_best_score}')
knn_preds = generate_submission(knn, X_scaled_knn, y_train, X_test_scaled_knn)

KNN best params: {'metric': 'euclidean', 'n_neighbors': 7, 'p': 0.5}
KNN best score: 0.8069582304630423
Ready


## Logistic regression with recursive feature elimination

In [388]:
logistic_regression = LogisticRegression(random_state=0, max_iter=100)
kbest = RFECV(logistic_regression, cv=5).fit(X_train_scaled, y_train)
used_features = kbest.get_support()
X_train_scaled_rfe = X_train_scaled.loc[:, used_features]
X_test_scaled_rfe = X_test_scaled.loc[:, used_features]

In [389]:
lr_grid = {
        'penalty': ['l2'],
        'solver': ['lbfgs', 'liblinear', 'sag','saga'],
        'C': [0.01, 0.1, 0.3, 0.5, 0.7, 1, 10, 100],
        'max_iter': [500]
            }
lr = LogisticRegression(random_state=0, n_jobs=-1)
lr_best_params, lr_best_score = search_grid(lr, lr_grid, X_train_scaled_rfe, y_train)
lr.set_params(**lr_best_params)
print(f'Logistic regression best params: {lr_best_params}')
print(f'Logistic regression best score: {lr_best_score}')
lr_preds = generate_submission(lr, X_train_scaled_rfe, y_train, X_test_scaled_rfe)

Logistic regression best params: {'C': 100, 'max_iter': 500, 'penalty': 'l2', 'solver': 'sag'}
Logistic regression best score: 0.8181740313397491
Ready


## XGBOOST

In [390]:
xgb_grid = {
    'n_estimators': [10, 20, 30],
    'learning_rate': [0.001, 0.1, 0.2, 0.3],
    'colsample_bytree': [0.4, 0.6, 0.8, 1],
    'colsample_bylevel': [0.4, 0.6, 0.8, 1],
    'max_depth': [1, 2, 3, 4, 5]
        }
xgb_cl = XGBClassifier(random_state=0, n_jobs=-1, eval_metric='logloss')
xgb_best_params, xgb_best_score = search_grid(xgb_cl, xgb_grid, X_train,
                                              y_train)
xgb_cl.set_params(**xgb_best_params)
print(xgb_best_params, xgb_best_score)
xgb_preds = generate_submission(xgb_cl, X_train, y_train, X_test)

{'colsample_bylevel': 0.6, 'colsample_bytree': 0.8, 'learning_rate': 0.3, 'max_depth': 5, 'n_estimators': 10} 0.8439965737894897
Ready


## <span style="color:#01872A;display: block; font-style: italic;padding:10px; background:#daf2e1;border-radius:20px; text-align: left; font-size: 30px; "> Step 6 results: </span>

1. Fitted 4 different models on data.
2. Found the best parameters for all these models.
3. Generated predictions for each model.

<div id="Step7">
</div>

# <span style="color:#01872A; display: block; padding:10px; background:#daf2e1;border-radius:20px; text-align: center; font-size: 40px; "> Step 7. Ensembles of models.</span>

## Create a DataFrame with predictons

In [391]:
df_preds = pd.DataFrame({'DT': dt_preds,
                          'KNN': knn_preds,
                          'LogReg': lr_preds,
                          'XGB': xgb_preds},
                         index=X_test.index)
df_preds

Unnamed: 0_level_0,DT,KNN,LogReg,XGB
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
892,0,0,0,0
893,0,0,0,1
894,0,0,0,0
895,0,0,0,0
896,1,0,1,1
...,...,...,...,...
1305,0,0,0,0
1306,0,1,1,1
1307,0,0,0,0
1308,0,0,0,0


## Make a simple mean of predictions as an ensemble prediction.

In [392]:
df_preds['Pred'] = (df_preds['DT']
                    + df_preds['KNN']
                    + df_preds['LogReg']
                    + df_preds['XGB']
                    ) / 4

df_preds['PredRounded'] = df_preds['Pred'].round(0)
df_preds['PredRounded'].value_counts()

0.0    282
1.0    136
Name: PredRounded, dtype: int64

In [393]:
df_preds['PredRounded'].head()

PassengerId
892    0.0
893    0.0
894    0.0
895    0.0
896    1.0
Name: PredRounded, dtype: float64

## <span style="color:#01872A;display: block; font-style: italic;padding:10px; background:#daf2e1;border-radius:20px; text-align: left; font-size: 30px; "> Step 7 results: </span>

1. Generated an ensemble prediction for the test dataset.

<div id="Step8">
</div>

# <span style="color:#01872A; display: block; padding:10px; background:#daf2e1;border-radius:20px; text-align: center; font-size: 40px; "> Step 8. Create submission.</span>

In [395]:
submission_df = pd.DataFrame({'PassengerId': X_test.index,
                              'Survived': df_preds['PredRounded'].astype(int)})
submission_df.to_csv('data/FullSubmission.csv', index=False)
submission_df.head()

Unnamed: 0_level_0,PassengerId,Survived
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
892,892,0
893,893,0
894,894,0
895,895,0
896,896,1


## <span style="color:#01872A;display: block; font-style: italic;padding:10px; background:#daf2e1;border-radius:20px; text-align: left; font-size: 30px; "> Step 8 results: </span>

1. Created the Kaggle submission.

<div id="Step9">
</div>

# <span style="color:#01872A; display: block; padding:10px; background:#daf2e1;border-radius:20px; text-align: center; font-size: 40px; "> Step 9. Conclusion.</span>

1. Score of 81.3% allowed to finish in top 2% of the contenders.
2. Very extensive feature generation was required.
3. Ensemble of models was used to achieve good results.

## [Part 1. EDA.](https://nbviewer.org/github/sersonSerson/Projects/blob/master/Classification/Titanic/01%20EDA.ipynb)

## [Part 2. Feature engineering.](https://nbviewer.org/github/sersonSerson/Projects/blob/master/Classification/Titanic/02%20Feature%20Engineering.ipynb)