# Table of contents
+ [Load dataset](#Load-dataset)
+ [Cleaning data and Feature engineering](#Cleaning-data)
+ [Classifiers](#Classifiers)
  + [Logistic regression](#Model:-LogisticRegression)
  + [DecisionTree classifier](#Model:-DecisionTreeClassifier)
  + [RandomForest classifier](#Model:-RandomForestClassifier)
  + [SVM](#Model:-SVM)

I've worked on different notebooks and my final submission score has been around 0.78824.
In this notebook trying to improve the score with different concepts learned from other notebooks in this comptetition.
I've linked few for the references:
+ [Ensemble-learning meta-classifier for stacking](https://www.kaggle.com/remekkinas/ensemble-learning-meta-classifier-for-stacking)
+ [TPS04 - SVM with scikit-learn-intelex](https://www.kaggle.com/napetrov/tps04-svm-with-scikit-learn-intelex)
+ _will add more as and when I find them..._

In [None]:
# Import libraries
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler

import warnings
warnings.simplefilter('ignore')

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Load dataset

In [None]:
# Load the datasets
train = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/train.csv', index_col='PassengerId')
test = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/test.csv', index_col='PassengerId')
submission = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/sample_submission.csv')

In [None]:
submission.head()

In [None]:
# Check top 5 rows from train and test sets
train.head()

In [None]:
# Check top 5 rows from train and test sets
test.head()

Now, let's perform preprocessing to make the data clean and usable in our Classifier models

## Cleaning data

In [None]:
# Check for missing values
train.isnull().sum()

In [None]:
# Same is the case with test set
test.isnull().sum()

We'll do following to handle missing values:
+ Age - Impute with mean value
+ Ticket - Impute with 'X' and take only first word if more than 1 word else use 'X'
+ Fare - Impute with mean value based on Pclass group
+ Cabin - Impute with 'X' and take first letter
+ Embarked - Impute with 'X'

_Perform same imputations on train and test set_

In [None]:
# Age column
train['Age'] = train['Age'].fillna(train['Age'].mean())
test['Age'] = test['Age'].fillna(test['Age'].mean())

# Ticket column
train['Ticket'] = train['Ticket'].fillna('X').map(lambda x: str(x).split()[0] if len(str(x).split()) > 1 else 'X')
test['Ticket'] = test['Ticket'].fillna('X').map(lambda x: str(x).split()[0] if len(str(x).split()) > 1 else 'X')


In [None]:
# Fare column
fare_map = train[['Fare', 'Pclass']].dropna().groupby('Pclass').mean().to_dict()
train['Fare'] = train['Fare'].fillna(train['Pclass'].map(fare_map['Fare']))
train['Fare'] = np.log1p(train['Fare'])
test['Fare'] = test['Fare'].fillna(test['Pclass'].map(fare_map['Fare']))
test['Fare'] = np.log1p(test['Fare'])

In [None]:
# Cabin column
train['Cabin'] = train['Cabin'].fillna('X').map(lambda x: x[0].strip())
test['Cabin'] = test['Cabin'].fillna('X').map(lambda x: x[0].strip())

# Embarked column
train['Embarked']  = train['Embarked'].fillna('X')
test['Embarked'] = test['Embarked'].fillna('X')

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

Great, we've crossed our 1st hurdle by handling missing values.

In [None]:
# Check train top 5 rows
train.head()

As we can see, we have categorical features with strings, which is our next hurdle to convert them to numeric values.
We'll be using LabelEncoder, OneHotEncoder.

Also, we'll scale numerical features using StandardScaler.

In [None]:
# Drop Name column as we'll not use that.
train.drop('Name', axis=1, inplace=True)
test.drop('Name', axis=1, inplace=True)

In [None]:
# Define type of columns
num_cols = ['Age', 'Fare']
label_cols = ['Pclass', 'SibSp', 'Parch', 'Ticket', 'Cabin']
ohe_cols = ['Sex', 'Embarked']

In [None]:
# Handle label columns
for col in label_cols:
    le = LabelEncoder()
    train[col] = le.fit_transform(train[col])
    test[col] = le.transform(test[col])

In [None]:
# Handle ohe columns
ohe_encoded_train_df = pd.get_dummies(train[ohe_cols], drop_first=True)
ohe_encoded_test_df = pd.get_dummies(test[ohe_cols], drop_first=True)

In [None]:
ohe_encoded_train_df[:5]

In [None]:
train = pd.concat([train, ohe_encoded_train_df], axis=1)
train.drop(ohe_cols, axis=1, inplace=True)

In [None]:
test = pd.concat([test, ohe_encoded_test_df], axis=1)
test.drop(ohe_cols, axis=1, inplace=True)

In [None]:
# Scale numeric columns
scaler = StandardScaler()
train[num_cols] = scaler.fit_transform(train[num_cols])
test[num_cols] = scaler.transform(test[num_cols])

Let's have a look at our cleaned train and test datasets}

In [None]:
train.head()

In [None]:
test.head()

### Install scikit-learn-intelex

As we can see cross validation on RandomForest is taking time(almost a minute), we'll try to use library `scikit-learn-intelex` which was recommended in the referenced notebook.

In [None]:
!pip install scikit-learn-intelex --progress-bar off >> /tmp/pip_sklearnex.log

In [None]:
# Enable Intel(R) Extension for sk-learn
from sklearnex import patch_sklearn
patch_sklearn()

## Classifiers

In [None]:
# Libraries
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

from sklearn.metrics import accuracy_score

from sklearn.model_selection import train_test_split, cross_val_score, KFold

import optuna

In [None]:
features = train.drop('Survived', axis=1)
target = train.Survived

In [None]:
# Split train into train_, valid_ datasets
X_train, X_valid, y_train, y_valid = train_test_split(features, target, test_size=0.3, random_state=41)

print(X_train.shape, y_train.shape)
print(X_valid.shape, y_valid.shape)

In [None]:
RANDOM_STATE = 41
FOLDS = 10

### Model: LogisticRegression

Let's do cross validation and then check score

In [None]:
lr = LogisticRegression(random_state=RANDOM_STATE)
scores = cross_val_score(lr, X_train, y_train, cv=FOLDS, scoring='accuracy')
print(f'LogisticRegression(CV): {scores.mean()}')

### Model: DecisionTreeClassifier

In [None]:
%%time
dt = DecisionTreeClassifier(random_state=RANDOM_STATE)
scores = cross_val_score(dt, X_train, y_train, cv=FOLDS, scoring='accuracy')
print(f'DecisionTree(CV): {scores.mean()}')

Not improved compared to LogisticRegression model

### Model: RandomForestClassifier

In [None]:
%%time
rf = RandomForestClassifier(random_state=RANDOM_STATE,
                           max_depth=15,
                           min_samples_leaf=8)
scores = cross_val_score(rf, X_train, y_train, cv=FOLDS, scoring='accuracy')
print(f'RandomForest(CV): {scores.mean()}')

This has improved score than LogisticRegression. Let's try parameter tuning using library `Optuna`.

### Short introduction on library `Optuna`

Optuna is used to optimize hyperparameters for an algorithm.
``` python
import optuna
```
Conventionally, functions to be optimized are named `objective`.
``` python
def objective(trial):
    x = trial.suggest_float("x", -10, 10)
    return (x - 2) ** 2
```
This function returns the value of (x-2)^2. Our goal is to find the value of `x` that minimizes the output of the `objective` function. This is the **optimization**. During optimization, Optuna repeatedly calls and evaluates the objective function with different values of `x`.

A `Trial` object corresponds to a single execution of the objective function and is internally instantiated upon each invocation of the function.

`suggest` APIs (`suggest_float()`) are called inside the objective function to obtain parameters for a trial. `suggest_float()` selects parameters uniformly within the range provided. In above example, -10 to 10.

To start the optimization, we create a study object and pass the objective function to method `optimize()` as follows:
``` python
study = optuna.create_study()
study.optimize(objective, n_trials=100)
```

You can get the best parameter as follows:
``` python
best_params = study.best_params
found_x = best_params["x"]
```
_When used to search for hyperparameters in machine learning, usually the objective function would return the loss or accuracy of the model._

In [None]:
# define objective function so that accuracy for RandomForest can be optimized using Optuna
def objective(trial):
    params = {
        'random_state': RANDOM_STATE,
        'max_depth': trial.suggest_int('max_depth', 10, 25),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 5, 12)
    }
    
    rf_ = RandomForestClassifier(**params)
    rf_.fit(X_train, y_train)
    return rf_.score(X_valid, y_valid)

In [None]:
study = optuna.create_study(sampler=optuna.samplers.TPESampler(seed=RANDOM_STATE),
                           direction='maximize',
                           pruner=optuna.pruners.MedianPruner())
study.optimize(objective, n_trials=100, show_progress_bar=True)

In [None]:
# After optuna optimization, print best params
print(f'Best Accuracy: {study.best_trial.value}')
print(f'Best Params: {study.best_params}')

Let's try RandomForest with the optimized params

In [None]:
%%time
kf = KFold(n_splits=FOLDS, shuffle=True, random_state=RANDOM_STATE)
y_pred = np.zeros(test.shape[0])

for fold, (train_index, valid_index) in enumerate(kf.split(features, target)):
    print("Running Fold {}".format(fold + 1))
    X_train, X_valid = pd.DataFrame(features.iloc[train_index]), pd.DataFrame(features.iloc[valid_index])
    y_train, y_valid = target.iloc[train_index], target.iloc[valid_index]
    rf_ = RandomForestClassifier(**study.best_params)
    rf_.fit(X_train, y_train)
    print("  Accuracy: {}".format(accuracy_score(y_valid, rf_.predict(X_valid))))
    y_pred += rf_.predict(test)

y_pred /= FOLDS

print("")
print("Done!")

Prepare RandomForest model for prediction

In [None]:
submission['Survived'] = np.round(y_pred).astype(int)
submission.to_csv('rf_10_folds_optuna.csv', index=False) # Kaggle Score: 0.79805

### Model: SVM

In [None]:
def objective(trial):
    params = {
        'C': trial.suggest_loguniform('C', 0.1, 0.5),
        'gamma': trial.suggest_categorical('gamma', ['auto']),
        'kernel': trial.suggest_categorical('kernel', ['rbf']),
    }
    
    svc_ = SVC(**params)
    svc_.fit(X_train, y_train)
    return svc_.score(X_valid, y_valid)

In [None]:
study = optuna.create_study(sampler=optuna.samplers.TPESampler(seed=RANDOM_STATE),
                           direction='maximize',
                           pruner=optuna.pruners.MedianPruner())
study.optimize(objective, n_trials=5, show_progress_bar=True)

In [None]:
%%time
kf = KFold(n_splits=FOLDS, shuffle=True, random_state=RANDOM_STATE)
y_pred = np.zeros(test.shape[0])

for fold, (train_index, valid_index) in enumerate(kf.split(features, target)):
    print("Running Fold {}".format(fold + 1))
    X_train_, X_valid_ = pd.DataFrame(features.iloc[train_index]), pd.DataFrame(features.iloc[valid_index])
    y_train_, y_valid_ = target.iloc[train_index], target.iloc[valid_index]
    svc_ = SVC(**study.best_params)
    svc_.fit(X_train_, y_train_)
    print("  Accuracy: {}".format(accuracy_score(y_valid_, svc_.predict(X_valid_))))
    y_pred += svc_.predict(test)

y_pred /= FOLDS

print("")
print("Done!")

This is not an improvement compared to RandomForest classifier.

### Model: KNeighborsClassifier

In [None]:
knc = KNeighborsClassifier(n_neighbors=1)
scores = cross_val_score(knc, X_train, y_train, cv=FOLDS, scoring='accuracy')
print(f'KNeighbors: {scores.mean()}')

Not a good performance compared to RandomForestClassifier.

### Model: ExtraTreesClassifier

In [None]:
def objective(trial):
    params = {
        'max_features':trial.suggest_float('max_features', 0.45, 0.6),
        'min_samples_leaf':trial.suggest_int('min_samples_leaf', 6, 10),
        'min_samples_split':trial.suggest_int('min_samples_split', 3, 5)
    }
    
    etc = ExtraTreesClassifier(random_state=RANDOM_STATE, n_estimators=100, **params)
    etc.fit(X_train, y_train)
    return etc.score(X_valid, y_valid)

In [None]:
study = optuna.create_study(sampler=optuna.samplers.TPESampler(seed=RANDOM_STATE),
                           direction='maximize',
                           pruner=optuna.pruners.MedianPruner())
study.optimize(objective, n_trials=10, show_progress_bar=True)

In [None]:
%%time
kf = KFold(n_splits=FOLDS, shuffle=True, random_state=RANDOM_STATE)
y_pred = np.zeros(test.shape[0])

for fold, (train_index, valid_index) in enumerate(kf.split(features, target)):
    print("Running Fold {}".format(fold + 1))
    X_train_, X_valid_ = pd.DataFrame(features.iloc[train_index]), pd.DataFrame(features.iloc[valid_index])
    y_train_, y_valid_ = target.iloc[train_index], target.iloc[valid_index]
    etc_ = ExtraTreesClassifier(random_state=RANDOM_STATE, n_estimators=100, **study.best_params)
    etc_.fit(X_train_, y_train_)
    print("  Accuracy: {}".format(accuracy_score(y_valid_, etc_.predict(X_valid_))))
    y_pred += etc_.predict(test)

y_pred /= FOLDS

print("")
print("Done!")

This is almost similar to RandomForestClassifier, but we'll try to submit these predictions.

In [None]:
submission['Survived'] = np.round(y_pred).astype(int)
submission.to_csv('extratrees_10_folds_optuna.csv', index=False)

### Model: LGBMClassifier

In [None]:
def objective(trial):
#     trial_params = {
#         'min_child_samples': trial.suggest_int('min_child_samples', 145, 160),
#         'num_leaves': trial.suggest_int('num_leaves', 15, 25),
#         'max_depth': trial.suggest_int('max_depth', 14, 16)
#     }
    lgb_params = {
        'metric': 'binary_logloss',
        'n_estimators': 100,
        'objective': 'binary',
        'random_state': RANDOM_STATE,
        'learning_rate': 0.01,
        'min_child_samples': 150,
        'reg_alpha': 3e-5,
        'reg_lambda': 9e-2,
        'num_leaves': 20,
        'max_depth': 16,
        'colsample_bytree': 0.8,
        'subsample': 0.8,
        'subsample_freq': 2,
        'max_bin': 240,
    }
    
    lgbm_ = LGBMClassifier(**lgb_params)
    lgbm_.fit(X_train, y_train)
    return lgbm_.score(X_valid, y_valid)

In [None]:
lgb_params = {
    'metric': 'binary_logloss',
    'n_estimators': 100,
    'objective': 'binary',
    'random_state': RANDOM_STATE,
    'learning_rate': 0.01,
    'min_child_samples': 150,
    'reg_alpha': 3e-5,
    'reg_lambda': 9e-2,
    'num_leaves': 20,
    'max_depth': 16,
    'colsample_bytree': 0.8,
    'subsample': 0.8,
    'subsample_freq': 2,
    'max_bin': 240,
}

lgbm_ = LGBMClassifier(**lgb_params)
scores = cross_val_score(lgbm_, 
                      X_train, 
                      y_train,
                      cv=5,
                      scoring='accuracy')
scores

In [None]:
# metric='binary_logloss',
#                            n_estimators=100,
#                            objective='binary',
#                            random_state=RANDOM_STATE,
#                            learning_rate=0.01,
#                            reg_alpha=3e-5,
#                            reg_lambda=9e-2,
#                            colsample_bytree=0.8,
#                            subsample=0.8,
#                            subsample_freq=2,
#                            max_bin=240, 

In [None]:
study = optuna.create_study(sampler=optuna.samplers.TPESampler(seed=RANDOM_STATE),
                           direction='maximize',
                           pruner = optuna.pruners.MedianPruner())
study.optimize(objective, n_trials=10, show_progress_bar=True)