# **Voting classifier**

In this notebook, the following algorithms/methods are used:

* Support Vector Machines
* Logistic Regression
* Random Forest
* XGBoost
* LGBM
* CatBoost
* Neural Network

For each one of them, an output file was submitted for evaluation and thre LB scores were noted. Then, two voting classifiers were used to combine all results:

* one with **hard voting** using the binary outcomes of all agorithms and choosing the majority classification for each test instance
* one with **soft voting** using the probability outcomes of all algorithms, computing an average and rounding the result for each test instance

The results show that both voting classifiers outperform all individual results. Also, hard voting results in a better LB score than soft voting.

# Load libraries and data

In [None]:
import os
import numpy as np
import pandas as pd

from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

In [None]:
train = pd.read_csv('../input/tabular-playground-series-apr-2021/train.csv', index_col='PassengerId')
test = pd.read_csv('../input/tabular-playground-series-apr-2021/test.csv', index_col='PassengerId')
submission = pd.read_csv('../input/tabular-playground-series-apr-2021/sample_submission.csv', index_col='PassengerId')

target = train.pop('Survived')

## Preprocessing

Based on [this notebook](https://www.kaggle.com/ekozyreff/tps-2021-04-support-vector-machines).

In [None]:
train.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
test.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

test['Age'].fillna((train['Age'].median()), inplace=True)
train['Age'].fillna((train['Age'].median()), inplace=True)

test['Fare'].fillna((train['Fare'].median()), inplace=True)
train['Fare'].fillna((train['Fare'].median()), inplace=True)

test['Fare'] = test['Fare'].map(lambda i: np.log(i) if i > 0 else 0)
train['Fare'] = train['Fare'].map(lambda i: np.log(i) if i > 0 else 0)

test['FamilySize'] = test['SibSp'] + test['Parch'] + 1
train['FamilySize'] = train['SibSp'] + train['Parch'] + 1

test['Embarked'].fillna('S', inplace=True)
train['Embarked'].fillna('S', inplace=True)

for col in ['Pclass', 'Sex', 'Embarked']:
    le = LabelEncoder()
    le.fit(train[col])
    test[col] = le.transform(test[col])
    train[col] = le.transform(train[col])    

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(train, target, test_size=0.1, random_state=0)

For SVM, it is recommended that the input be scaled.

In [None]:
X_train_scaled = X_train.copy()
X_valid_scaled = X_valid.copy()
test_scaled = test.copy()

scaler = StandardScaler()
scaler.fit(train)
X_train_scaled = scaler.transform(X_train)
X_valid_scaled = scaler.transform(X_valid)
test_scaled = scaler.transform(test)

X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_valid_scaled = pd.DataFrame(X_valid_scaled, columns=X_valid.columns)
test_scaled = pd.DataFrame(test_scaled, columns=test.columns)

# SVM with RBF kernel

In order to the probabilities associated with each prediction, we need to set `probability=True` here, and this makes the algorithm much slower. The following cell takes approximately 30 minutes to run. 

For the other methods, this is not necessary and we can use directly `predict_proba`.

In [None]:
%%time
svc_kernel_rbf = SVC(kernel='rbf', random_state=0, C=0.01, probability=True)
svc_kernel_rbf.fit(X_train_scaled, y_train)
y_pred = svc_kernel_rbf.predict(X_valid_scaled)
print("Accuracy: {}".format(accuracy_score(y_pred, y_valid)))

In [None]:
%%time
svc_kernel_rbf_final_pred_probs = svc_kernel_rbf.predict_proba(test_scaled)[:,1]
svc_kernel_rbf_final_pred_binary = svc_kernel_rbf.predict(test_scaled)
submission['Survived'] = svc_kernel_rbf_final_pred_binary
submission.to_csv('svm_kernel_rbf.csv')

Public LB score: **0.78642**.

# Logistic Regression

In [None]:
%%time
log_reg = LogisticRegression(random_state=0)
log_reg.fit(X_train_scaled, y_train)
y_pred = log_reg.predict(X_valid_scaled)
print("Accuracy: {}".format(accuracy_score(y_pred, y_valid)))

In [None]:
%%time
log_reg_final_pred_probs = log_reg.predict_proba(test_scaled)[:,1]
log_reg_final_pred_binary = log_reg.predict(test_scaled)
submission['Survived'] = log_reg_final_pred_binary
submission.to_csv('logistic_regression.csv')

Public LB score: **0.79341**.

# Random Forest

In [None]:
%%time
random_forest = RandomForestClassifier(random_state=0, n_estimators=1000, max_features=2, min_samples_split=0.1)
random_forest.fit(X_train, y_train)
y_pred = random_forest.predict(X_valid)
print("Accuracy: {}".format(accuracy_score(y_pred, y_valid)))

In [None]:
%%time
random_forest_final_pred_probs = random_forest.predict_proba(test)[:,1]
random_forest_final_pred_binary = random_forest.predict(test)
submission['Survived'] = random_forest_final_pred_binary
submission.to_csv('random_forest.csv')

Public LB score: **0.79506**.

# XGBoost

In [None]:
%%time
xgboost = XGBClassifier(random_state=0, n_estimators=1000, use_label_encoder=False, eval_metric='logloss')
xgboost.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], early_stopping_rounds=10, verbose=False)
y_pred = xgboost.predict(X_valid)
print("Accuracy: {}".format(accuracy_score(y_pred, y_valid)))

In [None]:
%%time
xgboost_final_pred_probs = xgboost.predict_proba(test)[:,1]
xgboost_final_pred_binary = xgboost.predict(test)
submission['Survived'] = xgboost_final_pred_binary
submission.to_csv('xgboost.csv')

Public LB score: **0.78247**.

# LGBM

In [None]:
%%time
lgbm = LGBMClassifier(random_state=0, n_estimators=1000)
lgbm.fit(X_train, y_train, eval_set=(X_valid, y_valid), eval_metric='logloss', early_stopping_rounds=10, verbose=0)
y_pred = lgbm.predict(X_valid)
print("Accuracy: {}".format(accuracy_score(y_pred, y_valid)))

In [None]:
%%time
lgbm_final_pred_probs = lgbm.predict_proba(test)[:,1]
lgbm_final_pred_binary = lgbm.predict(test)
submission['Survived'] = lgbm_final_pred_binary
submission.to_csv('lgbm.csv')

Public LB score: **0.78711**.

# CatBoost

In [None]:
%%time
catboost = CatBoostClassifier(random_state=0, n_estimators=1000)
catboost.fit(X_train, y_train, eval_set=(X_valid, y_valid), verbose=False, early_stopping_rounds=10)
y_pred = catboost.predict(X_valid)
print("Accuracy: {}".format(accuracy_score(y_pred, y_valid)))

In [None]:
%%time
catboost_final_pred_probs = catboost.predict_proba(test)[:,1]
catboost_final_pred_binary = catboost.predict(test)
submission['Survived'] = catboost_final_pred_binary
submission.to_csv('catboost.csv')

Public LB score: **0.78550**.

# Neural Network

In [None]:
tf.random.set_seed(0)

early_stopping = keras.callbacks.EarlyStopping(
    patience = 10,
    min_delta = 0.001,
    restore_best_weights = True,
)

neural_net = keras.Sequential([
    layers.Dense(units=100, activation='relu', input_shape=[X_train_scaled.shape[1]]),
    layers.Dropout(rate=0.3),
    layers.BatchNormalization(),
    layers.Dense(units=100, activation='relu'),
    layers.Dropout(rate=0.3),
    layers.BatchNormalization(),
    layers.Dense(units=50, activation='relu'),
    layers.BatchNormalization(),
    layers.Dense(units=1, activation='sigmoid'),
])
neural_net.compile(
    optimizer="adam",
    loss="binary_crossentropy",
    metrics = ['binary_accuracy']
)

In [None]:
%%time
history = neural_net.fit(X_train, y_train,
                     validation_data = (X_valid, y_valid),
                     batch_size = 512,
                     epochs = 50,
                     callbacks = [early_stopping],
                    )

In [None]:
%%time
neural_net_final_pred_probs = neural_net.predict(test).reshape(100000,)
neural_net_final_pred_binary = np.round(neural_net_final_pred_probs).astype(int).reshape(100000,)
submission['Survived'] = neural_net_final_pred_binary
submission.to_csv('neural_net.csv')

Public LB score: **0.79284**.

# Hard voting classifier

In [None]:
binary_average = np.mean([svc_kernel_rbf_final_pred_binary,
                          log_reg_final_pred_binary,
                          random_forest_final_pred_binary,
                          xgboost_final_pred_binary,
                          lgbm_final_pred_binary,
                          catboost_final_pred_binary,
                          neural_net_final_pred_binary], axis=0)

hard_classifier_predictions = np.round(binary_average).astype(int)

In [None]:
submission['Survived'] = hard_classifier_predictions
submission.to_csv('hard_voting_classifier.csv')

Public LB score: **0.79692**.

# Soft voting classifier

In [None]:
probs_average = np.mean([svc_kernel_rbf_final_pred_probs,
                          log_reg_final_pred_probs,
                          random_forest_final_pred_probs,
                          xgboost_final_pred_probs,
                          lgbm_final_pred_probs,
                          catboost_final_pred_probs,
                          neural_net_final_pred_probs], axis=0)

soft_classifier_predictions = np.round(probs_average).astype(int)

In [None]:
submission['Survived'] = soft_classifier_predictions
submission.to_csv('soft_voting_classifier.csv')

Public LB score: **0.79603**.

## Summary of results

| Algorithm               | LB score |
| --- | --- |
| Support Vector Machines    | 0.78642 |
| Logistic Regression        | 0.79341 |
| Random Forest              | 0.79506 |
| XGBoost                    | 0.78247 |
| LGBM                       | 0.78711 |
| CatBoost                   | 0.78550 |
| Neural Network             | 0.79284 |
| **Hard voting classifier** | **0.79692** |
| **Soft voting classifier** | **0.79603** |

# Final remarks

1. The main purpose of this notebook was to test whether a simple ensemble method such as a voting classifier would improve on individual results and, as expected, it did.

2. The best individual result was 0.79506 (Random Forest) and both voting classifiers outperformed this value (0.79692 with hard voting and 0.79603 with soft voting). However, I was expecting soft voting to do better than hard voting, since in theory there is "more information" in the probabilites that each method produces. That did not happen here.

3. I realize that the individual results for XGBoost, LGBM and CatBoost are low compared to other kernels, so my next step is to work on them individually. Hopefully when I have them performing better I will get a much higher score with the ensemble.

If you have any thoughts to share, please do so in the comments. And thanks for stopping by! :)