# Kaggle Competition: Space Titanic

## Look at the Big Picture

### Frame the Problem

Let us first define the nature of the problem. Since we want to predict which passengers were transported to an alternate dimension, this is a binary classification problem. Since the dataset involves both features and labels, supervising machine learning techniques are used. Batch learning is used since the dataset is prepared to be used.

### Select a Performance Metric

Since this is a binary classification problem, we can use accuracy or ROC-AUC score to assess the performance of the models.

- ROC-AUC score is a more comprehensive performance metric, since it is calculated independent of the threshold.

### Check the Assumptions

Now Let's check the assumptions. By looking at the training data and the test data, we can see that the label are True/False values, that means it is a binary classification task.

## Get the Data

We obtain the data from Kaggle competition website.

## Discover and Visualize the Data to Gain Insights

First we load the data

In [None]:
import numpy as np
import pandas as pd
train = pd.read_csv('../input/spaceship-titanic/train.csv')
test = pd.read_csv('../input/spaceship-titanic/test.csv')

In [None]:
train.head()

We use info() method to check the details of each variable. We can see that there are 14 columns altogether of which 13 of them are features and 1 of them is label. 7 of the data types of the features are 'object', while 6 of them are 'float64'.

In [None]:
train.info()

In [None]:
train.describe()

In [None]:
train.shape

From the histogram below, we can see that the 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', and 'VRDeck' are right skewed.

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt
train.hist(log=True)
plt.tight_layout()
plt.show()

In [None]:
train.isna().sum()

Construct a correlation matrix. If there are null values, corr() remove the pairwise correlations.

- https://stackoverflow.com/questions/57155427/how-does-corr-remove-na-and-null-values

In [None]:
train.corr()

In [None]:
corr_matrix = train.corr()
corr_matrix['Transported'].sort_values(ascending=False)

Construct scatter matrix for further inspection

In [None]:
from pandas.plotting import scatter_matrix
attributes = ['Age', 'VRDeck', 'Spa', 'RoomService']
scatter_matrix(train[attributes], figsize=(12,12))

## Prepare the Data for Machine Learning Algorithms

Separate the features and labels

Drop the Passenger Id and the Name since they are irrelevant.

The 'Cabin' variable is dropped for now. It will be used later.

In [None]:
X_train = train.drop(columns=['Transported','Name'])
y_train = train['Transported']

### Data Cleaning

First, we should decide how we would manage the missing data. From above we see that except for passenger ID and the label, each of the variables contains missing values.

- For quantiative data, we fill the missing data during the data transformation process using SimpleImputer. Here we fill the missing values with the median.
- For qualitative data, we also fill the missing data during the data transfomraiton process. We use Imputer class to fill the missing values with the most frequent category.

In [None]:
train.isna().sum()

### Handling Text and Categorical Variables

- To handle categorigal variables, we can use one hot encoding during the data transformation process.
- The 'Deck' and 'Side' categorical variables are extracted from the 'Cabin' variable.

In [None]:
X_train_copy = X_train.copy()

def df_transform(X):
    X['Group'] = X['PassengerId'].str.split('_', expand=True).iloc[:,0]
    group_group = X.groupby('Group')
    group_size = group_group.apply(len)
    X['GroupSize'] = X['Group'].replace(list(group_size.index), list(group_size.values))
    X = X.drop(columns=['Group'])
    X['Deck'] = X['Cabin'].str.split('/', expand=True).iloc[:,0]
    X['Side'] = X['Cabin'].str.split('/', expand=True).iloc[:,2]
    return X

### Custom Transformer

### Feature Scaling

- All quantitative features will be standardized.

### Transformation Pipelines

- After listing the strategies to handle missing data and categorical variales, we can now build a data transformation pipeline for preprocessing. we first build separate pipelines for quantiative and qualitative data, then we use column Transformer to combine the two pipelines into a full one.

In [None]:
num_attribs = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'GroupSize']
cat_attribs = ['HomePlanet','Destination', 'CryoSleep', 'VIP', 'Deck', 'Side']

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('std_scaler', StandardScaler())
])

cat_pipeline = Pipeline([
    ('oh_encoder', OneHotEncoder())
])

full_pipeline = ColumnTransformer([
    ('num', num_pipeline, num_attribs),
    ('cat', cat_pipeline, cat_attribs)
])

# non-selected columns such as 'PassengerId' and 'Cabin' will be
# dropped after passing through the column transformer
X_train_copy_transformed = df_transform(X_train_copy)
print(X_train_copy_transformed.columns)
X_train_copy_prepared = full_pipeline.fit_transform(X_train_copy_transformed)
print(X_train_copy_prepared[0])

## Select and Train a Model

### Training and Evaluating on the Training Set

Now we choose which classification models to use. Here we use the following classifiers:

1. Random Forest Classifier
2. Support Vector Classifier with linear kernel
3. Support Vector Classifier with polynomial kernel
4. Support Vector classifier with Guassian RBF kernel
5. K Neighbours Classifier
6. Logistic Regression
7. SGD Classifier (Linear Support Vector Machine)
8. Soft Voting Classifier (Random Forest, Logistic Regression, and Linear SVM)
9. AdaBoost Classifier

Note: for SVM, probability=True to generate prediction probability for voting classifier for the calculation of the ROC-AUC score.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

forest_clf = RandomForestClassifier(n_estimators=200, random_state=42)
linear_svm_clf = SVC(C=5, random_state=42)
poly_svm_clf = SVC(kernel='poly', degree=3, coef0=1, probability=True)
rbf_svm_clf = SVC(kernel='rbf', gamma='scale', C=5)
knn_clf = KNeighborsClassifier()
log_clf = LogisticRegression(solver='sag', random_state=42)
sgd_clf = SGDClassifier(loss='hinge', alpha=0.017, max_iter=1000, tol=1e-3, random_state=42)
voting_clf = VotingClassifier([
    ('svm', poly_svm_clf),
    ('log', log_clf),
    ('for', forest_clf)
], voting='soft')
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=200,
    algorithm="SAMME.R", learning_rate=0.5, random_state=42)

algorithms = [forest_clf, linear_svm_clf, poly_svm_clf, rbf_svm_clf,
 knn_clf, log_clf, sgd_clf, voting_clf, ada_clf]

Calculate cross validation scores for each of the classifiers

Again, ROC-AUC is used because this is a binary classification task.

In [None]:
X_train_transformed = df_transform(X_train)
X_train_prepared = full_pipeline.fit_transform(X_train_transformed)

from sklearn.model_selection import cross_val_score

best_mean = 0
for alg in algorithms:
    alg_scores = cross_val_score(alg, X_train_prepared, y_train, scoring='roc_auc', cv=10)
    print(f'Classifier: {str(alg)}')
    print(f'Mean Score: {alg_scores.mean()}')
    print(f'Standard Deviation: {alg_scores.std()}')
    if alg_scores.mean() > best_mean:
        best_mean = alg_scores.mean()
        best_classifier = str(alg)

print(f'Best Model: {best_classifier}')
print(f'Best Model Mean Score: {best_mean}')

From the cross validation results above, we found that the Voting Classifier performs the best. Therefore, we will fine tune the model using Grid Search.

## Fine Tune Your Model

### Grid Search

Logistic Regression does not require grid search as it has no hyperparameters.

### Evaluate Your System on the Test Set

In [None]:
X_test = test.drop(columns=['Name'])
X_test_transformed = df_transform(X_test)
X_test_prepared = full_pipeline.fit_transform(X_test_transformed)

In [None]:
voting_clf.fit(X_train_prepared, y_train)
y_test_pred = voting_clf.predict(X_test_prepared)
y_test_pred

In [None]:
final = pd.read_csv('../input/spaceship-titanic/sample_submission.csv')
final.iloc[:,1] = y_test_pred
final.to_csv('final_submission.csv', index=False)

## Comments

**Results without cabin information**

Logistic Regression 0.79027

Voting Classifier (Linear SVM, Logistic Regression, Random Forest): 0.78980

**Results with cabin information**

Voting Classifier (Linear SVM, Logistic Regression, Random Forest): 0.80289

**Results with cabin information and groupsize**

Voting Classifier (Polynomial SVM, Logistic Regression, Random Forest (n_estimators=200)): 0.80430