## Introduction
In this competition your task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal records recovered from the ship's damaged computer system.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Loading Important Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split,GridSearchCV,StratifiedKFold
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
from sklearn.preprocessing import StandardScaler,MinMaxScaler,OneHotEncoder, LabelEncoder,OrdinalEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer

from sklearn.linear_model import LogisticRegression

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

In [None]:
# Reading Data Files
train = pd.read_csv('../input/spaceship-titanic/train.csv')
test = pd.read_csv('../input/spaceship-titanic/test.csv')

print('Train set shape:', train.shape,'  Test set shape:', test.shape)

train.head()

## File and Data Field Descriptions
* PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
* HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
* CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
* Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
* Destination - The planet the passenger will be debarking to.
* Age - The age of the passenger.
* VIP - Whether the passenger has paid for special VIP service during the voyage.
* RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
* Name - The first and last names of the passenger.
* Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

In [None]:
train.describe(include='all')

In [None]:
train.info()

**dtypes**
* bool(2) 
* float64(7)
* int64(8) 
* object(4)

In [None]:
print('Trainset Missing Values')
# Trainset Missing Values
(train.isna().sum())


In [None]:
print('Testset Missing Values')
# Testset Missing Values
test.isna().sum()

Info - Almost all the columns posses some Null entries ( <2.5% for train set)


In [None]:
# Checking for Cardinality i.e.
#Unique features in every column
train.nunique()

Other than the numerical features, [PassengerId,Cabin,Name] have very high cardinality. These columns can be dropper after feature extraction.

## EDA

In [None]:
# Checking for Imbalance in the dataset
plt.figure(figsize=(6,6))
train['Transported'].value_counts().plot(kind = 'bar',color=['orange','blue'])

Info - Dataset is balanced !

In [None]:
sns.distplot(train['Age'])

Info - Age can be approximated by Normal distribution and NUll values can be imputed by mean/median.

In [None]:
# Effect of Age on Transportation
plt.figure(figsize=(10,4))

sns.histplot(data=train, x='Age', hue='Transported', binwidth=1, kde=True)


plt.title('Age distribution')
plt.xlabel('Age (years)')

Info :
* Age <18 : High chance of Transportation
* Age >18 and <30 : Low chance of Transportation
* Age >30 : Inconclusive



In [None]:
# Effect of expenditure features on Transportation
exp_feats=['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']


fig=plt.figure(figsize=(10,20))
for i, var_name in enumerate(exp_feats):
    # Left plot
    ax=fig.add_subplot(5,2,2*i+1)
    sns.histplot(data=train, x=var_name, axes=ax, bins=30, kde=False, hue='Transported')
    ax.set_title(var_name)
    
    # Right plot (truncated view)
    ax=fig.add_subplot(5,2,2*i+2)
    sns.histplot(data=train, x=var_name, axes=ax, bins=30, kde=True, hue='Transported')
    plt.xlim([0,3000])
    plt.ylim([0,300])
    ax.set_title(var_name)
fig.tight_layout() 
plt.show()

* Features follow exponential decay function with long right tail
* Most of the passengers are not spending on these activities.

In [None]:
# Categorical features effect on Transportation
cat_feats=['HomePlanet', 'CryoSleep', 'Destination', 'VIP']

# Plot categorical features
fig=plt.figure(figsize=(10,16))
for i, var_name in enumerate(cat_feats):
    ax=fig.add_subplot(4,1,i+1)
    sns.countplot(data=train, x=var_name, axes=ax, hue='Transported')
    ax.set_title(var_name)
fig.tight_layout()  # Improves appearance a bit
plt.show()

* Cryosleep can have good predicting power.
* Europa : High probability of transportation
* Earth : Low probability of transportation

In [None]:
# Impute median (for continuous data)
train['Age'].fillna(train['Age'].median(), inplace=True)
test['Age'].fillna(train['Age'].median(), inplace=True)


In [None]:
# Impute mode (for categorical data)
cat=['HomePlanet','CryoSleep','Destination','VIP']
for i in cat:
    print(train[i].mode()[0])
    train[i].fillna(train[i].mode()[0], inplace=True)
    test[i].fillna(train[i].mode()[0], inplace=True)

In [None]:
# For expenditure features, most of the passengers are not spending. Therefore, replacing NULL by 0.
for col in exp_feats:
    train.loc[train[col].isna(),col]=0
    test.loc[test[col].isna(),col]=0

* PassengerId takes the form xxxx_yy where xxxx indicates a group the passenger is travelling with and yy is their number within the group.
* Cabin takes the form deck/num/side, where side can be either P for Port or S for Starboard.

In [None]:
# For ID/Qualitative variables, replacing with UNK (matching data format)
train['Cabin'].fillna('U/9999/K', inplace=True)
test['Cabin'].fillna('U/9999/K', inplace=True)

train['Name'].fillna('UNK UNK', inplace=True)
test['Name'].fillna('UNK UNK', inplace=True)

## Feature Engineering

In [None]:
# Age segmentation 

# Age features - training set
train['Under_18']=(train['Age']<18).astype(int)
train['18_to_30']=((train['Age']>=18) & (train['Age']<=30)).astype(int)
train['Over_30']=(train['Age']>30).astype(int)

# Age features - test set
test['Under_18']=(test['Age']<18).astype(int)
test['18_to_30']=((test['Age']>=18) & (test['Age']<=30)).astype(int)
test['Over_30']=(test['Age']>30).astype(int)

# Plot distribution of age features
train['Age_plot']=train['Under_18']+2*train['18_to_30']+3*train['Over_30']
plt.figure(figsize=(10,4))
g=sns.countplot(data=train, x='Age_plot', hue='Transported')
plt.title('Age status distribution')
g.set_xticklabels(['Under 18', '18-30', 'Over 30'])
train.drop('Age_plot', axis=1, inplace=True)

In [None]:
#Calculate total expenditure and identify passengers with no expenditure.


train['Expenditure']=train[exp_feats].sum(axis=1)
train['No_spending']=(train['Expenditure']==0).astype(int)

# New features - test set
test['Expenditure']=test[exp_feats].sum(axis=1)
test['No_spending']=(test['Expenditure']==0).astype(int)

# Plot distribution of new features
plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
sns.histplot(data=train, x='Expenditure', hue='Transported', bins=200)
plt.title('Total expenditure (truncated)')
plt.ylim([0,200])
plt.xlim([0,20000])

plt.subplot(1,2,2)
sns.countplot(data=train, x='No_spending', hue='Transported')
plt.title('No spending indicator')

In [None]:
# Group size effect on Transportation
plt.figure(figsize=(20,4))
train['Group'] = train['PassengerId'].apply(lambda x: x.split('_')[0]).astype(int)
train['Group_size']=train['Group'].map(lambda x: train['Group'].value_counts()[x])

# New features - test set
test['Group'] = test['PassengerId'].apply(lambda x: x.split('_')[0]).astype(int)
test['Group_size']=test['Group'].map(lambda x: test['Group'].value_counts()[x])
plt.subplot(1,2,2)
sns.countplot(data=train, x='Group_size', hue='Transported')
plt.title('Group size')
fig.tight_layout()


Solo travellers have less chance of being Transported. Creating a feature for solo identification.

In [None]:
train['Solo']=(train['Group_size']==1).astype(int)
test['Solo']=(test['Group_size']==1).astype(int)

In [None]:
# Solo features - train set
train['Cabin_deck'] = train['Cabin'].apply(lambda x: x.split('/')[0])

train['Cabin_side'] = train['Cabin'].apply(lambda x: x.split('/')[2])

# Solo features - test set
test['Cabin_deck'] = test['Cabin'].apply(lambda x: x.split('/')[0])

test['Cabin_side'] = test['Cabin'].apply(lambda x: x.split('/')[2])


In [None]:
# Calculate family size from last name.


# New features - training set
train['Surname']=train['Name'].str.split().str[-1]
train['Family_size']=train['Surname'].map(lambda x: train['Surname'].value_counts()[x])

# New features - test set
test['Surname']=test['Name'].str.split().str[-1]
test['Family_size']=test['Surname'].map(lambda x: test['Surname'].value_counts()[x])

# Set outliers (no name) to have no family
train.loc[train['Family_size']==200,'Family_size']=0
test.loc[test['Family_size']==200,'Family_size']=0

# New feature distribution
plt.figure(figsize=(10,4))
sns.countplot(data=train, x='Family_size', hue='Transported')
plt.title('Family size')

Family size approximately follows right skewed normal distribution with mean around 5.

In [None]:
## DROP UNWANTED FEATURES
train.drop(['PassengerId', 'Cabin', 'Name', 'Surname', 'VIP'], axis=1, inplace=True)
test.drop(['PassengerId', 'Cabin', 'Name', 'Surname', 'VIP'], axis=1, inplace=True)

In [None]:
y=train['Transported'].copy().astype(int)
X=train.drop('Transported', axis=1).copy()
X_test=test.copy()

Important Link :
* https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html
* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html


In [None]:
# Indentify numerical and categorical columns
numerical_cols = [cname for cname in X.columns if X[cname].dtype in ['int64', 'float64']]
categorical_cols = [cname for cname in X.columns if X[cname].dtype in ["object"]]

# Standardize numerical data to have mean=0 and variance=1
numerical_transformer = Pipeline(steps=[('scaler', StandardScaler())])

# Ordinal encode categorical data
categorical_transformer = Pipeline(steps=[('ordinal',OrdinalEncoder())])



# Combine preprocessing
CTransformer = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)],
        remainder='passthrough')

# Apply preprocessing
X = CTransformer.fit_transform(X)
X_test = CTransformer.transform(X_test)

# Print new shape
print('Training set shape:', X.shape)

## Feature selection using KBest
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html

In [None]:
from sklearn.feature_selection import f_classif
fs = SelectKBest(score_func=f_classif, k='all')
# learn relationship from training data
fs.fit(X, y)
for i in range(len(fs.scores_)):
    print('Feature %s: %f' % (i, fs.pvalues_[i]))
# plot the scores
plt.bar([i for i in range(len(fs.pvalues_))], fs.pvalues_)
plt.show()


In [None]:
## Dropping features with p-value < 0.05
cols_to_remove=[3]
final_cols=list(range(20))
for i in cols_to_remove:
    final_cols.pop(i)

In [None]:
X=X[:,final_cols]

In [None]:
from imblearn.over_sampling import SMOTE
oversample = SMOTE()
X, y = oversample.fit_resample(X,y)
y.value_counts()

In [None]:
## Train- Validation Split 
X_train, X_valid, y_train, y_valid = train_test_split(X,y,stratify=y,train_size=0.8,test_size=0.2,random_state=0)

In [None]:
X_test=X_test[:,final_cols]

## Model Selection
* Logistic Regression: Unlike regression which uses Least Squares, the model uses Maximum Likelihood to fit a sigmoid-curve on the target variable distribution. It uses a logistic function, and most commonly used when the data in question has binary output.
 
* Random Forest (RF): RF is a reliable ensemble of decision trees, which can be used for regression or classification problems. Here, the individual trees are built via bagging (i.e. aggregation of bootstraps which are nothing but multiple train datasets created via sampling with replacement) and split using fewer features. The resulting diverse forest of uncorrelated trees exhibits reduced variance; therefore, is more robust towards change in data and carries its prediction accuracy to new data. It works well with both continuous & categorical data.
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

* Light Gradient Boosting Machine (LGBM): LGBM works essentially the same as XGBoost but with a different boosting technique. It usually produces similar results to XGBoost but is significantly faster.
https://lightgbm.readthedocs.io/en/latest/
 
* Categorical Boosting (CatBoost): CatBoost is an open source algorithm based on gradient boosted decision trees. It supports numerical, categorical and text features. It works well with heterogeneous data and even relatively small data. Informally, it tries to take the best of both worlds from XGBoost and LGBM.
https://catboost.ai/



In [None]:
## Grid definition for model selection

classifiers = {
    "LogisticRegression" : LogisticRegression(random_state=0),
    "RandomForest" : RandomForestClassifier(random_state=0),
    "LGBM" : LGBMClassifier(random_state=0),
    "CatBoost" : CatBoostClassifier(random_state=0, verbose=False)
}

# Grids for grid search
LR_grid = {'penalty': ['l1','l2'],
           'C': [0.25, 0.5, 0.75, 1, 1.25, 1.5],
           'max_iter': [50, 100, 150]}



SVC_grid = {'C': [0.25, 0.5, 0.75, 1, 1.25, 1.5],
            'kernel': ['linear', 'rbf'],
            'gamma': ['scale', 'auto']}

RF_grid = {'n_estimators': [50, 100, 150, 200, 250, 300],
        'max_depth': [4, 6, 8, 10, 12]}

boosted_grid = {'n_estimators': [50, 100, 150, 200],
        'max_depth': [4, 8, 12],
        'learning_rate': [0.05, 0.1, 0.15]}



# Dictionary of all grids
grid = {
    "LogisticRegression" : LR_grid,
    "RandomForest" : RF_grid,
    "LGBM" : boosted_grid,
    "CatBoost" : boosted_grid
}

Grid search CV : https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [None]:
# GridSearchCV for model selection in Action !
import time
i=0
clf_best_params=classifiers.copy()
valid_scores=pd.DataFrame({'Classifer':classifiers.keys(), 'Validation accuracy': np.zeros(len(classifiers)), 'Training time': np.zeros(len(classifiers))})
for key, classifier in classifiers.items():
    start = time.time()
    clf = GridSearchCV(estimator=classifier, param_grid=grid[key], n_jobs=-1, cv=None)

    # Train and score
    clf.fit(X_train, y_train)
    valid_scores.iloc[i,1]=clf.score(X_valid, y_valid)

    # Save trained model
    clf_best_params[key]=clf.best_params_
    
    # Print iteration and training time
    stop = time.time()
    valid_scores.iloc[i,2]=np.round((stop - start)/60, 2)
    
    print('Model:', key)
    print('Training time (mins):', valid_scores.iloc[i,2])
    print('')
    i+=1

In [None]:
# Model Performances
valid_scores

In [None]:
# Best Model Parameters
clf_best_params

* Try permutations of ensemble to get the best score !
* **Final model selected - CatBoost**

In [None]:
# Best Classifiers selected
best_classifiers = {
    #"RandomForest" : RandomForestClassifier(**clf_best_params["RandomForest"], random_state=0),
    #"LGBM" : LGBMClassifier(**clf_best_params["LGBM"], random_state=0),
    "CatBoost" : CatBoostClassifier(**clf_best_params["CatBoost"], verbose=False, random_state=0)
}

## Stratified K Fold test for cross validation.
This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html

Predictions are ensembled together using soft voting. This averages the predicted probabilies to produce the most confident predictions.



In [None]:
FOLDS=10

preds=np.zeros(len(X_test))
for key, classifier in best_classifiers.items():
    start = time.time()
    
    # 5-fold cross validation
    cv = StratifiedKFold(n_splits=FOLDS, shuffle=True, random_state=0)
    
    score=0
    for fold, (train_idx, val_idx) in enumerate(cv.split(X, y)):
        # Get training and validation sets
        X_train, X_valid = X[train_idx], X[val_idx]
        y_train, y_valid = y[train_idx], y[val_idx]

        # Train model
        clf = classifier
        clf.fit(X_train, y_train)

        # Make predictions and measure accuracy
        preds += clf.predict_proba(X_test)[:,1]
        score += clf.score(X_valid, y_valid)

    # Average accuracy    
    score=score/FOLDS
    
    # Stop timer
    stop = time.time()

    # Print accuracy and time
    print('Model:', key)
    print('Average validation accuracy:', np.round(100*score,2))
    print('Training time (mins):', np.round((stop - start)/60,2))
    print('')
    
# Ensemble predictions
preds=preds/(FOLDS*len(best_classifiers))


## Final Prediction and Submission

In [None]:
preds=np.round(preds).astype(int)
sub=pd.read_csv('../input/spaceship-titanic/sample_submission.csv')

sub['Transported']=preds

sub=sub.replace({0:False,1:True})
sub.to_csv('submission.csv', index=False)