In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt
import sklearn as sk
import warnings 
warnings.filterwarnings("ignore")

from numpy.random import seed
seed(1)

Today I'm going to complete an Exploratory Data Analysis of the Spaceship Titanic Dataset, after cleaning the data. I will then prepare a machine learning pipeline, using a few different types of models, and comparing their efficacy. Finally, I will submit the model with the most accurate predictions to the competition, and see how I do. 

In [None]:
trainPath = "../input/spaceship-titanic/train.csv"
trainingData = pd.read_csv(trainPath)
trainingData.head()
testPath = "../input/spaceship-titanic/test.csv"
testingData = pd.read_csv(testPath)

The Data is Formated as following according to the competition:
* PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
* HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
* CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
* Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
* Destination - The planet the passenger will be debarking to.
* Age - The age of the passenger.
* VIP - Whether the passenger has paid for special VIP service during the voyage.
* RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
* Name - The first and last names of the passenger.
* Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

As we can see in these descriptions, some further work is required in order to fully process the data set, so lets add some columns to include the additional data that can be gleaned from what we have. 

In [None]:
trainingData[["GroupId", "GroupPassengerId"]] = trainingData["PassengerId"].str.split("_", expand = True).astype(int)
trainingData[["Deck", "CabinNum", "Side"]] = trainingData['Cabin'].str.split("/", expand = True)
trainingData["TotalSpending"] = trainingData[["RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]].sum(axis = 1)
testingData[["GroupId", "GroupPassengerId"]] = testingData["PassengerId"].str.split("_", expand = True)
testingData[["Deck", "CabinNum", "Side"]] = testingData['Cabin'].str.split("/", expand = True)
testingData["TotalSpending"] = testingData[["RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]].sum(axis = 1)

1. **Univariate Data Analysis**

Lets start with the quantitative variables, and then analyze the categorical variables 


In [None]:
from matplotlib import figure
sns.displot(trainingData, x = trainingData['TotalSpending'], binwidth = 1000, bins = 35)
sns.displot(trainingData, x = trainingData['Age'], binwidth = 5, bins = 20 )
plt.show()

The distribution of the spending is skewed heavily right with several large outliers, while the age distribution is pretty evenly distributed by comparison. Now that we've seen the distributions of the quantitative variables, lets move to the categorical variables. 

In [None]:
num_cols = trainingData._get_numeric_data().columns
cat_cols = list(set(trainingData.columns) - set(num_cols))
print(cat_cols)

In [None]:
fig, ax = plt.subplots(3,2, figsize = (15,10))
sns.countplot(trainingData['Transported'], palette='Paired_r', ax = ax[0][0])
sns.countplot(trainingData['HomePlanet'], palette='Set2', ax = ax[0][1])
sns.countplot(trainingData['Destination'], palette='Paired_r', ax = ax[1][0])
sns.countplot(trainingData['Side'], palette='Set2', ax = ax[1][1])
sns.countplot(trainingData['Deck'], palette='Paired_r', ax = ax[2][0])
sns.countplot(trainingData['VIP'], palette='Set2', ax = ax[2][1])
plt.show()

2. **Multivariate Data Analysis**

There are a few relationships that I am interested in in this segment: 
* VIP and Transported 
* Deck and Transported
* Side and Transported
* Age and Transported
* TotalSpending and Transported 

Now I will create a plot that can visualize each of these relationships. 

In [None]:
sns.countplot(x = trainingData['VIP'], hue = trainingData['Transported'])
plt.show()
sns.countplot(x = trainingData.Deck, hue = trainingData.Transported)
plt.show()
sns.displot(x = trainingData.TotalSpending, hue = trainingData.Transported)
plt.show()
sns.kdeplot(x=trainingData.Age , hue=trainingData.Transported, palette='Paired_r', multiple = 'stack')
plt.show()
sns.countplot(x = trainingData.Side, hue = trainingData.Transported)

Given that we expect roughly even amounts of people who are transported and those that aren't, we can see which variables have a major impact on chances of survival. For example, being on the starboard side of the ship is not exactly great for your health, as there is a larger proportion of people who get transported from that side of the ship than the other. Lets really quickly utilize a chi-squared test for independence to determine if the differences in these distributions are statistically significant. 

In [None]:
from sklearn.preprocessing import LabelEncoder

initialobserved = trainingData[['Side', 'Transported']]
for col in initialobserved.columns:
    initialobserved[col] = LabelEncoder().fit_transform(initialobserved[col])

count = pd.DataFrame(index = ['Transported', 'Not Transported'], columns = ['Port', 'Starboard'], dtype = int)

def createCount (row):
    counttp = 0 
    countts = 0
    countntp = 0 
    countnts = 0
    if row['Side'] == 0 and row['Transported'] == 1:
        counttp += 1
    if row['Side'] == 1 and row['Transported'] == 1:
        countts += 1
    if row['Side'] == 0 and row['Transported'] == 0:
        countntp += 1
    if row['Side'] == 1 and row['Transported'] == 0:
        countnts += 1
    return counttp, countts, countntp, countnts

initialobserved = initialobserved.apply(lambda row: createCount(row), axis = 1)

observed = pd.DataFrame(initialobserved.tolist(), index = initialobserved.index)

count['Port']['Transported'] = observed[0].sum(axis = 0).astype(int)
count['Starboard']['Transported'] = observed[1].sum(axis = 0).astype(int)
count['Port']['Not Transported'] = observed[2].sum(axis = 0).astype(int)
count['Starboard']['Not Transported'] = observed[3].sum(axis = 0).astype(int)

With the counts I have created lets create a heatmap and complete the chi squared test. 

In [None]:
plt.figure(figsize=(12,8)) 
sns.heatmap(count, annot=True, cmap="YlGnBu")

In [None]:
from scipy.stats import chi2_contingency 

c, p, dof, expected = chi2_contingency(count) 

if p < .05:
    print('p = ' + str(p) + ' which is less than 5%, thus the variables of side of the ship and transported are not independent')
if p > .05: 
    print('p = ' + str(p) + ' which is greater than 5%, thus the variables of side of the ship and transported are independent')

Now that we have these correlations and have shown that the difference between sides of the ship is statistically significant, lets move into data cleaning and model creation, imputing missing values, and encoding the categorical data into numerical values so that it is usable in the modelling process. 

**Data Cleaning**

In [None]:
print("Null Count for all Columns")
print(trainingData.isnull().sum(axis = 0))

Now lets build the Preprocessor to handle all of these null values and to encode the data to numerical values. 

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.impute import KNNImputer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer


Categorical_Cleaner = Pipeline(steps=[('imputer', KNNImputer())])

Numerical_Cleaner = Pipeline(steps=[('imputer', KNNImputer())])

Now that we have defined the preprocessor, lets split the data into a training and a testing set, and begin to assemble a few seperate model pipelines to train the models, select the most important features, and then retrain a new model. 

In [None]:
from sklearn.model_selection import train_test_split

y = trainingData.Transported
y = pd.DataFrame(LabelEncoder().fit_transform(y))

features = ['Deck', 'VIP', 'GroupPassengerId', 
            'CryoSleep', 'GroupId', 'Destination', 
            'HomePlanet', 'Side', 'Age',
            'TotalSpending']

X = trainingData[features]
s = (X.dtypes == 'object')
object_cols = list(s[s].index)
num_cols = X._get_numeric_data().columns
for object_col in object_cols:
    X[object_col] = LabelEncoder().fit_transform(X[object_col])


X[object_cols] = Categorical_Cleaner.fit_transform(X[object_cols])
X[num_cols] = Numerical_Cleaner.fit_transform(X[num_cols])
X['TotalSpending'] = StandardScaler().fit_transform(X['TotalSpending'].array.reshape(-1,1))


trainX, testX, trainy, testy = train_test_split(X, y, test_size = .25, random_state=42) 

targetX = testingData[features]
s = (targetX.dtypes == 'object')
object_cols = list(s[s].index)
num_cols = targetX._get_numeric_data().columns
for object_col in object_cols:
    targetX[object_col] = LabelEncoder().fit_transform(targetX[object_col])


targetX[object_cols] = Categorical_Cleaner.fit_transform(targetX[object_cols])
targetX[num_cols] = Numerical_Cleaner.fit_transform(targetX[num_cols])
targetX['TotalSpending'] = StandardScaler().fit_transform(targetX['TotalSpending'].array.reshape(-1,1))

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression

Naive_Bayes_Model = GaussianNB()
LogisticModel = LogisticRegression(max_iter=10000, tol=0.1)

from xgboost import XGBClassifier

XGB_model = XGBClassifier()

from sklearn.ensemble import GradientBoostingClassifier

gradientBoost = GradientBoostingClassifier(n_estimators = 100, random_state = 42)

from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier

def KerasBuilder():
    model = Sequential()
    model.add(Dense(10, input_dim=10, activation = 'relu'))
    model.add(Dense(5, activation = 'relu'))
    model.add(Dense(2, activation = 'relu'))
    model.compile(loss = 'binary_crossentropy', 
                  optimizer = 'adam', 
                  metrics = ['accuracy'])
    return model

KerasEstimator = KerasClassifier(build_fn = KerasBuilder, 
                                 epochs = 200, 
                                 batch_size = 5, 
                                 verbose = False)

Now that we have the models defined and the pipelines built, lets train the models and evaluate them on the test set. 

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

bayespreds = Naive_Bayes_Model.fit(trainX, trainy).predict(testX)

logpreds = LogisticModel.fit(trainX, trainy).predict(testX)

gradientboostpreds = gradientBoost.fit(trainX, trainy).predict(testX)

XGB_Model = XGB_model.fit(trainX, trainy, eval_metric = 'error')
#KerasEstimator.fit(trainX, trainy, epochs=150, batch_size=10, verbose=0)
#Keraspreds = pd.DataFrame(KerasEstimator.predict(testX))

Now the moment we've all been waiting for, the scoring of the initial models. Lets see which of thes survives through the first round of eliminations. 

In [None]:
#print("Keras Scores: " + str(accuracy_score(Keraspreds, testy)))
print("Bayes score: " + str(accuracy_score(bayespreds, testy)))
print("Logistic score: " + str(accuracy_score(logpreds, testy)))
print('Ensemble learning score: ' + str(accuracy_score(gradientboostpreds, testy)))
print("XGB score: " + str(XGB_Model.score(testX, testy)))
print('XGB training score: ' + str(XGB_Model.score(trainX, trainy)))

In the end, the XGB Classifier is the winner, slipping past the ensemble learning algrithm by roughly .003 accuracy. The next step is to complete a search for the best hyperparameters, and complete feature selection. This is going to be interesting as I have never done this before, so here goes nothing. 

In [None]:
#The parameter grid is defined below for XGBClassifier
params = {
        'min_child_weight': [1, 5, 10],
        'gamma': [0.5, 1, 1.5, 2, 5],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'max_depth': [3, 4, 5]
        }

from datetime import datetime
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold

#Because this is going to take a while, I'm stealing this timer function to time it. 
def timer(start_time = None):
    if not start_time:
        start_time = datetime.now()
        return start_time
    elif start_time: 
        thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(), 3600)
        tmin, tsec = divmod(temp_sec, 60)
        print('\n Time taken: %i hours %i minutes and %s seconds.' % (thour, 
                                                                      tmin, 
                                                                      round(tsec, 2)))
                                                                      

I've placed the random search Below so that it can be commented out, because it is going to take a while. 

In [None]:
xgb = XGBClassifier(learning_rate=0.02, n_estimators=600, objective='binary:logistic',
                    silent=True, nthread=1)


folds = 5
param_comb = 50
skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1001)

random_search = RandomizedSearchCV(xgb, param_distributions=params, n_iter=param_comb, scoring='roc_auc', n_jobs=4, cv=skf.split(X,y), verbose=0, random_state=42 )


start_time = timer(None) 
random_search.fit(X, y)
timer(start_time) 

print('\n All results:')
print(random_search.cv_results_)
print('\n Best estimator:')
print(random_search.best_estimator_)
print('\n Best normalized gini score for %d-fold search with %d parameter combinations:' % (folds, param_comb))
print(random_search.best_score_ * 2 - 1)
print('\n Best hyperparameters:')
print(random_search.best_params_)

One of the best ways to increase model accuracy is to increase the amount of training data available to the model, so in the next code block, I will fit the model first to the training data to evaluate its accuracy, and then to the entire dataset. After some testing, I've found that the randomized search hurt performance quite considerably, so I will simply use the initial XGB model that is created. 

In [None]:
optimizedXGB = random_search.best_estimator_
optimizedXGB.fit(trainX, trainy)
print(optimizedXGB.score(trainX, trainy))
print(optimizedXGB.score(testX, testy))

As we can see, the model's performance on the training data suffers considerably, while the testing data only moderately improves. Because of this, I will use the original Naive bayes model, and repeat the process of parameter optimization, and see if we get better results. 

In [None]:
#Parameter matrix for a gaussian NB model 
NBParams = {'var_smoothing': np.logspace(0,-9, num=100)}

start_timeGNB = timer(None) 
tunedNB = GridSearchCV(estimator = Naive_Bayes_Model, param_grid = NBParams, cv = skf.split(X,y),verbose=1, scoring='accuracy')
tunedNB.fit(X,y)
timer(start_timeGNB)

In [None]:

TunedNB = tunedNB.best_estimator_
TunedNB.fit(trainX,trainy)
print(TunedNB.score(trainX, trainy))
print(TunedNB.score(testX, testy))

The following code finally saves the model in a pickle repository so that I don't have to do this again.

In [None]:
import pickle

with open('GNB_tuned.pkl', 'wb') as file:
    pickle.dump(TunedNB, file)
with open('GNB_tuned.pkl', 'rb') as model:
    finalmodel = pickle.load(model)

The below code finally predicts and submits the predictions of the best model. 

In [None]:
preds = pd.DataFrame(TunedNB.predict(targetX))
print(preds)
EXSUB = pd.read_csv('../input/spaceship-titanic/sample_submission.csv')
finalPreds = pd.DataFrame(preds.astype(bool))
finalPreds.insert(0,"PassngerId", testingData.PassengerId)
finalPreds.columns = EXSUB.columns
print(finalPreds)
finalPreds.to_csv('submission.csv', index = False)