## Importing the libraries and dataset

In [None]:
import pandas as pd
import numpy as np
import copy
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

In [None]:
titanic_train_df = pd.read_csv('../input/spaceship-titanic/train.csv')

## Data Analysis

In [None]:
# dimensions of the original training dataset
titanic_train_df.shape

The dataset has 8693 records and 14 attributes.

In [None]:
# training data column information
titanic_train_df.info()

In [None]:
# examining the first few rows of the training data
titanic_train_df.head()

### Visualizing the null values in each columns

In [None]:
plt.bar(titanic_train_df.columns, titanic_train_df.isna().sum())
plt.xticks(rotation = 90)
plt.show()

On observing the above bar chart it can be concluded that except PassengerId and Transported columns, the rest of the columns in the dataset have missing values.
Since the number of missing values in most of these columns are nearly around 200, they are further examined for the presence of any pattern among them.

### Using a seaborn library heatmap to visually identify any patterns in missing values

In [None]:
plt.figure(figsize = (15, 10))
sns.heatmap(titanic_train_df.loc[:, ~titanic_train_df.columns.isin(['PassengerId', 'Transported'])].isna(), cmap = 'YlGnBu')
plt.show()

On analysing the heatmap of missing values, there are no signs of patterns among the records. To further support the claim, a heatmap from missingno package is used. This package contains various charts and dendograms that can be used to analyse the missing data in a dataset.

### Analysing missing data using heatmap from missingno package

In [None]:
msno.heatmap(titanic_train_df, figsize = (15, 10))

It is evident from the above heatmap that there is no pattern among the missing values in the dataset. Therefore, further analysis needs to be performed to impute with proper values in place of the missing information in the dataset.

### Distribution of numerical values in the dataset

In [None]:
numerical_cols = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

plt.figure(figsize = (15, 15))
plt.subplots_adjust(hspace = 0.25)
for i in range(0, len(numerical_cols)):
    plt.subplot(3, 2, i + 1)
    plt.title(numerical_cols[i])
    plt.hist(titanic_train_df.loc[:, numerical_cols[i]])

On examining the numerical records in the dataset, it can be noticed that the columns - RoomService, FoodCourt, ShoppingMall, Spa and VRDeck are more skewed than the age column. Mean is an efficient method for imputation if the data follows normal distribution. On the other hand, median can be used for skewed data.

## Preprocessing

During the analysis stage, it was observed that the PassengerId column had both the group number and the passenger number within the group. It can be used to extract the number of applicants within each group. This feature could aid as the ratio of number of applicants to accepted applicants might play a role in overall result.

Similarly, the Cabin column has 3 informations namely Deck, Deck Number and Deck Side. They are extracted into individual columns.

The Last Name is extracted from the Name column because of two reasons. They are,
1. Number of family members in the same group influencing decision
2. Previlege to members from certain family

In [None]:
def cabin_transform(cabin, index):
    if cabin is np.nan:
        return cabin
    else:
        return str(cabin).split('/')[index]
    
def modify_features(orig_df):
    df = copy.deepcopy(orig_df)
    
    df.insert(0, 'PassengerGroup', df['PassengerId'].transform(lambda passengerId: int(passengerId.split('_')[0])))
    df.insert(1, 'GroupCount', df.groupby('PassengerGroup')['PassengerId'].transform('count'))

    df['Deck'] = df['Cabin'].transform(lambda cabin: cabin_transform(cabin, 0))
    df['DeckNumber'] = df['Cabin'].transform(lambda cabin: cabin_transform(cabin, 1))
    df['DeckSide'] = df['Cabin'].transform(lambda cabin: cabin_transform(cabin, 2))
    df['FamilyName'] = df['Name'].transform(lambda name: name if (name is np.nan) else str(name).split(' ')[-1])
    
    return df

The numerical columns such as Age, RoomService are converted to categories by binning the values. This will also help to mitigate the effect of data imputation in these columns.

In [None]:
def bin_numerical_values(orig_df):
    df = copy.deepcopy(orig_df)
    
    ageLabels = ['children', 'youth', 'adult', 'senior']
    amountLabels = ['< 1000', '< 2000', '> 2000']
    
    df['Age'] = pd.cut(df['Age'], bins = [0, 15, 24, 64, np.inf], labels = ageLabels, include_lowest=True)
    
    for col in ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']:
        df[col] = pd.cut(df[col], bins = [0, 1000, 2000, np.inf], labels = amountLabels, include_lowest = True)
    
    return df
 

The missing values are now imputed using the above mentioned strategy for numerical values. On the other hand, for all the other columns the missing values are marked as 'Unknown' and grouped into a separate category.

As a part of preprocessing the dataset, all the categorical values are encoded into numerical values using OrdinalEncoder while the output column is encoded using LabelEncoder.

The encoders and the imputer functions are stored in separate dictonaries so that they could be used to transform the test data with the information fitted against the training data.

In [None]:
def encode_output(y):
    encoder = LabelEncoder()
    return encoder.fit_transform(y)

def preprocess_training_data(orig_df):
    df = copy.deepcopy(orig_df)
    
    df = modify_features(df)
    y = encode_output(df['Transported'])
    df = df.drop(['PassengerGroup', 'PassengerId', 'Cabin', 'Name', 'Transported'], axis = 1)
    
    cols = df.columns
    
    medianImputer = SimpleImputer(strategy = 'median')
    constantImputer = SimpleImputer(strategy = 'constant', fill_value = 'unknown')
    meanImputer = SimpleImputer(strategy = 'mean')
    
    df[['Age']] = meanImputer.fit_transform(df[['Age']])
    df[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']] = medianImputer.fit_transform(df[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']])
    df = constantImputer.fit_transform(df)
    
    imputers = {
        'constant' : constantImputer, 
        'median' : medianImputer,
        'mean' : meanImputer
    }
    
    df = pd.DataFrame(df, columns = cols)
    df = df.convert_dtypes()
    
    df = bin_numerical_values(df)
        
    encoder_models = {}
    
    for col in df.columns:
        encoder = OrdinalEncoder(handle_unknown = 'use_encoded_value', unknown_value = -1)
        if df[col].dtypes != 'object':
            df[[col]] = encoder.fit_transform(df[[col]].astype('category'))
        else:
            df[[col]] = encoder.fit_transform(df[[col]].astype('string'))

        encoder_models[col] = encoder
        
    return df, y, imputers, encoder_models

In [None]:
train_df, y, imputers, encoders = preprocess_training_data(titanic_train_df)

In [None]:
train_df.head()

In [None]:
train_df.isna().sum()

The imputers and encoders obtained while preprocessing the training data are used to transform the test data.

In [None]:
def preprocess_test_data(orig_df, imputers, encoders):
    
    df = copy.deepcopy(orig_df)
    
    df = modify_features(df)
    df = df.drop(['PassengerGroup', 'PassengerId', 'Cabin', 'Name'], axis = 1)
    
    cols = df.columns
    
    df[['Age']] = imputers['mean'].transform(df[['Age']])
    df[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']] = imputers['median'].transform(df[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']])
    df = imputers['constant'].transform(df)
    
    df = pd.DataFrame(df, columns = cols)
    df = df.convert_dtypes()
    
    df = bin_numerical_values(df)
    
    for col in encoders:
        if df[col].dtypes != 'object':
            df[[col]] = encoders[col].transform(df[[col]].astype('category'))
        else:
            df[[col]] = encoders[col].transform(df[[col]].astype('string'))
        
    return df

In [None]:
titanic_test_df = pd.read_csv('../input/spaceship-titanic/test.csv')
test_df = preprocess_test_data(titanic_test_df, imputers, encoders)

In [None]:
test_df.head()

In [None]:
test_df.isna().sum()

## Prediction

Since there is no test output to evaluate the performance of the model, the training data is split into train and test (hold-out cross validation set) sets and accuracy is used as the metric to evaluate different models.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

train_X, test_X, train_y, test_y = train_test_split(train_df, y, test_size = 0.10, shuffle = True, random_state = 0)

In [None]:
print('X train shape : ', train_X.shape)
print('y train shape : ', train_y.shape)
print('X test shape : ', test_X.shape)
print('y test shape : ', test_y.shape)

### Logistic Regression

Since the model was underfitting the max_iter parameter was increased to 2000 among other hyperparameters that were tuned but then removed as they did not increase the performance of the model.

In [None]:
from sklearn.metrics import accuracy_score
clf = LogisticRegression(max_iter = 2000)
clf.fit(train_X, train_y)
y_train_pred = clf.predict(train_X)
print('Training accuracy : ', accuracy_score(train_y, y_train_pred))
y_pred = clf.predict(test_X)
print('Test accuracy : ', accuracy_score(test_y, y_pred))

### Decision Tree Classifier

On fitting the Decision Tree Classifier with the default hyperparameters the training accuracy was close to 99% while the testing result was at 73%. This is clearly a problem of overfitting. Therefore, to reduce the variance, the max_depth value was adjusted till the model did not improve anymore.

In [None]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth = 14, random_state = 0)
clf.fit(train_X, train_y)
y_train_pred = clf.predict(train_X)
print('Training accuracy : ', accuracy_score(train_y, y_train_pred))
y_pred = clf.predict(test_X)
print('Test accuracy : ', accuracy_score(test_y, y_pred))

On comparing all the above two methods, the decision tree classifier is chosen for the final prediction with the test data.

In [None]:
test_pred = clf.predict(test_df)

In [None]:
submission = pd.DataFrame({
    'PassengerId' : titanic_test_df['PassengerId'],
    'Transported' : test_pred.astype(bool)
})
submission.to_csv('./submission.csv', index = False)