# Space Titanic

This notebook will be done under the **OSEMN** framework. <br>
Here is a quick recap:<br>
**O** - Obtaining the data (Collect the data and transform it into suitable format) <br>
**S** - Scrubbing / cleaning the data (Try to understand errors and handle missing values) <br>
**E** - Exploring the data (Statistical analysis, visualisation, feature engineering) <br>
**M** - Model training (Model training) <br>
**N** - Interpretation (Evaluation of the model performance) <br>

# Obtaining the data
All the datasets on the Kaggle come in handy so this step is omitted.

# S is for scrubbing

In [None]:
space_titanic = pd.read_csv('../input/spaceship-titanic/train.csv')
space_titanic.head()

In [None]:
space_titanic.describe().T

In [None]:
# count the proportion of zeros in all numerical columns
space_titanic.isnull().sum() / len(space_titanic)

We see that proportion of null values is about 2% in all numerical columns. At the same time we see that 75% quartile is low for all the numerical columns except ```Age``` which means that it's more than reasonable to impute null values insted of missing values. For the categorical columns we will use more sophisticated approach described below.


In [None]:
space_titanic.dtypes

In [None]:
from sklearn.impute import SimpleImputer

df = space_titanic.copy()


df['Age'] = df['Age'].fillna(df['Age'].mean())

num_cols = df.select_dtypes(include=['float64']).columns.tolist()
num_cols.remove("Age")

df[num_cols] = df[num_cols].fillna(0)

Time to do some feature engineering. First what comes to mind is to split ```PassengerId``` into 2 columns and ```Cabin``` into 3 columns. <br>
We get feature names out of dataset description: <br>
1. ```PassengerNum``` - just the ranking parameter for all groups of passengers (groups = families, etc.). 
2. ```PassengerGroup``` - number of the passenger within the group.
3. ```CabinDeck``` - the deck on which passenger is staying.
4. ```CabinNum``` - the cabin number where the passenger is staying.
5. ```CabinSide``` - can be either port or starboard.

In [None]:
df[['PassengerNum', 'PassengerGroup']] = df.PassengerId.str.split('_', expand=True)
df[['CabinDeck', 'CabinNum', 'CabinSide']] = df.Cabin.str.split('/', expand=True)
df[['FName', 'SName']] = df.Name.str.split(' ', expand=True)

Get the total number of group members within each group and drop all the used columns. 

In [None]:
df['GroupCount'] = df['PassengerNum'].map(lambda x: (df['PassengerNum'] == x).sum())

df = df.drop(columns=['PassengerId', 'Cabin', 'Name'])
df.head()

In [None]:
print(df.SName.nunique() / len(df))
print(df.FName.nunique() / len(df))

The proportion of unique names is low so I impute missing values in the ```FName``` & ```SName``` columns using the most frequent strategy.

In [None]:
name_cols = ['FName', 'SName']
imputer = SimpleImputer(strategy='most_frequent').fit(df[name_cols])
df[name_cols] = imputer.transform(df[name_cols])

In [None]:
# adding an additional column to calculate the total expence 
df['TotalExpence'] = df.RoomService + df.FoodCourt + df.ShoppingMall + df.Spa + df.VRDeck

In [None]:
# add total group expence
# df_temp = df.copy()
# df_temp['GroupExpence'] = df_temp['PassengerNum'].map(lambda x: x ** 2 if df_temp['PassengerNum'] == x)

Data is now more clean and meaningful so let's move onto next step.

# E for exploartion and visualization

In [None]:
def zero_expences(row):
    if row['RoomService'] == 0 and row['FoodCourt'] == 0 \
    and row['ShoppingMall'] == 0 and row["Spa"] == 0 and row['VRDeck'] == 0:
        return True
    else:
        return False

df['ZeroExpences'] = df.apply(lambda row: zero_expences(row), axis = 1)

In [None]:
# add a specific paramter that displays family size
def family_size_splitter(row):
    family_size = ''
    if row == 1:
        family_size = 'Solo'
    elif row <= 3:
        family_size = 'Small'
    elif row <= 5:
        family_size = 'Medium'
    else:
        family_size = 'Large'
    return family_size

df['FamilySize'] = df.apply(lambda row: family_size_splitter(row['GroupCount']), axis = 1)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

cols = ['HomePlanet',
        'CryoSleep',
        'VIP',
        'Destination',
        'CabinDeck',
        'CabinSide',
        'PassengerGroup',
        'GroupCount',
        'ZeroExpences',
        'FamilySize']

fig, axs = plt.subplots(2, 5, figsize=(20, 15))
for i in range(len(cols)):
    plt.subplot(2, 5, i + 1)
    sns.countplot(data = df, x = cols[i], hue = 'Transported')
plt.show()

In [None]:
plt.figure(figsize=(15, 15))
sns.displot(data = df,
            x = 'Age',
            hue = 'Transported',
            kde = True)

In [None]:
isVip = df[df.VIP == True]
notVip = df[df.VIP == False]

isVip.describe().T

In [None]:
notVip.describe().T

In [None]:
plt.figure(figsize=(10, 10))
sns.heatmap(df.corr())

# Further preprocessing

The number of the total expence is not random - it is slightly bigger then the 75% quartile for those who are not in the VIP.

In [None]:
def fill_vip(row):
    if row['VIP'] != None:
        if row['TotalExpence'] <= 1500:
            return False
        else:
            return True
    else:
        return row['VIP']

df['VIP'] = df.apply(lambda row: fill_vip(row), axis = 1)

In [None]:
df.isnull().sum()

In [None]:
cat_cols = ['HomePlanet',
            'CryoSleep',
            'Destination',
            'CabinDeck',
            'CabinSide']

num_cols = ['CabinNum']


pre_transformer = ColumnTransformer([
    ('cat', SimpleImputer(strategy='most_frequent'), cat_cols),
    ('num', SimpleImputer(strategy='mean'), num_cols)
], remainder = 'passthrough')

cols = df.columns.tolist()
df = pd.DataFrame(pre_transformer.fit_transform(df),
                  columns = cols)

In [None]:
pre_transformer.get_feature_names_out

In [None]:
df.head()

# M is for model building
1. Split data into train & test dataset to eliminate bias
2. Create preprocessing pipeline to work with numerical and categorical columns separately
3. Transform data and restore index & column names

In [None]:
from sklearn.preprocessing import (
    OrdinalEncoder, 
    MinMaxScaler,
    StandardScaler,
    OneHotEncoder
)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split


X = df.copy()
y = X.pop('Transported')
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, train_size = 0.8)

num_cols = list(X.select_dtypes(include = ['float', 'int']))
cat_cols = list(X.select_dtypes(include = ['object']))
cat_cols.remove('FName')
cat_cols.remove('SName')

numerical_pipe = Pipeline(steps=[
    ('transformer', StandardScaler())
])

categorical_pipe = Pipeline(steps=[
    ('trasformer', OrdinalEncoder())
])

transformer_pipe = ColumnTransformer([
    ('num', numerical_pipe, num_cols),
    ('cat', categorical_pipe, cat_cols)
], remainder='drop')

In [None]:
print(cat_cols)
X_train.head()

In [None]:
# transform via imputer

X_train_imputed = pd.DataFrame(transformer_pipe.fit_transform(X_train))
X_test_imputed = pd.DataFrame(transformer_pipe.transform(X_test))

X_train_imputed.index = X_train.index
X_train_imputed.columns = num_cols + cat_cols

X_test_imputed.index = X_test.index
X_test_imputed.columns = num_cols + cat_cols

In [None]:
X_train_imputed.head()

Further preprocessing ideas:
1. Introduce new features to the dataset (split several first columns)
2. Evaluate current features via permutation importance

# Model build
## Random forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

n_leaves = range(10, 100, 10)

for n in n_leaves:
    clf = RandomForestClassifier(max_leaf_nodes = n, random_state = 42)
    clf.fit(X_train_imputed, y_train)
    score = clf.score(X_test_imputed, y_test)
    print(f'For {n} leaves model score is \t {score}')    

The peak performance is around 60, 70 leaves.
To evaluate lets try to build k-means classifier.
## K-Neighbors

In [None]:
from sklearn.neighbors import KNeighborsClassifier

for i in range(1, 10):
    knn = KNeighborsClassifier(n_neighbors = i).fit(X_train_imputed, y_train)
    train_score = knn.score(X_train_imputed, y_train)
    test_score = knn.score(X_test_imputed, y_test)
    print(f'For {i} neighbors \t train score: {train_score} \t test_score: {train_score}')
print('===============')

In [None]:
from sklearn.model_selection import GridSearchCV

params = {'n_neighbors': range(1, 10),
         'weights': ['uniform', 'distance']}

model = KNeighborsClassifier()
knc_grid = GridSearchCV(model,
                       params,
                       cv=5,
                       n_jobs=5,
                       verbose=True)

knc_grid.fit(X_train_imputed, y_train)
print(knc_grid.best_score_)
print(knc_grid.best_params_)


In [None]:
import eli5
from eli5.sklearn import PermutationImportance

clf = KNeighborsClassifier(n_neighbors = 5).fit(X_train_imputed, y_train)
perm = PermutationImportance(clf).fit(X_train_imputed, y_train)
eli5.show_weights(perm, feature_names = X_train_imputed.columns.tolist())

Model gets slightly bigger score with 2 neighbors but it's probably due overfitting to the data, so we choose 3 neighbors in the final version.

# Final pipeline creation
In future I want to avoid the rewriting of the code - very demoralizing and frustrating, write a pipeline with thoughts of final evaluation. <br>
In other words data can be split in the end of the program.

In [None]:
first_num_cols = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
first_cat_cols = ['HomePlanet', 'CryoSleep', 'VIP', 'Destination', 'Cabin']
second_num_cols = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 
                       'VRDeck', 'PassengerNum', 'CabinNum', 'RelativesNum', 'TotalExpence']
second_cat_cols = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'CabinFL', 'CabinSL']


def split_drop_df(X):
    X[['PassengerNum', 'PassengerClass']] = X.PassengerId.str.split('_', expand=True)
    X[['CabinFL', 'CabinNum', 'CabinSL']] = X.Cabin.str.split('/', expand=True)
    X[['FName', 'SName']] = X.Name.str.split(' ', expand=True)
    
    X['RelativesNum'] = X['PassengerNum'].map(lambda x: (X['PassengerNum'] == x).sum())
    X = X.drop(columns=['PassengerId', 'PassengerClass', 'Cabin', 'Name'])
    X = X.RoomService + X.FoodCourt + X.ShoppingMall + X.Spa + X.VRDeck
    return X


def preprocess_df(X):
    X['Age'] = X['Age'].fillna(X['Age'].mean())
    X[first_num_cols] = X[first_num_cols].fillna(0)
    X[first_cat_cols] = SimpleImputer(strategy='most_frequent').fit_transform(X[first_cat_cols])
    X = split_drop_df(X)
    
    
    numerical_pipe = Pipeline(steps=[
        ('transformer', MinMaxScaler())
    ])

    categorical_pipe = Pipeline(steps=[
        ('trandformer', OrdinalEncoder())
    ])

    transformer_pipe = ColumnTransformer([
        ('num', numerical_pipe, second_num_cols),
        ('cat', categorical_pipe, second_cat_cols)
    ])
    
    return X, transformer_pipe


# Submission

In [None]:
X_train = pd.read_csv('../input/spaceship-titanic/train.csv')
X_test = pd.read_csv('../input/spaceship-titanic/test.csv')

y_train = X_train.pop('Transported')
passenger_id = X_test['PassengerId']

X_train, _ = preprocess_df(X_train)
X_test, transformer_pipe = preprocess_df(X_test)

In [None]:
X_train_new = pd.DataFrame(transformer_pipe.fit_transform(X_train))
X_test_new = pd.DataFrame(transformer_pipe.transform(X_test))


X_train_new.columns = second_num_cols + second_cat_cols
X_test_new.columns = second_num_cols + second_cat_cols

knn = KNeighborsClassifier(n_neighbors = 2)
knn.fit(X_train_new, y_train)

predictions = pd.Series(clf.predict(X_test_new), name='Transported')
final_result = pd.concat([passenger_id, predictions], axis=1)

final_result.to_csv('/kaggle/working/submission.csv', index = False)