# The Task

Welcome to the year 2912. We've received a transmission from four lightyears away and things aren't looking good.

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.

Help save them and change history!

# Reading and inspecting data

In [None]:
# imports 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# reading data

data_types = {
    'HomePlanet': 'category',
#    'CryoSleep': 'category',
    'Destination': 'category',
#    'VIP': 'category',
#    'Transported': 'category'
}

train = pd.read_csv("../input/spaceship-titanic/train.csv", dtype=data_types)
test = pd.read_csv("../input/spaceship-titanic/test.csv", dtype = data_types)
train.head(10)

**Columns**

* **PassengerId** - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
* **HomePlanet** - The planet the passenger departed from, typically their planet of permanent residence.
* **CryoSleep** - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
* **Cabin** - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
* **Destination** - The planet the passenger will be debarking to.
* **Age** - The age of the passenger.
* **VIP** - Whether the passenger has paid for special VIP service during the voyage.
* **RoomService, FoodCourt, ShoppingMall, Spa, VRDeck** - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
* **Name** - The first and last names of the passenger.
* **Transported** - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict

**Insights**: 
* Bool values: CyroSleep, VIP and Transported (the target value) - should be converted to 1 and 0
* HomePlanet, Cabin, Destination  may be categorical values, should be encoded to numbers
* Cabin like "B/0/P" or "F/1/S" may contain several interesting information - it may be worth investigating in detail

In [None]:
# Changing bool values to 1 and 0 for easier visualisation: 

train = train.replace({True: 1, False: 0})
test = test.replace({True: 1, False: 0})
train.head(10)

In [None]:
# overview of the train data
train.info()


In [None]:
# summary statistics
train.describe()

**Insights:**
* At first sight, it seems that around 50% of the passengers have survived. 
* There are interestingly huge numbers among the max values of the RoomService, FoodCourt, ShoppingMall and Spa section. I will need to check if there 

In [None]:
# checking number of missing values 
display(train.isna().sum())
display(f'Number of total missing values in the training dataset {train.isna().sum().sum()}')

**Insights**:
* There are more than 2.3k missing values in the training dataset, which we will need to take care of. 

In [None]:
train.Transported.value_counts()

# Visual EDA

Inspecting the distribution and correlation of the various features. But first, let's take a look at the target!


In [None]:
# checking distribution of our target ('Transported')


plt.subplots(figsize=(6,5))
plt.pie(train.Transported.value_counts(normalize=True), labels=['Transported', 'Not transported'], startangle=90, autopct='%.2f%%')
plt.title('Distribution of the target')

Luckily, the dataset is balanced, the amount of Transported and Not transported passengers are almost equal.   

## Categorical features

### 1. Travellers per Home Planet

Let's take a look how the Home Planet impacts surviving chances. 

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15,6))

b = sns.countplot(data=train, x='HomePlanet', hue='Transported', ax=axes[0]).set_title("Number of passengers per Home Planet")
c = sns.barplot(data=train, x = 'HomePlanet', y = 'Transported', ax = axes[1], ci=None)
c.set_xlabel("Number of passengers")
c.set_title('Ratio of transported passengers per Home planet')
sns.despine()




In [None]:
# d=sns.catplot(data=train, x='HomePlanet', hue='Transported', kind='count').set(title='Number of travellers').despine()

# Survival ratios
#e = sns.catplot(data=train, x='HomePlanet', y = 'Transported', kind='bar', ci=None).set(title="Ratio of transported travellers per Home Planet")
#sns.despine()

**Insights:**
* It seems that travellers from Europa got lost in the alternate dimension with the highest chance, while 50% of travellers from Mars got lost .
* The most travellers are from Earth and earthlings have the best surviving chances, only 40% of them got transported.

### 2. Cyro sleep

Let's take a look how the fact that a traveller was under cyrosleep impacts surviving chances. 

In [None]:
f = sns.catplot(data=train, x= 'CryoSleep', hue='Transported', kind='count')
f.set(title='Number of passengers in CryoSleep').set_axis_labels('CryoSleep status', 'No of passsengers').set_xticklabels(['Not in sleep', 'Sleep']).despine()


**Insights:**
* 5k+ passengers were awake while roughly 3k passenger were in CryoSleep and they got transported with a much higher chance

In [None]:
g = sns.catplot(data=train, x='CryoSleep', y ='Transported', kind='bar', ci=None)
g.despine().set(title='Ratio of transported (lost) persons per CryoSleep status').set_xticklabels(['Awake', 'Sleeping'])
g.set_axis_labels('CryoSleep status', 'Ratio')

**Insights:**
* 80% of travellers in CyroSleep got lost - it seems that CyroSleep is a high risk to get lost in the alternate dimension
* However, being awake was not a total defence against getting lost in alternate dimensions, 30% of awake persons also got lost. 

### 3. Destination

Let's investigate the distribution of the destination of the passengers

In [None]:
dest = list(train.Destination.unique())

print(f'Destinations of the passengers: {dest[0]}, {dest[1]}, {dest[2]}')

In [None]:
h = sns.catplot(data=train, x='Destination', y='Transported', kind='bar', ci=None)
h.despine().set(title="Ratio of passengers lost per Destination").set_axis_labels('Destination', 'Ratio of lost')

**Insights:**
* Around 60% of the passengers goint to 55 Cancri e got lost. This ratio is somewhat lower at the PSO J318.5-22 and TRAPPIST-1e, around 50%.

### 4. VIP status

According to the summary statistics at the beginning of this notebook, around 2,3439% of the passengers had a VIP status. 

In [None]:
i = sns.catplot(data=train, x='VIP', y='Transported', kind='bar', ci=None)
i.despine().set(title='VIP status and chanches of getting lost').set_axis_labels('VIP status', 'Ratio of getting lost')
i.set_xticklabels(['Not VIP', 'VIP'])

**Insights:**
* There is somewhat higher risk of getting transported into alternate dimension for not VIP passengers  (~50% risk vs ~40% risk of the VIPs' risk) 

## Analysis of numerical features

Let's investigate the distribution of the age of the passengers

### 1. Age of passengers

In [None]:
# distribution of Age of passengers:
fig, ax = plt.subplots(1,1, figsize=(10, 5))
g = sns.histplot(data=train, x='Age', hue='Transported', bins=79, ax=ax)
g.set(title='Distribution of age of lost and not lost passengers')
g.set_ylabel('Number of passengers')


plt.legend(labels=['Lost', 'Not lost'], title='Got lost')
sns.despine()

**Insights:**
* It seems that the age distribution of the lost and not lost passengers are mostly similar for ages above 25 years. 
* The two biggest difference is that babies (< 5 years) are lost to much greater extent. Also, among 19-25 years the ratio of lost passengers is lower.

In [None]:
age= train.groupby('Age')['Transported'].agg([np.mean, sum]).reset_index()
display(age)


In [None]:
fig, ax = plt.subplots(1,1, figsize=(10,8))
j=sns.scatterplot(data=age, x='Age', y='mean', size='sum', ax=ax)
j.set(title='Ratio of lost passengers per age')
j.set_ylabel('Ratio of lost passengers')
plt.legend(title='Amount of people lost')
sns.despine()

**Insights:**
* There is a trend that younger passenger tend to get lost in alternate dimension: 80% of the babies got lost, while this ratio was around 50% at the ages between 40-50 years. 

### 2. Spendings

Let's take a look how the various spendings (RoomService, FoodCourt, ShoppingMall, Spa, VRDeck) influenced whether a passenger has beeen transported. 

In [None]:
# distribution of the various spendings

# creating long format from the train data for the boxplot
long = train.melt(id_vars=['PassengerId','HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age', 'VIP', 'Name','Transported'], var_name='Services', value_name='Spending')
long

In [None]:
# Visualization and transported status
fig, ax = plt.subplots(1,1, figsize=(20,8))

k=sns.boxplot(data=long, y='Services', x='Spending', hue= 'Transported')
plt.legend(title='Transported')
plt.title('Spendings depending on transported status')
sns.despine()


**Insights:**
* It seems that amongst the spendings, there are a lot of outliers. 
* It raises the question of invalidating those outliers as they might be incorrect. In an earlier version of this notebook I took out those spending items, which exceeded 1.5 * interquartile range from the upper interquartile. However, then the model performance dropped from 78% to 62% so it weakens the model. 
* Therefore i do not invalidate any of the spednings.

* Other then the outliers, it is hard to discover any pattern in the distribution of spendings and whether or not the passenger was lost (transported). It seems that not transported passengers spent more on RoomService, Spa and VR deck than not transported passengers, but this trend cannot be seen in the case of shopping mall and foodcourt spendings. 

Let's create a total spending column and see whether it is useful

In [None]:
train['Total_spending'] = train.RoomService + train.FoodCourt + train.ShoppingMall + train.Spa + train.VRDeck
test['Total_spending'] = test.RoomService + test.FoodCourt + test.ShoppingMall + test.Spa + test.VRDeck

train.head()

fig, ax = plt.subplots(1,1, figsize=(20,5))
o = sns.histplot(data=train, x='Total_spending', hue='Transported', ax=ax)
sns.despine()


In [None]:
p = sns.boxplot(data=train, x='Transported', y='Total_spending')
p.set_xticklabels(['Not transported', 'Transported'])
p.set_title('Distribution of spending')
sns.despine()

**Insights:**
* there is not much difference between the total spending of transported and not transported passengers

## Analysis of text features


### PassengerId: Family / traveller group size
PassengerId contains information if a passenger travelled with family or alone. In order to analyse this, first, we distract the information of the "family group" of the passenger (first part of the PassengerId) and then the size of the family in new features. 

In [None]:
train['Family_group'] = train.PassengerId.str[:4]
test['Family_group'] = test.PassengerId.str[:4]

train.head(10)

In [None]:
# checkingc whether there are overlapping family groups between the train and test set 

train_fam_groups = train.Family_group.unique()
test_fam_groups = test.Family_group.unique()

overlapping_fam_groups = [fam_group for fam_group in train_fam_groups if fam_group in test_fam_groups]


display(f'Number of Family_groups in both train and test set: {len(overlapping_fam_groups)}')



**Insight:**
* There is no overlapping between the family groups of the train set and the test set. This means that we can count the family size in the train set and the test set separately (there are no families where some of the members are in the train set while the others are in the test set)

In [None]:
# determining family size based on the number of same family groups. There may be member of familygroups in the test group
train['Family_size'] = train.groupby('Family_group')['Family_group'].transform('count')
test['Family_size'] = test.groupby('Family_group')['Family_group'].transform('count')
# train['Fam_survival_rate']=train.groupby('Family_group')['Transported'].transform('mean') # I will add this maybe later
train.head(20)


**Insights:**
* Let's visualize how family size and family survival rate impact each other. 

### Visualization of Family size and Fam survival rate

In [None]:
fig, ax = plt.subplots(1,1, figsize=(8, 5))

l = sns.countplot(data=train, x='Family_size', hue='Transported', ax = ax).set(title='Distribution of family sizes according to transported(lost) and not transported passengers')
sns.despine()


In [None]:
m = sns.catplot(data=train, x='Family_size', y='Transported', kind='bar', ci=None)
m.set(title='Rate of getting lost per family sizes')

**Insights:**
* It seems that travellers who travel alone have the lowest chance of gettin lost (around 45%) while groups of 4 and 6 persons had a higher (60%+) chance of getting lost. 
* we try to encode the family size to the order of the survival rate


In [None]:
# calculating the chances of each family sizes and ordering it ascending for encoding
"""
family_survival_order = train.groupby('Family_size')['Transported'].mean().sort_values()
display(family_survival_order)

# encoding family sizes
family_ordering_dict = {k:v for (k,v) in zip(family_survival_order.index, range(1,9))}
display(family_ordering_dict)
train['Family_size'].replace(family_ordering_dict, inplace=True)
train.head()
"""         

### Cabin number
The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard. Let's extract the deck and the side as separate features from the cabin number. 


In [None]:
# extracting Deck feature
train['Deck'] = train.Cabin.str[0]
test['Deck'] = test.Cabin.str[0]
display(f'Deck info is missing for {train.Deck.isna().sum()} passengers')

display(f'Decks on the spaceships: {train.Deck.unique()}')

# Visualizing decks chances of getting transported to alternate dimension
plot_order=train.groupby('Deck')['Transported'].mean().sort_values(ascending=True).index
plot_order


In [None]:

n = sns.catplot(data=train, x='Deck', y='Transported', kind='bar', ci=None, order=plot_order)
n.set(title='Ratio of passengers transported per deck')
plt.show()


**Insights:**
* It seems that there are 8 different decks on the spaceships: B, F, A, G, E, D, C, T.
* There are also missing data for decks for 200 passengers. Assuming that all passengers stayed in cabins, we will need to impute those data. 
* Above 70% of passenger got transported from deck B and a slightly lower but still high ratio of passengers got transported from deck C. It seems that deck T was the safest, only ~20% of the passengers got lost from there.
* It seems that deck may be an important feature as the variance (20% vs 70%) is significant.

Let's encode decks in the order of risks (T with 1 to B as 8)

In [None]:
# creating a dictionary for encoding decks in increasing risk order
deck_encoding_dict = {'T': 1, 
                      'E': 2,
                      'D': 2, 
                      'F': 2,
                      'A': 2,
                      'G': 2,
                      'C': 3,
                      'B': 3
                     }
deck_encoding_dict

In [None]:
train['Deck'] = train.Deck.replace(deck_encoding_dict)
test['Deck'] = test.Deck.replace(deck_encoding_dict)
train.head()
test.head()

In [None]:
# extracting the information whether the cabin was at Port side or Starboard side
train['Side'] = train.Cabin.str[-1]
test['Side'] = test.Cabin.str[-1]
display(train.head())

# Visualizing Portside / Starboard side
o = sns.catplot(data=train, x='Side', y='Transported', ci=None, kind='bar')
o.set(title='Ratio of transported people per side')


**Insights:**
* There is not much difference in the ratio of getting lost amongst portside(P) (approximately 45%) and Starboardside (S) (approximately 53%) passengers. This does not seem to be an important feature for prediction.

### Correlation of features

Now as we have all our features, i check whether there are highly correlated features, which should be eliminated from the dataset

In [None]:
# preparing mask masking the upper half of the triangle
mask = np.triu(train.corr(), k=0)
display(np.triu)

# Checking correlation of features
plt.figure(figsize=(10,10))
r = sns.heatmap(train.corr(), annot=True, mask=mask, cmap='coolwarm').set_title('Correlation of the features in the training set')

**Insights:**
* The only items highly correlated are Total spending and Food Court - i will not put Total spending among the features

 # Building a model
 
 We will start with Logreg and KNN neigbors to experiment with

In [None]:
# Imports

# preprocessing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.svm import LinearSVC

# pipeline + grid search
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV, train_test_split

# metrics
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


In [None]:
# First let's drop rows with NaN
train.dropna()

**Insights** 
* from the 8092 rows we would loose more than 1400 observations, which is too big a loss -> instead of dropping we will impute the missing data 

In [None]:
# Selecting features for prediction

num_cols = ['Age', 'RoomService', 'FoodCourt','ShoppingMall', 'Spa', 'VRDeck', 'Deck',  'Family_size',] # now Deck is an already encoded feature, i put it into the numerical cols. I did not include Total Spending 
cat_cols = ['HomePlanet', 'Destination', 'Side']


In [None]:
# Splitting the data to train and validation set: 

X = train[num_cols + cat_cols].copy()
y = train['Transported']
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.25, random_state=42)


### Building a pipeline 

In the earlier versions of this notebook, I used GridSearchCV to test the following models: 
* LogisticRegression, 
* KNeighborsClassifiers, 
* RandomForestClassifier,
* ExtraTreesClassifier
* GradientBoostingClassifier

While all the models (without further hyperparameter-tuning achieved an accuracy betweeen 75-80%, GradientBoostingClassifier brought the best results (80.4%). Therefore I use GBC further and try to find the optimal hyperparameters for that.

Regarding the number of estimators, the default 100 estimators brought the best results (with gridsearch I tested for 100-1000 estimators and 1500 estimators). 200 estimators and a learning rate of 0.05 brought similar results.

After a couple of grid searches the search resulted in a learning_rate': 0.08, 'classifier__max_depth': 4, 'classifier__n_estimators': 80 parameters. (The last hyperparameter space was defined  as:
 *   'classifier__n_estimators': [80, 100, 150, 200, 300], 
 *  'classifier__max_depth': [3, 4, 5, 6],  
 * 'classifier__learning_rate': [0.15, 0.1, 0.08, 0.05, 0.01] 



In [None]:
# defining potential estimators for the pipeline

logreg = LogisticRegression(random_state=42)
knn = KNeighborsClassifier()
rf = RandomForestClassifier(random_state=42)
extra_trees = ExtraTreesClassifier(random_state=42)
gbc = GradientBoostingClassifier( learning_rate=0.08, n_estimators=80, max_depth=4, random_state=42)


In [None]:

# building separate transformers for the various features
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse=False))
])

numerical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

preprocessor = ColumnTransformer([
    ('cat', categorical_transformer, cat_cols),
    ('num', numerical_transformer, num_cols)
])

pipe = Pipeline([
    ('preproc', preprocessor),
    ('classifier', gbc) # we start with knn but this will change due to the gridsearch 
])

In [None]:
# defining hyperparameter space and search object
param_grid = {
    'classifier__n_estimators': [80],
    'classifier__max_depth': [4], 
    'classifier__learning_rate': [0.08]
    
}

search = GridSearchCV(pipe, param_grid, refit=True, verbose=3, scoring='accuracy')


In [None]:
# fitting training data to the model
search.fit(X_train, y_train)


In [None]:
display(f"Best parameters: {search.best_params_}")
display(f"Best estimator's accuracy score: {search.best_score_}")

In [None]:
# Selecting the best perfoming estimator from gridsearch for prediction 
best_model = search.best_estimator_


In [None]:
y_pred = best_model.predict(X_valid)
display(f'Accuracy on validation set: {accuracy_score(y_valid, y_pred)}')

**Insights:**
* It seems that our model predicts the accuracy on the validation set similarly as in the grid search. 

In [None]:
print(confusion_matrix(y_valid, y_pred))
print (classification_report(y_valid, y_pred))

### Generating prediction

In [None]:
# fitting model to all of the observations to generate more accurate predictions
best_model.fit(X, y)
target = best_model.predict(test[num_cols + cat_cols])
display(target)

# Translating predicted 0, 1 values to False / True
target = target.astype(dtype=bool) 
display(target)

In [None]:
# Creating submission file

submission = pd.DataFrame({
    'PassengerId': test.PassengerId,
    'Transported': target
})

submission.to_csv('submission.csv', index=False)