![](https://morbotron.com/img/S02E01/926381.jpg)

# The Problem
The year is 2912 and we've got a problem. The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, **almost half of the passengers were transported to an alternate dimension!**

Our task is to **predict which passengers were transported by the anomaly** using records recovered from the spaceship’s damaged computer system.

Submissions are evaluated based on their classification accuracy, the percentage of predicted labels that are correct.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import xgboost as xgb
from xgboost import XGBClassifier
import cufflinks as cf
import plotly.express as px
from plotly.subplots import make_subplots
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objects as go 
import missingno as msno

init_notebook_mode(connected=True)
cf.go_offline()

import matplotlib.ticker as mtick
plt.rcParams["figure.figsize"] = 10, 6
plt.rc("axes.spines", top=False, right=False)
palette = ['#636EFA', '#EF553B', '#00CC96', '#AB63FA', '#FFA15A', '#19D3F3', '#FF6692', '#B6E880', '#FF97FF', '#FECB52']
sns.set_palette(palette)

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

In [None]:
#Reading Data
train = pd.read_csv('../input/spaceship-titanic/train.csv')
test = pd.read_csv('../input/spaceship-titanic/test.csv')

# Preliminary Data Exploration

![](https://morbotron.com/img/S02E01/367663.jpg)

In [None]:
train.head()

* **PassengerId** - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
* **HomePlanet** - The planet the passenger departed from, typically their planet of permanent residence.
* **CryoSleep** - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
* **Cabin** - The cabin number where the passenger is staying. Takes the form **deck/num/side**, where side can be either P for Port or S for Starboard.
* **Destination** - The planet the passenger will be debarking to.
* **Age** - The age of the passenger.
* **VIP** - Whether the passenger has paid for special VIP service during the voyage.
* **RoomService, FoodCourt, ShoppingMall, Spa, VRDeck** - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
* **Name** - The first and last names of the passenger.
* **Transported** - Whether the passenger was transported to another dimension. This is the target

In [None]:
print(f'Shape of the train dataset is',train.shape)
print(f'Shape of the train dataset is',test.shape)

In [None]:
#check for duplicates in train test
train['PassengerId'].nunique()

We Shouldn't have duplicates.

In [None]:
cat_col = [col for col in train.columns if train[col].dtype == 'object' or train[col].dtype == 'bool' ]
num_col = [col for col in train.columns if train[col].dtype !='object']

print(f'categorical columns:', cat_col)
print(f'numeric columns:', num_col)

Check for missing values in train and test set.

In [None]:
msno.matrix(train)
plt.title('Missing Value Distribution in train set', size=20);

In [None]:
train.isnull().sum().sort_values(ascending=False)

In [None]:
msno.matrix(test)
plt.title('Missing Value Distribution in test set', size=20);

In [None]:
test.isnull().sum().sort_values(ascending=False)

We have missing values in both train and test set, we will handle them later, right now we are just exploring. In the test set only the Passenger Id column has no missing values while in the train set passengerId and Transported ave no missing values.

In [None]:
fig,ax = plt.subplots(1,1, figsize=(12,8))
(train.isnull().mean()*100).plot(kind='bar', ax=ax, align='center', width=.4)
(test.isnull().mean()*100).plot(kind='bar', ax=ax, align='edge',width=.4, color=palette[1])
plt.legend(labels=['Train Set','Test Set'])
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.tick_params(axis='x', labelrotation=80)
ax.set_ylabel('Missing Values (%)')
ax.set_title('Percentage of missing values in train and test set');

Missing Values percentage is consistent between train and test set

In [None]:
sns.countplot(train.Transported)
plt.title('Target Variable');

The target variable is balanced.

# Feature Exploration

## Categorical features

In [None]:
cat_col

In [None]:
#Unique values in HomePlanet
train['HomePlanet'].value_counts().sort_values(ascending=False)

In [None]:
sns.countplot(train.HomePlanet)
plt.title('Number of passengers per Home Planet');

In [None]:
p_planet = train['HomePlanet'].value_counts(normalize=True).round(decimals=2)*100
p_planet

In [None]:
fig,ax=plt.subplots(1,1,)
sns.barplot(x=p_planet.index, y=p_planet.values, ax=ax)
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.tick_params(axis='x', labelrotation=0)
ax.set_ylabel('Percentage(%)')
ax.set_title('Passenger Percentage by Home Planet');

In [None]:
sns.countplot(train.HomePlanet, hue=train.Transported)
plt.title('Number of passengers transported to another dimensions \n by Home Planet');

Most of the passengers come from earth, so earthlings represent both most transported that the least transported to the other dimensio, better see the same graph in percentage.

In [None]:
tp = train.groupby('HomePlanet')['Transported'].value_counts(normalize=True).round(decimals=2).unstack()*100
fig,ax = plt.subplots(1,1)
tp.plot(kind='bar', ax=ax)
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.tick_params(axis='x', labelrotation=0)
ax.set_ylabel('Percentage(%)')
ax.set_title('Passengers transported to another dimensions \n by Home Planet (Percentage)');

* Most of the passengers from Earth were not transported to another dimension
* Despite being the 25% of all passengers, more than 60% of the passengers from Europa were transported to another dimension
* Slightly more than 50% of the passengers from Mars (21% of total passengers) were transported to another dimension

We need to continue explore to find some answers.

In [None]:
sns.countplot(train.CryoSleep)
plt.title('Number of Passengers in CryoSleep');

Most of the passengers were not into cryo sleep.

In [None]:
sns.countplot(x='CryoSleep', hue='Transported', data=train)
plt.title('Number of Passengers Transported by CryoSleep');

In [None]:
cp = train.groupby('CryoSleep')['Transported'].value_counts(normalize=True).round(decimals=2).unstack()*100
fig,ax = plt.subplots(1,1)
cp.plot(kind='bar', ax=ax)
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.tick_params(axis='x', labelrotation=0)
ax.set_ylabel('Percentage(%)')
ax.set_title('Passengers transported to another dimensions \n by CryoSleep (Percentage)');

80% of the Passenger that were in CryoSleep have been transported to another dimension.

In [None]:
sns.countplot(hue='CryoSleep', x='HomePlanet', data=train)
plt.title('Number of Passengers in CryoSleep by HomePlanet');

In [None]:
train['Destination'].value_counts()

In [None]:
sns.countplot(train.Destination)
plt.title('Number of passengers by destination');

In [None]:
sns.countplot(hue='Transported', x='Destination', data=train)
plt.title('Number of passengers transported by destination');

In [None]:
dt = train.groupby('Destination')['Transported'].value_counts(normalize=True).round(decimals=2).unstack()*100
fig,ax = plt.subplots(1,1)
dt.plot(kind='bar', ax=ax)
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.tick_params(axis='x', labelrotation=0)
ax.set_ylabel('Percentage(%)')
ax.set_title('Passengers transported to another dimensions \n by Destination (Percentage)');

21% of the total passengers were bound for 55 Cancri, but 60% of them have been trasported to another dimension.

In [None]:
sns.countplot(data=train, hue='CryoSleep', x='Destination')
plt.title('Number of passengers in cryo sleep by destination');

In [None]:
sns.countplot(train.VIP)
plt.title('Number of VIP passengers');

In [None]:
sns.countplot(hue='Transported', x='VIP', data=train)
plt.title('Number of transported Passenger by VIP status');

In [None]:
vip_t = train.groupby('VIP')['Transported'].value_counts(normalize=True).round(decimals=2).unstack()*100
fig,ax = plt.subplots(1,1)
vip_t.plot(kind='bar', ax=ax)
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.tick_params(axis='x', labelrotation=0)
ax.set_ylabel('Percentage(%)')
ax.set_title('Passengers transported to another dimensions \n by Vip Status (Percentage)');

If you are a VIP you have a slightly higher chance of not being transported.

Now we can use the **Cabin** column to create other informative features. We need to do this for both the train and test set. The first letter in the cabin code stands for the deck, then we have the number and at last the side.

In [None]:
train[['Deck', 'Num', 'Side']] = train['Cabin'].str.split('/', 2, expand=True)
test[['Deck', 'Num', 'Side']] = test['Cabin'].str.split('/', 2, expand=True)

In [None]:
train.head()

In [None]:
test.head()

In [None]:
sns.countplot(train.Deck, order=['F','G','E','C','B','D','A','T'])
plt.title('Number of Passengers by Deck');

Most of the passengers are from deck F and G, we have very little passengers in deck T.

In [None]:
fig = plt.figure()
ax1 = sns.countplot(x='Deck', hue='CryoSleep', data=train)
ax2 = ax1.twinx()
sns.pointplot(x='Deck',y='Transported', hue='CryoSleep', data=train, 
              palette= 'Set2',ax=ax2, linestyles='--')
plt.title('Number of Passengers Transported in Cryosleep by Deck')
ax1.legend(loc='upper center');

In [None]:
c_deck = train.groupby('Deck')['CryoSleep'].value_counts(normalize=True).round(decimals=2).unstack()*100
fig,ax = plt.subplots(1,1)
c_deck.plot(kind='bar', ax=ax)
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.tick_params(axis='x', labelrotation=0)
ax.set_ylabel('Percentage(%)')
ax.set_title('Passengers in CryoSleep by Deck (Percentage)');

More than half of the passengers in Deck G and B were in cryosleep.

In [None]:
sns.countplot(x='Deck', hue='Transported', data=train, 
            order=['F','G','E','C','B','D','A','T'])
plt.title('Number of Passengers transported by Deck');

In [None]:
deck_trans = train.groupby('Deck')['Transported'].value_counts(normalize=True).round(decimals=2).unstack()*100
fig,ax = plt.subplots(1,1)
deck_trans.plot(kind='bar', ax=ax)
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.tick_params(axis='x', labelrotation=0)
ax.set_ylabel('Percentage(%)')
ax.set_title('Passengers transported to another dimensions \n by Deck (Percentage)');

Deck **B** and **C** have a high percentage of transported passengers. In deck **A** we have a perfect equilibrium.

In [None]:
sns.countplot(x='Deck', hue='HomePlanet', data=train)
plt.title('Home Planet by Deck');

In [None]:
deck_hm = train.groupby('Deck')['HomePlanet'].value_counts(normalize=True).round(decimals=2).unstack()*100
fig,ax = plt.subplots(1,1)
deck_hm.plot(kind='bar', ax=ax)
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.tick_params(axis='x', labelrotation=0)
ax.set_ylabel('Percentage(%)')
ax.set_title('Home Planet by Deck(Percentage)');

In deck **A**, **B**, **C**, and **T** we only have passengers from **Europa**, while in deck **G** we only have passengers from **Earth**.


In [None]:
sns.countplot(train.Side)
plt.title('Number of Passengers by side');

In [None]:
sns.countplot(x='Side', hue='Transported', data=train)
plt.title('Number of Passengers transported by side');

We have more transported passenger on the Starboard, let's see the deck disposition by side.

In [None]:
sns.countplot(hue='Side', x='Deck', data=train)
plt.title('deck disposition by side');

In [None]:
sns.countplot(x='Side', hue='CryoSleep', data=train)
plt.title('Number of Passengers in CryoSleep by Side');

In [None]:
side_trans = train.groupby('Side')['Transported'].value_counts(normalize=True).round(decimals=2).unstack()*100
fig,ax = plt.subplots(1,1)
side_trans.plot(kind='bar', ax=ax)
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.tick_params(axis='x', labelrotation=0)
ax.set_ylabel('Percentage(%)')
ax.set_title('Passengers transported to another dimensions \n by Side (Percentage)');

The decks are evenly distributed between the two sides. Hovewer we can see that more than 50% of passenger in side **S** have been trnasported to another dimension.

Now I need to impute the missing values in **Num** column and convert it into **int** to visualize its distribution.

In [None]:
train['Num'].fillna(train['Num'].mode()[0], inplace=True) 
test['Num'].fillna(test['Num'].mode()[0], inplace=True)
train['Num'] = train['Num'].astype('int64')
test['Num'] = test['Num'].astype('int64')

In [None]:
fig = px.histogram(data_frame=train,
            x='Num',
            color='Transported',
            marginal='box')
fig.update_layout(title = "Distribution of Cabin Num by Transported" , title_x = 0.5)
fig.show()

We have more trnasported passengers in cabins with lower num value. Instead of using this feature as a continous numeric variables, we can create different categories.

In [None]:
train['Num'].describe()

In [None]:
def num_group(s):
    
    if (s >= 0) & (s <= 300):
        return 1
    elif (s > 300) & (s <= 600):
        return 2
    elif (s > 600) & (s <= 900):
        return 3
    elif (s > 900) & (s <= 1200): 
        return 4
    elif (s > 1200) & (s <= 1500): 
        return 5
    elif (s > 1500): 
        return 6
    
train['Num_Group'] = train['Num'].apply(num_group)
test['Num_Group'] = test['Num'].apply(num_group)

In [None]:
sns.countplot(x=train.Num_Group)
plt.title('Number of Passengers by Num_Group');

In [None]:
num_trans = train.groupby('Num_Group')['Transported'].value_counts(normalize=True).round(decimals=2).unstack()*100
fig,ax = plt.subplots(1,1)
num_trans.plot(kind='bar', ax=ax)
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.tick_params(axis='x', labelrotation=0)
ax.set_ylabel('Percentage(%)')
ax.set_title('Passengers transported to another dimensions \n by Num_Group (Percentage)');

We can use PassengerId to create a new column

In [None]:
train[['group_ID', 'Personal_ID']] = train['PassengerId'].str.split('_',1,expand=True)
test[['group_ID', 'Personal_ID']] = test['PassengerId'].str.split('_',1,expand=True)

In [None]:
train['Grouped'] = 0
test['Grouped'] = 0
train['Grouped'] = train['group_ID'].duplicated(keep=False).astype(int).astype('int64')
test['Grouped'] = test['group_ID'].duplicated(keep=False).astype(int).astype('int64')

In [None]:
train.head()

We know that the first part of the PassengerId is equal for the passengers who travel together (they may be a family or not), so I created a column **Grouped** that is equal to **1** when the passenger is travelling with someone else and **0** when the passenger is alone.

In [None]:
sns.countplot(train.Grouped)
plt.title('Number of passenger travelling alone or in a group');

Looks like most of the passengers were travelling alone.

In [None]:
sns.countplot(x='Grouped', hue='Transported', data=train)
plt.title('Number of Passengers transported by Grouped');

In [None]:
group_trans = train.groupby('Grouped')['Transported'].value_counts(normalize=True).round(decimals=2).unstack()*100
fig,ax = plt.subplots(1,1)
group_trans.plot(kind='bar', ax=ax)
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.tick_params(axis='x', labelrotation=0)
ax.set_ylabel('Percentage(%)')
ax.set_title('Passengers transported to another dimensions \n by Grouped (Percentage)');

In [None]:
sns.catplot(kind='point', y='Transported', 
            col='HomePlanet', 
            hue='Grouped', 
            x='CryoSleep', 
            data=train, sharey=True)
plt.suptitle('Transported probablity for most relevant categorical feature',y=1.05);

More than 50% of passengers who were travelling with someone else have been transported to another dimension.
This concludes our analysis of the categorical variables, I will now drop some columns that we will not use in the classification task.

In [None]:
train.drop(['group_ID','Personal_ID', 'Num'], axis=1, inplace=True)
test.drop(['group_ID','Personal_ID','Num'], axis=1, inplace=True)

![](https://morbotron.com/img/S02E01/345758.jpg)

## Numeric Features

In [None]:
num_col

In [None]:
fig,ax = plt.subplots(1,2, figsize=(20,8))
sns.distplot(train.Age, ax=ax[0])
sns.boxplot(train.Age, ax=ax[1])
plt.suptitle('Age Distribution of the Passengers');

In [None]:
train[train['Age']==0].value_counts().sum()

From the distribution plot we can see that we have some older passengers but that's ok. However we also hace 140 passengers with age equal to 0, we need to investigate further these instances because they could be missing values that we need to impute.

In [None]:
#Inspect the firts 30 rows with age = 0
train[train['Age']==0].head(30)

We can see that all the Passengers with age equal 0 were travelling with someone else (Grouped = 1) so they could all actually be babies.

In [None]:
fig = px.histogram(data_frame = train, 
                   x="Age",
                   color= "Transported",
                   marginal="box",
                   template="plotly_white"
                )
fig.update_layout(title = "Distribution of Age by Transported" , title_x = 0.5)
fig.show()

Here we can see the difference in the age distribution between the passengers who have been transported or not.
The median age of the transported passenger is a little lower respet to the not transported one. We can covert di age column into categories too, but first we need to handle the missing values.


In [None]:
train['Age'].isnull().sum()

In [None]:
train['Age'].describe()

Since we have seen that we have a lot of values equal to 0 I'll impute the missing values using the median wich in more robust than the mean.

In [None]:
train['Age'].fillna(train['Age'].median(), inplace=True)
test['Age'].fillna(test['Age'].median(), inplace=True)

In [None]:
#create age groups
def age_group(s):
    if s == 0:
        return 0 #special categories for values equal to 0
    elif (s > 0) & (s <= 15):
        return 1
    elif (s > 15) & (s <= 25):
        return 2
    elif (s > 25) & (s <= 35):
        return 3
    elif (s > 35) & (s <= 50):
        return 4
    elif (s > 50) & (s <= 65):
        return 5
    elif (s > 65) & (s <= 75):
        return 6
    elif (s > 75):
        return 7
    
    
train['Age_Group'] = train['Age'].apply(age_group)
test['Age_Group'] = test['Age'].apply(age_group)

In [None]:
sns.countplot(train.Age_Group)
plt.title('Number of passengers by age group');

In [None]:
age_trans = train.groupby('Age_Group')['Transported'].value_counts(normalize=True).round(decimals=2).unstack()*100
fig,ax = plt.subplots(1,1)
age_trans.plot(kind='bar', ax=ax)
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.tick_params(axis='x', labelrotation=0)
ax.set_ylabel('Percentage(%)')
ax.set_title('Passengers transported to another dimensions \n by Age_Group(Percentage)');

**80%** of the passeger with age equal to **0** have been transported to another dimension. 

The rest of our numeric variables are the one relating to luxury expenses. 

'RoomService',
 'FoodCourt',
 'ShoppingMall',
 'Spa',
 'VRDeck'

In [None]:
fig,ax = plt.subplots(3,2, figsize=(12,12), sharey=True)
sns.kdeplot(train.RoomService, ax=ax[0,0])
sns.kdeplot(train.FoodCourt, ax=ax[0,1])
sns.kdeplot(train.ShoppingMall, ax=ax[1,0])
sns.kdeplot(train.Spa, ax=ax[1,1])
sns.kdeplot(train.VRDeck, ax=ax[2,0])
plt.suptitle('Luxury Expenses Distribution');

In [None]:
px.box(data_frame=train, 
       x='RoomService', 
       color='Transported', 
       title='Distribution of RoomService by transported')

In [None]:
px.box(data_frame=train, 
       x='FoodCourt', 
       color='Transported', 
       title='Distribution of FoodCourt by transported')

In [None]:
px.box(data_frame=train, 
       x='ShoppingMall', 
       color='Transported', 
       title='Distribution of ShoppingMall by transported')

In [None]:
px.box(data_frame=train, 
       x='Spa', 
       color='Transported', 
       title='Distribution of Spa by transported')

In [None]:
px.box(data_frame=train, 
       x='VRDeck', 
       color='Transported', 
       title='Distribution of VRDeck by transported')

Looks Like most of the passengers didn't spend money in luxury items. But we can see a certain difference in the distribution of the money spent at the **Spa**. Looks like this is the only variable that can hel us differentiate between the transported and the not transported. We mwy drop the other features and only leave **Spa**.

In [None]:
numc_corr = ['Age',
   'RoomService',
   'FoodCourt',
   'ShoppingMall',
   'Spa',
   'VRDeck',
   'Transported']

corr = train[numc_corr].corr()
plt.figure(figsize=(10,10))
sns.heatmap(corr, annot=True, square=True, fmt='.2f', vmin=-1, vmax=1, linewidths=0.5, cmap='coolwarm')
plt.title('Numeric Features Correlation');

The correlation with the target variable is quite **weak**. Only VRDeck, Spa and RoomService show a slightly stronger correlation.

In [None]:
train.drop(['Age','Name','PassengerId','Cabin'], axis=1, inplace=True)
test.drop(['Age','Name','PassengerId','Cabin'], axis=1, inplace=True)

In [None]:
train.head()

In [None]:
test.head()

Now we can move on with our analysis and handle missing values.

# Handling Missing Values

We have removed some columns and created new columns, let's see how the percentage of missing values looks right now.

In [None]:
fig,ax = plt.subplots(1,1, figsize=(12,8))
(train.isnull().mean()*100).plot(kind='bar', ax=ax, align='center', width=.4)
(test.isnull().mean()*100).plot(kind='bar', ax=ax, align='edge',width=.4, color=palette[1])
plt.legend(labels=['Train Set','Test Set'])
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.tick_params(axis='x', labelrotation=80)
ax.set_ylabel('Missing Values (%)')
ax.set_title('Percentage of missing values in train and test set');

In [None]:
#Median to handle missing values in numeric features because they have a skewed distribution
train['Spa'].fillna(train['Spa'].median(), inplace=True)
train['VRDeck'].fillna(train['VRDeck'].median(), inplace=True)
train['RoomService'].fillna(train['RoomService'].median(), inplace=True)
train['FoodCourt'].fillna(train['FoodCourt'].median(), inplace=True)
train['ShoppingMall'].fillna(train['ShoppingMall'].median(), inplace=True)

test['Spa'].fillna(test['Spa'].median(), inplace=True)
test['VRDeck'].fillna(test['VRDeck'].median(), inplace=True)
test['RoomService'].fillna(test['RoomService'].median(), inplace=True)
test['FoodCourt'].fillna(test['FoodCourt'].median(), inplace=True)
test['ShoppingMall'].fillna(test['ShoppingMall'].median(), inplace=True)

#using mode for the rest of the features
train = train.fillna(train.agg(lambda x: pd.Series.mode(x)[0], axis=0))
test = test.fillna(test.agg(lambda x: pd.Series.mode(x)[0], axis=0))

In [None]:
print('Train set missing values \n',train.isnull().sum())
print('-'*10)
print('Test set missing values\n',test.isnull().sum())

# Encoding Categorical Features

In [None]:
train.dtypes

In [None]:
le = LabelEncoder()
cols = ['CryoSleep','Side']

def LE(train_df, test_df):
    for col in cols:
        train_df[col] = le.fit_transform(train_df[col])
        test_df[col] = le.fit_transform(test_df[col])
    return train_df, test_df

train, test = LE(train, test)

In [None]:
train = pd.get_dummies(train, columns=['HomePlanet', 'Destination', 'Deck'])
test = pd.get_dummies(test, columns=['HomePlanet', 'Destination', 'Deck'])

In [None]:
train.head()

In [None]:
X_train = train.drop('Transported', axis = 1)
y_train = train['Transported']
X_test = test

# Models

## Random Forest

In [None]:
rfc=RandomForestClassifier()
parameters = {
    "n_estimators": [200,300,400], 
    "max_features": [3, 5, 10],
    "min_samples_leaf" : [3, 5,10],
    
}

rfc_grid = GridSearchCV(rfc, param_grid = parameters, cv = 5, scoring = 'accuracy', n_jobs= -1)
rfc_grid.fit(X_train, y_train)
print('Best Parameters : ', rfc_grid.best_params_)
print('-'*50)
print('Best Accuracy : ', rfc_grid.best_score_)

## XGBoost

In [None]:
param_grid = {'n_estimators': [200,300,400],
              'learning_rate': [0.01, 0.05, 0.1, 0.5],
              'eval_metric': ['mlogloss'],
              'objective':['reg:logistic'],
              'max_depth': [5,10,15],}
grid = GridSearchCV(XGBClassifier(), param_grid=param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)
best_params = grid.best_params_
print('Best score of cross validation: {:.2f}'.format(grid.best_score_))
print('-'*50)
print('Best parameters:', best_params)

With RandomForest we can the **feature importance**.

In [None]:
RandomForest = (rfc_grid.best_estimator_)
importances = RandomForest.feature_importances_
feature_names = X_train.columns
fi = pd.Series(index=feature_names, data=importances).sort_values(ascending=True)
plt.figure(figsize=(8,15))
plt.barh(fi.index, fi.values)
plt.title('Random Forest Features Importance')
plt.show()

We can also plot XGBoost feature importance

In [None]:
xgb.plot_importance(grid.best_estimator_);

We can also plot a Tree from XGBoost.

In [None]:
xgb.plot_tree(grid.best_estimator_,num_trees=0)
plt.rcParams['figure.figsize'] = [150, 80]
plt.show()

In [None]:
y_pred = grid.predict(X_test)

In [None]:
y_pred

In [None]:
subs = pd.read_csv('../input/spaceship-titanic/sample_submission.csv')
subs

In [None]:
subs['Transported'] = y_pred
subs.to_csv('Transported_xgb', index = False)