In [None]:
try:
    import pycaret
except:
    !pip install --upgrade pycaret

try:
    import missingno
except:
    !pip install missingno
    
try:
    import interpret
except:
    !pip install interpret

try:
    import tune_sklearn
except:    
    !pip install tune-sklearn ray[tune]

<hr style="border: solid 3px blue;">

# Introduction

![](https://64.media.tumblr.com/d994c3b1c3adb94ac65692a599aae700/5866f6929b208337-cb/s540x810/bc492d78a4dc5a31fe7c5b42c1fff9bcd5938c9d.gif)

Picture Credit: https://techrecipe.co.kr

We always make big or small decisions. However, deciding something is not easy. Machines are also difficult to decide.
Even in this problem, we and the model always have to make difficult decisions. In particular, a more careful decision is required when a decision related to a person's life has to be made.

On what basis should the decision be made? In the end, we need to understand the dataset as much as possible and make the decision right through effective modeling. Let's start the difficult task.

We hope that the decision of us and the model will save many people.

## Features
* **PassengerId** - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
* **HomePlanet** - The planet the passenger departed from, typically their planet of permanent residence.
* **CryoSleep** - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
* **Cabin** - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
* **Destination** - The planet the passenger will be debarking to.
* **Age** - The age of the passenger.
* **VIP** - Whether the passenger has paid for special VIP service during the voyage.
* **RoomService, FoodCourt, ShoppingMall, Spa, VRDeck** - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
* **Name** - The first and last names of the passenger.
* **Transported** - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

----------------------------
# Setting UP

In [None]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import rcParams

import warnings
warnings.filterwarnings(action='ignore')

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

from sklearn.preprocessing import PowerTransformer

sns.set(style="ticks", context="talk",font_scale = 1)
plt.style.use("dark_background")

In [None]:
train_data = pd.read_csv('../input/spaceship-titanic/train.csv')
test_data = pd.read_csv('../input/spaceship-titanic/test.csv')
submission_data = pd.read_csv('../input/spaceship-titanic/sample_submission.csv')
space_df = pd.concat([train_data, test_data], ignore_index = True, sort = False)
tr_idx = space_df['Transported'].notnull()

<hr style="border: solid 3px blue;">

# Anomaly Detection

![](https://www.oreilly.com/content/wp-content/uploads/sites/2/2019/06/8230004725_6338759eb3_o_crop-57552d1a7b9df6b9221d5c1fd342334f.jpg)

Picture Credit: https://www.oreilly.com/content

Before proceeding with EDA in earnest, anomaly detection is performed with a dataset. And, through this, we want to check which cases are judged as outliers and gain insight into EDA.

In [None]:
from pycaret.anomaly import *

In [None]:
_ = pycaret.anomaly.setup(
    data=space_df[tr_idx],
    silent=True)

In [None]:
knn = pycaret.anomaly.create_model('knn')

------------------------------------------
## Extracting top 5 outliers 

In [None]:
knn_df = pycaret.anomaly.assign_model(knn)
abnormal_data = knn_df[knn_df.Anomaly == 1].sort_values(by='Anomaly_Score', ascending=False)
print("the size of anomaly = ",len(abnormal_data))
abnormal_data.head().style.set_properties(**{'background-color': 'black',
                           'color': 'white',
                           'border-color': 'white'})

<span style="color:Blue"> Observation:
* There are a total of 435 outliers.
* In cases where HomePlanet is Europa and Destination is 55 Cancri e, there are many cases where anomaly was decided.

In [None]:
plt.style.use("dark_background")
plot_model(knn,plot='umap')

In [None]:
plot_model(knn,plot='tsne')

<hr style="border: solid 3px blue;">

# EDA

In [None]:
space_df.head().T.style.set_properties(**{'background-color': 'black',
                           'color': 'white',
                           'border-color': 'white'})

-----------------------------------
## Checking Missing Values

![](https://miro.medium.com/max/640/0*10yDGnSUYVYTuHR-.jpg)

Picture Credit: https://miro.medium.com

In [None]:
import missingno as msno
msno.matrix(space_df,color=(0, 0, 0))

In [None]:
isnull_series = space_df.loc[:,:'Name'].isnull().sum()
isnull_series[isnull_series > 0].sort_values(ascending=False)

plt.figure(figsize = (20,10))

ax = isnull_series[isnull_series > 0].sort_values(ascending=False).plot(kind='bar',
                                                                        grid = False,
                                                                        fontsize=20)
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+ p.get_width() / 2., height + 5, height, ha = 'center', size = 30)
sns.despine()

<span style="color:Blue"> Observation:
* Unfortunately, there are many missing values. How the missing values ​​are filled is expected to have a big influence on the model performance.

------------------------------------------
## Checking Data Type

In [None]:
plt.figure(figsize = (10,8))
with plt.rc_context({'figure.facecolor':'black'}):
    sns.set(style="ticks", context="talk",font_scale = 1)
    plt.style.use("dark_background")
    ax = space_df.dtypes.value_counts().plot(kind='bar',fontsize=20)
    for p in ax.patches:
        height = p.get_height()
        ax.text(p.get_x()+ p.get_width() / 2., height + 0.1, height, ha = 'center', size = 25)
    sns.despine()

---------------------------------------------
## Checking Target Balance

In [None]:
colors = ['gold', 'mediumturquoise']
labels = ['Not-Transported','Transported']
values = space_df['Transported'].value_counts()/space_df['Transported'].shape[0]

fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=.3)])
fig.update_traces(hoverinfo='label+percent', textinfo='percent', textfont_size=20,
                  marker=dict(colors=colors, line=dict(color='white', width=0.1)))
fig.update_layout(
    title_text="Target Blance",
    title_font_color="white",
    legend_title_font_color="yellow",
    paper_bgcolor="black",
    plot_bgcolor='black',
    font_color="white",
)
fig.show()

<span style="color:Blue"> Observation:

* Target is well balanced.

<hr style="border: solid 2px black;">

# Categorical Features

![](http://cdn.shopify.com/s/files/1/1334/2321/articles/Picture1_1024x1024.png?v=1497575369)

Picture Credit: http://cdn.shopify.com

> In statistics, a categorical variable (also called qualitative variable) is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property.[1] In computer science and some branches of mathematics, categorical variables are referred to as enumerations or enumerated types. Commonly (though not in this article), each of the possible values of a categorical variable is referred to as a level. The probability distribution associated with a random categorical variable is called a categorical distribution.

Ref: https://en.wikipedia.org/wiki/Categorical_variable

-----------------------------------------------------
## Name

Let's decide the name as a unique value and drop it.

In [None]:
space_df.drop(['Name'],axis=1,inplace=True,errors='ignore')

------------------------------
## PassengerId 

A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. **People in a group are often family members, but not always.**

Let's make a group feature.

In [None]:
def extract_group(s):
    return s.split('_')[1]

space_df['Group'] = space_df['PassengerId'].apply(extract_group).astype(int)

In [None]:
total_cnt = space_df[tr_idx]['Transported'].count()
plt.figure(figsize=(20,8))
sns.set(style="ticks", context="talk",font_scale = 1)
plt.style.use("dark_background")
ax = sns.countplot(x="Group",
                   hue="Transported", 
                   data=space_df[tr_idx])
ax.set_title('Group/Rate')
for p in ax.patches:
    x, height, width = p.get_x(), p.get_height(), p.get_width()
    ax.text(x + width / 2, height + 80, f'{height / total_cnt * 100:2.1f}%', va='center', ha='center', size=15)
sns.despine()

<span style="color:Blue"> Observation:
* People in Group 1 were relatively untransported.

In [None]:
space_df.drop(['PassengerId'],axis=1,inplace=True,errors='ignore')

------------------------------
## HomePlanet

The planet the passenger departed from, typically their planet of permanent residence.

In [None]:
space_df['Has_HomePlanet'] = space_df['HomePlanet'].isnull().astype(int)

In [None]:
total_cnt = space_df[tr_idx]['Transported'].count()
plt.figure(figsize=(20,8))
sns.set(style="ticks", context="talk",font_scale = 1)
plt.style.use("dark_background")
ax = sns.countplot(x="HomePlanet",
                   hue="Transported", 
                   data=space_df[tr_idx])
ax.set_title('HomePlanet/Rate')
for p in ax.patches:
    x, height, width = p.get_x(), p.get_height(), p.get_width()
    ax.text(x + width / 2, height + 80, f'{height} / {height / total_cnt * 100:2.1f}%', va='center', ha='center', size=20)
sns.despine()

<span style="color:Blue"> Observation:

* Relatively many people from Earth are not transported.

---------------------
## CryoSleep

![](https://qph.fs.quoracdn.net/main-qimg-30ac22fbc0cff552d0db1094338da8f2-pjlq)

Picture Credit: https://qph.fs.quoracdn.net

Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.

In [None]:
space_df['Has_CryoSleep'] = space_df['CryoSleep'].isnull().astype(int)

In [None]:
plt.figure(figsize=(15,8))
sns.set(style="ticks", context="talk",font_scale = 1)
plt.style.use("dark_background")
ax = sns.countplot(x="CryoSleep",
                   hue="Transported", 
                   data=space_df[tr_idx])
ax.set_title('CryoSleep/Rate')
for p in ax.patches:
    x, height, width = p.get_x(), p.get_height(), p.get_width()
    ax.text(x + width / 2, height + 80, f'{height} / {height / total_cnt * 100:2.1f}%', va='center', ha='center', size=20)
sns.despine()

<span style="color:Blue"> Observation:
* People doing CryoSleep were more transported.

---------------------
## Destination

The planet the passenger will be debarking to.

In [None]:
space_df['Has_Destination'] = space_df['Destination'].isnull().astype(int)

In [None]:
plt.figure(figsize=(20,8))
sns.set(style="ticks", context="talk",font_scale = 1)
plt.style.use("dark_background")
ax = sns.countplot(x="Destination",
                   hue="Transported", 
                   data=space_df[tr_idx])
ax.set_title('Destination/Rate')
for p in ax.patches:
    x, height, width = p.get_x(), p.get_height(), p.get_width()
    ax.text(x + width / 2, height + 80, f'{height} / {height / total_cnt * 100:2.1f}%', va='center', ha='center', size=20)
sns.despine()

------------------------------
## Cabin

The cabin number where the passenger is staying. Takes the form **deck/num/side**, where side can be either P for Port or S for Starboard.

In [None]:
plt.figure(figsize=(15,8))
ax = space_df['Cabin'].value_counts().sort_values(ascending=False)[:10].plot(kind='bar',
                                                                        grid = False,
                                                                        fontsize=20)
plt.legend(loc = 'upper right')
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+ p.get_width() / 2., height + 0.1, height, ha = 'center', size = 20)
sns.despine()

In [None]:
space_df['Has_Cabin'] = space_df['Cabin'].isnull().astype(int)

In [None]:
space_df['Cabin'].fillna('None/None/None', inplace=True)

In [None]:
def extract_first(s):
    return s.split('/')[0]
def extract_mid(s):
    return s.split('/')[1]
def extract_last(s):
    return s.split('/')[2]

space_df['Deck'] = space_df['Cabin'].apply(extract_first)
space_df['Num'] = space_df['Cabin'].apply(extract_mid)
space_df['Side'] = space_df['Cabin'].apply(extract_last)

## Deck

In [None]:
plt.figure(figsize=(20,8))
sns.set(style="ticks", context="talk",font_scale = 1)
plt.style.use("dark_background")
ax = sns.countplot(x="Deck",
                   hue="Transported", 
                   data=space_df[tr_idx])
ax.set_title('Deck/Rate')
for p in ax.patches:
    x, height, width = p.get_x(), p.get_height(), p.get_width()
    ax.text(x + width / 2, height + 40, f'{height / total_cnt * 100:2.1f}%', va='center', ha='center', size=15)
sns.despine()

<span style="color:Blue"> Observation:
* There are many cases where people on the B, G, and C Decks are transported.
* There are many cases where people in F,E,D Decks are not transported.

## Side

In [None]:
plt.figure(figsize=(15,6))
sns.set(style="ticks", context="talk",font_scale = 1)
plt.style.use("dark_background")
ax = sns.countplot(x="Side",
                   hue="Transported", 
                   data=space_df[tr_idx])
ax.set_title('Side/Rate')
for p in ax.patches:
    x, height, width = p.get_x(), p.get_height(), p.get_width()
    ax.text(x + width / 2, height + 80, f'{height} / {height / total_cnt * 100:2.1f}%', va='center', ha='center', size=15)
sns.despine()

<span style="color:Blue"> Observation:
* There are many cases where people on the S side are transported.

In [None]:
cat_cols = ['Deck','Num','Side']
space_df[cat_cols].nunique()

<span style="color:Blue"> Observation:
* The number of Num/Deck levels is large. It seems better to do label encoding than one-hot encoding.

In [None]:
for c in cat_cols:
    from sklearn.preprocessing import LabelEncoder
    le = LabelEncoder()
    space_df[c]= le.fit_transform(space_df[c])

In [None]:
space_df.drop('Cabin',axis=1,inplace=True)

-----------------------------------------
## VIP

Whether the passenger has paid for special VIP service during the voyage.

In [None]:
space_df['Has_VIP'] = space_df['VIP'].isnull().astype(int)

In [None]:
plt.figure(figsize=(15,6))
sns.set(style="ticks", context="talk",font_scale = 1)
plt.style.use("dark_background")
ax = sns.countplot(x="VIP",
                   hue="Transported", 
                   data=space_df[tr_idx])
ax.set_title('VIP/Rate')
for p in ax.patches:
    x, height, width = p.get_x(), p.get_height(), p.get_width()
    ax.text(x + width / 2, height + 150, f'{height} / {height / total_cnt * 100:2.1f}%', va='center', ha='center', size=20)
sns.despine()

<span style="color:Blue"> Observation:
* It doesn't seem like transporte has become much special just because you receive VIP service.

<hr style="border: solid 2px black;">

# Numerical Features

![](http://cdn.shopify.com/s/files/1/1334/2321/articles/Picture1_1024x1024.png?v=1497575369)

Picture Credit: http://cdn.shopify.com

> Numeric variables have values that describe a measurable quantity as a number, like 'how many' or 'how much'. Therefore numeric variables are quantitative variables.

> Numeric variables may be further described as either continuous or discrete:
> * A continuous variable is a numeric variable. Observations can take any value between a certain set of real numbers. The value given to an observation for a continuous variable can include values as small as the instrument of measurement allows. Examples of continuous variables include height, time, age, and temperature.
> * A discrete variable is a numeric variable. Observations can take a value based on a count from a set of distinct whole values. A discrete variable cannot take the value of a fraction between one value and the next closest value. Examples of discrete variables include the number of registered cars, number of business locations, and number of children in a family, all of of which measured as whole units (i.e. 1, 2, 3 cars).

Ref: https://www.abs.gov.au/

In [None]:
def display_stat(df,feature):
    mean = df[feature].mean()
    std = df[feature].std()
    skew = df[feature].skew()
    kurtosis = df[feature].kurtosis()
    print('mean: {0:.4f}, std: {1:.4f}, skew: {2:.4f}, kurtosis: {3:.4f} '.format(mean, std, skew, kurtosis))

In [None]:
def plot_histgram(df,feature):    
    fig = px.histogram(df, x=feature,
                       color="Transported", 
                       marginal="box",
                       barmode ="overlay",
                       histnorm ='density'
                      )  
    fig.update_layout(
        title={
            'text': feature+" histogram",
            'x':0.5,
            'xanchor': 'center',
            'yanchor': 'top'},    
        paper_bgcolor="black",
        plot_bgcolor='black',
        font_color="white"
    )
    fig.show()

-----------------------------------------
## Age

First, let's check the correlation between Age and other features.

In [None]:
corr=space_df.corr()
sns.set(style="ticks", context="talk",font_scale = 1)
plt.style.use("dark_background")
plt.figure(figsize=(10, 8))
abs(corr['Age']).sort_values()[:-1].plot.barh()
plt.title('Correlation with Age',fontsize=20)
sns.despine()

<span style="color:Blue"> Observation:
* The correlation between Age and Deck is higher than other features. 

In [None]:
plt.figure(figsize=(20,6))
sns.set(style="ticks", context="talk",font_scale = 1)
plt.style.use("dark_background")
plt.subplots_adjust(wspace=0.3)
plt.subplot(1,2,1)
sns.boxenplot(data=space_df, x='Deck',y='Age')
plt.subplot(1,2,2)
sns.regplot(data=space_df, x='Deck',y='Age')

<span style="color:Blue"> Observation:
* The people on deck 6 appear to be younger than the people on the other decks.

In [None]:
plot_histgram(space_df[tr_idx],'Age')
display_stat(space_df[tr_idx],'Age')

In [None]:
space_df['Has_Age'] = space_df['Age'].isnull().astype(int)

Fill in the missing values ​​with the median value for age for each deck.

In [None]:
space_df['Age'] = space_df['Age'].fillna(space_df.groupby('Deck')['Age'].transform('median'))

In [None]:
plot_histgram(space_df[tr_idx],'Age')
display_stat(space_df[tr_idx],'Age')

-----------------------------------------------------------------
## Money paid for Titanic's many luxury amenities

![](https://blog-cdn.touringplans.com/blog/wp-content/uploads/2018/02/Star-Wars-Hotel-3498234-624x321.png)

Picture Credit: https://blog-cdn.touringplans.com

-----------------------------------------
### RoomService

In [None]:
space_df['Has_RoomService'] = space_df['RoomService'].isnull().astype(int)

In [None]:
display_stat(space_df[tr_idx],'RoomService')

In [None]:
plot_histgram(space_df[tr_idx],'RoomService')
display_stat(space_df[tr_idx],'RoomService')

<span style="color:Blue"> Observation:
* This feature is skewed. It seems necessary to do a non-linear transformation.

In [None]:
pt = PowerTransformer(method='yeo-johnson')
space_df[['RoomService_pt']] = pt.fit_transform(space_df[['RoomService']])

In [None]:
plot_histgram(space_df[tr_idx],'RoomService_pt')
display_stat(space_df[tr_idx],'RoomService_pt')
space_df.drop('RoomService_pt',axis=1,inplace=True)

------------------------------
### FoodCourt

In [None]:
space_df['Has_FoodCourt'] = space_df['FoodCourt'].isnull().astype(int)

In [None]:
plot_histgram(space_df[tr_idx],'FoodCourt')
display_stat(space_df[tr_idx],'FoodCourt')

<span style="color:Blue"> Observation:
* This feature is skewed. It seems necessary to do a non-linear transformation.

In [None]:
space_df[['FoodCourt_pt']] = pt.fit_transform(space_df[['FoodCourt']])

In [None]:
plot_histgram(space_df[tr_idx],'FoodCourt_pt')
display_stat(space_df[tr_idx],'FoodCourt_pt')
space_df.drop('FoodCourt_pt',axis=1,inplace=True)

--------------------------------
### ShoppingMall

In [None]:
space_df['Has_ShoppingMall'] = space_df['ShoppingMall'].isnull().astype(int)

In [None]:
plot_histgram(space_df[tr_idx],'ShoppingMall')
display_stat(space_df[tr_idx],'ShoppingMall')

<span style="color:Blue"> Observation:
* This feature is skewed. It seems necessary to do a non-linear transformation.

In [None]:
space_df[['ShoppingMall_pt']] = pt.fit_transform(space_df[['ShoppingMall']])

In [None]:
plot_histgram(space_df[tr_idx],'ShoppingMall_pt')
display_stat(space_df[tr_idx],'ShoppingMall_pt')
space_df.drop('ShoppingMall_pt',axis=1,inplace=True)

---------------------------------
### Spa

In [None]:
space_df['Has_Spa'] = space_df['Spa'].isnull().astype(int)

In [None]:
display_stat(space_df[tr_idx],'Spa')

In [None]:
plot_histgram(space_df[tr_idx],'Spa')
display_stat(space_df[tr_idx],'Spa')

<span style="color:Blue"> Observation:
* This feature is skewed. It seems necessary to do a non-linear transformation.

In [None]:
space_df[['Spa_pt']] = pt.fit_transform(space_df[['Spa']])

In [None]:
plot_histgram(space_df[tr_idx],'Spa_pt')
display_stat(space_df[tr_idx],'Spa_pt')
space_df.drop('Spa_pt',axis=1,inplace=True)

------------------------------
### VRDeck

In [None]:
space_df['Has_VRDeck'] = space_df['VRDeck'].isnull().astype(int)

In [None]:
display_stat(space_df[tr_idx],'VRDeck')

In [None]:
plot_histgram(space_df[tr_idx],'VRDeck')
display_stat(space_df[tr_idx],'VRDeck')

<span style="color:Blue"> Observation:
* This feature is skewed. It seems necessary to do a non-linear transformation.

In [None]:
space_df[['VRDeck_pt']] = pt.fit_transform(space_df[['VRDeck']])

In [None]:
plot_histgram(space_df[tr_idx],'VRDeck_pt')
display_stat(space_df[tr_idx],'VRDeck_pt')
space_df.drop('VRDeck_pt',axis=1,inplace=True)

In [None]:
space_df['TotalSpend'] = space_df['VRDeck'] + space_df['Spa'] + space_df['ShoppingMall'] + space_df['FoodCourt'] + space_df['RoomService']

<hr style="border: solid 2px black;">

# Non-Linear Transformation 

![](https://www.researchgate.net/profile/Chun-Fung-2/publication/224678404/figure/fig1/AS:646829247057921@1531227518360/llustration-of-Non-linear-Data-Transformation-3-Proposed-Framework-31-Previous-Work.png)

In [None]:
transform_features = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck','TotalSpend']
space_df[transform_features] = pt.fit_transform(space_df[transform_features])

<hr style="border: solid 3px blue;">

# Machine Learning

![](https://cdn.dribbble.com/users/1373613/screenshots/5436457/plexus___3.gif)

Picture Credit: https://cdn.dribbble.com

In [None]:
num_cols = space_df.select_dtypes(exclude = ['object', 'bool']).columns.tolist()

In [None]:
from pycaret.classification import *

---------------------------------------------------------
## Making Pipeline before Training

![](https://signal-to-noise.xyz/static/images/pipes.jpg)

Picture Credit: https://signal-to-noise.xyz

In [None]:
_ = setup(data = space_df[tr_idx], 
      target = 'Transported',
      numeric_features = num_cols,
      silent = True,
      remove_multicollinearity = True,
      ignore_low_variance = True,
      imputation_type = 'simple',
      categorical_imputation = 'mode',
      numeric_imputation = 'median' )

In [None]:
sns.set(style="ticks", context="talk",font_scale = 1)
plt.style.use("dark_background")

-----------------------------------------
## Comparing Models

In [None]:
top3 = compare_models(sort='Accuracy',n_select = 3
                      ,exclude = ['knn', 'svm','ridge','nb','dummy','qda','xgboost'] )

In [None]:
catboost = create_model('catboost')
lightgbm = create_model('lightgbm')

---------------------------------------
## Tuning Hyperparamters

![](https://miro.medium.com/max/1400/0*8c_vfbRh9YUSeIXJ)

Ref: https://miro.medium.com

> In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are learned.
> 
> The same kind of machine learning model can require different constraints, weights or learning rates to generalize different data patterns. These measures are called hyperparameters, and have to be tuned so that the model can optimally solve the machine learning problem. Hyperparameter optimization finds a tuple of hyperparameters that yields an optimal model which minimizes a predefined loss function on given independent data. The objective function takes a tuple of hyperparameters and returns the associated loss. Cross-validation is often used to estimate this generalization performance.

Ref: https://en.wikipedia.org/wiki/Hyperparameter_optimization

In [None]:
tuned_lightgbm = tune_model(lightgbm, 
                            optimize = 'Accuracy',
                            search_library="tune-sklearn",
                            search_algorithm="optuna",
                            early_stopping = True,
                            n_iter = 40)

In [None]:
tuned_catboost = tune_model(catboost,
                            optimize = 'Accuracy',
                            search_library="tune-sklearn",
                            search_algorithm="optuna",
                            early_stopping = True,
                            n_iter = 40)

----------------------------------------------------------
# Interpreting Models


In [None]:
with plt.rc_context({'figure.facecolor':'lightgrey'}):
    interpret_model(catboost)

<span style="color:Blue"> Observation:
* Spa, VRDeck, and RoomService features were judged to be important features.

In [None]:
interpret_model(catboost,plot='pdp',feature='Spa')

In [None]:
interpret_model(catboost,plot='pdp',feature='VRDeck')

In [None]:
with plt.rc_context({'figure.facecolor':'lightgrey'}):
    interpret_model(lightgbm)

<span style="color:Blue"> Observation:
* The lightgbm model judged the CryoSleep feature as the most important features.
* This diversity is an advantage of ensemble learning. 

-----------------------------------------------------------
# Ensemble (Soft Voting)

![](https://miro.medium.com/max/806/1*bliKQZGPccS7ho9Zo6uC7A.jpeg)

Picture Credit: https://miro.medium.com

> Ensemble learning is the process by which multiple models, such as classifiers or experts, are strategically generated and combined to solve a particular computational intelligence problem. Ensemble learning is primarily used to improve the (classification, prediction, function approximation, etc.) performance of a model, or reduce the likelihood of an unfortunate selection of a poor one. Other applications of ensemble learning include assigning a confidence to the decision made by the model, selecting optimal (or near optimal) features, data fusion, incremental learning, nonstationary learning and error-correcting. This article focuses on classification related applications of ensemble learning, however, all principle ideas described below can be easily generalized to function approximation or prediction type problems as well.

Ref: http://www.scholarpedia.org/article/Ensemble_learning

In [None]:
blend_soft = blend_models(estimator_list = [catboost,lightgbm], optimize = 'Accuracy',method = 'soft')

In [None]:
opt_model = optimize_threshold(blend_soft)

In [None]:
cali_model = calibrate_model(opt_model)

In [None]:
final_model = finalize_model(cali_model)

In [None]:
plt.figure(figsize=(8, 8))
plot_model(final_model, plot='boundary')

<span style="color:Blue"> Observation:
* Boundary decision is an important thing to do with models. Looking at the picture above, it can be seen that our model does its best to determine the boundary.
* Areas where the model cannot determine the boundary at all are observed.    

In [None]:
plt.figure(figsize=(8, 8))
plot_model(final_model, plot='confusion_matrix')

In [None]:
X_test_df = space_df[~tr_idx].drop('Transported',axis=1)
last_result_df = predict_model(final_model, data=X_test_df)
submission_data['Transported'] = list(last_result_df.Label)
submission_data.to_csv('submission.csv', index = False)

<hr style="border: solid 3px blue;">