**Import Libraries and set up Notebook**

In [None]:
import numpy as np
import pandas as pd

from IPython.core.display import display, HTML
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio


import seaborn as sns
from importlib import reload
import matplotlib.pyplot as plt
import matplotlib
import warnings

# Configure Jupyter Notebook
pd.set_option('display.max_columns', None) 
pd.set_option('display.max_rows', 500) 
pd.set_option('display.expand_frame_repr', False)
# pd.set_option('max_colwidth', -1)
display(HTML("<style>div.output_scroll { height: 35em; }</style>"))

reload(plt)
%matplotlib inline
%config InlineBackend.figure_format ='retina'

warnings.filterwarnings('ignore')

# configure plotly graph objects
pio.renderers.default = 'iframe'
# pio.renderers.default = 'vscode'

pio.templates["ck_template"] = go.layout.Template(
    # layout_colorway = px.colors.sequential.Viridis, 
#     layout_hovermode = 'closest',
#     layout_hoverdistance = -1,
    layout_autosize=False,
    layout_width=800,
    layout_height=600,
    layout_font = dict(family="Calibri Light"),
    layout_title_font = dict(family="Calibri"),
    layout_hoverlabel_font = dict(family="Calibri Light"),
#     plot_bgcolor="white",
)
 
# pio.templates.default = 'seaborn+ck_template+gridon'
pio.templates.default = 'ck_template+gridon'
# pio.templates.default = 'seaborn+gridon'
# pio.templates

# Business Understanding

Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good.

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.

Help save them and change history!

# Data Understanding (Exploratory Data Analysis)

In [None]:
df = pd.read_csv("/kaggle/input/spaceship-titanic/train.csv")

**PassengerId** - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.

**HomePlanet** - The planet the passenger departed from, typically their planet of permanent residence.

**CryoSleep** - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.

**Cabin** - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.

**Destination** - The planet the passenger will be debarking to.

**Age** - The age of the passenger.

**VIP** - Whether the passenger has paid for special VIP service during the voyage.

**RoomService, FoodCourt, ShoppingMall, Spa, VRDeck** - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.

**Name** - The first and last names of the passenger.

**Transported** - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

In [None]:
df.info()

In [None]:
list(df)

In [None]:
df.head(10)

In [None]:
df.describe(include='all')

In [None]:
plt.figure(figsize=(15,15))
threshold = 0.00
sns.set_style("whitegrid", {"axes.facecolor": ".0"})
df_cluster2 = df.corr()
mask = df_cluster2.where((abs(df_cluster2) >= threshold)).isna()
plot_kws={"s": 1}
sns.heatmap(df_cluster2,
            cmap='RdYlBu',
            annot=True,
            mask=mask,
            linewidths=0.2, 
            linecolor='lightgrey').set_facecolor('white')

In [None]:
# Image is too large for the notebook to be saved, pity

from plotly.subplots import make_subplots
norm_width = 1.5
high_width = 2.5

title_set=[]
secondary_Y=[]
for feature in df.columns:
    title_set.append(feature)
    
fig = make_subplots(rows=len(df.columns),
                    cols=1,
                    subplot_titles=title_set,
                    # specs=secondary_Y
                   )

fig.update_layout(title="Comparison of Labels",
                  height=400*len(df.columns),
                  showlegend=False,
                 )

i = 0    
for feature in df.columns:
    i+=1 
    x0 = df[feature][df['Transported']==0]
    x1 = df[feature][df['Transported']==1]
    hist_data = [x0, x1]
    group_labels = ['Label 0','Label 1']
    
    try:
        
        fig.add_trace(go.Violin(y=x0,
                                x=df['Transported'],
                                jitter=0.3,
                                # points='all',
                                meanline_visible=True,
                                # opacity=0.3,
                                  )
                      ,row=i, col=1,secondary_y=False,)  
          
    except:
        pass
fig.show()

In [None]:
df.describe(include='all')

In [None]:
%%time

def corrdot(*args, **kwargs):
    corr_r = args[0].corr(args[1])
    corr_text = f"{corr_r:2.2f}".replace("0.", ".")
    ax = plt.gca()
    ax.set_axis_off()
    marker_size = abs(corr_r) * 10000
    ax.scatter([.5], [.5], marker_size, [corr_r], alpha=0.6, cmap='coolwarm',
               vmin=-1, vmax=1, transform=ax.transAxes)
    font_size = abs(corr_r) * 40 + 5
    ax.annotate(corr_text, [.5, .5,],  xycoords="axes fraction",
                ha='center', va='center', fontsize=font_size)

sns.set(style='white', font_scale=1.6)
g = sns.PairGrid(df.select_dtypes(include=[np.number]), aspect=1.4, diag_sharey=False)

g.map_lower(sns.regplot, lowess=True, ci=False, line_kws={'color': 'black','lw': 1.5}, scatter_kws={'s':3,'alpha':0.1,'color':'black'})
g.map_diag(sns.distplot, kde_kws={'color': 'black'},hist_kws={'color':'gray','alpha':1,})
g.map_upper(corrdot)

In [None]:
from pandas_profiling import ProfileReport

In [None]:
%%time
profile = ProfileReport(df,
                        title="Post Block Assignment 3",
                        dataset={"description": "This profiling report was generated for Carl Kirstein",
                                 "copyright_holder": "Carl Kirstein",
                                 "copyright_year": "2022",
                                },
                        explorative=True,
                       )
profile

## PassengerId

This is just an unique set, can be dropped

In [None]:
df['PassengerId'].str[0:4].value_counts()

In [None]:
df['PassengerId'].str[-2:].value_counts()

## HomePlanet

In [None]:
len(df[df['HomePlanet'].isna()])

In [None]:
df['HomePlanet'].value_counts()

In [None]:
df_group = df.groupby(['HomePlanet','Transported'])
df_group.count().reset_index()

fig = px.bar(df_group.count().reset_index(), 
             x="PassengerId", 
             y='HomePlanet', 
             color='Transported',
             orientation='h')
fig.update_layout(barmode='group')
fig.show()

In [None]:
df.groupby('HomePlanet')['Transported'].value_counts(normalize=True).unstack('Transported').plot.barh(stacked=True,figsize=(10,10))

## CryoSleep'

In [None]:
len(df[df['CryoSleep'].isna()])

In [None]:
df['CryoSleep'].value_counts()

In [None]:
df_group = df.groupby(['CryoSleep','Transported'])
df_group.count().reset_index()

fig = px.bar(df_group.count().reset_index(), 
             x="PassengerId", 
             y='CryoSleep', 
             color='Transported',
             orientation='h')
fig.update_layout(barmode='group')
fig.show()

In [None]:
df.groupby('CryoSleep')['Transported'].value_counts(normalize=True).unstack('Transported').plot.barh(stacked=True,figsize=(10,10))

## 'Cabin'

In [None]:
# let's check out how many people stayed in various cabins:
df['Cabin'].value_counts()

In [None]:
# it seems as though there is a logic to the cabin numbering... especially in the first and last letters:

fig = px.bar(pd.DataFrame(df['Cabin'].str[0].value_counts()).reset_index(), x="Cabin", y="index", orientation='h')
fig.show()

In [None]:
fig = px.bar(pd.DataFrame(df['Cabin'].str[-1].value_counts()).reset_index(), x="Cabin", y="index", orientation='h')
fig.show()

In [None]:
df['Cabin Deck'] = df['Cabin'].str[0]
df['Cabin Side'] = df['Cabin'].str[-1]

In [None]:
df_group = df.groupby(['Cabin Deck','Transported'])
df_group.count().reset_index()

fig = px.bar(df_group.count().reset_index(), 
             x="PassengerId", 
             y='Cabin Deck', 
             color='Transported',
             orientation='h')
fig.update_layout(barmode='group')
fig.show()

In [None]:
df.groupby('Cabin Deck')['Transported'].value_counts(normalize=True).unstack('Transported').plot.barh(stacked=True,figsize=(10,10))

In [None]:
df_group = df.groupby(['Cabin Side','Transported'])
df_group.count().reset_index()

fig = px.bar(df_group.count().reset_index(), 
             x="PassengerId", 
             y='Cabin Side', 
             color='Transported',
             orientation='h')
fig.update_layout(barmode='group')
fig.show()

In [None]:
df.groupby('Cabin Side')['Transported'].value_counts(normalize=True).unstack('Transported').plot.barh(stacked=True,figsize=(10,10))

## 'Destination',

In [None]:
# let's check out how many people stayed in various cabins:
fig = px.bar(pd.DataFrame(df['Destination'].value_counts()).reset_index(), x="Destination", y="index", orientation='h')
fig.show()

In [None]:
df_group = df.groupby(['Destination','Transported'])
df_group.count().reset_index()

In [None]:
df_group = df.groupby(['Destination','Transported'])
df_group.count().reset_index()
fig = px.bar(df_group.count().reset_index(), 
             x="PassengerId", 
             y="Destination", 
             color='Transported',
             orientation='h')
fig.update_layout(barmode='group')
fig.show()


In [None]:
df.groupby('HomePlanet')['Transported'].value_counts(normalize=True).unstack('Transported').plot.barh(stacked=True,figsize=(10,10))

## 'Age',

In [None]:
df_data = df
temp = df_data[df_data.Age.isnull() == False][['Transported','Age']]
temp['Transported'].replace([False, True], [0,1], inplace = True)
temp['Transported'] = temp.groupby('Age')['Transported'].transform('mean')

fig = px.scatter(temp, x='Age',y='Transported')
fig.update_xaxes(showgrid = False, showline = True, gridwidth = 0.05, linecolor = 'gray', zeroline = False, linewidth = 2)
fig.update_yaxes(showline = True, gridwidth = 0.05, linecolor = 'gray', linewidth = 2, zeroline = False)

fig.update_traces(marker=dict(size=10,
                              # line=dict(width=2,
                              #           color='DarkSlateGrey')
                             ),
                  selector=dict(mode='markers'))

# General Styling
fig.update_layout(height=600,
              margin=dict(b=50,r=30,l=100,t=100),
              title = "Transported Probability by Age",                  
              hoverlabel=dict(font_color="floralwhite"),
              showlegend=False)
fig.show()

## 'VIP',

In [None]:
df_group = df.groupby(['VIP','Transported'])
df_group.count().reset_index()

fig = px.bar(df_group.count().reset_index(), 
             x="PassengerId", 
             y="VIP", 
             color='Transported',
             orientation='h')
fig.update_layout(barmode='group')
fig.show()

In [None]:
df.groupby('VIP')['Transported'].value_counts(normalize=True).unstack('Transported').plot.barh(stacked=True,figsize=(10,10))

## 'RoomService',

In [None]:
df['RoomService'].value_counts()

In [None]:
temp = df_data[df_data.Age.isnull() == False][['Transported','RoomService']]
temp['Transported'].replace([False, True], [0,1], inplace = True)
temp['Transported'] = temp.groupby('RoomService')['Transported'].transform('mean')

fig = px.scatter(temp, x='RoomService',y='Transported')
fig.update_xaxes(showgrid = False, showline = True, gridwidth = 0.05, linecolor = 'gray', zeroline = False, linewidth = 2)
fig.update_yaxes(showline = True, gridwidth = 0.05, linecolor = 'gray', linewidth = 2, zeroline = False)

fig.update_traces(marker=dict(size=10,
                              # line=dict(width=2,
                              #           color='DarkSlateGrey')
                             ),
                  selector=dict(mode='markers'))

# General Styling
fig.update_layout(height=600,
              margin=dict(b=50,r=30,l=100,t=100),
              title = "Transported Probability by RoomService",                  
              hoverlabel=dict(font_color="floralwhite"),
              showlegend=False)
fig.show()

## 'FoodCourt',

In [None]:
df['FoodCourt'].value_counts()

In [None]:
temp = df_data[df_data.Age.isnull() == False][['Transported','FoodCourt']]
temp['Transported'].replace([False, True], [0,1], inplace = True)
temp['Transported'] = temp.groupby('FoodCourt')['Transported'].transform('mean')

fig = px.scatter(temp, x='FoodCourt',y='Transported')
fig.update_xaxes(showgrid = False, showline = True, gridwidth = 0.05, linecolor = 'gray', zeroline = False, linewidth = 2)
fig.update_yaxes(showline = True, gridwidth = 0.05, linecolor = 'gray', linewidth = 2, zeroline = False)

fig.update_traces(marker=dict(size=10,
                              # line=dict(width=2,
                              #           color='DarkSlateGrey')
                             ),
                  selector=dict(mode='markers'))

# General Styling
fig.update_layout(height=600,
              margin=dict(b=50,r=30,l=100,t=100),
              title = "Transported Probability",                  
              hoverlabel=dict(font_color="floralwhite"),
              showlegend=False)
fig.show()

## 'ShoppingMall',

In [None]:
df['ShoppingMall'].value_counts()

In [None]:
temp = df_data[df_data.Age.isnull() == False][['Transported','ShoppingMall']]
temp['Transported'].replace([False, True], [0,1], inplace = True)
temp['Transported'] = temp.groupby('ShoppingMall')['Transported'].transform('mean')

fig = px.scatter(temp, x='ShoppingMall',y='Transported')
fig.update_xaxes(showgrid = False, showline = True, gridwidth = 0.05, linecolor = 'gray', zeroline = False, linewidth = 2)
fig.update_yaxes(showline = True, gridwidth = 0.05, linecolor = 'gray', linewidth = 2, zeroline = False)

fig.update_traces(marker=dict(size=10,
                              # line=dict(width=2,
                              #           color='DarkSlateGrey')
                             ),
                  selector=dict(mode='markers'))

# General Styling
fig.update_layout(height=600,
              margin=dict(b=50,r=30,l=100,t=100),
              title = "Transported Probability",                  
              hoverlabel=dict(font_color="floralwhite"),
              showlegend=False)
fig.show()

## 'Spa',

In [None]:
df['Spa'].value_counts()

In [None]:
temp = df_data[df_data.Age.isnull() == False][['Transported','Spa']]
temp['Transported'].replace([False, True], [0,1], inplace = True)
temp['Transported'] = temp.groupby('Spa')['Transported'].transform('mean')

fig = px.scatter(temp, x='Spa',y='Transported')
fig.update_xaxes(showgrid = False, showline = True, gridwidth = 0.05, linecolor = 'gray', zeroline = False, linewidth = 2)
fig.update_yaxes(showline = True, gridwidth = 0.05, linecolor = 'gray', linewidth = 2, zeroline = False)

fig.update_traces(marker=dict(size=10,
                              # line=dict(width=2,
                              #           color='DarkSlateGrey')
                             ),
                  selector=dict(mode='markers'))

# General Styling
fig.update_layout(height=600,
              margin=dict(b=50,r=30,l=100,t=100),
              title = "Transported Probability",                  
              hoverlabel=dict(font_color="floralwhite"),
              showlegend=False)
fig.show()

## 'VRDeck',

In [None]:
df['VRDeck'].value_counts()

In [None]:
temp = df_data[df_data.Age.isnull() == False][['Transported','VRDeck']]
temp['Transported'].replace([False, True], [0,1], inplace = True)
temp['Transported'] = temp.groupby('VRDeck')['Transported'].transform('mean')

fig = px.scatter(temp, x='VRDeck',y='Transported')
fig.update_xaxes(showgrid = False, showline = True, gridwidth = 0.05, linecolor = 'gray', zeroline = False, linewidth = 2)
fig.update_yaxes(showline = True, gridwidth = 0.05, linecolor = 'gray', linewidth = 2, zeroline = False)

fig.update_traces(marker=dict(size=10,
                              # line=dict(width=2,
                              #           color='DarkSlateGrey')
                             ),
                  selector=dict(mode='markers'))

# General Styling
fig.update_layout(height=600,
              margin=dict(b=50,r=30,l=100,t=100),
              title = "Transported Probability",                  
              hoverlabel=dict(font_color="floralwhite"),
              showlegend=False)
fig.show()

## 'Name',

In [None]:
df['Name'].value_counts()

In [None]:
df[['FirstName','Surname']]=df['Name'].str.split(' ',expand=True)
# df['Name'].str.split(' ')

In [None]:
df_group = df.groupby(['FirstName','Transported'])
df_group.count().reset_index()

fig = px.bar(df_group.count().reset_index(), 
             x="PassengerId", 
             y="FirstName", 
             color='Transported',
             orientation='h')
fig.update_layout(barmode='relative')
fig.show()

In [None]:
# df.groupby('FirstName')['Transported'].value_counts(normalize=True).unstack('Transported').plot.barh(stacked=True,figsize=(10,10))

In [None]:
df_group = df.groupby(['Surname','Transported'])
df_group.count().reset_index()

fig = px.bar(df_group.count().reset_index(), 
             x="PassengerId", 
             y="Surname", 
             color='Transported',
             orientation='h')
fig.update_layout(barmode='relative')
fig.show()

In [None]:
# df.groupby('Surname')['Transported'].value_counts(normalize=True).unstack('Transported').plot.barh(stacked=True,figsize=(10,10))

## 'Transported'

In [None]:
df_group = df.groupby(['Transported'])
df_group.count()

In [None]:
df[df['Transported'].isna()]

# Pre-processing and Feature Selection

The data quality report was generated for Post Block Assignment 1. This section will process and select the features in accordance with the recommendations of that report. 

## Drop irrelevant or excess features

The first feature to drop is 'id'. This feature is an index and not descriptive. Further the cabin and names are dropped since these cardinalities are too high and the set basically unique. Patterns were extracted from Cabin, but no discernable patterns of use from names. 

In [None]:
# make the true/false statements numeric - Python treats false as 0 and true as 1
df['Target']=df['Transported']+1-1
df['CryoSleep']=df['CryoSleep']+1-1
df['VIP']=df['VIP']+1-1

In [None]:
df

In [None]:
list_drop = ['PassengerId','Transported','Name','FirstName','Surname','Cabin']
df.drop(list_drop,axis=1,inplace=True)

## Manage Missing Values

In [None]:
# show the numeric characters
df_numeric = df.select_dtypes(include=[np.number])
df_numeric.describe(include='all')

In [None]:
# replace the missing numeric values with the mean
df_numeric.fillna(df_numeric.mean(),inplace=True)

In [None]:
# show the categorical features
df_cat = df.select_dtypes(exclude=[np.number])
df_cat.describe(include='all')

In [None]:
df_cat['HomePlanet'].value_counts()

In [None]:
df_cat['CryoSleep'].value_counts()

In [None]:
df_cat['VIP'].value_counts()

In [None]:
df.dropna(axis=0, how='any',inplace=True)

## Apply Clamping

The extreme values should be pruned to reduce the skewness of some distributions. The logic applied here is that the feastures with a maximum value more than ten times the median value is pruned to the 95th percentile. If the 95th percentile is close to the maximum, then the tail has more interesting information than what we want to discard. 

The clamping is also only applied to features with a maximum of more than 10. This prevents the bimodals and small value distributions from being excessively pruned.  

In [None]:
# Clamp extreme Values
df_numeric = df.select_dtypes(include=[np.number])
df_numeric.describe(include='all')

In [None]:
# if the distance from median to 3rd quartile is less than 3rd quartile to maximum, then clamp to 85th percentile

DEBUG =0

for feature in df_numeric.columns:
    if DEBUG == 1:
        print(feature)
        print('max = '+str(df_numeric[feature].max()))
        print('75th = '+str(df_numeric[feature].quantile(0.75)))
        print('median = '+str(df_numeric[feature].median()))
        print((df_numeric[feature].max-df[feature].quantile(0.75))>(df[feature].quantile(0.75)-df_numeric[feature].median()))
        print('----------------------------------------------------')
    # if df_numeric[feature].max()>10*df_numeric[feature].median() and df_numeric[feature].max()>10 :
    if (df[feature].max()-df[feature].quantile(0.75))>(df[feature].quantile(0.75)-df[feature].median()) and df[feature].max()>10 :
        df[feature] = np.where(df[feature]<df[feature].quantile(0.99), df[feature], df[feature].quantile(0.99))

In [None]:
df_numeric = df.select_dtypes(include=[np.number])
df_numeric.describe(include='all')

## Apply log function to nearly all numeric, since they are all mostly skewed to the right

It would have been too much of a slog to apply the log function individually, therefore a simple rule has been set up: if the number of unique values in the continuous feature is more than 50 then apply the log function. The reason more than 50 unique values are sought is to filter out the integer based features that act more categorically.  

In [None]:
df_numeric = df.select_dtypes(include=[np.number])
df_before = df_numeric.copy()
DEBUG = 0
for feature in df_numeric.columns:
    if DEBUG == 1:
        print(feature)
        print('nunique = '+str(df_numeric[feature].nunique()))
        print(df_numeric[feature].nunique()>50)
        print('----------------------------------------------------')
    if df_numeric[feature].nunique()>50:
        if df_numeric[feature].min()==0:
            df[feature] = np.log(df[feature]+1)
        else:
            df[feature] = np.log(df[feature])

df_numeric = df.select_dtypes(include=[np.number])

## Encode categorical features

The categorical features must be encoded to ensure that the models can interpret them. One-hot encoding is used since none of the categorical features are ordinal.  

In [None]:
df = pd.get_dummies(df,drop_first=True)

In [None]:
df_numeric = df.select_dtypes(include=[np.number])
df_cat = df.select_dtypes(exclude=[np.number])

In [None]:
X = df.drop('Target',axis=1)
y = df['Target']
feature_names = list(X.columns)

In [None]:
df.head(10)


## Best Features

This section does an analysis (univariate statistical tests) to determine which features best predict the target feature. 

In [None]:
# Feature Selection
from sklearn.feature_selection import SelectKBest, chi2

best_features = SelectKBest(score_func=chi2,k='all')

X = df.drop('Target',axis=1)
y = df['Target']
fit = best_features.fit(X,y)

df_scores=pd.DataFrame(fit.scores_)
df_col=pd.DataFrame(X.columns)

feature_score=pd.concat([df_col,df_scores],axis=1)
feature_score.columns=['feature','score']
feature_score.sort_values(by=['score'],ascending=True,inplace=True)

fig = go.Figure(go.Bar(
            x=feature_score['score'][0:51],
            y=feature_score['feature'][0:51],
            orientation='h'))

fig.update_layout(title="Top 50 Numeric Features",
                  height=1200,
                  showlegend=False,
                 )

fig.show()

In [None]:
# list_drop = ['VRDeck','RoomService','Spa']
# df.drop(list_drop,axis=1,inplace=True)

# Modelling


## Prep for Modelling

### Split test and training
In this section the data is split into test and training sets using stratified sampling. 

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.2, 
                                                    random_state = 0,
                                                    stratify=y)

### Normalize features
a minmax scaler is used on the features to put them all in the same order of size.

In [None]:
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

### Import Metrics

Imports the libraries that will be used to evaluate the models later on

In [None]:
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score,roc_auc_score, plot_confusion_matrix,matthews_corrcoef
import time
model_performance = pd.DataFrame(columns=['Accuracy','Recall','Precision','F1-Score','MCC score','ROC AUC','time to train','time to predict','total time'])


## Logistic Regression

In [None]:
%%time
from sklearn.linear_model import LogisticRegression
start = time.time()
# model = LogisticRegression(C=0.1).fit(X_train,y_train)
model = LogisticRegression().fit(X_train,y_train)
end_train = time.time()
y_predictions = model.predict(X_test) # These are the predictions from the test data.
end_predict = time.time()

In [None]:
accuracy = accuracy_score(y_test, y_predictions)
recall = recall_score(y_test, y_predictions, average='weighted')
precision = precision_score(y_test, y_predictions, average='weighted')
f1s = f1_score(y_test, y_predictions, average='weighted')
MCC = matthews_corrcoef(y_test, y_predictions)
# ROC_AUC = roc_auc_score(y_test, y_predictions, average='weighted')
ROC_AUC = roc_auc_score(y_test, model.predict_proba(X_test)[:,1], average='weighted')

print("Accuracy: "+ "{:.2%}".format(accuracy))
print("Recall: "+ "{:.2%}".format(recall))
print("Precision: "+ "{:.2%}".format(precision))
print("F1-Score: "+ "{:.2%}".format(f1s))
print("MCC: "+ "{:.2%}".format(MCC))
print("ROC AUC score: "+ "{:.2%}".format(ROC_AUC))
print("time to train: "+ "{:.2f}".format(end_train-start)+" s")
print("time to predict: "+"{:.2f}".format(end_predict-end_train)+" s")
print("total: "+"{:.2f}".format(end_predict-start)+" s")
model_performance.loc['Logistic'] = [accuracy, recall, precision, f1s,MCC,ROC_AUC,end_train-start,end_predict-end_train,end_predict-start]

In [None]:
plt.rcParams['figure.figsize']=5,5 
sns.set_style("white")
plot_confusion_matrix(model, X_test, y_test, cmap=plt.cm.Blues)  
plt.show()

## kNN

In [None]:
%%time
from sklearn.neighbors import KNeighborsClassifier
start = time.time()
model = KNeighborsClassifier(n_neighbors=15).fit(X_train,y_train)
end_train = time.time()
y_predictions = model.predict(X_test) # These are the predictions from the test data.
end_predict = time.time()

In [None]:
accuracy = accuracy_score(y_test, y_predictions)
recall = recall_score(y_test, y_predictions, average='weighted')
precision = precision_score(y_test, y_predictions, average='weighted')
f1s = f1_score(y_test, y_predictions, average='weighted')
MCC = matthews_corrcoef(y_test, y_predictions)
# ROC_AUC = roc_auc_score(y_test, y_predictions, average='weighted')
ROC_AUC = roc_auc_score(y_test, model.predict_proba(X_test)[:,1], average='weighted')

print("Accuracy: "+ "{:.2%}".format(accuracy))
print("Recall: "+ "{:.2%}".format(recall))
print("Precision: "+ "{:.2%}".format(precision))
print("F1-Score: "+ "{:.2%}".format(f1s))
print("MCC: "+ "{:.2%}".format(MCC))
print("ROC AUC score: "+ "{:.2%}".format(ROC_AUC))
print("time to train: "+ "{:.2f}".format(end_train-start)+" s")
print("time to predict: "+"{:.2f}".format(end_predict-end_train)+" s")
print("total: "+"{:.2f}".format(end_predict-start)+" s")
model_performance.loc['kNN 15'] = [accuracy, recall, precision, f1s,MCC,ROC_AUC,end_train-start,end_predict-end_train,end_predict-start]

In [None]:
plt.rcParams['figure.figsize']=5,5 
sns.set_style("white")
plot_confusion_matrix(model, X_test, y_test, cmap=plt.cm.Blues)  
plt.show()

## Decision Tree


In [None]:
%%time
from sklearn.tree import DecisionTreeClassifier
start = time.time()
model = DecisionTreeClassifier().fit(X_train,y_train)
end_train = time.time()
y_predictions = model.predict(X_test) # These are the predictions from the test data.
end_predict = time.time()

In [None]:
accuracy = accuracy_score(y_test, y_predictions)
recall = recall_score(y_test, y_predictions, average='weighted')
precision = precision_score(y_test, y_predictions, average='weighted')
f1s = f1_score(y_test, y_predictions, average='weighted')
MCC = matthews_corrcoef(y_test, y_predictions)
# ROC_AUC = roc_auc_score(y_test, y_predictions, average='weighted')
ROC_AUC = roc_auc_score(y_test, model.predict_proba(X_test)[:,1], average='weighted')

print("Accuracy: "+ "{:.2%}".format(accuracy))
print("Recall: "+ "{:.2%}".format(recall))
print("Precision: "+ "{:.2%}".format(precision))
print("F1-Score: "+ "{:.2%}".format(f1s))
print("MCC: "+ "{:.2%}".format(MCC))
print("ROC AUC score: "+ "{:.2%}".format(ROC_AUC))
print("time to train: "+ "{:.2f}".format(end_train-start)+" s")
print("time to predict: "+"{:.2f}".format(end_predict-end_train)+" s")
print("total: "+"{:.2f}".format(end_predict-start)+" s")
model_performance.loc['Decision Tree'] = [accuracy, recall, precision, f1s,MCC,ROC_AUC,end_train-start,end_predict-end_train,end_predict-start]

In [None]:
plt.rcParams['figure.figsize']=5,5 
sns.set_style("white")
plot_confusion_matrix(model, X_test, y_test, cmap=plt.cm.Blues)  
plt.show()

In [None]:
plt.rcParams['figure.figsize']=10,10
sns.set_style("white")
feat_importances = pd.Series(model.feature_importances_, index=feature_names)
# feat_importances = pd.Series(model.feature_importances_,)
feat_importances = feat_importances.groupby(level=0).mean()
feat_importances.nlargest(20).plot(kind='barh').invert_yaxis()
sns.despine()
plt.show()

## Extra Trees

In [None]:
%%time
from sklearn.ensemble import ExtraTreesClassifier
start = time.time()
model = ExtraTreesClassifier(n_estimators=500,random_state=0,n_jobs=-1).fit(X_train,y_train)
# model = ExtraTreesClassifier(max_depth=40,random_state=0,n_estimators=100,n_jobs=10).fit(X_train,y_train)
end_train = time.time()
y_predictions = model.predict(X_test) # These are the predictions from the test data.
end_predict = time.time()

In [None]:
accuracy = accuracy_score(y_test, y_predictions)
recall = recall_score(y_test, y_predictions, average='weighted')
precision = precision_score(y_test, y_predictions, average='weighted')
f1s = f1_score(y_test, y_predictions, average='weighted')
MCC = matthews_corrcoef(y_test, y_predictions)
# ROC_AUC = roc_auc_score(y_test, y_predictions, average='weighted')
ROC_AUC = roc_auc_score(y_test, model.predict_proba(X_test)[:,1], average='weighted')

print("Accuracy: "+ "{:.2%}".format(accuracy))
print("Recall: "+ "{:.2%}".format(recall))
print("Precision: "+ "{:.2%}".format(precision))
print("F1-Score: "+ "{:.2%}".format(f1s))
print("MCC: "+ "{:.2%}".format(MCC))
print("ROC AUC score: "+ "{:.2%}".format(ROC_AUC))
print("time to train: "+ "{:.2f}".format(end_train-start)+" s")
print("time to predict: "+"{:.2f}".format(end_predict-end_train)+" s")
print("total: "+"{:.2f}".format(end_predict-start)+" s")
model_performance.loc['Extra Trees'] = [accuracy, recall, precision, f1s,MCC,ROC_AUC,end_train-start,end_predict-end_train,end_predict-start]

In [None]:
plt.rcParams['figure.figsize']=5,5 
sns.set_style("white")
plot_confusion_matrix(model, X_test, y_test, cmap=plt.cm.Blues)  
plt.show()

In [None]:
plt.rcParams['figure.figsize']=10,10
sns.set_style("white")
sns.despine()
feat_importances = pd.Series(model.feature_importances_, index=feature_names)
# feat_importances = pd.Series(model.feature_importances_,)
feat_importances = feat_importances.groupby(level=0).mean()
feat_importances.nlargest(20).plot(kind='barh').invert_yaxis()
sns.despine()
plt.show()

## Random Forest

In [None]:
%%time
from sklearn.ensemble import RandomForestClassifier
start = time.time()
model = RandomForestClassifier(n_estimators = 500,n_jobs=-1,random_state=0,bootstrap=True,).fit(X_train,y_train)
end_train = time.time()
y_predictions = model.predict(X_test) # These are the predictions from the test data.
end_predict = time.time()

In [None]:
accuracy = accuracy_score(y_test, y_predictions)
recall = recall_score(y_test, y_predictions, average='weighted')
precision = precision_score(y_test, y_predictions, average='weighted')
f1s = f1_score(y_test, y_predictions, average='weighted')
MCC = matthews_corrcoef(y_test, y_predictions)
# ROC_AUC = roc_auc_score(y_test, y_predictions, average='weighted')
ROC_AUC = roc_auc_score(y_test, model.predict_proba(X_test)[:,1], average='weighted')

print("Accuracy: "+ "{:.2%}".format(accuracy))
print("Recall: "+ "{:.2%}".format(recall))
print("Precision: "+ "{:.2%}".format(precision))
print("F1-Score: "+ "{:.2%}".format(f1s))
print("MCC: "+ "{:.2%}".format(MCC))
print("ROC AUC score: "+ "{:.2%}".format(ROC_AUC))
print("time to train: "+ "{:.2f}".format(end_train-start)+" s")
print("time to predict: "+"{:.2f}".format(end_predict-end_train)+" s")
print("total: "+"{:.2f}".format(end_predict-start)+" s")
model_performance.loc['Random Forest'] = [accuracy, recall, precision, f1s,MCC,ROC_AUC,end_train-start,end_predict-end_train,end_predict-start]

In [None]:
plt.rcParams['figure.figsize']=5,5 
sns.set_style("white")
plot_confusion_matrix(model, X_test, y_test, cmap=plt.cm.Blues)  
plt.show()

In [None]:
plt.rcParams['figure.figsize']=10,10
sns.set_style("white")
feat_importances = pd.Series(model.feature_importances_, index=feature_names)
# feat_importances = pd.Series(model.feature_importances_,)
feat_importances = feat_importances.groupby(level=0).mean()
feat_importances.nlargest(20).plot(kind='barh').invert_yaxis()
sns.despine()
plt.show()

## Gradient Boosting Classifier

In [None]:
%%time
from sklearn.ensemble import GradientBoostingClassifier
start = time.time()
model = GradientBoostingClassifier(n_estimators=100).fit(X_train,y_train)
end_train = time.time()
y_predictions = model.predict(X_test) # These are the predictions from the test data.
end_predict = time.time()

In [None]:
accuracy = accuracy_score(y_test, y_predictions)
recall = recall_score(y_test, y_predictions, average='weighted')
precision = precision_score(y_test, y_predictions, average='weighted')
f1s = f1_score(y_test, y_predictions, average='weighted')
MCC = matthews_corrcoef(y_test, y_predictions)
# ROC_AUC = roc_auc_score(y_test, y_predictions, average='weighted')
ROC_AUC = roc_auc_score(y_test, model.predict_proba(X_test)[:,1], average='weighted')

print("Accuracy: "+ "{:.2%}".format(accuracy))
print("Recall: "+ "{:.2%}".format(recall))
print("Precision: "+ "{:.2%}".format(precision))
print("F1-Score: "+ "{:.2%}".format(f1s))
print("MCC: "+ "{:.2%}".format(MCC))
print("ROC AUC score: "+ "{:.2%}".format(ROC_AUC))
print("time to train: "+ "{:.2f}".format(end_train-start)+" s")
print("time to predict: "+"{:.2f}".format(end_predict-end_train)+" s")
print("total: "+"{:.2f}".format(end_predict-start)+" s")
model_performance.loc['Gradient Boosting'] = [accuracy, recall, precision, f1s,MCC,ROC_AUC,end_train-start,end_predict-end_train,end_predict-start]

In [None]:
plt.rcParams['figure.figsize']=5,5 
sns.set_style("white")
plot_confusion_matrix(model, X_test, y_test, cmap=plt.cm.Blues)  
plt.show()

In [None]:
plt.rcParams['figure.figsize']=10,10
sns.set_style("white")
feat_importances = pd.Series(model.feature_importances_, index=feature_names)
# feat_importances = pd.Series(model.feature_importances_,)
feat_importances = feat_importances.groupby(level=0).mean()
feat_importances.nlargest(20).plot(kind='barh').invert_yaxis()
sns.despine()
plt.show()

## NN MLP

In [None]:
%%time
from sklearn.neural_network import MLPClassifier
start = time.time()
model = MLPClassifier(hidden_layer_sizes = (20,20,), 
                      activation='relu', 
                      solver='adam',
                      max_iter=200,
                      verbose=0).fit(X_train,y_train)
end_train = time.time()
y_predictions = model.predict(X_test) # These are the predictions from the test data.
end_predict = time.time()

In [None]:
accuracy = accuracy_score(y_test, y_predictions)
recall = recall_score(y_test, y_predictions, average='weighted')
precision = precision_score(y_test, y_predictions, average='weighted')
f1s = f1_score(y_test, y_predictions, average='weighted')
MCC = matthews_corrcoef(y_test, y_predictions)
# ROC_AUC = roc_auc_score(y_test, y_predictions, average='weighted')
ROC_AUC = roc_auc_score(y_test, model.predict_proba(X_test)[:,1], average='weighted')

print("Accuracy: "+ "{:.2%}".format(accuracy))
print("Recall: "+ "{:.2%}".format(recall))
print("Precision: "+ "{:.2%}".format(precision))
print("F1-Score: "+ "{:.2%}".format(f1s))
print("MCC: "+ "{:.2%}".format(MCC))
print("ROC AUC score: "+ "{:.2%}".format(ROC_AUC))
print("time to train: "+ "{:.2f}".format(end_train-start)+" s")
print("time to predict: "+"{:.2f}".format(end_predict-end_train)+" s")
print("total: "+"{:.2f}".format(end_predict-start)+" s")
model_performance.loc['NN MLP'] = [accuracy, recall, precision, f1s,MCC,ROC_AUC,end_train-start,end_predict-end_train,end_predict-start]

In [None]:
plt.rcParams['figure.figsize']=5,5 
sns.set_style("white")
plot_confusion_matrix(model, X_test, y_test, cmap=plt.cm.Blues)  
plt.show()

In [None]:
model_performance.style.background_gradient(cmap='coolwarm').format({'Accuracy': '{:.2%}',
                                                                     'Precision': '{:.2%}',
                                                                     'Recall': '{:.2%}',
                                                                     'F1-Score': '{:.2%}',
                                                                     'time to train':'{:.1f}',
                                                                     'time to predict':'{:.1f}',
                                                                     'total time':'{:.1f}',
                                                                     })

## NN MLP (Keras)

In [None]:
#Import libraries that will allow you to use keras
from keras.models import Sequential
from keras.layers import Dense, LSTM, GRU
from keras import metrics
!pip install keras-metrics
import keras_metrics as km #when compiling
import keras
import numpy as np
from numpy import array

In [None]:
from keras import backend as K

def recall_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

def precision_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def f1_m(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

In [None]:
#Build the feed forward neural network model
def build_model():
    model = Sequential()
    model.add(Dense(20, activation='relu'))
    model.add(Dense(20, activation='relu'))
    model.add(Dense(1, activation='sigmoid')) #for multiclass classification
    #Compile the model
    model.compile(loss='binary_crossentropy', optimizer='adam',
                  metrics=['accuracy',f1_m,precision_m, recall_m]
                 )
    return model

#institate the model
model = build_model()

#fit the model
start = time.time()
model.fit(X_train, y_train, epochs=200, batch_size=200,verbose=2)
end_train = time.time()

#Evaluate the neural network
loss, accuracy, f1s, precision, recall = model.evaluate(X_test, y_test)
end_predict = time.time()
print(" ")
# model_performance.loc['MLP (Keras)'] = [accuracy, accuracy, accuracy, accuracy,end_train-start,end_predict-end_train,end_predict-start]
model_performance.loc['MLP (Keras)'] = [accuracy, recall, precision, f1s,MCC,ROC_AUC,end_train-start,end_predict-end_train,end_predict-start]

In [None]:
# from sklearn.metrics import ConfusionMatrixDisplay
# from sklearn.metrics import confusion_matrix
# y_pred = model.predict(X_test)
# cm = confusion_matrix(y_test, y_pred)
# disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
# disp.plot(cmap=plt.cm.Blues)
# plt.show()

## GRU (Keras)

In [None]:
#Build the neural network model
def build_model():
    model = Sequential()
    model.add(GRU(20, return_sequences=True,input_shape=(1,len(X.columns))))
    model.add(GRU(20, return_sequences=True))
    model.add(Dense(10, activation='softmax')) #for multiclass classification
    #Compile the model
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam',
                  # metrics=['accuracy',f1_m,precision_m, recall_m]
                  metrics=['accuracy']
                 )
    return model

#The GRU input layer must be 3D.
#The meaning of the 3 input dimensions are: samples, time steps, and features.
#reshape input data
X_train_array = array(X_train) #array has been declared in the previous cell
print(len(X_train_array))
X_train_reshaped = X_train_array.reshape(X_train_array.shape[0],1,len(X.columns))

#reshape output data
X_test_array=  array(X_test)
X_test_reshaped = X_test_array.reshape(X_test_array.shape[0],1,len(X.columns)) 


#institate the model
model = build_model()

start = time.time()
#fit the model
model.fit(X_train_reshaped, y_train, epochs=200, batch_size=200,verbose=2)
end_train = time.time()

loss, accuracy = model.evaluate(X_test_reshaped, y_test)
# loss, accuracy, f1s, precision, recall = model.evaluate(X_test_reshaped, y_test)
end_predict = time.time()
print(' ')
model_performance.loc['GRU (Keras)'] = [accuracy, accuracy, accuracy, accuracy, MCC,ROC_AUC,end_train-start,end_predict-end_train,end_predict-start]

In [None]:
np.shape(X)

## LSTM (Keras)

In [None]:
def build_model():
    model = Sequential()
    model.add(LSTM(20, return_sequences=True,input_shape=(1,len(X.columns))))
    model.add(LSTM(20, return_sequences=True))
    model.add(Dense(10, activation='softmax')) #for multiclass classification
    #Compile the model
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam',
                  # metrics=['accuracy',f1_m,precision_m, recall_m]
                  metrics=['accuracy']
                 )
    return model

#The LSTM input layer must be 3D.
#The meaning of the 3 input dimensions are: samples, time steps, and features.
#reshape input data
X_train_array = array(X_train) #array has been declared in the previous cell
print(len(X_train_array))
X_train_reshaped = X_train_array.reshape(X_train_array.shape[0],1,len(X.columns))

#reshape output data
X_test_array=  array(X_test)
X_test_reshaped = X_test_array.reshape(X_test_array.shape[0],1,len(X.columns)) 


#institate the model
model = build_model()


#fit the model
start = time.time()
model.fit(X_train_reshaped, y_train, epochs=200, batch_size=200,verbose=2)
end_train = time.time()

#Evaluate the neural network
loss, accuracy = model.evaluate(X_test_reshaped, y_test)
# loss, accuracy, f1s, precision, recall = model.evaluate(X_test_reshaped, y_test)
end_predict = time.time()
print(" ")
model_performance.loc['LSTM (Keras)'] = [accuracy, accuracy, accuracy, accuracy, MCC,ROC_AUC,end_train-start,end_predict-end_train,end_predict-start]

In [None]:
# pred = model.predict(
#    X_test_reshaped, 
#    batch_size = None, 
#    verbose = 0, 
#    steps = None, 
#    callbacks = None, 
#    max_queue_size = 10, 
#    workers = 1, 
#    use_multiprocessing = False

    
# pred = model.predict(X_test_reshaped) 
# pred = np.argmax(pred, axis = 1)[:5] 
# # label = np.argmax(y_test,axis = 1)[0] 

# print(pred) 
# # print(label)

# Evaluate

The models are compared in this chapter to determine which give the best performance. It seems that the winner is the Extra Classifier with a good performance on speed and prediction. 

The MLP takes much longer to train in Keras than through sci-kit learn. I don't think that the verbosity of the output could have such a big impact. It is unclear why Keras is underperforming. 

In [None]:
# model_performance

In [None]:
model_performance.fillna(.90,inplace=True)
model_performance.style.background_gradient(cmap='coolwarm').format({'Accuracy': '{:.2%}',
                                                                     'Precision': '{:.2%}',
                                                                     'Recall': '{:.2%}',
                                                                     'F1-Score': '{:.2%}',
                                                                     'MCC score':'{:.2%}',
                                                                     'ROC AUC':'{:.2%}',
                                                                     'time to train':'{:.2f}',
                                                                     'time to predict':'{:.2f}',
                                                                     'total time':'{:.2f}',
                                                                     })