<h1><center> <span style="color:DarkBlue;">INTRODUCTION</span></center></h1>

<p>It is the year 2912. We've received a transmission from four lightyears away and things aren't looking good.</p>
    
    
<p>The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

To help rescue crews and retrieve the lost passengers, We predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.</h4>

<center><img src= "https://as2.ftcdn.net/v2/jpg/02/24/24/23/1000_F_224242336_f7ekk6EoCmz5061Si58wEWUqizXAEUJk.jpg" alt ="Titanic" style='width: 800px;'></center>

<h1><center> <span style="color:DarkBlue;">IMPORT PACKAGES</span></center></h1>

In [None]:
import numpy as np
import pandas as pd
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from sklearn.impute import SimpleImputer
import seaborn as sns
import plotly.graph_objects as go
from sklearn import preprocessing
import plotly.express as px
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

<h1><center> <span style="color:DarkBlue;">LOAD DATASET</span></center></h1>

*We read the two data files to the kaggle environment*

In [None]:
test = pd.read_csv("../input/spaceship-titanic/test.csv")
train = pd.read_csv("../input/spaceship-titanic/train.csv")

<h1><center> <span style="color:DarkBlue;">EXPLORING AND PREPARING THE DATASET</span></center></h1>

<h3><span style="color:purple;">Explore the datatypes of all the features</span></h3>

In [None]:
train.dtypes

*The features in the dataset are not uniform and need to be converted to make the features more usable. Categorical features need to be label encoded, Age Variable needs to be binned and categorised so as to make our analytics easier.*

<h3><span style="color:purple;">Basic statistics about the Test and Train Data</span></h3>

*Finding the Statistical information about all the Numeric features of the dataset. Statistical knowledge can be critical for identifying skewness and outliers in the dataset.*

In [None]:
print("Test statistics: \n", test.describe())
print("\n")
print("Train statistics: \n", train.describe())

<h3><span style="color:purple;">Explore Missingness in Data</span></h3>

*Missing Values in the data need to be delt with before performing Analysis. If this step is missed, the predictions could be less reliable and might have biases. The various missing values imputation methods are discussed later in this notebook.*

In [None]:
missing_test = pd.DataFrame(test.isna().sum())
missing_test.sort_values(by=0, ascending=False)

missing_train = pd.DataFrame(train.isna().sum())
missing_train.sort_values(by=0, ascending=False)

In [None]:
fig = make_subplots(rows=1, cols=2)

fig.add_trace(go.Bar(y=missing_train[0], x=missing_train.index,
                    marker=dict(color=[n for n in range(14)], 
                                coloraxis="coloraxis")),
              1, 1)

fig.add_trace(go.Bar(y=missing_test[0], x=missing_test.index,
                    marker=dict(color=[n for n in range(14)], 
                                coloraxis="coloraxis")),
              1, 2)

fig.update_layout(coloraxis=dict(colorscale='Bluered_r'), showlegend=False, title_text="Features' Null Value Distribution in Train and Test Data", title_x=0.5)
fig.show()

*On plotting the missing values we can see that the Feature, "PassengerId", and "Transported" in the train data have no missing values, whereas "CryoSleep" has the most no of missing values.*

*Similarly, in the test dataset, "PassengerId has no missing values, and we also notice that there is no "Transported" feature in the test dataset. "FoodCourt" spends has the most no of missing data in this dataset.*

<h3><span style="color:purple;">Handle Missing Values In The Dataset, And Prepare It For Modelling</span></h3>

*The First step in dealing with missing values is to generalise all the missing values and give them a standard notation. Here, we replace All Missing Values With NaN.*

In [None]:
train = train.replace(['', ' '], np.NaN)
test = test.replace(['', ' '], np.NaN)

<h3><span style="color:purple;">Deal With Missing Values, Using Appropriate Imputations</span></h3>

MISSING DATA IMPUTATION FOR EACH FEATURE:

* 'HomePlanet', The Home Planet is set to the most common planet in HomePlanet feature.
* 'CryoSleep', CryoSleep is set to False.
* 'Cabin', Cabin is set as the same from previous record, because cabins are very likely to be same or very similar for families, and they tend to occur in groups.
* 'Destination', Destination is set to the most frequently occuring Destination in Destination feature.
* 'Age', is set to average
* 'VIP', is set to False, because more than 90% of the records are not VIP.
* 'RoomService', Spend is set to 0.
* 'FoodCourt', Spend is set to 0.
* 'ShoppingMall', Spend is set to 0.
* 'Spa', Spend is set to 0.
* 'VRDeck', Spend is set to 0.
* 'Name', is set to Mr. XXXX as the Default value.



*On Deciding how to treat missing values in each of the features, we create Imputers. Imputers for this Dataset are of 3 strategies.*
1. Mean Imputation
    Here, the mean of the whole feature is calculated and is appended wherever values are missing. This can be performed only on Numerical Features.
2. Constant Imputation
    In this type of Imputation, a constant value is imputed in all the missing value indexes. This can be the Default value of the feature or a Boolean Value.
3. Most Frequent Value Imputation
    This is one of the most common Imputation method, where the most frequently occuring value is imputed in all the missing value indexes.

In [None]:
impmean = SimpleImputer(strategy='mean', missing_values=np.nan)
impcomm = SimpleImputer(strategy='most_frequent', missing_values=np.nan)
impconst0 = SimpleImputer(strategy='constant', missing_values=np.nan, fill_value=0)
impconstf = SimpleImputer(strategy='constant', missing_values=np.nan, fill_value= False)
impconstx = SimpleImputer(strategy='constant', missing_values=np.nan, fill_value="Mr. XXXX")

In [None]:
impmean = impmean.fit(train[['Age']])
train[['Age']] = impmean.transform(train[['Age']])
impmean = impmean.fit(test[['Age']])
test[['Age']] = impmean.transform(test[['Age']])

impcomm = impcomm.fit(train[['HomePlanet', 'Destination']])
train[['HomePlanet', 'Destination']] = impcomm.transform(train[['HomePlanet', 'Destination']])
impcomm = impcomm.fit(test[['HomePlanet', 'Destination']])
test[['HomePlanet', 'Destination']] = impcomm.transform(test[['HomePlanet', 'Destination']])


impconstf = impconstf.fit(train[['CryoSleep', 'VIP']])
train[['CryoSleep', 'VIP']] = impconstf.transform(train[['CryoSleep', 'VIP']])
impconstf = impconstf.fit(test[['CryoSleep', 'VIP']])
test[['CryoSleep', 'VIP']] = impconstf.transform(test[['CryoSleep', 'VIP']])

impconst0 = impconst0.fit(train[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']])
train[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']] = impconst0.transform(train[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']])
impconst0 = impconst0.fit(test[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']])
test[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']] = impconst0.transform(test[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']])


impconstx = impconstx.fit(train[['Name']])
train[['Name']] = impconstx.transform(train[['Name']])
impconstx = impconstx.fit(test[['Name']])
test[['Name']] = impconstx.transform(test[['Name']])

train[['Cabin']] = train[['Cabin']].fillna(method='ffill')
test[['Cabin']] = test[['Cabin']].fillna(method='ffill')

<h1><center> <span style="color:DarkBlue;">EXPLORATORY DATA ANALYTICS</span></center></h1>

<h3><span style="color:purple;">Check Feature Correlation</span></h3>

*Feature Correlation is done to identify the extent of correlation between the features in dataset. It is very important to get rid of Features that are correlated because if the variables are correlated, they result to bias while making predictions, and also leads to the model getting overfit in some cases.*

In [None]:
corr1 = train.corr(method="pearson")

fig, ax =plt.subplots(1,2, figsize=(15,6))
c1 = sns.heatmap(corr1, annot=True, linewidths=.5, ax=ax[0])
c1.set_title('Train Features Correlation')

corr2 = test.corr(method="pearson")
c2 = sns.heatmap(corr2, annot=True, linewidths=.5, ax=ax[1])
c2.set_title('Test Features Correlation')

*On finding the correlations between the features of the dataset, we see that there is no string correlation between any of the features of the dataset. But it is clear that there is a small correlation between the different spends('RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'). There is also a small correlation between the spends and the fact of being Transported.*

*The highest Negative correlation for Train dataset was between 'RoomService' and 'Transported', where as the highest positive correlation is between 'Spa'-'FoodCourt', and 'VRDeck'-'FoodCourt'.*

*The highest correlation in the test data is between 'FoodCourt'-'VRDeck'. The most negative correlation is between 'VRDeck'-'RoomService'*

In [None]:
bins = [0, 12, 18, 32, 60, 120]

labels = ["Child", "Teen" , "Young Adult", "Adult", "Old"]
train['Age_Cat'] = pd.cut(train['Age'], bins = bins, labels=labels)
test['Age_Cat'] = pd.cut(test['Age'], bins = bins, labels=labels)

train.drop(columns=['Age', 'Name', 'Cabin'], inplace=True)
test.drop(columns=['Age', 'Name', 'Cabin'], inplace=True)

*Since Age is a continuous numerical variable, its hard to perform any analysis on it directly. Hence we categorise the feature by binning. Here, we have categorised age into five categories,*
1. Child (0-12)years
2. Teen (12-18)years
3. Yound Adult (18-32)years
4. Adult (32-60)years
5. Old (>60)years

*This enables us to use Age in our Analysis in a better way, helping us perform better data analysis.*

<h3><span style="color:purple;">Check Age Distribution of Passengers</span></h3>

In [None]:
values = train['Age_Cat'].value_counts().values

fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=.4)])
fig.update_layout(margin=dict(t=50, b=0, l=0, r=0), title_text="Passengers' Age Distribution", title_x=0.3)
fig.show()

*From the Donut Chart we can see that 77.6% of the passengers in the spaceship are aged below 18 years of age, And There are only 2.58% of passengers aged above 60 years. This can probably explain the high spends on('RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck').*

<h3><span style="color:purple;">Check Age Distribution of VIP Passengers</span></h3>

In [None]:
vip_train = train[train['VIP'] == True]

values = vip_train['Age_Cat'].value_counts().values

fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=.4)])
fig.update_layout(margin=dict(t=40, b=0, l=0, r=0), title_text="VIP Passengers' Age Distribution", title_x=0.5)
fig.show()

*It is evident from the chart above that 95% of the VIPs are aged below 18 years.*

<h3><span style="color:purple;">Check Spending Levels of VIP vs Normal Passengers</span></h3>

In [None]:
rs_t_val = train['RoomService'].sum()/train.shape[0]
rs_v_val = vip_train['RoomService'].sum()/vip_train.shape[0]

fc_t_val = train['FoodCourt'].sum()/train.shape[0]
fc_v_val = vip_train['FoodCourt'].sum()/vip_train.shape[0]

sm_t_val = train['ShoppingMall'].sum()/train.shape[0]
sm_v_val = vip_train['ShoppingMall'].sum()/vip_train.shape[0]

s_t_val = train['Spa'].sum()/train.shape[0]
s_v_val = vip_train['Spa'].sum()/vip_train.shape[0]

vr_t_val = train['VRDeck'].sum()/train.shape[0]
vr_v_val = vip_train['VRDeck'].sum()/vip_train.shape[0]

In [None]:
fig = make_subplots(rows=2, cols=3, vertical_spacing = 0.1 ,subplot_titles=('Room Service','Food Court', 'Shopping Mall', 'Spa', 'VR Deck'), y_title="Average Amount Spent Per Person")


fig.add_trace(go.Bar(y=[rs_t_val, rs_v_val], x=["Normal", "VIP"],
                    marker=dict(color=[4, 7], coloraxis="coloraxis")),
              1, 1)

fig.add_trace(go.Bar(y=[fc_t_val, fc_v_val], x=["Normal", "VIP"],
                    marker=dict(color=[2, 3, 5], coloraxis="coloraxis")),
              1, 2)

fig.add_trace(go.Bar(y=[sm_t_val, sm_v_val], x=["Normal", "VIP"],
                    marker=dict(color=[4, 7], coloraxis="coloraxis")),
              1, 3)

fig.add_trace(go.Bar(y=[s_t_val, s_v_val], x=["Normal", "VIP"],
                    marker=dict(color=[2, 3, 5], coloraxis="coloraxis")),
              2, 1)

fig.add_trace(go.Bar(y=[vr_t_val, vr_v_val], x=["Normal", "VIP"],
                    marker=dict(color=[4, 7], coloraxis="coloraxis")),
              2, 2)

fig.update_layout(coloraxis=dict(colorscale='Bluered_r'), showlegend=False, height=800, width=900)
fig.show()

*The spends comparisions between the VIPs and Normal Passengers is a very interesting comparision, it can be inferred that the spending amount for shopping is very similar to both the class of passengers. The money spent on VR Deck is very contrary, where VIP passengers spend 1200 on average whereas Normal passengers spend about 300 on average.*

<h3><span style="color:purple;">Transportation of Normal vs VIP Percentage wise</span></h3>

In [None]:
normal_trans = train.loc[(train['VIP'] == False) & (train['Transported'] == True)].shape[0]
normal_ntrans = train.loc[(train['VIP'] == False) & (train['Transported'] == False)].shape[0]
vip_trans = train.loc[(train['VIP'] == True) & (train['Transported'] == True)].shape[0]
vip_ntrans = train.loc[(train['VIP'] == True) & (train['Transported'] == False)].shape[0]

In [None]:
fig = make_subplots(rows=1, cols=2, vertical_spacing = 0.1 ,subplot_titles=('NORMAL CLASS', 'VIP'), y_title="Number of Passengers")

fig.add_trace(go.Bar(y=[normal_trans, normal_ntrans], x=["Transported", "Not Transported"],
                    marker=dict(color=[4, 7], coloraxis="coloraxis")),
              1, 1)

fig.add_trace(go.Bar(y=[vip_trans, vip_ntrans], x=["Transported", "Not Transported"],
                    marker=dict(color=[2, 3, 5], coloraxis="coloraxis")),
              1, 2)

fig.update_layout(coloraxis=dict(colorscale='Bluered_r'),margin=dict(t=40, b=0, l=0, r=0), showlegend=False, height=400, width=500)
fig.show()

*From the Column Chart we can see that there is  50%-50% chance of normal passengers being transported, whereas, in VIP the no of passengers transported are less. From this we can infer that there was no priority or advantages for VIP passengers.*

<h3><span style="color:purple;">HomePlanet Density of all Passengers</span></h3>

In [None]:
dest_df = train.groupby('HomePlanet').sum()['Transported']
y_list = [dest_df[0], dest_df[1], dest_df[2]]

fig = go.Figure(go.Bar(
            x=y_list,
            y=dest_df.index,
            orientation='h', marker=dict(color=[3, 1, 5], coloraxis="coloraxis")))
fig.update_layout(coloraxis=dict(colorscale='Bluered_r'), title_text="Popular Source Planet", title_x=0.5)
fig.show()

*Out of all the passengers, majority of passengers are from Earth having a count of a little over 2000, after which we have Europa at around 1400 passengers and finally Mars having the least no of Passengers at under 1000.*

<h3><span style="color:purple;">Destination Planet Density of all Passengers</span></h3>

In [None]:
dest_df = train.groupby('Destination').sum()['Transported']
y_list = [dest_df[0], dest_df[1], dest_df[2]]

fig = go.Figure(go.Bar(
            x=y_list,
            y=dest_df.index,
            orientation='h', marker=dict(color=[3, 1, 5], coloraxis="coloraxis")))
fig.update_layout(coloraxis=dict(colorscale='Bluered_r'), title_text="Popular Destination Planet", title_x=0.5)
fig.show()

*The most popular Destination seems to be 'Trappist-1e', at almost 3000 passengers heading over there. Second favourable place was '55 Canceri e' at little over 1000. The least favourable planet is 'PSO J318.5-22' with less than 500 passengers.*

<h3><span style="color:purple;">Count the Successful and Unsuccessful Destination transportation for all Passengers</span></h3>

In [None]:
train.groupby(['Destination', 'Transported']).count()['PassengerId']

*The Passengers who chose their destination as "55 Cancri e" are more likely to have suceeded in reaching the destination than passengers who chose to travel to a different planet. The Success rate of being transported to the three planets are,*
1. "55 Cancri e"  : 61.0% 
2. "PSO J318.5-22": 50.3%
3. "TRAPPIST-1e"  : 47.2%

<h3><span style="color:purple;">Cheking Which Home Planet's transportation was most successful</span></h3>

In [None]:
train.groupby(['HomePlanet', 'Transported']).count()['PassengerId']

*We can see that that Passengers who started their journey from "Europa" are more likely to have suceeded in reaching the destination than passengers who started from a different planet.*

<h3><span style="color:purple;">Label Encoding Categorical Features</span></h3>

In [None]:
le = preprocessing.LabelEncoder()
label_cols = ["HomePlanet", "CryoSleep", "Destination" ,"VIP", "Age_Cat"]

for col in label_cols:
        train[col] = train[col].astype(str)
        test[col] = test[col].astype(str)
        train[col] = le.fit_transform(train[col])
        test[col] =  le.fit_transform(test[col])

<h3><span style="color:purple;">Identify the Most successful Route in this Journey</span></h3>

In [None]:
ts_earth = train.groupby(['HomePlanet']).count()['PassengerId'][0]
ts_europa = train.groupby(['HomePlanet']).count()['PassengerId'][1]
ts_mars = train.groupby(['HomePlanet']).count()['PassengerId'][2]

planet_journeys = ['earth_canceri', 'earth_pso', 'earth_trappist', 'europa_canceri', 'europa_pso', 'europa_trappist', 'mars_canceri', 'mars_pso', 'mars_trappist']
success_rates = []

earth_canceri = train.loc[(train['HomePlanet'] == 0) & (train['Destination'] == 0) & (train['Transported'] == True)].shape[0]
earth_canceri_rate = earth_canceri/ts_earth*100
success_rates.append(earth_canceri_rate)

earth_pso = train.loc[(train['HomePlanet'] == 0) & (train['Destination'] == 1) & (train['Transported'] == True)].shape[0]
earth_pso_rate = earth_pso/ts_earth*100
success_rates.append(earth_pso_rate)

earth_trappist = train.loc[(train['HomePlanet'] == 0) & (train['Destination'] == 2) & (train['Transported'] == True)].shape[0]
earth_trappist_rate = earth_trappist/ts_earth*100
success_rates.append(earth_trappist_rate)

europa_canceri = train.loc[(train['HomePlanet'] == 1) & (train['Destination'] == 0) & (train['Transported'] == True)].shape[0]
europa_canceri_rate = europa_canceri/ts_europa*100
success_rates.append(europa_canceri_rate)

europa_pso = train.loc[(train['HomePlanet'] == 1) & (train['Destination'] == 1) & (train['Transported'] == True)].shape[0]
europa_canceri_rate = europa_pso/ts_europa*100
success_rates.append(europa_canceri_rate)

europa_trappist = train.loc[(train['HomePlanet'] == 1) & (train['Destination'] == 2) & (train['Transported'] == True)].shape[0]
europa_trappist_rate = europa_trappist/ts_europa*100
success_rates.append(europa_trappist_rate)

mars_canceri = train.loc[(train['HomePlanet'] == 2) & (train['Destination'] == 0) & (train['Transported'] == True)].shape[0]
mars_canceri_rate = mars_canceri/ts_mars*100
success_rates.append(mars_canceri_rate)

mars_pso = train.loc[(train['HomePlanet'] == 2) & (train['Destination'] == 1) & (train['Transported'] == True)].shape[0]
mars_pso_rate = mars_pso/ts_mars*100
success_rates.append(mars_pso_rate)

mars_trappist = train.loc[(train['HomePlanet'] == 2) & (train['Destination'] == 2) & (train['Transported'] == True)].shape[0]
mars_trappist_rate = mars_trappist/ts_mars*100
success_rates.append(mars_trappist_rate)

success_df = pd.DataFrame({"Passenger_Route" : planet_journeys, "Success_rate_in_Percentage" : success_rates})
success_df.columns
success_df.sort_values(by = 'Success_rate_in_Percentage', ascending=False, inplace=True)

In [None]:
fig = go.Figure(go.Bar(
            x=success_df['Passenger_Route'],
            y=success_df['Success_rate_in_Percentage'],
            marker=dict(color=[n for n in range(9)], coloraxis="coloraxis")))
fig.update_layout(coloraxis=dict(colorscale='Bluered_r'), title_text="Inter-Planet Transportation Success Rate(%)", title_x=0.5)
fig.show()

*On calculating the Success rates of each of the 9 journeys in the dataset, it was evident that journey from Mars to Trappist -1e was the most successful route, at almost 45% success rate. The second most successful route was from Europa to Trappist -1e, at a little over 36%.*

*The least successful route was from Europa to PSO J318.5-22, at 1% success rate. Followed by Mars to PSO J318.5-22 at 2%.*

<h1><center> <span style="color:DarkBlue;">MODELLING</span></center></h1>

<h3><span style="color:purple;">Import The Necessary Packages</span></h3>

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import svm, metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

<h3><span style="color:purple;">Split the Data into Test and Train</span></h3>

*For our analysis and interpretations, we split the train data to 2 parts, train and test. we train our models using this train data and test it on the test data, to get the accuracy and determine which model performs better.*

In [None]:
X = train.drop(columns=['Transported'], axis =1 )
y = train['Transported']
X_train , X_test , y_train , y_test = train_test_split(X , y, random_state = 12 ,test_size =0.33)

*We build 6 most popular clusterinng models to predict if the passengers are Transported to the Planets or no. The 6 models are,*
1. Linear SVM (https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)
2. K Neighbours Classifier (https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
3. Naive Bayes (https://scikit-learn.org/stable/modules/naive_bayes.html)
4. Decision Tree (https://scikit-learn.org/stable/modules/tree.html)
5. Random Forest (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
6. Logistic Regression (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [None]:
model_list = ['Linear SVM', 'K Neighbors Classifier', 'Naive Bayes', 'Decision Tree', 'Random Forest', 'Logistic Regression']
accuracy_list = []

<h3><span style="color:purple;">Linear SVM Classifier</span></h3>

*Linear SVM is used for linearly separable data, which means if a dataset can be classified into two classes by using a single straight line, then such data is termed as linearly separable data, and classifier is used called as Linear SVM classifier.*

In [None]:
svc = svm.SVC()
svc.fit(X_train, y_train)

y_pred_svm = svc.predict(X_test)

accuracy_svm = metrics.accuracy_score(y_test, y_pred_svm)
accuracy_list.append(round(accuracy_svm, 2)*100)
svm_cm = metrics.confusion_matrix(y_test, y_pred_svm)

print("The Accuracy of this model is: ", round(accuracy_svm, 2)*100, "%")

<h3><span style="color:purple;">K Nearest Neighbour Classifier</span></h3>

*It is one of the simplest and widely used classification algorithms in which a new data point is classified based on similarity in the specific group of neighboring data points.*

In [None]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

y_pred_knn = knn.predict(X_test)

accuracy_knn = metrics.accuracy_score(y_test, y_pred_knn)
accuracy_list.append(round(accuracy_knn, 2)*100)
knn_cm = metrics.confusion_matrix(y_test, y_pred_knn)

print("The Accuracy of this model is: ", round(accuracy_knn, 2)*100, "%")

<h3><span style="color:purple;">Naive Bayes Classifier</span></h3>

*A naive Bayes classifier is an algorithm that uses Bayes' theorem to classify objects. Naive Bayes classifiers assume strong, or naive, independence between attributes of data points.*

In [None]:
gnb = GaussianNB()
gnb.fit(X_train, y_train)

y_pred_nb = gnb.predict(X_test)

accuracy_nb = metrics.accuracy_score(y_test, y_pred_nb)
accuracy_list.append(round(accuracy_nb, 2)*100)
nb_cm = metrics.confusion_matrix(y_test, y_pred_nb)

print("The Accuracy of this model is: ", round(accuracy_nb, 2)*100, "%")

<h3><span style="color:purple;">Decision Tree</span></h3>

*Decision Trees are a supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.*

In [None]:
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)

y_pred_dt = dtc.predict(X_test)

accuracy_dt = metrics.accuracy_score(y_test, y_pred_dt)
accuracy_list.append(round(accuracy_dt, 2)*100)
dt_cm = metrics.confusion_matrix(y_test, y_pred_dt)

print("The Accuracy of this model is: ", round(accuracy_dt, 2)*100, "%")

<h3><span style="color:purple;">Random Forest Classifier</span></h3>

*The random forest is a classification algorithm consisting of many decisions trees. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree.*

In [None]:
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

y_pred_rf = rfc.predict(X_test)

accuracy_rf = metrics.accuracy_score(y_test, y_pred_rf)
accuracy_list.append(round(accuracy_rf, 2)*100)
rf_cm = metrics.confusion_matrix(y_test, y_pred_rf)

print("The Accuracy of this model is: ", round(accuracy_rf, 2)*100, "%")

<h3><span style="color:purple;">Logistic Regression</span></h3>

*Logistic regression is a statistical analysis method to predict a binary outcome, such as yes or no, based on prior observations of a data set.*

In [None]:
reg = LogisticRegression()
reg.fit(X_train, y_train)

y_pred_reg = reg.predict(X_test)

accuracy_reg = metrics.accuracy_score(y_test, y_pred_reg)
accuracy_list.append(round(accuracy_reg, 2)*100)
lr_cm = metrics.confusion_matrix(y_test, y_pred_reg)

print("The Accuracy of this model is: ", round(accuracy_reg, 2)*100, "%")

<h1><center> <span style="color:DarkBlue;">MODEL EVALUATION</span></center></h1>

<h3><span style="color:purple;">Plot Confusion Matrix</span></h3>

*Confusion matrices are used to visualize important predictive analytics like recall, specificity, accuracy, and precision.*

In [None]:
fig = plt.figure(figsize=(18, 10))
fig.subplots_adjust(hspace=0.325)
sub = fig.add_subplot(2, 3, 1).set_title("Support Vector Machine")
cm_plot1 = sns.heatmap (svm_cm, annot=True, cmap = 'gist_heat')
cm_plot1.set_xlabel('Predicted Values')
cm_plot1.set_ylabel('Actual Values')

sub = fig.add_subplot(2, 3, 2).set_title("K Nearest Neighbours")
cm_plot2 = sns.heatmap (knn_cm, annot=True, cmap = 'gist_heat')
cm_plot2.set_xlabel('Predicted Values')
cm_plot2.set_ylabel('Actual Values')

sub = fig.add_subplot(2, 3, 3).set_title("Naive Bayes")
cm_plot = sns.heatmap (nb_cm, annot=True, cmap = 'gist_heat')
cm_plot.set_xlabel('Predicted Values')
cm_plot.set_ylabel('Actual Values')

sub = fig.add_subplot(2, 3, 4).set_title("Decision Trees")
cm_plot = sns.heatmap (dt_cm, annot=True, cmap = 'gist_heat')
cm_plot.set_xlabel('Predicted Values')
cm_plot.set_ylabel('Actual Values')

sub = fig.add_subplot(2, 3, 5).set_title("Random Forest")
cm_plot = sns.heatmap (rf_cm, annot=True, cmap = 'gist_heat')
cm_plot.set_xlabel('Predicted Values')
cm_plot.set_ylabel('Actual Values')

sub = fig.add_subplot(2, 3, 6).set_title("Logistic Regression")
cm_plot = sns.heatmap (lr_cm, annot=True, cmap = 'gist_heat')
cm_plot.set_xlabel('Predicted Values')
cm_plot.set_ylabel('Actual Values')

<h3><span style="color:purple;">Accuracy Table</span></h3>

In [None]:
comparision = pd.DataFrame({"Models" : model_list, "Model_Accuracy" : accuracy_list})
comparision

In [None]:
fig = go.Figure(go.Bar(
            x=model_list,
            y=accuracy_list,
            marker=dict(color=[3, 1, 5, 6, 9, 3], coloraxis="coloraxis")))
fig.update_layout(coloraxis=dict(colorscale='Bluered_r'), title_text="Model Accuracies In (%)", title_x=0.5)
fig.show()

**On Plotting all the models' accuracies, we see that Logistic regression and Random Forest Give us the best Accuracy(77%) for predicting the passengers that were transported.**

<center><img src= "https://i0.wp.com/dariusforoux.com/wp-content/uploads/2018/11/success.png?fit=665%2C499&ssl=1" alt ="Titanic" style='width: 500px;'></center>

<h1><center> <span style="color:Red;">PLEASE UPVOTE IF YOU FOUND THIS NOTEBOOK HELPFUL, THANK YOU. :D</span></center></h1>

<h1><center> <span style="color:DarkBlue;">SUBMISSION</span></center></h1>

<h3><span style="color:purple;">Predicting the test data using the best model we have built (Logistic Regression)</span></h3>

In [None]:
test_pred_reg = reg.predict(test)
test_pred_reg = test_pred_reg.astype("bool")
test_id = test['PassengerId']

In [None]:
# Sklearn
test.drop(columns='PassengerId', inplace=True)
predictions = (model.predict(test) > 0.5).astype(int)

In [None]:
submission = pd.DataFrame({"PassengerId" : test_id, "Transported" : test_pred_reg})
submission.to_csv("submission.csv",index=False)
submission.head()