# Titanic Project 

## Prelude:
    In this project we are given data regarding the passengers travelling in the iconic ship called "Titanic". Given features and label attributes tells us about the particulars of the passengers travelling in the ship.

Let us now get familiar with the feature variables:

1. PassengerId: It is an identifying unique numerical variable

2. Pclass: This is a catergorical numeric variable with 3 categories which refers to socio-economic class of the passenger

3. Name: This is an identifying character variable

4. Sex: This is a categorical variable with 2 categories describing the gender of the passenger

5. Age: This is a numeric variable

6. Sibsp: This is a numeric categorical variable with 7 categories describing about the number of sibling passenger has got

7. Parch: This is a numeric categorical variable with 7 categories describing the parents and number of siblings of a passenger

8. Ticket: This is an identifying character variable 

9. Fare: This is a numeric variable describing the price of ticket purchased in GBP

10. Cabin: This is a categorical variable describing about the cabin 

11. Embark: This is a character variable describing about the port from which port the passengers boarded the ship
     (C = Cherbourg; Q = Queenstown; S = Southampton) 
     
Moving forward let's throw some light on our label variable
Survived: This is a categorical variable with two categories. (0= Not Survived, 1= Survived)

## Problem Statement:
   ####  We have to build a classification machine learning model to predict the survival of the passenger boarding the ship.

In [1]:
# Importing necessary packages

# Importing fundamental packages
import warnings
warnings.filterwarnings("ignore")
from IPython.core.interactiveshell import InteractiveShell        ## To display multiple outputs
InteractiveShell.ast_node_interactivity = "all"
import pyforest            

## For visualization
import matplotlib.pyplot as plt                                         
%matplotlib inline

import seaborn as sns

from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.express as px

## Data Pre-Processing Packages
from scipy.stats import chi2_contingency
from sklearn import metrics
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from scipy.stats import zscore
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.impute import KNNImputer

## To create copy of data
import copy

## Pipeline Packages

from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline


## Ensemble Learning Algorithms Packages

from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
import xgboost as xgb
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform
from sklearn.linear_model import Ridge

## Evaluation Metrics Packages
import statsmodels.api as sm
from scipy.stats import f
from statsmodels.formula.api import ols
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.utils import resample
from sklearn import metrics
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn import preprocessing
from sklearn.metrics import average_precision_score, confusion_matrix, accuracy_score, classification_report, plot_confusion_matrix
from sklearn.metrics import roc_curve, auc

# Bagging and Boosting
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostClassifier                           

# Saving the model
import pickle

In [2]:
def read(link):
    global data

    data=pd.read_csv(link)
    
    data=pd.DataFrame(data)
    print(data)

In [3]:
read(link="titanic_train.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name     Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    male  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                               Heikkinen, Miss. Laina  female  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                             Allen, Mr. William Henry    male  35.0      0   
..                                                 ...     ...   ... 

In [4]:
def eda(mydata):                                    # Defining function
    
    pd.set_option("display.max_rows", None)         # to display all rows
    pd.set_option("display.max_columns", None)       #to display all columns
    
    print(mydata.head())                              # to display first 10 records
    print("\n")                               
    print(mydata.tail())                              # to display last 10 records
    print("\n")
    print("\n")
    
    print(mydata.info())                               # to understand attributes of the data
    
    print(data.describe())                          # to get descriptive statistics
    print("\n")
    print("\n")
    print("Skewness for the data","\n",data.skew())       # to get skewness of the data, skewness=0 for normal distribution
    print("\n")
    print("Kurosis for the data","\n",data.kurtosis() )            # to get kutosis, kurtosis <=3 for normal distribution
    print("\n")
    print("\n")
    
    sns.pairplot(mydata, kind='scatter', diag_kind='kde')                       # to represent data graphically
    print("\n")
    print("\n")    
    
    plt.figure(figsize=(10,10))                      # plotting heat map to check correlation
    sns.heatmap(mydata.corr(method = "pearson"), annot = True)
    print("\n")    

In [None]:
eda(mydata=data)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  




In [None]:
# Making a data copy to understand significant variables

In [None]:
data1= copy.deepcopy(data)

In [None]:
data1['Cabin']=data1['Cabin'].astype(str).str[0]

In [None]:
enc = OrdinalEncoder()
data1[["Cabin"]] = enc.fit_transform(data1[["Cabin"]])

Conducting a chi-square to test the significance of the variable "Cabin" to label variable "Survived"

In [None]:
crosstab = pd.crosstab(data1["Cabin"], data1["Survived"])


In [None]:
# H0= Cabin does not impact survival of a passenger
# HA= Cabin does impact survival of a passenger

# defining the table
stat, p, dof, expected = chi2_contingency(crosstab)

# interpret p-value
alpha = 0.05
print("p value is " , str(p))
if p <= alpha:
    print('Dependent (reject H0)')
else:
    print('Independent (H0 holds true)')


Conducting a chi-square to test the significance of the variable "Sex" to label variable "Survived"

In [None]:
crosstab = pd.crosstab(data1["Sex"], data1["Survived"])


In [None]:
# H0= Sex does not impact survival of a passenger
# HA= Sex does impact survival of a passenger

# defining the table
stat, p, dof, expected = chi2_contingency(crosstab)

# interpret p-value
alpha = 0.05
print("p value is " , str(p))
if p <= alpha:
    print('Dependent (reject H0)')
else:
    print('Independent (H0 holds true)')


Conducting a chi-square to test the significance of the variable "Age" to label variable "Survived"

In [None]:
crosstab = pd.crosstab(data1["Age"], data1["Survived"])


In [None]:
# H0= Age does not impact survival of a passenger
# HA= Age does impact survival of a passenger

# defining the table
stat, p, dof, expected = chi2_contingency(crosstab)

# interpret p-value
alpha = 0.05
print("p value is " , str(p))
if p <= alpha:
    print('Dependent (reject H0)')
else:
    print('Independent (H0 holds true)')


Conducting a chi-square to test the significance of the variable "Name" to label variable "Survived"

In [None]:
crosstab = pd.crosstab(data1["Name"], data1["Survived"])


In [None]:
# H0= Name does not impact survival of a passenger
# HA= Name does impact survival of a passenger

# defining the table
stat, p, dof, expected = chi2_contingency(crosstab)

# interpret p-value
alpha = 0.05
print("p value is " , str(p))
if p <= alpha:
    print('Dependent (reject H0)')
else:
    print('Independent (H0 holds true)')


Conducting a chi-square to test the significance of the variable "PassengerId" to label variable "Survived"

In [None]:
crosstab = pd.crosstab(data1["PassengerId"], data1["Survived"])


In [None]:
# H0= Passenger ID does not impact survival of a passenger
# HA= Passenger Id does impact survival of a passenger

# defining the table
stat, p, dof, expected = chi2_contingency(crosstab)

# interpret p-value
alpha = 0.05
print("p value is " , str(p))
if p <= alpha:
    print('Dependent (reject H0)')
else:
    print('Independent (H0 holds true)')


Conducting a chi-square to test the significance of the variable "Embarked" to label variable "Survived"

In [None]:
crosstab = pd.crosstab(data1["Embarked"], data1["Survived"])
# H0= Passenger Id does not impact survival of a passenger
# HA= Passenger Id does impact survival of a passenger

# defining the table
stat, p, dof, expected = chi2_contingency(crosstab)

# interpret p-value
alpha = 0.05
print("p value is " , str(p))
if p <= alpha:
    print('Dependent (reject H0)')
else:
    print('Independent (H0 holds true)')


Conducting a chi-square to test the significance of the variable "Ticket" to label variable "Survived"

In [None]:
crosstab = pd.crosstab(data1["Ticket"], data1["Survived"])
# H0= Ticket number does not impact survival of a passenger
# HA= Ticket number  does impact survival of a passenger

# defining the table
stat, p, dof, expected = chi2_contingency(crosstab)

# interpret p-value
alpha = 0.05
print("p value is " , str(p))
if p <= alpha:
    print('Dependent (reject H0)')
else:
    print('Independent (H0 holds true)')


In [None]:
crosstab = pd.crosstab(data1["Pclass"], data1["Survived"])
# H0= class does not impact survival of a passenger
# HA= class  does impact survival of a passenger

# defining the table
stat, p, dof, expected = chi2_contingency(crosstab)

# interpret p-value
alpha = 0.05
print("p value is " , str(p))
if p <= alpha:
    print('Dependent (reject H0)')
else:
    print('Independent (H0 holds true)')


In [None]:
def outlier(mydata):                        # Outlier Plotting
    for i in mydata.columns:
        fig = px.box(mydata, y= i, width=600, height=400, title=i, template="plotly_dark")
        fig.show()

In [None]:
outlier(mydata=data.drop(["Survived"],axis=1))

In [None]:
## Histogram for all the variables using plotly_express by 'quality'

for i in data.columns:
    fig = px.histogram(data, x= i, histfunc = "count", color = "Survived", 
                       width=1000, height=800, title = "Histogram for " + i, 
                       template="plotly_dark")

    fig.show() 

In [None]:
# Percntage of Survival found in Titanic
percentage=(data['Survived'].value_counts())*100 / (len(data['Survived']))
print(percentage.sort_values(ascending=False))

In [None]:
data.groupby(["Survived"]).describe().transpose() ### Five point summary grouped by "Survived"

In [None]:
data["Survived"].value_counts(ascending=True)     # Finding values in each class of Survival

## Findings:  

01. There are 891 rows and 12 columns

02. There are missing values in the data

03. There are 7 numeric variables and 5 object variables

04. There's no multicollinearity in the data

05. Variables like "Passenger Id", "Pclass", "Name" and "Age" are not significant for predicting survival of a passenger,           hence, it can be dropped from our data

06. Out of 891 passengers, around 342 (38.38%) passengers survived. In survivors maximum were women and children.

07. Average age of passengers who boarded is 29.699 years and passengers who survived is 28.343 years

08. Average price of Tickets purchased was GBP 32.20 and for passengers survived it was GBP 48.4

09. Considering the price of ticket we see survived passengerswere socio-economically upper class

10. Generally, passengers boarded the ship belonged to Middle-class socio-economically.

11. From our various chi-square tests we find the survival chances of the passengers is influenced by their age, sex,               embarkment and socio economic class as higher the price of ticket better is their survival chances.

12. In this dataset, we find variables like Parch, Fare, Sibsp and Age have outliers.

13. Variables like "SibSp","Parch" and "Fare" have outliers

14. Maximum Survivors had boarded the ship from Cherbourg

## Data Cleaning

In [None]:
# Making a copy of dataset for cleaning and model building purpose

data2= copy.deepcopy(data)

In [None]:
data2.columns

In [None]:
# Dropping insignificant columns
data2= data.drop(["PassengerId","Pclass","Name","Age"], axis=1)

In [None]:
# Converting Int variable to Float variable for column Age
data2['Age']=data['Age'].astype(float)

Encoding the values

In [None]:
# Encoding "Cabin" variable

data2['Cabin']=data2['Cabin'].astype(str).str[0]
enc = OrdinalEncoder()
data2[["Cabin"]] = enc.fit_transform(data2[["Cabin"]])


In [None]:
# Encoding "Sex" and "Embarked" variables
enc2= LabelEncoder()
data2[["Sex","Embarked"]]=data2[['Sex', 'Embarked']].apply(enc2.fit_transform)

# Embark: Cherbourg=0 Queenstown=1 Southampton=2
# Sex: Male= 0, Female= 1

In [None]:
# Keeping only numeric for variable "Ticket"
data2["Ticket"]=data2["Ticket"].str.extract('(\d+)').astype(float)

In [None]:
# Splitting data into train and test

def split (data,target):
    data_reset_index = data.reset_index(drop=True)
# Data split
    global x
    global y
    global x_train
    global y_train
    global x_test
    global y_test
# Segregate Feature & Target Variables
    x = data_reset_index.drop(target, axis=1)
    y = data_reset_index[target]
    x_train,x_test,y_train,y_test = train_test_split(x,y, test_size=0.3, random_state=3)
    
    print(x_train.info())
    ("\n")
    print(x_test.info())
    ("\n")
    print(y_train.shape)
    ("\n")
    print(y_test.shape)
    ("\n")

In [None]:
split(data=data2,
      target="Survived")

### Imputing Missing Values in the data set

In [None]:
impute_1= SimpleImputer(strategy= "mean")
impute_2= IterativeImputer(max_iter=10, random_state= 0)
impute_3= KNNImputer(n_neighbors=9)

In [None]:
model= LinearRegression()

In [None]:
pipe1 = Pipeline([
    ('impute',impute_1),
    ('model',model),
])    
pipe2 = Pipeline([
    ('impute',impute_2),
    ('model',model),
])
pipe3 = Pipeline([
    ('impute',impute_3),
    ('model',model),  
    
])

In [None]:
# Create Function pre_process to check the score for strategy
def pre_process(data, pipe):
    
    data_reset_index = data.reset_index(drop=True)
    

# Pipe.fit, pipe.predict and accuracy
    
    pipe.fit(x_train,y_train)
    
    y_pred = pipe.predict(x_test)
    
    score = np.sqrt(metrics.mean_squared_error(y_pred, y_test))
    
    return score

In [None]:
pre_process(data= data2,
           pipe= pipe1)

In [None]:
pre_process(data= data2,
            pipe= pipe2)

In [None]:
pre_process(data= data2,
            pipe= pipe2)

In [None]:
def impute(feature, ft_train, ft_test, impute, target, target_train, target_test):
    
    
    
    
    # Imputing the training dataset with final chosen imputation method
    
    # converting array into dataframe
    
    ft_train = pd.DataFrame(impute.fit_transform(ft_train))
    
    # Assigning column names to the training data
    
    ft_train.columns = feature.columns
    
    print("Final Imputed Training Feature Data Set: \n", ft_train.isnull().sum())
    
    print(" ")
    
    target_train = pd.DataFrame(np.array(target_train))
    target_train.columns = [target]
    
    
    
    
    
    
    # Creating final training dataset (concatenating feature training and target training)
    
    global final_train
    
    final_train = pd.concat([ft_train, target_train], axis = 1)
    
    print("Final Train Data Health: ")
    
    print(" ")
    
    print(final_train.info())
    
    print(" ")
    
        
    print("Final look: \n", final_train.isnull().sum())
    
    print(" ")
    
   
    
    
    
    
    # Imputing the test dataset with final chosen imputation method
    
    # converting array into dataframe
    
    ft_test = pd.DataFrame(impute.fit_transform(ft_test))
    
    # Assigning column names to the test data
    
    ft_test.columns = feature.columns
    
    print("Final Imputed Test Feature Data Set: \n", ft_test.isnull().sum())
    
    print(" ")
    
    target_test = pd.DataFrame(np.array(target_test))
    
    target_test.columns = [target]
    
    
    
    
    
    # Creating final test dataset (concatenating feature training and target training)
    
    
    
    global final_test
    
    final_test = pd.concat([ft_test, target_test], axis = 1)
    
    print("Final Test Data Health: ")
    
    print(" ")
    
    print(final_test.info())
    
    print(" ")
    
   
    
    print("Final look: \n", final_test.isnull().sum())
    
    print(" ")

In [None]:
impute(feature = x , 
       ft_train = x_train , 
       ft_test = x_test , 
       impute = impute_3 ,
       target = 'Survived',
       target_train = y_train, 
       target_test = y_test)

In [None]:
data3=pd.concat([final_train,final_test ], axis = 0)

In [None]:
data3.info()

In [None]:
eda(mydata=data3)

In [None]:
outlier(mydata=data3.drop(["Survived"],axis=1))

In [None]:
## Histogram for all the variables using plotly_express by 'quality'

for i in data3.columns:
    fig = px.histogram(data2, x= i, histfunc = "count", color = "Survived", 
                       width=1000, height=800, title = "Histogram for " + i, 
                       template="plotly_dark")

    fig.show() 

In [None]:
# Plotting qq plot

for i in data3.columns:
    fig = sm.qqplot(data2[i])

    fig.show() 
    

In [None]:
# Percntage of quality found in wine
percentage=(data3['Survived'].value_counts())*100 / (len(data2['Survived']))
print(percentage.sort_values(ascending=False))

In [None]:
for i in data3.drop(['Survived'],axis=1).columns:
    data3[i] = np.where(data3[i] > (data3[i].quantile(0.75) + (data3[i].quantile(0.75) - data3[i].quantile(0.25))*1.5),
                           (data3[i].quantile(0.75) + (data3[i].quantile(0.75) - data3[i].quantile(0.25))*1.5),
                          np.where(data3[i] < (data3[i].quantile(0.25) - (data3[i].quantile(0.75) - data3[i].quantile(0.25))*1.5),
                           (data3[i].quantile(0.25) - (data3[i].quantile(0.75) - data3[i].quantile(0.25))*1.5),data3[i]))
    

In [None]:
outlier(mydata=data2.drop(["Survived"],axis=1))

Now we have cleaned our dataset

In [None]:
data3["Survived"].value_counts(abs,ascending=True)

In [None]:
eda(mydata=data3)

In [None]:
data3.groupby(["Survived"]).describe().transpose() ### Five point summary grouped by "Survived"

## Findings:  

01. There are 891 rows and 9 columns

02. There are no missing values in the data

03. There are no object variables

04. There's no multicollinearity in the data

05. Variables like "Passenger Id", "Pclass", "Name" and "Age" are not significant for predicting survival of a passenger,           hence, are dropped from our data

06. Out of 891 passengers, around 342 (38.38%) passengers survived. In survivors maximum were women and children.

07. Average age of passengers who boarded is 29.699 years and passengers who survived is 28.558 years 

08. Average price of Tickets purchased was GBP 32.20 and for passengers survived it was GBP 48.4

09. Considering the price of ticket we see survived passengerswere socio-economically upper class

10. Generally, passengers boarded the ship belonged to Middle-class socio-economically.

11. In this dataset, we find variables like Parch, Fare, Sibsp and Age have outliers.

12. Maximum Survivors had boarded the ship from Cherbourg

From the above values we see a problem of class imbalance

In [None]:
split(data=data3,
      target="Survived")

In [None]:
# Trying Smote
sm = SMOTE(random_state = 5)

columns = x_train.columns

train_data = pd.concat([x_train,y_train], axis = 1)

train_data.head()



In [None]:
x_os_train, y_os_train  = sm.fit_resample(train_data.drop('Survived', axis = 1), train_data['Survived'])

In [None]:
y_os_train.value_counts()

In [None]:
## Number of records in Oversampled Train dataset

print('the number of records in x_os_train :', len(x_os_train))
print('the number of records in y_os_train :', len(y_os_train))
print('the number of records in x_test :', len(x_test))
print('the number of records in y_test :', len(y_test))

# Target Class Distribution for Train and test Dataset

print('the ratio of 0 and 1 in y_os_train:')

print(y_os_train.value_counts(normalize = True)*100)

print('the ratio of 0 and 1 in y_test:')

print(y_test.value_counts(normalize = True)*100)

In [None]:
print("Before UpSampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before UpSampling, counts of label '0': {} \n".format(sum(y_train==0)))

In [None]:
print("After UpSampling, counts of label '1': {}".format(sum(y_os_train==1)))
print("After UpSampling, counts of label '0': {} \n".format(sum(y_os_train==0)))

### Scaling the data

In [None]:
scale_1= StandardScaler()
scale_2= MinMaxScaler()
scale_3= RobustScaler()


In [None]:
model= LinearRegression()

In [None]:
pipe1 = Pipeline([
    ('Scale',scale_1),
    ('model',model),
])    
pipe2 = Pipeline([
    ('Scale',scale_2),
    ('model',model),
])
pipe3 = Pipeline([
    ('Scale',scale_3),
    ('model',model),  
    
])

In [None]:
# Create Function Name 
def pre_process(data, pipe):

# Pipe.fit, pipe.predict and accuracy
    
    pipe.fit(x_os_train,y_os_train)
    
    y_pred = pipe.predict(x_test)
    
    score = np.sqrt(metrics.mean_squared_error(y_pred, y_test))
    
    return score

In [None]:
pre_process(data= data3,
           pipe= pipe1)

In [None]:
pre_process(data= data3,
           pipe= pipe2)

In [None]:
pre_process(data= data3,
           pipe= pipe3)

Here we see all techniques are providing same score. We can use standard scaler.

In [None]:
scalar= StandardScaler()
scalar.fit(x_train)
x_trainsc =  scalar.transform(x_os_train)
x_testsc  =  scalar.transform(x_test)


### PCA

In [None]:
pca = PCA(n_components = 0.5)
pca.fit(x_trainsc)
x_train_model = pca.transform(x_trainsc)
x_test_model = pca.transform(x_testsc)
ex_variance=np.var(x_train_model,axis=0)
ex_variance_ratio = ex_variance/np.sum(ex_variance)

print("shape of x_train_pca", x_train_model.shape)
print('')    
print("Explained Variance Ratio for Training Dataset: \n", ex_variance_ratio)

print(" ")

ex_variance_1 = np.var(x_test_model , axis=0)
ex_variance_ratio_1 = ex_variance_1 / np.sum(ex_variance_1)
    
print("Explained Variance Ratio for Test Dataset: \n", ex_variance_ratio_1) 
print(" ")

## Model Building

### Naive Baye's Model

In [None]:
nb_model          =      GaussianNB()

## Model.fit

nb_model.fit(x_trainsc, y_os_train)

## Model.predict


y_pred_nb_0 = nb_model.predict(x_testsc)

### Naive Bayes

conf_matrix_nb = confusion_matrix(y_test, y_pred_nb_0, labels=[0, 1])

df_cmatrix_nb = pd.DataFrame(conf_matrix_nb, index = [i for i in [0, 1]],
                  columns = [i for i in ["Predict_Rejected","Predict_Accepted"]])


fig_cmatrix_nb = px.imshow(df_cmatrix_nb , title = "Confusion Matrix for NB Classifier Model")

## Saving the Classification Reports : precision, recall, f1-score ##


pred_report_nb = classification_report(y_test, y_pred_nb_0 , digits=2)

### NB Classifier

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test,  y_pred_nb_0)
roc_nb = auc(false_positive_rate, true_positive_rate)


print("The Accuracy Score For The NB Classifier Model Is :  ", accuracy_score(y_test, y_pred_nb_0))
print("\n")
print("The roc_auc score for NB Classifier Model:  ", roc_nb)



##  Confusion Matrices 


fig_cmatrix_nb.show()

### Classification Reports 

print("Title : The Classification Report for NB Classifier Model: \n  ", pred_report_nb)


In [None]:
##### K Fold Cross Validation ####
    
cross_valid = True
cv_score = cross_val_score(nb_model , x_trainsc , y_os_train, scoring = 'accuracy', 
                                          cv = KFold(n_splits = 10))
accuracy = accuracy_score(y_test, y_pred_nb_0)

print("The Cross Validation Score For Baseline Support Vector Classifier Model After", "Fold Cross Validation Is :  \n", cv_score)
print(" ")
print("The Accuracy Score For Baseline Support Vector Classifier Model After Cross Validation Is: \n", accuracy)

### Linear Discriminant Analysis

In [None]:
# Training the model
lda_model         =      LinearDiscriminantAnalysis()

## Model.fit

lda_model.fit(x_trainsc, y_os_train)

## Model.predict

y_lda_0 = lda_model.predict(x_testsc)

## Linear Discriminant Analysis Model


conf_matrix_lda =  confusion_matrix(y_test, y_lda_0 , labels=[0, 1])

df_cmatrix_lda   = pd.DataFrame(conf_matrix_lda , index = [i for i in [0, 1]],
                  columns = [i for i in ["Predict_Rejected","Predict_Accepted"]])


fig_cmatrix_lda = px.imshow(df_cmatrix_lda , title = "Confusion Matrix for Baseline Linear Discriminant Analysis Model")

## Saving the Classification Reports : precision, recall, f1-score ##

pred_report_lda = classification_report(y_test, y_lda_0 , digits=2)

### Saving the ROC_AUC Scores for Algorithms

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_lda_0)
roc_lda = auc(false_positive_rate, true_positive_rate)

#### Printing the Accuracy Scores 

print("The Accuracy Score For The Linear Discriminant Analysis Model Is :  ",   
      accuracy_score(y_test, y_lda_0))

##  Confusion Matrices 

fig_cmatrix_lda.show()

### Classification Reports 

print("Title : The Classification Report for Linear Discriminant Model: \n  ", pred_report_lda)

###  roc_auc scores 

print("The roc_auc score for Linear Discriminat Analysis Model:  ", roc_lda)



In [None]:
##### K Fold Cross Validation ####
    
cross_valid = True
cv_score = cross_val_score(lda_model , x_trainsc , y_os_train, scoring = 'accuracy', 
                                          cv = KFold(n_splits = 10))
accuracy = accuracy_score(y_test, y_lda_0)

print("The Cross Validation Score For Baseline Linear Discriminant Model After", "Fold Cross Validation Is :  \n", cv_score)
print(" ")
print("The Accuracy Score For Baseline Linear Discriminant Model After Cross Validation Is: \n", accuracy)

### Quadratic Discriminant Model

In [None]:
# Training the Model

qda_model         =      QuadraticDiscriminantAnalysis()

## Model.fit

qda_model.fit(x_trainsc, y_os_train)

## Model.predict

y_qda_0 = qda_model.predict(x_testsc)

## Quadratic Discriminant Analysis Model

conf_matrix_qda = confusion_matrix(y_test, y_qda_0 , labels=[0, 1])

df_cmatrix_qda = pd.DataFrame(conf_matrix_qda, index = [i for i in [0, 1]],
                  columns = [i for i in ["Predict_Rejected","Predict_Accepted"]])


fig_cmatrix_qda = px.imshow(df_cmatrix_qda , title = "Confusion Matrix for Baseline Quadratic Discriminant Analysis Model")

## Saving the Classification Reports : precision, recall, f1-score ##

pred_report_qda = classification_report(y_test, y_qda_0 , digits=2)

### Saving the ROC_AUC Scores for Baseline Algorithms

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_qda_0)
roc_qda = auc(false_positive_rate, true_positive_rate)

#### Printing the Accuracy Scores 

print("The Accuracy Score For The Quadratic inear Discriminant Analysis Model Is :  ",   
      accuracy_score(y_test, y_qda_0))

##  Confusion Matrices 

fig_cmatrix_qda.show()

### Classification Reports 

print("Title : The Classification Report for Quadratic Discriminant Model: \n ", pred_report_qda)

###  roc_auc scores 

print("The roc_auc score for Baseline Quadratic Discriminat Analysis Model:  ", roc_qda)

In [None]:
##### K Fold Cross Validation ####
    
cross_valid = True
cv_score = cross_val_score(qda_model , x_trainsc , y_os_train, scoring = 'accuracy', 
                                          cv = KFold(n_splits = 10))
accuracy = accuracy_score(y_test, y_qda_0)

print("The Cross Validation Score For Baseline Quadratic Discriminant Model  After", "Fold Cross Validation Is :  \n", cv_score)
print(" ")
print("The Accuracy Score For Baseline Quadratic Discriminant Model  After Cross Validation Is: \n", accuracy)

### K-Nearest Neighbor Model

In [None]:
# Training the model
knn_model          =       KNeighborsClassifier()

## Model.fit

knn_model.fit(x_trainsc, y_os_train)

## Model.predict

y_knn_0 = knn_model.predict(x_testsc)

## K- Nearest Neighbor Model

conf_matrix_knn = confusion_matrix(y_test, y_knn_0 , labels=[0, 1])

df_cmatrix_knn = pd.DataFrame(conf_matrix_knn, index = [i for i in [0, 1]],
                  columns = [i for i in ["Predict_Rejected","Predict_Accepted"]])


fig_cmatrix_knn = px.imshow(df_cmatrix_knn , title = "Confusion Matrix for K-Nearest Neighbor Model")

## Saving the Classification Reports : precision, recall, f1-score ##

pred_report_knn = classification_report(y_test, y_knn_0 , digits=2)

### Saving the ROC_AUC Scores for Baseline Algorithms

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_knn_0)
roc_knn = auc(false_positive_rate, true_positive_rate)

#### Printing the Accuracy Scores 

print("The Accuracy Score For The K-Nearest Neighbor Analysis Model Is :  ",   
      accuracy_score(y_test, y_knn_0))

##  Confusion Matrices 

fig_cmatrix_knn.show()

### Classification Reports 

print("Title : The Classification Report for K-Nearest Neighbor Model: \n ", pred_report_knn)

###  roc_auc scores 

print("The roc_auc score for K-Nearest Neighbor Analysis Model:  ", roc_knn)

In [None]:
##### K Fold Cross Validation ####
    
cross_valid = True
cv_score = cross_val_score(knn_model , x_trainsc , y_os_train, scoring = 'accuracy', 
                                          cv = KFold(n_splits = 10))
accuracy = accuracy_score(y_test, y_knn_0)

print("The Cross Validation Score For Baseline K- Nearest Neighbour Classifier Model After", "Fold Cross Validation Is :  \n", cv_score)
print(" ")
print("The Accuracy Score For Baseline K- Nearest Neighbour Classifier Model After Cross Validation Is: \n", accuracy)

### Support Vector Model

In [None]:
# Training the model
svm_model =  svm.SVC()

## Model.fit

svm_model.fit(x_trainsc, y_os_train)

## Model.predict

y_svc_0 = svm_model.predict(x_testsc)

## Support Vector Model

conf_matrix_svc = confusion_matrix(y_test, y_svc_0 , labels=[0, 1])

df_cmatrix_svc = pd.DataFrame(conf_matrix_knn, index = [i for i in [0, 1]],
                  columns = [i for i in ["Predict_Rejected","Predict_Accepted"]])


fig_cmatrix_svc = px.imshow(df_cmatrix_svc , title = "Confusion Matrix for Support Vector Model")

## Saving the Classification Reports : precision, recall, f1-score ##

pred_report_svc = classification_report(y_test, y_svc_0 , digits=2)

### Saving the ROC_AUC Scores for Baseline Algorithms

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_svc_0)
roc_svc = auc(false_positive_rate, true_positive_rate)

#### Printing the Accuracy Scores 

print("The Accuracy Score For The Support Vector Analysis Model Is :  ",   
      accuracy_score(y_test, y_svc_0))

##  Confusion Matrices 

fig_cmatrix_svc.show()

### Classification Reports 

print("Title : The Classification Report for Support Vector Model: \n ", pred_report_svc)

###  roc_auc scores 

print("The roc_auc score for Support Vector Analysis Model:  ", roc_svc)

In [None]:
##### K Fold Cross Validation ####
    
cross_valid = True
cv_score = cross_val_score(svm_model , x_trainsc , y_os_train, scoring = 'accuracy', 
                                          cv = KFold(n_splits = 10))
accuracy = accuracy_score(y_test, y_svc_0)

print("The Cross Validation Score For Baseline Support Vector Classifier Model After", "Fold Cross Validation Is :  \n", cv_score)
print(" ")
print("The Accuracy Score For Baseline Support Vector Classifier Model After Cross Validation Is: \n", accuracy)

### Logistic Regression

In [None]:
# Training the model
logit_model = LogisticRegression()

## Model.fit

logit_model.fit(x_trainsc, y_os_train)

## Model.predict

y_logr_0 = logit_model.predict(x_testsc)

## Logistic Regression Model

conf_matrix_logr = confusion_matrix(y_test, y_logr_0 , labels=[0, 1])

df_cmatrix_logr = pd.DataFrame(conf_matrix_knn, index = [i for i in [0, 1]],
                  columns = [i for i in ["Predict_Rejected","Predict_Accepted"]])


fig_cmatrix_logr = px.imshow(df_cmatrix_logr , title = "Confusion Matrix for Logistic Regression Model")

## Saving the Classification Reports : precision, recall, f1-score ##

pred_report_logr = classification_report(y_test, y_logr_0 , digits=2)

### Saving the ROC_AUC Scores for Baseline Algorithms

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_logr_0)
roc_logr = auc(false_positive_rate, true_positive_rate)

#### Printing the Accuracy Scores 

print("The Accuracy Score For The Logistic Regression Analysis Model Is :  ",   
      accuracy_score(y_test, y_logr_0))

##  Confusion Matrices 

fig_cmatrix_logr.show()

### Classification Reports 

print("Title : The Classification Report for Logistic Regression Model: \n ", pred_report_logr)

###  roc_auc scores 

print("The roc_auc score for Logistic Regression Analysis Model:  ", roc_logr)

In [None]:
##### K Fold Cross Validation ####
    
cross_valid = True
cv_score = cross_val_score(logit_model , x_trainsc , y_os_train, scoring = 'accuracy', 
                                          cv = KFold(n_splits = 10))
accuracy = accuracy_score(y_test, y_logr_0)

print("The Cross Validation Score For Baseline Logistic Regression Classifier Model After", "Fold Cross Validation Is :  \n", cv_score)
print(" ")
print("The Accuracy Score For Baseline Logistic Regression Classifier Model After Cross Validation Is: \n", accuracy)

In [None]:
### Hyper-parameter Tuning of Logistic Regression

penalty = ['l1', 'l2'] 
C = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]
class_weight = ['balanced', None]
solver = ['newton-cg','lbfgs','liblinear','sag','saga']

param_grid = dict(penalty=penalty,
                  C=C,
                  class_weight=class_weight,
                  solver=solver)

grid = GridSearchCV(estimator = logit_model,
                    param_grid = param_grid,
                    scoring ='roc_auc',
                    verbose = 1,
                    n_jobs =-1, cv = 10)
grid_result = grid.fit(x_trainsc, y_os_train)

print('Best Score: ', grid_result.best_score_)

print('Best Params: ', grid_result.best_params_)

In [None]:
# Training the model
logit_model = LogisticRegression(C=0.1, class_weight='balanced', penalty= 'l2', solver='liblinear')

## Model.fit

logit_model.fit(x_trainsc, y_os_train)

## Model.predict

y_logr_0 = logit_model.predict(x_testsc)

## Logistic Regression Model

conf_matrix_logr = confusion_matrix(y_test, y_logr_0 , labels=[0, 1])

df_cmatrix_logr = pd.DataFrame(conf_matrix_knn, index = [i for i in [0, 1]],
                  columns = [i for i in ["Predict_Rejected","Predict_Accepted"]])


fig_cmatrix_logr = px.imshow(df_cmatrix_logr , title = "Confusion Matrix for Logistic Regression Model")

## Saving the Classification Reports : precision, recall, f1-score ##

pred_report_logr = classification_report(y_test, y_logr_0 , digits=2)

### Saving the ROC_AUC Scores for Baseline Algorithms

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_logr_0)
roc_logr = auc(false_positive_rate, true_positive_rate)

#### Printing the Accuracy Scores 

print("The Accuracy Score For The Logistic Regression Analysis Model Is :  ",   
      accuracy_score(y_test, y_logr_0))

##  Confusion Matrices 

fig_cmatrix_logr.show()

### Classification Reports 

print("Title : The Classification Report for Logistic Regression Model: \n ", pred_report_logr)

###  roc_auc scores 

print("The roc_auc score for Logistic Regression Analysis Model:  ", roc_logr)

###  Decision Tree

Building Decision tree Model

In [None]:
dTree = DecisionTreeClassifier(criterion = 'gini', random_state=1)
dTree.fit(x_trainsc, y_os_train)

Scoring our Decision Tree

In [None]:
print(dTree.score(x_trainsc, y_os_train))
print(dTree.score(x_testsc, y_test))

Visualizing the Decision Tree

In [None]:
plt.figure(figsize=(15,10))
tree.plot_tree(dTree,filled=True)

Reducing Overfitting

In [None]:
dTreeR = DecisionTreeClassifier(criterion = 'gini', max_depth = 4, random_state=1)
dTreeR.fit(x_trainsc, y_os_train)
print(dTreeR.score(x_trainsc, y_os_train))
print(dTreeR.score(x_testsc, y_test))

In [None]:
dTreeR = DecisionTreeClassifier(criterion = 'entropy', max_depth = 4, random_state=1)
dTreeR.fit(x_trainsc, y_os_train)
print(dTreeR.score(x_trainsc, y_os_train))
print(dTreeR.score(x_testsc, y_test))

In [None]:
print(dTreeR.score(x_testsc , y_test))
y_predict = dTreeR.predict(x_testsc)

cm = confusion_matrix(y_test, y_predict, labels=[0, 1])

df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
                  columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')

## Saving the Classification Reports for Meta Learning Models:

pred_report_tree = classification_report(y_test, y_predict , digits=2)

### Saving the ROC_AUC Scores for Meta Learning Algorithms

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_predict)
roc_tree = auc(false_positive_rate, true_positive_rate)

## Saving the accuracy

accuracy = accuracy_score(y_test, y_predict)

print("Accuracy Score for Decision Tree: ",accuracy)
print("\n")
print("Decision Tree Classifier Report")
print("\n")
print(pred_report_tree)
print("\n")
print("ROC score: ",roc_tree)

In [None]:
pipe = Pipeline(steps=[('std_slc', scalar),
                           ('pca', pca),
                           ('dec_tree', dTreeR)])

n_components = list(range(1,x.shape[1]+1,1))

criterion = ['gini', 'entropy']
max_depth = [2,4,6,8,10,12]

parameters = dict(pca__n_components=n_components,
                      dec_tree__criterion=criterion,
                      dec_tree__max_depth=max_depth)

clf_GS = GridSearchCV(pipe, parameters)
clf_GS.fit(x_trainsc, y_os_train)
    
print('Best Criterion:', clf_GS.best_estimator_.get_params()['dec_tree__criterion'])
print('Best max_depth:', clf_GS.best_estimator_.get_params()['dec_tree__max_depth'])
print('Best Number Of Components:', clf_GS.best_estimator_.get_params()['pca__n_components'])
pca = PCA(n_components = 0.5)
pca.fit(x_trainsc)
x_train_model = pca.transform(x_trainsc)
x_test_model = pca.transform(x_testsc)
ex_variance=np.var(x_train_model,axis=0)
ex_variance_ratio = ex_variance/np.sum(ex_variance)

print("shape of x_train_pca", x_train_model.shape)
print('')    
print("Explained Variance Ratio for Training Dataset: \n", ex_variance_ratio)

print(" ")

ex_variance_1 = np.var(x_test_model , axis=0)
ex_variance_ratio_1 = ex_variance_1 / np.sum(ex_variance_1)
    
print("Explained Variance Ratio for Test Dataset: \n", ex_variance_ratio_1) 
print(" ")
print(clf_GS.best_estimator_.get_params()['dec_tree'])

In [None]:
pca = PCA(n_components = 6)
pca.fit(x_trainsc)
x_train_model = pca.transform(x_trainsc)
x_test_model = pca.transform(x_testsc)
ex_variance=np.var(x_train_model,axis=0)
ex_variance_ratio = ex_variance/np.sum(ex_variance)

print(" ")

ex_variance_1 = np.var(x_test_model , axis=0)
ex_variance_ratio_1 = ex_variance_1 / np.sum(ex_variance_1)


dTreeR = DecisionTreeClassifier(criterion = 'gini', max_depth = 6, random_state=3)
dTreeR.fit(x_trainsc, y_os_train)
print(dTreeR.score(x_trainsc, y_os_train))
print(dTreeR.score(x_testsc, y_test))

print(dTreeR.score(x_testsc , y_test))
y_predict = dTreeR.predict(x_testsc)

cm = confusion_matrix(y_test, y_predict, labels=[0, 1])

df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
                  columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')

## Saving the Classification Reports for Meta Learning Models:

pred_report_tree = classification_report(y_test, y_predict , digits=2)

### Saving the ROC_AUC Scores for Meta Learning Algorithms

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_predict)
roc_tree = auc(false_positive_rate, true_positive_rate)

## Saving the accuracy

accuracy = accuracy_score(y_test, y_predict)

print("Accuracy Score for Decision Tree: ",accuracy)
print("\n")
print("Decision Tree Classifier Report")
print("\n")
print(pred_report_tree)
print("\n")
print("ROC score: ",roc_tree)

Random Forest Classifier

In [None]:
rf_model        =      RandomForestClassifier(criterion = 'gini', 
                                              n_estimators = 100, random_state=1)

In [None]:
rf_model.fit(x_trainsc, y_os_train)

In [None]:
y_predict     =   rf_model.predict(x_testsc)

In [None]:
## Saving the Classification Reports for Meta Learning Models:

pred_report_tree = classification_report(y_test, y_predict , digits=2)

### Saving the ROC_AUC Scores for Meta Learning Algorithms

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_predict)
roc_tree = auc(false_positive_rate, true_positive_rate)

## Saving the accuracy

accuracy = accuracy_score(y_test, y_predict)

print("Accuracy Score for Random Forest: ",accuracy)
print("\n")
print("Random Forest Classifier Report")
print("\n")
print(pred_report_tree)
print("\n")
print("ROC score: ",roc_tree)

Random Forest using entropy

In [None]:
rf_model        =      RandomForestClassifier(criterion = 'entropy', bootstrap= True, class_weight='balanced', 
                                              n_estimators = 1000, random_state=5)

rf_model.fit(x_trainsc, y_os_train)

y_predict     =   rf_model.predict(x_testsc)

## Saving the Classification Reports for Meta Learning Models:

pred_report_tree = classification_report(y_test, y_predict , digits=2)

### Saving the ROC_AUC Scores for Meta Learning Algorithms

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_predict)
roc_tree = auc(false_positive_rate, true_positive_rate)

## Saving the accuracy

accuracy = accuracy_score(y_test, y_predict)

print("Accuracy Score for Random Forest: ",accuracy)
print("\n")
print("Random Forest Classifier Report")
print("\n")
print(pred_report_tree)
print("\n")
print("ROC score: ",roc_tree)

In [None]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
print(random_grid)

# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestRegressor()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(x_trainsc, y_os_train)

# Getting the best parameters
rf_random.best_params_

In [None]:
rf_model1       =      RandomForestClassifier(criterion = 'gini', bootstrap= True, class_weight='balanced',
                                              min_samples_split= 2, max_depth=20, min_samples_leaf=2, max_features= 'sqrt',
                                              n_estimators = 2000, random_state=10)

rf_model1.fit(x_trainsc, y_os_train)

y_predict     =   rf_model1.predict(x_testsc)

## Saving the Classification Reports for Meta Learning Models:

pred_report_tree = classification_report(y_test, y_predict , digits=2)

### Saving the ROC_AUC Scores for Meta Learning Algorithms

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_predict)
roc_tree = auc(false_positive_rate, true_positive_rate)

## Saving the accuracy

accuracy = accuracy_score(y_test, y_predict)

print("Accuracy Score for Random Forest: ",accuracy)
print("\n")
print("Random Forest Classifier Report")
print("\n")
print(pred_report_tree)
print("\n")
print("ROC score: ",roc_tree)

Boosting Algorithm

In [None]:
clf = AdaBoostClassifier(n_estimators=1000)
clf.fit(x_trainsc, y_os_train)
y_pred = clf.predict(x_testsc)

In [None]:
########## Printing the Accuracy Score

print("Accuracy Score: ",accuracy_score(y_test, y_pred))


########## Printing the Classification Report

print(classification_report(y_test, y_pred, digits=2))

### Saving the ROC_AUC Score

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
roc_auc_1 = auc(false_positive_rate, true_positive_rate)
print("roc_auc_1 score: ", roc_auc_1)

Gradient Boosting

In [None]:
clf1 = GradientBoostingClassifier(n_estimators=1000)
clf1.fit(x_trainsc, y_os_train)
y_pred = clf1.predict(x_testsc)

In [None]:
########## Printing the Accuracy Score

print("Accuracy Score: ",accuracy_score(y_test, y_pred))


########## Printing the Classification Report

print(classification_report(y_test, y_pred, digits=2))

### Saving the ROC_AUC Score

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
roc_auc_2 = auc(false_positive_rate, true_positive_rate)
print("roc_auc_2 score: ", roc_auc_2)

In [None]:
# Save the trained model as a pickle string.
saved_model = pickle.dumps(clf1)

## Conclusion

Gradient Boosting Classifier is the best model as it offers the best recall Value. Since it is a matter of life our recall is more crucial than precision. Determining survived passenger is more important. 

The accuracy for this model : 84.38%
Area covered: 83.05%
Recall for not survived: 89%
Recall for survived: 77%