# **Methodology**


*   Importing Some Basic Libraries
*   Importing Data
*   Performing Descriptive Analysis on the dataset to know data better before Pre-processing
*   Checking null values
*   Doing Pre-processing
*   Handling missing values
*   Processing Categorical Values by Performing Label Encoding on it
*   Checking Data Description After Pre-Processing
*   Plotting the Histogram
*   Analysis of Target Variable using different count plot w.r.t to different independent features. Also, by using Strip and Violin Plot in b/w Age and Survived and in b/w Fare and Survived
*   Plotting Correlation Matrix and Heat Map
*   Splitting train_df into 70% and 30% to construct training data and validation data respectively
*   Implements 5 models which are Logistic Regression, GBM, SVM, DT, Naive Bayes
*   Performing Prediction on Validation Data
*   Evaluating Model based on Confusion Matrix and Classification Report for each model
*   Save predictions on Testing data in .csv format













# **Importing Some Basic Libraries**

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import sys, os
from matplotlib import pyplot as plt
from sklearn.preprocessing import LabelEncoder
import random
from math import exp
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score 
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# **Importing Data**

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [1]:
input_data_dir = "../input/titanic/"
train_df = pd.read_csv(os.path.join(input_data_dir, "train.csv"))
test_df = pd.read_csv(os.path.join(input_data_dir, "test.csv"))

# **Descriptive Analysis of the dataset**

In [1]:
print("Size of training dataset       : {}".format(train_df.shape))
print("Size of test dataset           : {}".format(test_df.shape))

## **Data Description**

### **Training Data**

In [1]:
train_df.info()

In [1]:
train_df.describe().T

### **Testing Data**

In [1]:
test_df.info()

In [1]:
test_df.describe().T

In [1]:
test_df_PassengerId = test_df.PassengerId

## **NULL VALUES**

In [1]:
plt.figure(figsize=(15, 20))
sns.heatmap(train_df.isnull(), cbar=False)        #plotting heatmap using sns library to find missing values in train_df
plt.show()

In [1]:
train_df.isna().sum()                        # Printing a count of missing value w.r.t each feature in train_df

In [1]:
test_df.isna().sum()                        # Printing a count of missing value w.r.t each feature in test_df

# **Pre-Processing**

**As seen above, there is one Independent Feature(i.e. Cabin) having more than 75%  of the total values are missing values. So it is illogical to fill Missing Values for this feature.**


**Hence, We are going to drop this feature from our training dataset as well as testing data**

In [1]:
train_df = train_df.drop(['Cabin'], axis=1)
test_df = test_df.drop(['Cabin'], axis=1)

In [1]:
train_df = train_df.drop(['PassengerId','Name','Ticket'], axis=1)   # Dropping unuseful features for prediction.
test_df = test_df.drop(['PassengerId','Name','Ticket'], axis=1)

## **Handling Missing Values:**

In [1]:
train_df['Embarked'].fillna(train_df['Embarked'].mode()[0], inplace=True)   # Since it contains discrete values.
train_df['Embarked'].isna().sum()                                   # Prints the remaining missing values in 'Embarked' feature.

In [1]:
train_df['Age'].fillna(train_df['Age'].mean(),inplace=True)   # Since it contains continuous values.
train_df['Age'].isna().sum()                                   # Prints the remaining missing values in 'Age' feature.

In [1]:
test_df['Age'].fillna(test_df['Age'].mean(),inplace=True)   # Since it contains continuous values.
test_df['Age'].isna().sum()                                   # Prints the remaining missing values in 'Age' feature.

In [1]:
test_df['Fare'].fillna(test_df['Fare'].mean(),inplace=True)   # Since it contains continuous values.
test_df['Fare'].isna().sum()                                   # Prints the remaining missing values in 'Fare' feature.

## **Label Encoding On Categorical Features:**

In [1]:
# There are two Features in our data which we are going to encode.
train_df_encode = train_df[['Sex','Embarked']]
test_df_encode = test_df[['Sex','Embarked']]
train_df_encode.head()

In [1]:
# Features which we are not going to encode.
train_df_not_encode = train_df.drop(['Sex','Embarked'], axis=1)
test_df_not_encode = test_df.drop(['Sex','Embarked'], axis=1)
train_df_not_encode.head()

In [1]:
le = LabelEncoder()            # Using Label Encoder to encode features that are having data type as object in training data.
for i in train_df_encode:
    train_df_encode[i]=le.fit_transform(train_df_encode[i])

In [1]:
for j in test_df_encode:        # Using Label Encoder to encode features that are having data type as object in testing data.
    test_df_encode[j]=le.fit_transform(test_df_encode[j])

In [1]:
train_df_encode.head()

In [1]:
train_df = pd.concat([train_df_encode, train_df_not_encode], axis=1)
test_df = pd.concat([test_df_encode, test_df_not_encode], axis=1)

# **Data Description After Pre-Processing**

### **Training Data:**

In [1]:
train_df.head()

In [1]:
train_df.info()

In [1]:
train_df.describe().T

### **Testing Data:**

In [1]:
test_df.head()

In [1]:
test_df.info()

In [1]:
test_df.describe().T

In [1]:
train_df.hist(bins = 60, figsize = (20,17), color='magenta')

**I plotted the histogram to check the distribution of a sample of Training data.**

# **Analysis of Target Variable**

In [1]:
plt.figure(figsize=(8,5))
sns.countplot(x='Survived', data=train_df, order=[0, 1] )

In [1]:
a = sns.countplot(y='Survived',hue='Sex', data=train_df, order=[0,1])

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., title='Sex')
plt.ylabel('Survived')
plt.show()

In [1]:
b = sns.countplot(y='Survived',hue='Embarked', data=train_df, order=[0,1])

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., title='Embarked')
plt.ylabel('Survived')
plt.show()

In [1]:
c = sns.countplot(y='Survived',hue='Pclass', data=train_df, order=[0,1])

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., title='Pclass')
plt.ylabel('Survived')
plt.show()

In [1]:
d = sns.countplot(y='Survived',hue='SibSp', data=train_df, order=[0,1])

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., title='SibSp')
plt.ylabel('Survived')
plt.show()

In [1]:
e = sns.countplot(y='Survived',hue='Parch', data=train_df, order=[0,1])

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., title='Parch')
plt.ylabel('Survived')
plt.show()

In [1]:
fig, ax = plt.subplots(figsize=(10, 8))  
sns.violinplot(x='Survived', y='Age', data=train_df, ax=ax)  
ax.set_title('Age Vs Survived')  
plt.show()  

In [1]:
# Using Strip plot to visualize the Age feature impact on Survived.  
fig, ax= plt.subplots(figsize=(10, 8))  
sns.stripplot(train_df['Survived'], train_df['Age'], jitter=True, ax=ax)  
ax.set_title('Age Vs Survived')  
plt.show() 

**Using Above Strip Plot and Violin Plot for Age Vs Survived , we can easily seen that:**
1.   **Those People which having a Age value in between 60 to 80 are mostly died.**
2.   **Youngsters Which having a Age value in the range of 20 to 40 are mostly died.**


In [1]:
fig, ax = plt.subplots(figsize=(10, 8))  
sns.violinplot(x='Survived', y='Fare', data=train_df, ax=ax)  
ax.set_title('Fare Vs Survived')  
plt.show()

In [1]:
# Using Strip plot to visualize the Fare feature impact on Survived.
fig, ax= plt.subplots(figsize=(10, 8))  
sns.stripplot(train_df['Survived'], train_df['Fare'], jitter=True, ax=ax)  
ax.set_title('Fare Vs Survived')  
plt.show() 

**Using Above Strip Plot and Violin Plot for Fare Vs Survived , we can easily seen that:**


1.   **Those People which having a very costlier ticket(around 500) are mostly survived.**
2.   **Those People which having a very cheaper ticket(around 0) are mostly died.**



## **`Correlation Matrix and Heat Map`**



In [1]:
corr_data = train_df.corr()                       # calculating correlation data between features
plt.figure(figsize=(19, 17))                      # setting figure size
sns.set_style('ticks')                            # setting plot style
sns.heatmap(corr_data, cmap='viridis',annot=True)                # plotting heatmap using sns library
plt.show()

In [1]:
corr_data.Survived.apply(lambda x: abs(x)).sort_values(ascending=False).iloc[1:8][::-1].plot(kind='barh',color='purple')  # calculating top correlated faetures
                                                                                                                           # with respect to target variable i.e. "Survived"
plt.title("Top Correlated Features", size=20, pad=26)
plt.xlabel("Correlation coefficient")
plt.ylabel("Features")

In [1]:
train_df_X = train_df[['Sex',	'Embarked',	'Pclass',	'Age',	'SibSp',	'Parch',	'Fare']]
train_df_X.head(n=5)

In [1]:
train_df_y = train_df.Survived
train_df_y.head()

In [1]:
# Splitting selected_train_df into 70% and 30% to construct training data and Validation data respectively.
trainX, valX, trainy, valy = train_test_split(train_df_X, train_df_y,test_size=0.3, random_state=12) 

In [1]:
trainX.shape

In [1]:
trainy.shape

In [1]:
valX.shape

In [1]:
valy.shape

In [1]:
#Creating a Logistic Regression Classifier
LogisticRegression_Model = LogisticRegression(penalty='l2',solver='newton-cg')
#Train the model using the training sets
LogisticRegression_Model.fit(trainX, trainy)

In [1]:
#Creating a svm Classifier
SVM_Model = svm.SVC(kernel='linear', probability=True) # Linear Kernel
#Train the model using the training sets
SVM_Model.fit(trainX, trainy)

In [1]:
lr_list = [0.5, 0.6, 0.7, 0.71, 0.75, 0.9, 1]

for learning_rate in lr_list:
    GBM_Model = GradientBoostingClassifier(n_estimators=20, learning_rate=learning_rate, max_features=2, max_depth=2, random_state=0)
    GBM_Model.fit(trainX, trainy)

    print("Learning rate: ", learning_rate)
    print("Accuracy score (Training): {0:.3f}".format(GBM_Model.score(trainX, trainy)))
    print("Accuracy score (Validation): {0:.3f}".format(GBM_Model.score(valX, valy)))   

In [1]:
#Creating a GBM Classifier
GBM_Model = GradientBoostingClassifier(n_estimators=20, learning_rate=0.71, max_features=2, max_depth=2, random_state=0)
#Train the model using the training sets
GBM_Model.fit(trainX, trainy)

In [1]:
#Creating a Gaussian Classifier
NB_Model = GaussianNB()
# Train the model using the training sets
NB_Model.fit(trainX, trainy)

In [1]:
#Creating a Decision Tree Classifier
DT_Model = DecisionTreeClassifier(criterion = "entropy",splitter = "best", random_state = 100, max_depth=3, min_samples_leaf=5)  
# Train the model using the training sets
DT_Model.fit(trainX, trainy) 

**Perform Prediction on Validation Data:**

In [1]:
LogisticRegression_predictions = LogisticRegression_Model.predict(valX)
SVM_predictions = SVM_Model.predict(valX)
NB_predictions = NB_Model.predict(valX)
DT_predictions = DT_Model.predict(valX)
GBM_predictions = GBM_Model.predict(valX)

# **Evaluation**

In [1]:
print("Logistic Regression_Confusion Matrix:")
print(confusion_matrix(valy, LogisticRegression_predictions))

print("Logistic Regression_predictions_Classification Report")
print(classification_report(valy, LogisticRegression_predictions))

In [1]:
print("SVM_Confusion Matrix:")
print(confusion_matrix(valy, SVM_predictions))

print("SVM_Classification Report")
print(classification_report(valy, SVM_predictions))

In [1]:
print("Naive Bayes Confusion Matrix:")
print(confusion_matrix(valy, NB_predictions))
print("Naive Bayes Classification Report")
print(classification_report(valy, NB_predictions))

In [1]:
print("DT_Confusion Matrix:")
print(confusion_matrix(valy, DT_predictions))

print("DT_Classification Report")
print(classification_report(valy, DT_predictions))

In [1]:
print("GBM_Confusion Matrix:")
print(confusion_matrix(valy, GBM_predictions))

print("GBM_Classification Report")
print(classification_report(valy, GBM_predictions))

**Since, out of 5 Different Models GBM_Model(i.e. Gradient Boosting Machine) provides the best Accuracy(i.e. 82%). Therefore, we perform prediction on Test Data Using GBM_Model.**

In [1]:
GBM_predictions_On_Test_Data = GBM_Model.predict(test_df)



# **Predictions on Test Data:**

In [1]:
Output_DF = pd.DataFrame({'PassengerId':test_df_PassengerId,'Survived':GBM_predictions_On_Test_Data})

In [1]:
#Save to csv
Output_DF.to_csv('Titanic_pred.csv',index=False)
Output_DF.head()

Colab Link For same Notebook:
https://colab.research.google.com/drive/16PzOiBXX5ay89N_nq9jT7ofbIbiOIJHd?usp=sharing

**Thank you**,<br>
Nikunj Bansal,<br>
R177218063,<br>
B2 Batch<br>