### 20MAI0077 - Vivek Dadhich
> [Github repo Link](https://github.com/vivek20dadhich/dwm-ELA-CSE5021)

In [1]:
#import required packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sys
import warnings

if not sys.warnoptions:
    warnings.simplefilter("ignore")

In [2]:
#Read the dataset onto a variable

train = pd.read_csv('C:/Users/Vivek/Desktop/Machine Learning Techniques/titanic_data.csv')
train.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
# fix the predictor/x variable  and response variable

df = train['Survived','Pclass','Sex','Age','Fare']

KeyError: ('Survived', 'Pclass', 'Sex', 'Age', 'Fare')

In [4]:
df = train[['Survived','Pclass','Sex','Age','Fare']]

### Feature engineering #1

In [5]:
# Encoding - change male -> 1 and female -> 0 using lambda inline function

df['Sex'] = df['Sex'].apply(lambda sex:1 if sex=='male' else 0)

### Feature engineering #2

In [6]:

# Handling missing values - Data Imputation

print(df.isnull().sum())
# only age has missing values

df['Age'] = df['Age'].fillna(df['Age'].median())
print(df['Age'].isnull().sum())

Survived      0
Pclass        0
Sex           0
Age         177
Fare          0
dtype: int64
0


In [7]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare
0,0,3,1,22.0,7.25
1,1,1,0,38.0,71.2833
2,1,3,0,26.0,7.925
3,1,1,0,35.0,53.1
4,0,3,1,35.0,8.05


### Set the predictor and response variable

In [8]:
X = df.drop('Survived', axis = 1) # all except survived
Y = df['Survived']

### Splitting using the magical function 

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.3, random_state = 42)

## *Logistic Regression*

In [10]:
#Call the regression model

from sklearn.linear_model import LogisticRegression
logit = LogisticRegression()  #h/w learn more about it

logit.fit(X_train, Y_train)   #approximation, Model is learning the relationship between(x_train) and labels (y_train)

LogisticRegression()

In [11]:
# Compute predictions or y hat/y_pred

Y_pred = logit.predict(X_test) 
print(type(Y_pred)) 

<class 'numpy.ndarray'>


### Confusion matrix

In [12]:
from sklearn.metrics import confusion_matrix

confusion_matrix = confusion_matrix(Y_test, Y_pred)
confusion_matrix

array([[134,  23],
       [ 32,  79]], dtype=int64)

In [13]:
# 146 -> true negative (people didnt survive actually and model predicted)
# 24 -> false positive
# 58 -> false negative
# 40 -> true positive

### Accuracy Score

In [14]:
from sklearn.metrics import accuracy_score
acc_lr = round(accuracy_score(Y_test, Y_pred)*100,2)
acc_lr

79.48

### Classification report 

In [15]:
from sklearn.metrics import classification_report
report = classification_report(Y_test, Y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.81      0.85      0.83       157
           1       0.77      0.71      0.74       111

    accuracy                           0.79       268
   macro avg       0.79      0.78      0.79       268
weighted avg       0.79      0.79      0.79       268



<br></br>

## *Naive bayes classification*

> A Naive Bayes classifier is a probabilistic machine learning model that’s used for classification task. The crux of the classifier is based on the Bayes theorem.

>Bayes Theorem: P(A|B) =  P(A) P(B|A)P(B) 

Using Bayes theorem, we can find the probability of A happening, given that B has occurred. Here, B is the evidence and A is the hypothesis. The assumption made here is that the predictors/features are independent. That is presence of one particular feature does not affect the other. Hence it is called naive.

*Assumptions*

> - Consider that all predictors are independent
> - Another assumption made here is that all the predictors have an equal effect on the outcome

In [17]:
# training the model on training set 
from sklearn.naive_bayes import GaussianNB 
gnb = GaussianNB() 
gnb.fit(X_train, Y_train) 
  
# making predictions on the testing set 
predicted = gnb.predict(X_test) 
#print(type(Y_pred_nb))

# comparing actual response values (y_test) with predicted response values (y_pred) 
acc_nb = round(accuracy_score(Y_test, predicted)*100,2)
acc_nb

77.61

### Confusion matrix for naive bayes classifier

In [18]:
confusion_matrix_nb = confusion_matrix(Y_test, Y_pred_nb)
confusion_matrix_nb

NameError: name 'Y_pred_nb' is not defined

In [19]:
from sklearn.metrics import confusion_matrix
cfm = confusion_matrix(Y_test, predicted)
print(cfm)

[[126  31]
 [ 29  82]]


### Classification report for nb classifier

In [21]:
report_nb = classification_report(Y_test, predicted)
print(report_nb)

              precision    recall  f1-score   support

           0       0.81      0.80      0.81       157
           1       0.73      0.74      0.73       111

    accuracy                           0.78       268
   macro avg       0.77      0.77      0.77       268
weighted avg       0.78      0.78      0.78       268



## *Comparison*

In [23]:
results = pd.DataFrame({
    'Model': ['Logistic Regression',
              'Naive Bayes'],
    'Score': [acc_lr, acc_nb]})
result_df = results.sort_values(by='Score', ascending=False)
result_df = result_df.set_index('Score')
result_df.head()

Unnamed: 0_level_0,Model
Score,Unnamed: 1_level_1
79.48,Logistic Regression
77.61,Naive Bayes


For the given dataset and test_size = 0.3 and random_state value = 42, logistic regression model works better

Naive Bayes assumes that the features are conditionally independent. Real data sets are never perfectly independent but they can be close. In short Naive Bayes has a higher bias but lower variance compared to logistic regression. If the data set follows the bias then Naive Bayes will be a better classifier. 

Both Naive Bayes and Logistic regression are linear classifiers, Logistic Regression makes a prediction for the probability using a direct functional form where as Naive Bayes figures out how the data was generated given the results.