# Welcome to Simple guide Kernel

I am quite a newcomer to the Kaggle scene as well and the first proper kaggle script.The Titanic dataset is a prime candidate for introducing the concept of Machine learning as many newcomers like me to Kaggle start out here. 
The objective of this notebook is to follow a step-by-step workflow, explaining each step.

I hope that anyone, regardless of their python skills can find something useful and helpful.
please feel free to leave me any comments with regards to how I can improve.

<h2 style="color:blue"><center> Don't forget to upvote if you like it! It's free!! 

## Table of content:

* About RMS Titanic
* All about Data
* Import Necessary Libraries
* Load the data
* Data analysis
* Handle Missing Values
* Data Exploration/ Visualizing
* Correlation & Matrix
* Feature Engineering
* Predictive Modeling
> 1. Logistic Regression
> 2. KNN Classifier
> 3. Gaussian Naive Bayes
> 4. Support Vector Machine(SVM)
> 5. Random Forest
> 6. Decision Tree
* Confusion Matrix

### About RMS Titanic

The reason the titanic is often referred to as 'RMS Titanic' is because of Royal Mail Ship.
The RMS Titanic, a luxury steamship, sank in the early hours of April 15, 1912, off the coast of Newfoundland in the North Atlantic after sideswiping an iceberg during its maiden voyage. Of the 2,240 passengers and crew on board, more than 1,500 lost their lives in the disaster. 

We will go through the whole process of creating a machine learning model on the famous Titanic dataset, which is used by many people all over the world. It provides information on the fate of passengers on the Titanic, summarized according to economic status (class), sex, age and survival. In this challenge, we will predict whether a passenger on the titanic would have been survived or not.

<img src="https://i.ibb.co/WFKW312/titanic-conspiracy-rms-olympic-gettyimages-1055101284.jpg" alt="titanic" style="width:700px;height:400px;">

## All about Data 


<span style='font-family:Georgia'>
    <table>
        <tr>
            <th>Variable</th>
            <th>Key</th>
            <th>Definition</th>
        </tr>
        <tr>
            <td>survival</td>
            <td>0 = No, 1 = Yes</td>
            <td>Whether person survived or not</td>
        </tr>
        <tr>
            <td>pclass</td>
            <td>1 = 1st, 2 = 2nd, 3 = 3rd</td>
            <td>1st = Upper,2nd = Middle,3rd = Lower</td>
        </tr>
        <tr>
            <td>sex</td>
            <td>male,female</td>
            <td>sex of the passenger</td>
        </tr>
        <tr>
            <td>Age</td>
            <td>Continuous varivale</td>
            <td>Age in years</td>
        </tr>
        <tr>
            <td>sibsp</td>
            <td>numeric values</td>
            <td># siblings / spouses aboard the Titanic<br>Sibling = brother, sister, stepbrother, stepsister<br>Spouse = husband, wife</td>
        </tr>
        <tr>
            <td>parch</td>
            <td>numeric values</td>
            <td># parents / children aboard the Titanic<br> Parent = mother, father <br> Child = daughter, son, stepdaughter, stepson <br>Some children travelled only with a nanny, therefore parch=0 for them</td>
        </tr>
        <tr>
            <td>ticket</td>
            <td>numeric values</td>
            <td>Ticket number</td>
        </tr>
        <tr>
            <td>fare</td>
            <td>numeric values</td>
            <td>Passenger fare</td>
        </tr>
        <tr>
            <td>cabin</td>
            <td>numeric values</td>
            <td>Cabin number</td>
        </tr>
        <tr>
            <td>embarked</td>
            <td>C = Cherbourg, Q = Queenstown, S = Southampton</td>
            <td>Port of Embarkment</td>
        </tr>
    </table>  
</span>
    

## Import  Necessary Libraries

In [None]:
# Pandas library in python to read the csv file.
import pandas as pd

# for numerical computaions use numpy library
import numpy as np

# data visualization
import missingno as msno
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# Algorithms
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB
 
from sklearn.metrics import accuracy_score  #for accuracy_score
from sklearn.model_selection import KFold #for K-fold cross validation
from sklearn.model_selection import cross_val_score #score evaluation
from sklearn.model_selection import cross_val_predict #prediction
from sklearn.metrics import confusion_matrix #for confusion matrix

##  Load the Data

In [None]:
# Create a pandas dataframe and assign it to variable.
titanic = pd.read_csv('../input/titanic/train.csv')
titanic_test = pd.read_csv('../input/titanic/test.csv')

## Data Analysis

In [None]:
# Print first 5 rows of the dataframe.
titanic.head()

In [None]:
# Print Last 5 rows of the dataframe.
titanic_test.tail() 

There is no Survived column here which is our target varible we are trying to predict.



In [None]:
# gives shape of datase in (rows,columns)
titanic.shape

In [None]:
# Describe gives us statistical information about numerical columns in the dataset
titanic.describe()

we can check from count if there are missing values in columns, here 'age' has missing values.

Also we can see that 38% out of the training-set survived in Titanic.

We can also see that the passenger's age range from 0.4 to 80.

In [None]:
# unique values or range for feature set
print('Genders:', titanic['Sex'].unique())
print('Embarked:', titanic['Embarked'].unique())
print('Pclass:', titanic['Pclass'].unique())
print('Survived:', titanic['Survived'].unique())
print('SibSp Range:', titanic['SibSp'].min(),'-',titanic['SibSp'].max())
print('Parch Range:', titanic['Parch'].min(),'-',titanic['Parch'].max())
print('Family size range:', (titanic['Parch']+titanic['SibSp']).min(),'-',(titanic['Parch']+titanic['SibSp']).max())
print('Fare Range:', titanic['Fare'].min(),'-',titanic['Fare'].max())

In [None]:
# info method provides information about dataset like 
# total values in each column, null/not null, datatype, memory occupied etc
titanic.info()

 Also Embarked and cabin has missing values.
 
 ##  Missing Values 
 
 First we will visulize missing values.In which column missing values are present?

In [None]:
msno.matrix(titanic)

We can see that Age, Embarked and cabin has missing values. now, lets check missing values for test data.

In [None]:
msno.matrix(titanic_test)

cabin, age and fare has missing values in test data.

In [None]:
# Let's write a function to print the total percentage of the missing values.
# (This can be a good excercise for beginers to try to write sample function like this)

# This function takes a Dataframe (df) as input and returns two columns,total missing values and total missing alues percentage
def missing_data(df):
    total = df.isnull().sum().sort_values(ascending = False)
    percent = round(df.isnull().sum().sort_values(ascending = False) * 100 /len(df),2)
    return pd.concat([total,percent], axis = 1 ,keys = ['total','percent'])

In [None]:
missing_data(titanic)

now, lets check missing values for test data.

In [None]:
# check missing values in test dataset
missing_data(titanic_test)

 we will see how to deal with these missing valus next.

In [None]:
# COMPLETING: complete or delete missing values in train and test dataset
dataset = [titanic,titanic_test]

for data in dataset:
    # coplete missing age with median
    data['Age'].fillna(data['Age'].median(),inplace = True)
    
    # complete Embarked with mode
    data['Embarked'].fillna(data['Embarked'].mode()[0], inplace = True)
    
    # complete missing Fare with median
    data['Fare'].fillna(data['Fare'].median(),inplace = True)

In [None]:
missing_data(titanic)

### Note: Column "Cabin" has more than 75% of missing values in both train and test dataset.
#### Suggestion: Not to impute missing data in columns, which have more than 40% of missing data.

In [None]:
titanic.drop(['Cabin'], axis=1, inplace = True)
titanic_test.drop(['Cabin'],axis=1,inplace=True)

In [None]:
titanic.head()

In [None]:
titanic_test.head()

In [None]:
missing_data(titanic)

## Data Exploration/ Visulization

### 1. Survival Analysis

Let's check through plotting how many passenger & crew survived on ship.

In [None]:
net_Survived=titanic['Survived'].value_counts().to_frame().reset_index().rename(columns={'index':'Survived','Survived':'count'})

In [None]:
fig = go.Figure([go.Pie(labels=net_Survived['Survived'], values=net_Survived['count'])])

fig.update_traces(hoverinfo='label+percent', textinfo='value+percent', textfont_size=15,insidetextorientation='radial')

fig.update_layout(title="Travellers survived on titanic",title_x=0.5)
fig.show()

Here, 549 Travellers died in tragedy and 342 Travellers save their lives. About 38% of people alive in tragedy.

### 2. Gender Analysis

Lets Check how many male and female survived on titanic.

In [None]:
age_analysis=titanic[titanic['Survived']==1]['Sex'].value_counts().reset_index().rename(columns={'index':'Sex','Sex':'count'})

In [None]:
fig = go.Figure(go.Bar(x=age_analysis['Sex'],y=age_analysis['count']))
fig.update_layout(autosize=False,width=400,height=500,title_text='Analysis of Survived travellers by gender',xaxis_title="sex",yaxis_title="count",paper_bgcolor="lightsteelblue")
fig.show()

Out of 342 survived travellers there are 233 female and 109 male.We can see that survival chance of female is more than  male.but for surety check total number of male and female. 

In [None]:
def draw(graph):
    for p in graph.patches:
        height = p.get_height()
        graph.text(p.get_x()+p.get_width()/2., height + 5,height ,ha= "center")

In [None]:
sns.set(style="darkgrid")
plt.figure(figsize = (5, 6))
x = sns.countplot(titanic['Sex'])
draw(x)

Clearly we can see that survival chance of female is more than male.Because total nuumber of male 577 and survived male are only 109.total nuumber of male 314 and survived female are only 233.

### 3.Embarked and Fare Analysis

Embardked means From which location passengers go on board to Titanic.

Here we have three embarkment point

* C = Cherbourg

* Q = Queenstown

* S = Southampton

In [None]:
plt.figure(figsize = (10, 6))
graph  = sns.countplot(y = "Embarked", hue ="Survived", data = titanic)
for p in graph.patches:
        Total = '{:,.0f}'.format(p.get_width())
        x = p.get_x() + p.get_width() + 0.02
        y = p.get_y() + p.get_height()/2
        graph.annotate(Total, (x, y))

Embarked seems to be correlated with survival, depending on the gender.

But just think, Is these all parameters are important for survivalance of people?
Is there a correlation between port of embarkment and survival chances?

To find answers of these questions we plot some other graphs related to Embarked,Fare ,class and age.


In [None]:
FGrid = sns.FacetGrid(titanic, row='Pclass', aspect=2)
FGrid.map(sns.pointplot, 'Embarked', 'Survived', 'Sex', palette=None,  order=None, hue_order=None)
FGrid.add_legend()

On port S and C has more survival chances. Also we can see women have a high survival probability.

Based on this senario Embarked looks to be correlated with survival, depending on the Pclass.
More chances of surviaval if travellers are In Pclass 1.

From which location passenger start journey does it matter or its more important that passenger is on Titanic, no matter from where you start journey. 

as we know that **At 2:20 a.m. on April 15, 1912, the British ocean liner Titanic sinks into the North Atlantic Ocean.**
Its night time, high cold weather **(The temperature of the water was -2.2 degrees Celsius when Titanic was sinking)**, and Job-Location/Rest-room(Passenger class) allocated to everyone on Titanic.

We can use Embarked as feature here for getting high accuracy but logically its doesn't matter. so we drop it out.

As a part of data science you have to think 360 degree angle, some features are important but logically its not, so thats why you must have domain knowledge for feature selection.

In [None]:
titanic.drop(['Embarked'], axis=1, inplace = True)
titanic_test.drop(['Embarked'],axis=1,inplace=True)

### 4. Age Analysis

In [None]:
titanic=titanic.dropna()
titanic['age_category']=np.where((titanic['Age']<19),"below 19",
                                 np.where((titanic['Age']>18)&(titanic['Age']<=30),"19-30",
                                    np.where((titanic['Age']>30)&(titanic['Age']<=50),"31-50",
                                                np.where(titanic['Age']>50,"Above 50","NULL"))))
age=titanic['age_category'].value_counts().to_frame().reset_index().rename(columns={'index':'age_category','age_category':'Count'})

In [None]:
titanic_age=titanic['age_category'].value_counts().to_frame().reset_index().rename(columns={'index':'age_category','age_category':'count'})

In [None]:
colors=['pink','teal','orange','green']
fig = go.Figure([go.Pie(labels=titanic_age['age_category'], values=titanic_age['count'])])
fig.update_traces(hoverinfo='label+percent', textinfo='percent+label', textfont_size=15,
                 marker=dict(colors=colors, line=dict(color='#000000', width=2)))
fig.update_layout(title="Titanic Age Categories",title_x=0.5)
fig.show()

**Lets clear our vision by another graph.**

In [None]:
titanic['survived_or_not']=np.where(titanic['Survived']==1,"Survived",np.where(titanic['Survived']==0,"Died","null")) # .head(2)'

sun_df=titanic[['Sex','survived_or_not','age_category','Fare']].groupby(['Sex','survived_or_not','age_category']).agg('sum').reset_index()

In [None]:
fig = px.sunburst(sun_df, path=['Sex','survived_or_not','age_category'], values='Fare')
fig.update_layout(title="Titanic dataset distribution by Drilldown (Sex, Survived, Age Categories)",title_x=0.5)
fig.show()

#### How are the Age spread for travellers?

In [None]:
sur_age=titanic[titanic['Survived']==1]['Age']
un_age=titanic[titanic['Survived']==0]['Age']

In [None]:
fig = go.Figure(go.Box(y=sur_age,name="Age")) 
fig.update_layout(title="Distribution of Age by Survived travellers", autosize=False, width=600, height=700)
fig.show()

We plot this graph to check outliers of age column.

Here, we can see that average age of survived person near to 30.

and maximum survived passenger's age range lies between 22 to 35 years.

#### what is Outlier?
Outlier is an observation that appears far away and diverges from an overall pattern in a sample.

<img src="https://i.ibb.co/HNjCZ4s/images-mod1-spread11.gif" alt="outlier" width="500" height="350">
<ul>

In our plot there are blue points above upper fence,those are all ouliers.
Means our age range is 0.42 to 56 and rare age is above 56.Those are 58,60,62,63 and 80.

This plot show us value of min, max, median and quartile ranges. 

##### Lets do same way  check this for unsurvived passengers.

In [None]:
fig = go.Figure(go.Box(y=un_age,name="Age")) 
fig.update_layout(title="Distribution of Age By Unsurvived tarvellers", autosize=False, width=600, height=700)
fig.show()

### 4. Passanger Class(Pclass) analysis 

we will check whether Upper class or lower class affect survival rate.

In [None]:
ax = sns.countplot(y="Pclass", hue="Survived", data=titanic, palette="Set1")
for p in ax.patches:
        Total = '{:,.0f}'.format(p.get_width())
        x = p.get_x() + p.get_width() + 0.02
        y = p.get_y() + p.get_height()/2
        ax.annotate(Total, (x, y))

Here we see that Pclass is contributing to a persons chance of survival, especially if this person is in class 1. We will create another pclass plot below.

### 5. SibSp and Parch Analysis

SibSp and Parch would make more sense, Parents not let child die, Bond of Blood relation always help each other first, rather than helping others they think about them self and them family member.

Create new feature Family Size as a combination of SibSp and Parch

In [None]:
# combine test and train as single to apply some function, we will use it again in Data Preprocessing
all_data=[titanic,titanic_test]

for dataset in all_data:
    dataset['Family'] = dataset['SibSp'] + dataset['Parch'] + 1

In [None]:
sns.set(style="darkgrid")
plt.figure(figsize = (7, 6))
x = sns.countplot(titanic['Family'])
draw(x)

In [None]:
surfamily_size = titanic[titanic['Survived'] == 1]

In [None]:
fig = go.Figure(data=go.Violin(y=surfamily_size['Family'],
                               marker_color="blue",
                               x0='Family size'))

fig.update_layout(title="Survived travellers family size")
fig.show()

we can see  that single person have a high survival probability.

From plot we can say  smaller family higher survival rate.

**same way we can check for unsurvied travellers family size.**

In [None]:
unfamily_size = titanic[titanic['Survived'] == 0]

In [None]:
fig = go.Figure(data=go.Violin(y=unfamily_size['Family'],
                               marker_color="blue",
                               x0='Family size'))

fig.update_layout(title="Unsurvived travellers family size")
fig.show()

**Now we will see whether Age is considerable with family size for higher probability of survival?**

In [None]:
axes = sns.factorplot('Family','Age','Survived',
                      data=titanic, aspect = 2,kind='bar', orient='v',palette="Set2")

As per above plot we can say that if you are traveling alone or family of 2 members and your age is around 30 than your survival chance is almost 50%. While you have family of 5 members and all are around 30 then your survival chance very high.

Also we can see that survival chance is next to zero if your family members more than 7.

So family & age features are very important.

In [None]:
# create bin for age features. 
for dataset in all_data:
    dataset['Age_bin'] = pd.cut(dataset['Age'], bins=[0,12,20,40,120], labels=['Children','Teenage','Adult','Elder'])

In [None]:
plt.figure(figsize = (8, 5))
bin = sns.countplot(x='Age_bin', hue='Survived', data=titanic,palette="Set1")
draw(bin)

Here, Survived count is higher for Adult.
for children and teenage is almost eual chance of survival.

### 6. Fare Analysis

we are going to create bins for different fare price level.

In [None]:
for dataset in all_data:
    dataset['Fare_bin'] = pd.cut(dataset['Fare'], bins=[0,10,50,100,550], labels=['Low_fare','medium_fare','Average_fare','high_fare'])

Some insight in Fare information

* A higher family size doesn't necessarily indicate higher Fare.

* Passengers in class 2 and 3 paid a fare of under 100 bucks.

* Most passengers paid under 50 bucks of fare.

It seems that passenger fare depends on their travel class in our model.lets prove with prove.

In [None]:
plt.figure(figsize = (8, 5))
sns.countplot(x='Pclass', hue='Fare_bin', data=titanic)

We can say Low fare is only in 3rd Pclass.Medium fare travellers are in all class.High fare travellers are only in 1st class.

Pclass and Fare correlated with other.So we can drop one of them.

But if we think logically how much price I paid for ticket is not correlate to my survival chance.

So,basically we should drop fare.For surety we will create correlation matrix first.

## Correlation

### What is correaltion?

 correlation is a measure of how strongly one variable depends on another.
 
 **1.Positive correlation:**

A positive correlation is a relationship between two variables in which both variables move in the same direction.
Therefore, when one variable increases as the other variable increases, or one variable decreases while the other decreases.see it with example of age vs salary.

<img src="https://i.ibb.co/pKnm57p/age-vs-salary.png" alt="positive correlation" width="450" height="300" data-load="full" style="">

**2.Negative correlation**

A negaitive correlation is a relationship between two variables in which both variables move in the opposite direction.
Therefore, when one variable increases as the other variable decreases, or one variable decreases while the other increases.

<img src="https://i.ibb.co/Jv85tfs/scatter-plot-negative-correlation.png" alt="negative correlation" width="400" height="300" data-load="full" style="">

**Note:** only the numeric features are compared as it is obvious that we cannot correlate between alphabets or strings. 

In [None]:
pd.DataFrame(abs(titanic.corr()['Survived']).sort_values(ascending = False))

In [None]:
# Generate a mask for the upper triangle (taken from seaborn example gallery)
corr=titanic.corr()  #['Survived']

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
plt.subplots(figsize = (14,8))
sns.heatmap(corr, 
            annot=True,
            mask = mask,
            cmap = 'Blues',
            linewidths=.9, 
            linecolor='white',
            vmax = 0.3,
            fmt='.2f',
            center = 0,
            square=True)
plt.yticks(rotation = 0)
plt.title("Correlation Matrix", y = 1,fontsize = 25, pad = 20);

## Feature Engineering

Feature engineering is the art of converting raw data into useful features.To help us get a better performance, we can create new features based on the original features of our dataset.

we will see first which are not numeric data and than after convert them into numeric data.

In [None]:
titanic.info()

We already have Survived column so we drop the colom with name "survived_or_not" 

In [None]:
drop_col= ["survived_or_not","age_category"]
titanic.drop(drop_col,axis=1,inplace=True)

In [None]:
# Convert ‘Sex’ feature into numeric.
genders = {"male": 0, "female": 1}

for dataset in all_data:
    dataset['Sex'] = dataset['Sex'].map(genders)
titanic['Sex'].value_counts()

In [None]:
for dataset in all_data:
    dataset['Age'] = dataset['Age'].astype(int)
    dataset.loc[ dataset['Age'] <= 15, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 15) & (dataset['Age'] <= 20), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 20) & (dataset['Age'] <= 26), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 26) & (dataset['Age'] <= 28), 'Age'] = 3
    dataset.loc[(dataset['Age'] > 28) & (dataset['Age'] <= 35), 'Age'] = 4
    dataset.loc[(dataset['Age'] > 35) & (dataset['Age'] <= 45), 'Age'] = 5
    dataset.loc[ dataset['Age'] > 45, 'Age'] = 6
titanic['Age'].value_counts()

As we created new fetures form existing one, so we remove that one.

Dropping SibSp & Parch because we have family now. same way Age.
We also going to remove some other features like passenger id in list, Ticket number and Name.

In [None]:
for dataset in all_data:
    drop_column = ['Age_bin','Fare','Name','Ticket', 'PassengerId','SibSp','Parch','Fare_bin']
    dataset.drop(drop_column, axis=1, inplace = True)

## Predictive Modeling

After all the preprocessing, we are now ready for building and evaluating different Machine Learning models.

We have seen some insights from the data analysis. But with that, we cannot accurately predict whether a passenger will survive or die. So now we will predict whether a Passenger will survive or not using some great Classification Algorithms.

In [None]:
all_features = titanic.drop("Survived",axis=1)
Targete = titanic["Survived"]
X_train,X_test,y_train,y_test = train_test_split(all_features,Targete,test_size=0.3,random_state=0)
X_train.shape,X_test.shape,y_train.shape,y_test.shape

In [None]:
titanic.head()

### 1. Logistic Regression:

In [None]:
model = LogisticRegression()
model.fit(X_train,y_train)
prediction_lr=model.predict(X_test)
Log_acc = round(accuracy_score(prediction_lr,y_test)*100,2)
kfold = KFold(n_splits=10, random_state=22) # k=10, split the data into 10 equal parts
Log_cv_acc=cross_val_score(model,all_features,Targete,cv=10,scoring='accuracy')

print('The accuracy of the Logistic Regression is',Log_acc)
print('The cross validated score for Logistic REgression is:',round(Log_cv_acc.mean()*100,2))

### 2. K Nearest Neighbor:

In [None]:
knn = KNeighborsClassifier(n_neighbors = 3) 
knn.fit(X_train, y_train)  
Y_pred = knn.predict(X_test)  
acc_knn = round(knn.score(X_train, y_train) * 100, 2)
kfold = KFold(n_splits=10, random_state=22) 
result_knn=cross_val_score(model,all_features,Targete,cv=10,scoring='accuracy')

print('The accuracy of the K Nearst Neighbors Classifier is',round(accuracy_score(Y_pred,y_test)*100,2))
print('The cross validated score for K Nearest Neighbors Classifier is:',round(result_knn.mean()*100,2))

### 3. Gaussian Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB
model= GaussianNB()
model.fit(X_train,y_train)
prediction_gnb=model.predict(X_test) 
nb_acc = round(accuracy_score(prediction_gnb,y_test)*100,2)
kfold = KFold(n_splits=12, random_state=22)
result_gnb=cross_val_score(model,all_features,Targete,cv=12,scoring='accuracy')

print('The accuracy of the Gaussian Naive Bayes Classifier is',nb_acc)
print('The cross validated score for Gaussian Naive Bayes classifier is:',round(result_gnb.mean()*100,2))

### 4. Linear Support Vector Machine:

In [None]:
linear_svc = LinearSVC()
linear_svc.fit(X_train, y_train)

Y_pred = linear_svc.predict(X_test)

acc_linear_svc = round(linear_svc.score(X_train, y_train) * 100, 2)
kfold = KFold(n_splits=5, random_state=22)
result_svm=cross_val_score(model,all_features,Targete,cv=10,scoring='accuracy')

print('The accuracy of the Support Vector Machines Classifier is',acc_linear_svc)
print('The cross validated score for Support Vector Machines Classifier is:',round(result_svm.mean()*100,2))

### 5. Random Forest:

In [None]:
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, y_train)

Y_prediction = random_forest.predict(X_test)

random_forest.score(X_train, y_train)
acc_random_forest = round(random_forest.score(X_train, y_train) * 100, 2)

kfold = KFold(n_splits=10, random_state=22) # k=10, split the data into 10 equal parts
result_rm=cross_val_score(model,all_features,Targete,cv=10,scoring='accuracy')

print('The accuracy of the Random Forest Classifier is',acc_random_forest)
print('The cross validated score for Random Forest Classifier is:',round(result_rm.mean()*100,2))

### 6. Decision tree

In [None]:
decision_tree = DecisionTreeClassifier() 
decision_tree.fit(X_train, y_train)
Y_pred = decision_tree.predict(X_test) 
acc_decision_tree = round(decision_tree.score(X_train, y_train) * 100, 2)

kfold = KFold(n_splits=10, random_state=22) # k=10, split the data into 10 equal parts
result_rm=cross_val_score(model,all_features,Targete,cv=10,scoring='accuracy')

print('The accuracy of the Random Forest Classifier is',acc_decision_tree)
print('The cross validated score for Random Forest Classifier is:',round(result_rm.mean()*100,2))

## Which is the best Model ?

In [None]:
results = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes',  
              'Decision Tree'],
    'Score': [acc_linear_svc, acc_knn, Log_acc, 
              acc_random_forest, nb_acc, acc_decision_tree]})
result_df = results.sort_values(by='Score', ascending=False)
result_df = result_df.set_index('Model')
result_df.head(9)

Random forest and Descision tree are the best models for us here.

### Confusion Matrix

A confusion matrix also call error matrix.
Table that is often used to describe the performance of a classification model.
we will check for Random forest.


<img src="https://i.ibb.co/XJr46rH/Confusion-Matrix.png" alt="Confusion-Matrix" width="390" height="350">

In [None]:
predictions = cross_val_predict(random_forest, X_train, y_train, cv=3)
c_mat = confusion_matrix(y_train, predictions)
print(c_mat)

In [None]:
# we will see our confusion matrix in percentage.
sns.heatmap(c_mat/np.sum(c_mat), annot=True, 
            fmt='.2%', cmap='Blues')

here we predicted 333 travellers not survive and that actually not survive which is correct(**True Negative**).and 50 where wrongly classified as not survived (false positives).

we predicted that 167 travellers survive and that is actually true.(**True Positive:**)
we predicted that 75 travellers classified as survived .(False Negative)

### Precision and Recall:

precision refers to the percentage of results which are relevant, recall refers to the percentage of total relevant results correctly classified by our algorithm.

In [None]:
from sklearn.metrics import precision_score, recall_score
print("Precision:", precision_score(y_train, predictions))
print("Recall:",recall_score(y_train, predictions))

### F-Score

The F-score is computed with the harmonic mean of precision and recall. Note that it assigns much more weight to low values. As a result of that, the classifier will only get a high F-score, if both recall and precision are high.

In [None]:
from sklearn.metrics import f1_score
f1_score(y_train, predictions)

<div class="alert alert-block alert-info">
    <span style='font-family:Georgia'>
       
## Conclusion

This was my very first Kaggle competition and climbing up the leaderboard one step at a time was definitely a really nice journey.

Thank you for taking the time to read through my first exploration of a Kaggle dataset.

please feel free to leave me any comments with regards if you found this notebook useful or you just liked it. I would really appreciate it!

<h2 style="color:green"><center>Plz Upvote !!