# Titanic Survive Prediction Tutorial for Beginners
Titanic is kaggle's beginners competion where goal is to predict where passenger will survive or not.

![](https://media.giphy.com/media/jXJYVWquFTXTG/giphy.gif)

<center>Gif from [Giphy](https://giphy.com/gifs/titanic-jXJYVWquFTXTG)</center>

## Hello Everyone,
#### Welcome to this kernel
I have started this kernel to help beginners to understand Titanic Kaggle Challenge.

I hope that anyone, regardless of their Machine Learning and Python skills can find something useful and helpful.

# <font color='red'> Don't forget to upvote if you like it! </font>

## Thanks and be safe!

## Contents

* [Import required libraries](#import-required-libraries)
- [Load Data](#load-data)
- [Looking into Training and Testing Data](#looking-into-training-and-testing-data)
- [EDA (Exploratory Data Analysis)](#EDA-exploratory-data-analysis)
- [Data Visualization](#data-visualization)
- [Model Prediction](#model-prediction)
- [Support Vector Machine](#support-vector-machine)
- [K-Nearest Neighbour](#k-nearest-neighbour)
- [Gaussian Naive Bayes](#gaussian-naive-bayes)
- [Linear SVC](#linear-svc)
- [Stochastic Gradient Descent](#stochastic-gradient-descent)
- [Decision Tree](#decision-tree)
- [Random Forest](#random-forest)

I am continuesly updating this kernel so I really appriciate you feedback.

I you have any quetion do let me know in comment, I am more than happy to answer.

## Import Required Libraries

Firt thing first. It is very important to import all necessary python libraries. 
I am going to import NumPy and Pandas for Data Analysis. For visualization I am going to use Matplotlib and Seaborn. 

In [None]:
# data analysis
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# data visualization
import matplotlib.pyplot as plt # for data visualization
import seaborn as sns # for data visualization

sns.set_style('dark')

#ignore warnings
import warnings
warnings.filterwarnings('ignore')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Load Data

Once you are dont with libraries, second step is to import dataset. As you can see in above cell's output. There is 3 files in our input folder. 
1. train.csv -- our training file.
2. test.csv -- using our machine learning model we have to predict whethere gicen entries in this file will survive or not.
3. gender_submission.csv -- sample submission file.


So I am going to load train.csv and test.csv in different data frames.

In [None]:
# load train data
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
train_data.head()

In [None]:
# load test data
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
test_data.head()

# Looking into Training and Testing Data

In [None]:
print('='*50)
print("Number of columns in training data")
print('='*50)
print("\n")
print(train_data.columns.values)
print("\n")
print('='*50)
print("Number of columns in test data")
print('='*50)
print("\n")
print(test_data.columns.values)

From above column name we can see that test data doesn't have Survived column. That's our task to do. For test data we have to find out whethere give passenger will survive or not.

Lets have look at each column information.

* PassengerId: An unique index for each passenger. It starts from 1 and increments by 1 for every new passenger.
* Survived: Shows if the passenger survived or not. 1 stands for survived and 0 stands for not survived.

* Pclass: Ticket class. 1 stands for First class ticket. 2 stands for Second class ticket. 3 stands for Third class ticket.

* Name: Passenger's name. Name also contain title. "Mr" for man. "Mrs" for woman. "Miss" for girl. "Master" for boy.

* Sex: Passenger's sex. It's either Male or Female.

* Age: Passenger's age. "NaN" values in this column indicates that the age of that particular passenger has not been recorded.

* SibSp: Number of siblings or spouses travelling with each passenger.

* Parch: Number of parents of children travelling with each passenger.

* Ticket: Ticket number.

* Fare: How much money the passenger has paid for the travel journey.

* Cabin: Cabin number of the passenger. "NaN" values in this column indicates that the cabin number of that particular passenger has not been recorded.

* Embarked: Port from where the particular passenger was embarked/boarded.

Have a look at data shape

In [None]:
print('='*10)
print("Train data shape")
print('='*10)
print("\n")
print(train_data.shape)
print("\n")
print('='*10)
print("Test data shape")
print('='*10)
print("\n")
print(test_data.shape)

Describing training dataset

describe() method can show different values like count, mean, standard deviation, etc. of numeric data types.

In [None]:
print('='*50)
print("\nDescribe traing data\n")
print('='*50) 
print("\n")
print(train_data.describe())

In [None]:
print("Describe test data")
print('='*50)
print(test_data.describe())

Info of training data 

In [None]:
print('='*50)
print("\nTraining data info\n")
print('='*50)
print(train_data.info())
print("\n")
print('='*50)
print("\n Test data info \n")
print('='*50)
print("\n")
print(test_data.info())

We can see that Age, Cabin and Embarked have missing values.

Age and Embarked have only few missing values. Whereas Cabin column have so many missing values.

In [None]:
print('='*50)
print('\nNumber of null values in train data\n')
print('='*50)
print('\n')
print(train_data.isnull().sum())
print('\n')
print('='*50)
print('\n Number of null values in test data\n')
print('='*50)
print("\n")
print(test_data.isnull().sum())

### Age Feature
One solution is to fill in the null values with the median age.

In [None]:
train_data['Age'] = train_data['Age'].fillna(train_data['Age'].median())
test_data['Age'] = test_data['Age'].fillna(test_data['Age'].median())

### Cabin Feature
I'll start off by dropping the Cabin feature since not a lot more useful information can be extracted from it.

In [None]:
train_data = train_data.drop(['Cabin'], axis = 1)
test_data = test_data.drop(['Cabin'], axis = 1)

### Ticket Feature
I will also drop the Ticket feature since it's unlikely to yield any useful information.

In [None]:
train_data = train_data.drop(['Ticket'], axis = 1)
test_data = test_data.drop(['Ticket'], axis = 1)

### Embarked Features
There is two missing embarked values in train data.

In [None]:
train_data['Embarked'] = train_data['Embarked'].fillna('S')

### Fare Feature
For only test data we have one missing value so I am going to fill that with median.

In [None]:
test_data["Fare"].fillna(test_data["Fare"].median(), inplace=True)

In [None]:
# let check missing value again
print('='*50)
print('\nNumber of null values in train data\n')
print('='*50)
print('\n')
print(train_data.isnull().sum())
print('\n')
print('='*50)
print('\n Number of null values in test data\n')
print('='*50)
print("\n")
print(test_data.isnull().sum())

Now we don't have any missing value in train and test data. Let's do some visuaization.

# EDA (Exploratory Data Analysis)

To make some observations and assumptions, we need to quickly analyze some feature correlations by pivoting features against each other. As we cleaned our data, we are able to make this correlation for every feature.

#### Observation: 

- It is clear that out of 891 passengers only 342 manage to survive. Which indicated majority of passengers died.
- **Sex** Female passenger have high priority of survival.
- **Pclass** First class passenger have higher change of survival, which is >50%.
- **Embarked** Passger who board the ship from Cherbourg.

In [None]:
# number of survived passengers
train_data.groupby(['Survived'])['Survived'].count()

In [None]:
# percentage of male and female who survived
train_data[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
# percentage of people survived according to their Ticker Class
train_data[["Pclass", "Survived"]].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
# Percentage of survived people based on their embarked. 
train_data[["Embarked", "Survived"]].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
train_data[["Parch", "Survived"]].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
train_data[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)

# Data Visualization

In [None]:
sns.countplot(x = 'Survived', data = train_data)

In [None]:
#draw a bar plot of survival by sex
sns.barplot(x="Sex", y="Survived", data=train_data)

In [None]:
#draw a bar plot of survival by sex
sns.barplot(x="Pclass", y="Survived", data=train_data)

In [None]:
#draw a bar plot of survival by sex
sns.barplot(x = "Embarked", y = "Survived", data = train_data)

In [None]:
#draw a bar plot of survival by sex
sns.barplot(x="Parch", y="Survived", data=train_data)

In [None]:
# peaks for survived/not survived passengers by their age
facet = sns.FacetGrid(train_data, hue="Survived",aspect=4)
facet.map(sns.kdeplot,'Age',shade= True)
facet.set(xlim=(0, train_data['Age'].max()))
facet.add_legend()

# average survived passengers by age
fig, axis1 = plt.subplots(1,1,figsize=(18,4))
average_age = train_data[["Age", "Survived"]].groupby(['Age'],as_index=False).mean()
sns.barplot(x='Age', y='Survived', data=average_age)

In [None]:
grid = sns.FacetGrid(train_data, col='Survived', row='Pclass')
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend();

In [None]:
grid = sns.FacetGrid(train_data, col='Survived', row='Pclass')
grid.map(plt.hist, 'SibSp', alpha=.5, bins=20)
grid.add_legend();

In [None]:
grid = sns.FacetGrid(train_data, col='Survived', row='Pclass')
grid.map(plt.hist, 'Embarked', alpha=.5, bins=20)
grid.add_legend();

In [None]:
sns.heatmap(train_data.corr(),annot=True,cmap='RdYlGn',linewidths=0.2)
fig=plt.gcf()
fig.set_size_inches(20,12)
plt.show()

## Converting Categorial data to Numeric

In our data some of features are represent categorial values, like Sex, Embarked etc. So we have to convert them in numeric value.

In [None]:
train_data['Sex'] = train_data['Sex'].map({'male':1, 'female':0})
test_data['Sex'] = test_data['Sex'].map({'male':1, 'female':0})

In [None]:
train_data['Embarked'] = train_data['Embarked'].map({'Q':2, 'S':1, 'C':0})
test_data['Embarked'] = test_data['Embarked'].map({'Q':2, 'S':1, 'C':0})

# Model Prediction
Now our data is ready to prepare model to predict solution. There is plenty of predictive algorithm out there to try. However, our problem is classification problem thus I will try classification models. 

# First import all required machine learning libraries

In [None]:
# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

### Prepare data for train and test model.

In [None]:
X_train = train_data.drop(["Name", "Survived", "PassengerId"], axis=1)
Y_train = train_data["Survived"]
X_test  = test_data.drop(['Name',"PassengerId"], axis=1).copy()
X_train.shape, Y_train.shape, X_test.shape

# Support Vector Machine

In [None]:
# Support Vector Machine
svc = SVC()
svc.fit(X_train, Y_train)
svm_Y_pred = svc.predict(X_test)
svc_accuracy = svc.score(X_train, Y_train)
svc_accuracy

# K-Nearest Neighbour

In [None]:
# k-nearest neighbor
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
knn_Y_pred = knn.predict(X_test)
knn_accuracy = knn.score(X_train, Y_train)
knn_accuracy

# Gaussian Naive Bayes

In [None]:
# Gaussian Naive Bayes

gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
guassian_Y_pred = gaussian.predict(X_test)
gaussian_accuracy = gaussian.score(X_train, Y_train)
gaussian_accuracy

# Linear SVC

In [None]:
# Linear SVC

linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
linear_svc_Y_pred = linear_svc.predict(X_test)
linear_svc_accuracy = linear_svc.score(X_train, Y_train)
linear_svc_accuracy

# Stochastic Gradient Descent

In [None]:
# Stochastic Gradient Descent

sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
sgd_Y_pred = sgd.predict(X_test)
sgd_accuracy = sgd.score(X_train, Y_train)
sgd_accuracy

# Decision Tree


In [None]:
# Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
decision_tree_Y_pred = decision_tree.predict(X_test)
decision_tree_accuracy = decision_tree.score(X_train, Y_train)
decision_tree_accuracy

# Random Forest

In [None]:
# Random Forest

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
random_forest_Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
random_forest_accuracy = random_forest.score(X_train, Y_train)
random_forest_accuracy

In [None]:
models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Gaussian Naive Bayes', 'Linear SVC',
              'Stochastic Gradient Decent', 'Decision Tree','Random Forest'],
    'Score': [svc_accuracy, knn_accuracy, gaussian_accuracy, linear_svc_accuracy, 
              sgd_accuracy, decision_tree_accuracy, random_forest_accuracy]})
models.sort_values(by='Score', ascending=False)

In [None]:
# submission file from each model
svm_submission = pd.DataFrame({"PassengerId": test_data["PassengerId"], "Survived": svm_Y_pred})
svm_submission.to_csv('svm_submission.csv', index=False)

knn_submission = pd.DataFrame({"PassengerId": test_data["PassengerId"], "Survived": knn_Y_pred})
knn_submission.to_csv('knn_submission.csv', index=False)

guassian_submission = pd.DataFrame({"PassengerId": test_data["PassengerId"], "Survived": guassian_Y_pred})
guassian_submission.to_csv('guassian_submission.csv', index=False)

linear_svc_submission = pd.DataFrame({"PassengerId": test_data["PassengerId"], "Survived": linear_svc_Y_pred})
linear_svc_submission.to_csv('linear_svc_submission.csv', index=False)

sgd_submission = pd.DataFrame({"PassengerId": test_data["PassengerId"], "Survived": sgd_Y_pred})
sgd_submission.to_csv('sgd_submission.csv', index=False)

decision_tree_submission = pd.DataFrame({"PassengerId": test_data["PassengerId"], "Survived": decision_tree_Y_pred})
decision_tree_submission.to_csv('decision_tree_submission.csv', index=False)

random_forest_submission = pd.DataFrame({"PassengerId": test_data["PassengerId"], "Survived": random_forest_Y_pred})
random_forest_submission.to_csv('random_forest_submission.csv', index=False)

## <font color='blue'> I hope you enjoyed this kernel , Please don't forget to appreciate me with an Upvote. </font>