# Introduction

Titanic shipwreck is one of the biggest accidents in the world shipping history. The Titanic ship was such a big ship that it was named as "unsinkable". The ship hit a large iceberg  and 1502 of the 2224 passengers carried by Titanic drowned. 

In this dataset we will invesitgate the features and try to find which passengers are more likely to survive. We will use Machine Learning Algorithms to predict the survived passengers.In the end we will compare the performance of the various Machine Learning Algorithms. 

<font color = 'blue'>
 Content:
   
   1. [Load and check data](#1)
   2. [Information about variables](#2)
       * 2.1 [Univariate variable analysis](#3)
           * 2.1.1 [Categorical variable analysis](#4)
           * 2.1.2 [Numerical variable analysis](#5)   
   3. [Basic data analysis](#6)
   4. [Detection of outliers](#7)
   5. [Missing values](#8)
       * 5.1 [Detecting missing values](#9)
       * 5.2 [Filling missing values](#10)
           * 5.2.1 [Filling missing values in "Embarked" variable](#11)
           * 5.2.2 [Filling missing values in "Fare" variable](#12)
           
   6. [Visualization](#13)
       * 6.1 [Correlation matrix](#14)
       * 6.2 [SibSp and Survived](#15)
       * 6.3 [Parch and Survived](#16)
       * 6.4 [Pclass and Survived](#17)
       * 6.5 [Age and Survived](#18)
       * 6.6 [Pclass, Age and Survived](#19)
       * 6.7 [Embarked, Sex, Pclass and Survived](#20)
       * 6.8 [Embarked, Sex, Fare and Survived](#21)
       * 6.9 [Pclass, Sex and Survived](#22)
       
   7. [Filling the missing values in Age variable](#23)
   8. [Feature engineering](#24)
        * [8.1 Name - Title](#25)
        * [8.2 Family size](#26)
        * [8.3 Embarked](#27)
        * [8.4 Pclass](#28) 
        * [8.5 Sex](#29) 
        * [8.6  Droping PassengerId and Cabin](#30)
   9. [Modelling](#31)
        * [9.1 Train and test split](#32)
        * [9.2 Simple Logistic Regression Model](#33)
        * [9.3 Hyperparameter Tuning - Grid Search - Cross Validation](#34)
        * [9.4 Ensembling](#35)
        * [9.5 Prediction](#36)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Using Matplotlib and Seaborn for visualisation.

import matplotlib.pyplot as plt
plt.style.use("seaborn-whitegrid")
import seaborn as sns

from collections import Counter

import warnings
warnings.filterwarnings("ignore")

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

<a id = '1'></a><br>
## 1. Load and check data

In [None]:
# Import train and test data, and set them as df_train and df_test, accordingly.
df_train = pd.read_csv("/kaggle/input/titanic/train.csv")
df_test = pd.read_csv("/kaggle/input/titanic/test.csv")
test_Passenger_id = df_test["PassengerId"]

In [None]:
# Let's look at colums of the train dataframe quickly.
df_train.columns

In [None]:
# It shows the train columns and their types
df_train.info()

In [None]:
# That gives the FİRST 5 rows of the df_train 
df_train.head()

In [None]:
# That gives the LAST 5 rows of the df_train 
df_train.tail()

In [None]:
# shows the statistical information of the train columns
df_train.describe().T

In [None]:
# This shows the correlations between numerical train columns 
df_train.corr()

In [None]:
# Let's visualize the correlations between features of the train set.
fig, ax = plt.subplots(figsize=(8,5)) 
sns.heatmap(df_train.corr(), annot = True, fmt = ".2f", linewidths=0.5, ax=ax) 
plt.show()

**It can be concluded** from the heatmaps that "Survived" variable has the highest correlation with "Fare" variable. **We can say:**
* the more money passengers pay,  the higher  probability of survival they have
* the lower class passengers belongs to, the lowest probability of survival they have


<a id = '2'></a><br>
## 2. Information about variables


Description of the variables in dataset:

* PassengerId : refers to passenger's id, which is unique
* Survived : if passenger was survived, it takes 1, otherwise it is 0.
* Pclass : refers to ticket's class. 1= 1st, 2= 2nd , 3 = 3rd (1st is the highest class)
* Name : name of passenger
* Sex : gender of passenger
* Age : age of passenger
* SibSp : the number of siblings / spouses aboard the Titanic (mistresses and fiancés were ignored)
* Parch : defines family relations such as mother, father,daughter, son, stepdaughter, stepson (Some children travelled only with a nanny, therefore parch=0 for them.)
* Ticket : Ticket number
* Fare : cost written on the pessenger's ticket
* Cabin : cabin number
* Embarked : defines which passenger embarked on the Titanic from which port (C = Cherbourg, Q = Queenstown, S = Southampton )

<a id = '3'></a><br>
##  2.1 Univariate variable analysis
 In this section we will analyse the variables individually which means that we will ignore the relations between variables. Univariate variable analysis can be divided into 2 groups as follow:
 
   * Categorical variable analysis
   * Numerical variable analysis

<a id = '4'></a><br>
## 2.1.1 Categorical variable analysis


In [None]:
# We define a function that choose the variables and their own values and plot them. 

def bar_plot(variable):
    
    var = df_train[variable]
    
    varValue = var.value_counts()
    
    
    
    plt.figure(figsize = (9,3))
    plt.bar(varValue.index, varValue)
    plt.xticks(varValue.index, varValue.index.values)
    plt.ylabel("Frequency")
    plt.title(variable)
    plt.show
    print("{}: \n {}".format(variable, varValue))
    


In [None]:
# We are plotting the categorical variables 
category1=["Survived", "Sex", "Pclass", "Embarked", "SibSp", "Parch"]
list(map(lambda x:bar_plot(x), category1))
plt.show()

In [None]:
# They have lots of different categories, therefore we donot plot them. 
category2=["Cabin", "Name", "Ticket"]
list(map(lambda x:print("{} \n".format(df_train[x].value_counts())), category2))
plt.show()

<a id = '5'></a><br>

## 2.1.2 Numerical variable analysis

In [None]:
# We will use histogram to plot values of the numerical variables.

def hist_plot(variable):
    
    plt.figure(figsize=(9,3))
    plt.hist(df_train[variable],bins = 50)
    
    plt.xlabel(variable)
    plt.ylabel("Frequency")
    plt.title("{} Distribution with histogram". format(variable))
    plt.show

In [None]:
numericalVariables=["Fare", "Age", "PassengerId"]
list(map(lambda x: hist_plot(x), numericalVariables))
plt.show()

<a id = '6'></a><br>
## 3. Basic data analysis

In this section we will analyse the relationships between the variables and show the pattern
Following variables pair will be analysed:

* Pclass - Survived
* Sex - Survived
* SibSp - Survived
* Parch - Survived

In [None]:
 # we will check whether it is relation between Pclass and Survived
    
df_train[["Pclass","Survived"]].groupby("Pclass", as_index=False).mean().sort_values(by="Survived", ascending = False)



In [None]:
# we will check whether it is relation between Sex and Survived 
    
df_train[["Sex","Survived"]].groupby("Sex", as_index=False).mean().sort_values(by="Survived", ascending = False)


In [None]:
 # we will check whether it is relation between SibSp and Survived
    
df_train[["SibSp","Survived"]].groupby("SibSp", as_index=False).mean().sort_values(by="Survived", ascending = False)



In [None]:
 # we will check whether it is relation between Parch and Survived
    
df_train[["Parch","Survived"]].groupby("Parch", as_index=False).mean().sort_values(by="Survived", ascending = False)


 **From the relationships of variables, the following can be understood:**
* First class passengers have the highest survival rate with 62.9% while 3rd class passengers have the lowest survival rate with 24.2
* 74% of women survived, while 18% of men survived
* The survival rate of passengers with one or two siblings is significantly higher than others.
* The survival rate of people with 3 family members on board is the highest with 60%, but the survival rates of people with 1 and 2 family members are also quite high. It is 55% and 50% respectively.



From here, we can clearly conclude that if a person is **a woman** and has **a first-class ticket**, the probability of survival is **very high**

<a id = '7'></a><br>
## 4. Detection of outliers

In [None]:
def outlier_detection(df, features):
    outlier_indices = []
    
    for i in features:
        # 1st quirtile
        q1 = np.percentile(df[i],25)
        
        # 3rd quirtile
        q3 = np.percentile(df[i],75)
        
        # IQR
        IQR = q3 - q1
        
        # Outlier step
        
        outlier_step = IQR * 1.5
        
        # detect outliers and their indices 
        
        outliers_list = df[(df[i] < q1-outlier_step) | (df[i]> q3 + outlier_step)].index
        
        
        # Storing indices
        
        outlier_indices.extend(outliers_list)
        
        # Here we mean that if a feature has more than two outliers, we will store the indices of the outliers. 
        # Otherwise we are not intersted in outliers of a feature 
        outlier_indices1 = Counter(outlier_indices)
        multiple_outliers = list(j for j,v in outlier_indices1.items() if v>2 )
        
    return multiple_outliers

In [None]:
# Here we are running the outlier_detection function.
df_train.loc[outlier_detection(df_train, ["Age", "SibSp", "Parch", "Fare"])]

In [None]:
# Now we are dropping the outliers and reseting the index
df_train = df_train.drop(outlier_detection(df_train, ["Age", "SibSp", "Parch", "Fare"]), axis = 0).reset_index(drop = True)

<a id = '8'></a><br>
## 5. Missing values
       
Missing values that are unusable values (such as #, ?, - and NaN (not a number)), have to get handled in order to create machine learning model with high accuracy. 
In this section we will detect the missing values and replace them with meaningful numbers.

<a id = '9'></a><br>
### 5.1 Detecting missing values


In [None]:
# In order not to damage our original data, we assigned train and test sets to new variables by the copy method during the concatenating processes.
df_train_copy = df_train.copy()
df_test_copy = df_test.copy()

# concatenating train and test sets
df_new_train = pd.concat([df_train_copy,df_test_copy], axis = 0).reset_index(drop = True)
df_new_train 

In [None]:
# It shows all columns in the data set that have missing values.
df_new_train.columns[df_new_train.isnull().any()]

In [None]:
# It shows the sumation of the missing values in each variable.
df_new_train.isnull().sum()

**As we can see,** we do not have information about 
- ages og 256 passengers, 
- 1007 passengers' cabin numbers, 
- embarking port of 2 passengers and 
- ticket price of a passenger. 

There is also 418 missing values in "survived" column that comes from test_set, because the test_set does not have "survived" column.   

<a id = '10'></a><br>
### 5.2 Filling missing values

* Column "Embarked" has 2 missing values
* Column "Fare" has just one missing value
 

<a id = '11'></a><br>
### 5.2.1 Filling missing values in "Embarked" variable


In [None]:
# Let's see which indices of "Embarked" variable has missing values.
df_new_train[df_new_train["Embarked"].isnull()]

In [None]:
# Boxplot of "Age" grouped by "Embarked" variable. 
df_new_train.boxplot(column = "Age", by = "Embarked")
plt.show()

In [None]:
# Boxplot of "Fare" grouped by "Embarked" variable. 
df_new_train.boxplot(column = "Fare", by = "Embarked")
plt.show()

It can be concluded from the boxplots that median value of Passengers' age which embarked on the board from Cherbourg (C) is the closest to the  age values of the columns with missing data.
In terms of the median values of "Fare" variable,  passenger embarked on the board from Cherbourg paid the closest fare to the  fare values of the columns with missing data.

**It can be said that missing values of "Embarked" variable can be filled by Cherbourg (C)**

In [None]:
# Filling the missing values in "Embarked" variable
df_new_train["Embarked"] = df_new_train["Embarked"].fillna("C")

<a id = '12'></a><br>
### 5.2.2 Filling missing values in "Fare" variable

In [None]:
# Let's see which indices of "Fare" variable has missing values.
df_new_train[df_new_train["Fare"].isnull()]

In [None]:
# Boxplot of "Fare" grouped by "Embarked" variable. 
df_new_train.boxplot(column = "Fare", by = "Pclass")
plt.show()

In [None]:
# Median values of the "Fare" column grouped by "Pclass"
df_new_train[["Pclass","Fare"]].groupby("Pclass", as_index=False).median()

It can be seen that the row above with missing value belongs to 3rd class passenger. Therfore we can use the fare of the 3rd class passengers to fill the missing value. As we see from the boxplot that "Fare" values of 3rd class have lots of outliers, therefore it can be better if we use median value of  3rd class passengers' "Fare". 


In [None]:
# Filling the missing values in "Fare" variable by  the median value of  3rd class passengers' "Fare" . 
df_new_train["Fare"] = df_new_train["Fare"].fillna(df_new_train[["Pclass","Fare"]].groupby("Pclass", as_index=False).median()["Fare"][2])

<a id = '13'></a><br>
# 6. Visualization

In this section we will illustrate the relationship between variables by using visualization tools. we will visualize relation between :
*  SibSp and Survived
*  Parch and Survived
*  Pclass and Survived
*  Age and Survived
*  Pclass, Age and Survived
*  SibSp and Survived
*  Embarked, Sex, Pclass and Survived
*  Embarked, Sex, Pclass and Survived
*  correlation matrix

<a id = '14'></a><br>
## 6.1 Correlation matrix

In [None]:
# This shows the correlations between numerical train columns 
df_new_train.corr()

In [None]:
# Let's visualize the correlations between features of the train set.
%config InlineBackend.figure_format ='retina'
fig, ax = plt.subplots(figsize=(8,5)) 
sns.heatmap(df_new_train.corr(), annot = True, fmt = ".2f", linewidths=0.5, ax=ax) 
plt.show()


**It can be concluded** from the heatmaps that "Survived" variable has the highest correlation with "Fare" variable. **We can say:**
* the more money passengers pay,  the higher  probability of survival they have
* the lower class passengers belongs to, the lowest probability of survival they have


<a id = '15'></a><br>
## 6.2 SibSp and Survived
       

In [None]:
g= sns.factorplot(x = "SibSp", y = "Survived", data = df_new_train, kind = "bar", size = 6)
g.set_ylabels("Survived Probability")
plt.show()

**We can see that** when the number of siblings and spouse (SibSp) is more than 2, the survival probability decreases sharply. We can use this plot in order to extract the new feature.  **For example**, we can define new variable as SibSp2 and set 2 as treshold. if SibSp is equal to or less than 2, the value of SibSp2 equals to 1, otherwise 0.    

<a id = '16'></a><br>
## 6.3 Parch and Survived
       

In [None]:
g= sns.factorplot(x = "Parch", y = "Survived", data = df_new_train, kind = "bar", size = 6)
g.set_ylabels("Survived Probability")
plt.show()

**At first glance** at the above plotting,  we can say that Passengers with fewer family members are more likely to survive, and Parch with 3 has the highest survival probability. However black vertical line refers to standard deviation. Parch with 3 has very high standard deviation, therfore this relation is not enough to extract new feature, **but SibSp and Parch can be used together for extracting new feature.** 
  

<a id = '17'></a><br>
## 6.4 Pclass and Survived
       

In [None]:
g= sns.factorplot(x = "Pclass", y = "Survived", data = df_new_train, kind = "bar", size = 6)
g.set_ylabels("Survived Probability")
plt.show()

**First class** passengers have the highest survival rate with 62.9% while 3rd class passengers have the lowest survival rate with 24.2 .

<a id = '18'></a><br>
## 6.5 Age and Survived
       

In [None]:
g= sns.FacetGrid(df_new_train, col = "Survived")
g.map(sns.distplot, "Age", bins = 25)
plt.show()

The graph on the left shows the disturibition of the died passengers while The graph on the right demostrates the disturibition of the survived passengers. 
* As seen in the chart on the right, the rate of survival of children aged 0-8 is very high, which indicates that **children were given priority**
*  The rate of survival of elderly passengers aged over 70 is very high, which indicates that **old passengers were also given priority**
* Most of the passengers in the Titanic were between the ages of 15-35.
* Most of the **died passengers** in the Titanic were between the ages of **15-30.**
* Most of the **survived passengers** in the Titanic were between the ages of **20-35.**
* Both the left and right graphs have **gaussian distribution**.
* **We can use the Age distributions for filling the missing values in "Age" variable.** 

<a id = '19'></a><br>
## 6.6 Pclass, Age and Survived
      

In [None]:
g = sns.FacetGrid(df_new_train, col="Survived", row = "Pclass", size = 2)
g.map(plt.hist, "Age", bins = 25)
g.set_ylabels("The number of Survived")
plt.show()

We can say that:
* The majority of passengers belong to 3rd class. The great number of 3rd class passengers didnot survive. From this, we can deduce that there is an inverse proportion between the number of passengers in a class and the rate of living.
* While the survival rates of the first class passengers are significantly higher, there is no significant difference between the survival and death rates of the second class passengers.

<a id = '20'></a><br>
## 6.7 Embarked, Sex, Pclass and Survived
      

In [None]:
g = sns.FacetGrid(df_new_train, row = "Embarked", size = 2)
g.map(sns.pointplot, "Pclass", "Survived", "Sex")
g.add_legend()
plt.show()

It can be said:
* All females on first and second classes, which embarked from Queenstown,survived while all men on first and second classes, which embarked from Queenstown,died.
* Almost all females on first and second classes, which embarked from Southampton,survived while just %30 of women which is 3rd class and embarked from Southampton, survived.
* Men embared from Cherbourg, have the highest survival probability if ther are compared with other men embarked from Southampton and Quuenstown.
 

<a id = '21'></a><br>
## 6.8 Embarked, Sex, Fare and Survived

In [None]:
g = sns.FacetGrid(df_new_train, row = "Embarked", col="Survived", size = 2.3)
g.map(sns.barplot, "Sex", "Fare")
g.add_legend()
plt.show()

As we can see from above plotting:
* People who paid more money and embarked from Cherbourg and Southampton ports have more chance to survive than people who embarked from same ports but paid less. 
* Fare does not have effect of the survival probability of people who embarked from Quuenstown. 



<a id = '22'></a><br>
## 6.9 Pclass, Sex and Survived

In [None]:
g = sns.FacetGrid(df_new_train, col="Pclass", height=4, aspect=.5)
g.map(sns.barplot, "Sex", "Survived")
plt.show()

<a id = '23'></a><br>
# 7. Filling the missing values in Age variable


I prefer to handle missing values in Age variable because of its complexity. As we see in missing value section that variable Age has 256 missing values. We need to look depth inside the Age variable and its relationship with other varables. 

In [None]:
# This code is giving all variables whose Age number is  NaN. 
df_new_train[df_new_train["Age"].isnull()]

In [None]:
# let's look at relationship between Age and Sex variables.
sns.factorplot(x = "Sex", y = "Age", data = df_new_train, kind = "box");

**It is very clear** that distribution and median values of both men and females are almost similiar. Therefore we cannot use gender variables directly for filling missing values.  

In [None]:
# let's look at relationship between Pclass, Age and Sex variables.
sns.factorplot(x = "Sex", y = "Age", hue = "Pclass", data = df_new_train, kind = "box");

It can be seen that this realtionship give us much more valuable perspective. For example:
* we might say that if a person is male and on first class, his age can be 42 which is the median value of men on first class.
* we might say that if a person is female and on third class, her age can be filled by 22 which is the median value of females on third class.


In [None]:
# let's look at relationship between Age and Parch variables and between Age and SibSp variables.

sns.factorplot(x = "Parch", y = "Age", data = df_new_train, kind = "box");
sns.factorplot(x = "SibSp", y = "Age", data = df_new_train, kind = "box");

**We can say:**
* Age of passengers with parch equal to and less than 2 can be filled by 22 which is the avarage median value of Parch 0, 1 and 2.
* Age of passengers with parch equal to and greater than 3 can be filled by 40 which is the avarage median value of Parch 3 and above.

**We can divide SibSp into two groups as follow:**
* Age of passengers with SibSp equal to and less than 2 can be filled by 25 which is the avarage median value of SibSp 0, 1 and 2.
* Age of passengers with SibSp equal to and greater than 3 can be filled by 10 which is the avarage median value of SibSp 3 and above.

In [None]:
# We use list comprehension to tranform "Sex" variable from Object to numerical variable...
# because ı want to see correlation with Sex and other variables.
df_new_train["Sex"] = [1 if i == "male" else 0 for i in df_new_train["Sex"]]
df_new_train["Sex"]

In [None]:
# Correlation Matrix
var_list = ["Age", "Sex", "SibSp", "Parch", "Pclass"]
sns.set(font_scale=0.9)
fig, ax = plt.subplots(figsize=(10,5)) 
sns.heatmap(df_new_train[var_list].corr(), annot = True, fmt = ".2f", linewidths=0.5, ax=ax) 
plt.show()

**Pclass, Parch and SibSp have obvious correlation with Age while Sex doesn't have. Therefore it can be logical if we use Pclass, Parch and SibSp variables for filling the missing values in Age variable.**

In [None]:
# Finding the indices of the missing values in Age variable.
Age_index= df_new_train["Age"][df_new_train["Age"].isnull()].index
Age_index

In [None]:
# Let's filling missing values of Age 
for i in Age_index:
    predicted_Age = df_new_train["Age"][((df_new_train["SibSp"] == df_new_train.iloc[i]["SibSp"])&(df_new_train["Pclass"] == df_new_train.iloc[i]["Pclass"])&(df_new_train["Parch"] == df_new_train.iloc[i]["Parch"]))].median()
    Age_med = df_new_train["Age"].median()
    
    if not np.isnan(predicted_Age):
        df_new_train["Age"].iloc[i] = predicted_Age
    else:
         df_new_train["Age"].iloc[i] = Age_med
    

<a id = '24'></a><br>
# 8. Feature engineering

<a id = '25'></a><br>
## 8.1 Name - Title

In [None]:
# Looking at Name variable
df_new_train["Name"].head(10)

We cannot make a prediction about survival condition by using passengers' names, but there might be relationship between survival rate and titles (such as Mr., Miss or Mrs). 

In [None]:
# We are scraping titles from name and assign them into new variable as "Title".
name = df_new_train["Name"]
df_new_train["Title"] = [((i.split('.')[0]).split(',')[-1]).strip() for i in name]
df_new_train["Title"] 

In [None]:
# showing the frequiencies of each titel. 
df_new_train["Title"].value_counts()

In [None]:
# Plotting the frequiencies of each titel.
sns.countplot(x = "Title", data = df_new_train)
plt.xticks(rotation = 60)


As we can see from the frequiencies of each titel that Mr, Mrs, Miss and Master account for majority of the "Titles", therefore we can sum rest of the Titles and named as other. Now we can transform "Titles" to categorical variable.

In [None]:
# Creating new title name as other.
other_list = ["Rev","Dr","Col","Mlle","Ms","Major","Dona","Capt","Jonkheer","Lady","Mme","Sir","the Countess", "Don"]
df_new_train["Title"] = df_new_train["Title"].replace(other_list, "other")

In [None]:
df_new_train["Title"].value_counts()

In [None]:
# Converting the Title to categorical 
df_new_train["Title"] = [0 if i == "Master" else 1 if i == "Miss" else 2 if i == "Mrs" else 3 if i == "Mr" else 4  for i in df_new_train["Title"]]
df_new_train["Title"] = df_new_train["Title"].astype('category', inplace = True)
df_new_train["Title"].value_counts()

In [None]:
# Plotting the frequiencies of each title.
sns.countplot(x = "Title", data = df_new_train)
plt.xticks(rotation = 60)

In [None]:
g = sns.factorplot(x = "Title", y = "Survived", data = df_new_train, kind = "bar");
g.set_xticklabels(["Master", "Miss", "Mrs", "Mr", "other"])
g.set_ylabels("Survival probability");

It is clear that the number of ladies survived is much more higher than the number of survived men.

**We have usen "Name" variable to create new feature as Title. Therefore we donot need variable "Name" anymore, and we will drop it from the data.**

In [None]:
# Droping variable "Name" from the data.
df_new_train.drop("Name", axis = 1, inplace = True)
df_new_train.head()

In [None]:
# Convert "Title" categorical variable to dummy variables. 
df_new_train = pd.get_dummies(df_new_train, columns = ["Title"])
df_new_train

<a id = '26'></a><br>
## 8.2 Family size

Variables "SibSp" and "Parch" give the information about passengers family. Maybe we can use them to create new feature.

SibSp refers to sibling and spouse while Parch indicates parents or children.

In [None]:
# creating new feature as FamilySize. Here 1 refers to passenger itself 
df_new_train["FamilySize"] = df_new_train["SibSp"] + df_new_train["Parch"] + 1
df_new_train

In [None]:
# Let's see what the relationship between survival rate and FamilySize is. 
g = sns.factorplot(x = "FamilySize", y = "Survived", data = df_new_train, kind = "bar");
g.set_ylabels("Survival probability");

In [None]:
# As it can be seen from the plot that we can categorize family size as 1 if FamilySize is less than 5 or 0 if FamilySize is equal to and greater  than 5. 
df_new_train["FamilySize"] = [1 if i < 5 else 0 for i in df_new_train["FamilySize"] ]

In [None]:
df_new_train.head(10)

In [None]:
# The frequincies of the categories in FamilySize. 
df_new_train["FamilySize"].value_counts()

In [None]:
# Plotting the frequiencies of each Categories in FamilySize.
sns.countplot(x = "FamilySize", data = df_new_train);

In [None]:
# Let's see what the relationship between survival rate and FamilySize is. 
g = sns.factorplot(x = "FamilySize", y = "Survived", data = df_new_train, kind = "bar");
g.set_ylabels("Survival probability");

Survival probability of small families is almost three times higher than big families. 

In [None]:
# Convert "Title" categorical variable to dummy variables. 
df_new_train = pd.get_dummies(df_new_train, columns = ["FamilySize"])
df_new_train

In [None]:
df_new_train.head(10)

<a id = '27'></a><br>
## 8.3  Embarked
We are Converting "Embarked" categorical variable to dummy variables. We will use Embarked as it is to train our Machine Learning Model.



In [None]:
# Convert "Title" categorical variable to dummy variables. 
df_new_train = pd.get_dummies(df_new_train, columns = ["Embarked"])
df_new_train

<a id = '28'></a><br>
## 8.4  Pclass

We are Converting "Pclass" categorical variable to dummy variables. We will use Pclass as it is to train our Machine Learning Model.

In [None]:
# Convert "Pclass" categorical variable to dummy variables. 
df_new_train["Pclass"] = df_new_train["Pclass"].astype('category', inplace = True)
df_new_train = pd.get_dummies(df_new_train, columns = ["Pclass"])


In [None]:
df_new_train

<a id = '29'></a><br>
## 8.5  Sex

We are Converting "Sex" categorical variable to dummy variables. We will use Sex as it is to train our Machine Learning Model.

In [None]:
# Convert "Sex" categorical variable to dummy variables. 
df_new_train["Sex"] = df_new_train["Sex"].astype('category', inplace = True)
df_new_train = pd.get_dummies(df_new_train, columns = ["Sex"])


In [None]:
df_new_train

<a id = '30'></a><br>
## 8.6  Droping PassengerId and Cabin

We will not use the PassengerID and Cabin variable to train ML models because they are not meaningful variables. 

In [None]:
# Droping PassengerId and Cabin
df_new_train.drop(["PassengerId", "Cabin", "Ticket"], axis = 1, inplace = True)


In [None]:
df_new_train

<a id = '31'></a><br>
# 9.  Modelling


In [None]:
# Importin libraries
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

<a id = '32'></a><br>
## 9.1 Train and test split

In [None]:
# As you remember we concatenated test and train values. Now first of all we are splitting data into train and test 
len(df_test) # this is the length of the test data before concatenating.
len(df_train)# this is the length of the train data before concatenating.

# We will use these lengths of the train and test data for splitting final data as train and test set


In [None]:
# we are splitting test set and dropping Survived feature because test_set didnot have Survived column before concatenating
test = df_new_train[len(df_train):] 
test.drop(labels =["Survived"], axis = 1, inplace = True)
test

In [None]:
# we are splitting train set into x_train and y_train. Furthermore we use train_test_split method to create X_train, X_test, Y_train, Y_test
train = df_new_train[:len(df_train)]
x_train = train.drop(labels = ["Survived"], axis = 1)
y_train = train["Survived"]
X_train, X_test, Y_train, Y_test = train_test_split(x_train, y_train, test_size = 0.3, random_state = 42)


<a id = '33'></a><br>
## 9.2 Simple Logistic Regression Model

In [None]:
# Logistic regression model
log_Reg = LogisticRegression()
log_Reg.fit(X_train, Y_train)
log_Reg_train_Accuracy = round(log_Reg.score(X_train, Y_train)*100,3)
log_Reg_test_Accuracy = round(log_Reg.score(X_test, Y_test)*100,3)
print("Training accuracy: %{}".format(log_Reg_train_Accuracy))
print("Testing accuracy: %{}".format(log_Reg_test_Accuracy))

**Training accuracy: %83.117 and Testing accuracy: %83.019 are almost same.** 

<a id = '34'></a><br>
## 9.3 Hyperparameter Tuning - Grid Search - Cross Validation

In this section we will compare the performance of the 5 different machine learning classifiers such as Decision Tree, SVM, Random Forest, KNN and Logistic Regression. we will use Grid Search technique to investigate the hyper parameters of the models.It means that we will utilize Cross Validation technique to compare the parameters values and find the optimal values of the parameters in the ML models. 

In [None]:
random_state = 42
classifier = [DecisionTreeClassifier(random_state = random_state),
              SVC(random_state = random_state),
              RandomForestClassifier(random_state = random_state),
              LogisticRegression(random_state = random_state),
              KNeighborsClassifier()]

In [None]:
# we are tuning the hyper parameters
dt_param_grid = {"min_samples_split" : range(10,500,20),
                "max_depth": range(1,20,2)}

svc_param_grid = {"kernel" : ["rbf"],
                 "gamma": [0.001, 0.01, 0.1, 1],
                 "C": [1,10,50,100,200,300,1000]}

rf_param_grid = {"max_features": [1,3,10],
                "min_samples_split":[2,3,10],
                "min_samples_leaf":[1,3,10],
                "bootstrap":[False],
                "n_estimators":[500,1000],
                "criterion":["gini"]}

logreg_param_grid = {"C":np.logspace(-3,3,7),
                    "penalty": ["l1","l2"]}

knn_param_grid = {"n_neighbors": np.linspace(1,19,10, dtype = int).tolist(),
                 "weights": ["uniform","distance"],
                 "metric":["euclidean","manhattan"]}
classifier_param = [dt_param_grid,
                   svc_param_grid,
                   rf_param_grid,
                   logreg_param_grid,
                   knn_param_grid]

In [None]:
# Cross Validation
cv_result = []
best_estimators = []
for i in range(len(classifier)):
    clf = GridSearchCV(classifier[i], param_grid=classifier_param[i], cv = StratifiedKFold(n_splits = 10), scoring = "accuracy", n_jobs = -1,verbose = 1)
    clf.fit(X_train,Y_train)
    cv_result.append(clf.best_score_)
    best_estimators.append(clf.best_estimator_)
    print(cv_result[i])

In [None]:
# Visualazing the results
cv_results = pd.DataFrame({"Cross Validation Means":cv_result, "ML Models":["DecisionTreeClassifier", "SVM","RandomForestClassifier",
             "LogisticRegression",
             "KNeighborsClassifier"]})

g = sns.barplot("Cross Validation Means", "ML Models", data = cv_results)
g.set_xlabel("Mean Accuracy")
g.set_title("Cross Validation Scores");

<a id = '35'></a><br>
## 9.4 Ensembling

As it can be seen from cross validation scores that Decision Tree , Logistic Regression and Random Forest classifiers have higher scores than 80%. We set 80% as treshold and will ensemble these 3 classifiers. we will use VotingClassifier to ensemble them.

In [None]:
# Ensembling of Decision Tree, Random Forest classifiers and Logistic Regression
votingC = VotingClassifier(estimators = [("dt",best_estimators[0]),
                                        ("rfc",best_estimators[2]),
                                        ("lr",best_estimators[3])],
                                        voting = "soft",weights=[1,1,9], n_jobs = -1)
votingC = votingC.fit(X_train, Y_train)
print(accuracy_score(votingC.predict(X_test),Y_test))

<a id = '36'></a><br>
## 9.5 Prediction 

In [None]:
test_survived = pd.Series(votingC.predict(test), name = "Survived").astype(int)
results = pd.concat([test_Passenger_id, test_survived],axis = 1)
results.to_csv("titanic.csv", index = False)