**Introduction and Loading Python Support Libraries**

In this notebook we would be making prediction for Titanic Survival data set. Objective of the problem is to predict if a person would survive or not looking at the input variables provided. On a high level we would be running through the following steps

* a) Exploratory Data Analysis
* b) Cleaning Up Data 
* c) Visulization on data set
* d) Feature Engineering / Scaling data
* e) Model Generation (Using - KNN,Logisitc,SVM,Gaussain NB, Ensemble Models (Random Forest, Gradient Boost Classifier)

Let's get started with loading the python libraries and train/test data.

Steps - C and D are Overlapping and we go along creating new features while Visualizing the data.



In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
# data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import os
import pandas as pd 
from matplotlib import pyplot as plt
from sklearn.preprocessing import LabelEncoder
## ignore warnings
import warnings; warnings.simplefilter('ignore')
pd.set_option('display.width',1000000)
pd.set_option('display.max_columns', 500)

In [1]:
### some more libraries

import numpy as np # linear algebra
import seaborn as sns
import os
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import recall_score,precision_score,f1_score,accuracy_score,confusion_matrix
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from scipy import stats
import statistics as s
from sklearn.linear_model import LogisticRegression

Load train and test data sets

In [1]:

#print(os.getcwd())
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [1]:
gender_submission = pd.read_csv("../input/titanic/gender_submission.csv")
test = pd.read_csv("../input/titanic/test.csv")
train = pd.read_csv("../input/titanic/train.csv")

#### setting Index of the test data as Passenger Id 
#### which acts as a unique identifier for the data
train.set_index('PassengerId',inplace=True)
test.set_index('PassengerId', inplace=True)

### Data frame to hold scores
score_df = pd.DataFrame(columns={'Model_Name','Score'})

1.a **Exploratory data analysis** - The feed data or data captured in a data science pipeline is bound to have some inconsistencies - missing values, outliers, incorrect information etc .  Even applying a very good prediction model will not provide good results unless the feed data is cleaned.

Let's start with getting some feel of data with common used methods and also get an idea about Missing/Null values in our train and test data set

In [1]:
# check the columns in the data set
train.dtypes

Columns in data set along with brief description
 
survival	Survival	    0 = No, 1 = Yes
pclass	    Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
sex	        Sex	
Age	        Age in years	
sibsp   	# of siblings / spouses aboard the Titanic	
parch	    # of parents / children aboard the Titanic	
ticket	    Ticket number	
fare	    Passenger fare	
cabin	    Cabin number	
embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

In [1]:
train.isnull().any()

In [1]:
test.isnull().any()

In [1]:
## lets look at the initial few rows to see how the data looks 
train.head(5)

In [1]:
test.head(5)

In [1]:
## using describe functions give some usefull stats for the columns in the dataset
train.describe()

1b ** Cleaning Data **
We will start with filling up missing/null values 

In [1]:
######################## Column Embarked - Train data 

## Only 2 values are NUll in the data for this column 
print(train[train['Embarked'].isnull()])


In [1]:
### check if any relation between where a person has embarked and the ticket fare 
sns.catplot(x='Embarked',y='Fare',hue ='Pclass',data=train,kind='swarm')

From the above it's not clear if we can derive Embarked from the Fare and Pclass fields , for now let us have it as 'S' which has the maximum frequency count.

In [1]:
## check how Embarked is distirbuted
print(train['Embarked'].value_counts())

In [1]:
### for now don't see any relation between Embarked so makring this as 'S' which is the 
train['Embarked'].fillna('S',inplace=True)
#train[train['Embarked'].isnull()]

In [1]:
##### Column Cabin 
##### Cabin has a lot of distinct values this is of no use
##### for building a prediction model rather we can use a bit of feature engineering
print(train['Cabin'].value_counts())

In [1]:
#### to make this column usefull , let's only use the Cabin Name (first Alphabet)
train.loc[:,'Cabin'] = train.loc[:,'Cabin'].str[0]
test.loc[:,'Cabin'] = test.loc[:,'Cabin'].str[0]


By genreal observation we can assume that Cabin and Fare should be realted to each other

In [1]:
train[['Cabin','Fare']].groupby(['Cabin'],as_index=False).mean()

In [1]:

sns.catplot(x='Cabin',y='Fare',data=train,kind='bar',ci=None)

In [1]:
sns.catplot(x='Cabin',y='Fare',data=test,kind='swarm')

To fill NULL values for cabin we are going to use Fare as a predictor variable and use data that we have with us.Below function is going to clean out the Null values by assigning Cabin by Fare

In [1]:
def calc_cabin_by_fare(df_train):
    # sns.catplot(x='Cabin',y='Fare',data = df_train,kind='bar')
    # plt.show()
    def calculate_cabin(row):
        if row['Fare'] <= 15:
            return 'G'
        elif row['Fare'] <= 19:
            return 'F'
        elif row['Fare'] <= 35:
            return 'T'
        elif row['Fare'] <= 40:
            return 'A'
        elif row['Fare'] <= 45:
            return 'E'
        elif row['Fare'] <= 57:
            return 'D'
        elif row['Fare'] <= 100:
            return 'C'
        else:
            return 'B'


    df_train.loc[df_train['Cabin'].isnull(), 'Cabin'] = df_train[df_train['Cabin'].isnull()].apply(calculate_cabin,
                                                                                                       axis=1)
    return df_train

train = calc_cabin_by_fare(train)
test = calc_cabin_by_fare(test)

In [1]:
### Column Age 
sns.distplot(train.loc[train['Survived']==1,'Age'].dropna(),color='blue',bins=40)
sns.distplot(train.loc[train['Survived']==0,'Age'].dropna(),color='yellow',bins=40)

For Missing Age values we could either just fill  the mean from train data set or drill down a bit more and instead take mean of Age grouping by Sex and CAbin 

In [1]:

temp = train[['Sex','Cabin','Age']].groupby(['Sex','Cabin'],as_index=False).mean()

def find_mean_age(Sex,Cabin):
    return temp.loc[(temp['Sex']==Sex)&(temp['Cabin']==Cabin),'Age'].tolist()[0] 


train.loc[train['Age'].isnull(),['Age']] = train.apply(lambda row:find_mean_age(row['Sex']
                                                                                ,row['Cabin']),
                                                       axis=1)

test.loc[test['Age'].isnull(),['Age']] = test.apply(lambda row:find_mean_age(row['Sex']
                                                                                ,row['Cabin']),
                                                       axis=1)

In [1]:
## column - Fare 
## use mean to fill the Fare

test.loc[test['Fare'].isnull(),]

test.loc[test['Fare'].isnull(),['Fare']] = np.mean(train['Fare'])


Along with handling the Null values , Categorical variables need to be converted to numerical values as many of the Models do not support Categorical variables as input in model building

In [1]:
# Embarked
emb = {'S':1,'C':2,'Q':3}
train['Embarked']=train['Embarked'].replace(emb)
test['Embarked']=test['Embarked'].replace(emb)


Name attribute is not going to have any significance on Model building , but maybe the Title of the name might have some significance . for e.g - a Captain might have a less chance of surviving then a MR/ MS. So below we take out the Title and then numerically encode this 

In [1]:
### Extract the title from Name and store this in a new column
train['Title']= train['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
test['Title'] = test['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

train['Title'].value_counts()

In [1]:
## titles with very few frequency being renamed as Rare
train['Title'] = train['Title'].replace(['Lady', 'Countess', 'Capt', 'Col', 'Don', 'Dr', \
                                                 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

test['Title'] = test['Title'].replace(['Lady', 'Countess', 'Capt', 'Col', 'Don', 'Dr', \
                                             'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

train['Title'] = train['Title'].replace('Mlle', 'Miss')
train['Title'] = train['Title'].replace('Ms', 'Miss')
train['Title'] = train['Title'].replace('Mme', 'Mrs')

test['Title'] = test['Title'].replace('Mlle', 'Miss')
test['Title'] = test['Title'].replace('Ms', 'Miss')
test['Title'] = test['Title'].replace('Mme', 'Mrs')

In [1]:
# convert titles into numbers
titles = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
train['Title'] = train['Title'].map(titles)
# filling NaN with 0, to get safe
train['Title'] = train['Title'].fillna(0)


test['Title'] = test['Title'].map(titles)
# filling NaN with 0, to get safe
test['Title'] = test['Title'].fillna(0)

In [1]:
## convert sex variable to Numberic

train['Sex_var'] = np.where(train['Sex'] == 'male', 1, 0)
test['Sex_var'] = np.where(test['Sex'] == 'male', 1, 0)

In [1]:
## convert Cabin to Numeric

encode = LabelEncoder()
train['Cabin'] = encode.fit_transform(train['Cabin'])
test['Cabin'] = encode.transform(test['Cabin'])

**c) Visulization on data set** - Now that we have imputed missing values , let's see relation between the dependednt valriable(Survived) and other independent variables. Only the attributes which have an impact of Survival value would be used to train the model.


**d) Feature Engineering** - As we move along we are also going to probably explore creation of new Features

Plot co-realtion between all variables to see survived via  heat map 


In [1]:
plt.figure(figsize=(14,12))
plt.title('Correlation of Features')
cor = train.corr()
sns.heatmap(cor, cmap="YlOrRd", annot=True)
plt.show()

From the above map it's clear that SibSp and Parch don't seem to have much bearing on the person Surviving or not. Let's try to combine these two and see how it goes.

In [1]:
### create family count
train['fmly_count'] = train['SibSp']+train['Parch']
test['fmly_count'] = test['SibSp']+test['Parch']

##create is_alone

train['is_alone'] = np.where((train['SibSp'] + train['Parch']) == 0, 1, 0)
test['is_alone'] = np.where((test['SibSp'] + test['Parch']) == 0, 1, 0)




In [1]:

### let's draw the co-realtion matrix again to include new features

cor = train.corr()
plt.figure(figsize=(14,12))
plt.title('Correlation of Features')
sns.heatmap(cor, cmap="YlOrRd", annot_kws={'fontsize':8},annot=True,linewidths=0.1,vmax=1.0)
plt.show()


In [1]:
#print(train.dtypes)
#col_list=['Pclass','Sex_var','Age','Fare','Cabin','fmly_count','is_alone']
## selecting the column names that we would use for prediction modelling 
col_list = ['Pclass', 'Sex_var', 'Age', 'Fare', 'Cabin', 'fmly_count','Title']
X = train[col_list].copy()
y = train['Survived'].copy()

test_orig = test.copy()
test = test[col_list]
#print(test.describe())

For prediction modelling you would want to test your model prediction accuracy , to do this it is an agreed practice to divide your dataset into 2 parts

Train Data -> data fed while training model

Test Data -> the model trained on train data is used to test the accuracy 

(Some cases)

we might also keep an unseen set of data called 
Validate Data set -> this is used to test final prediction model's accuracy 

In [1]:
## by deafult 75% train data and 25% test data divison 
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=100)

**Knn (K nearest neighbour)** - would be a simple model to apply for this.

Important Parameters - n_neighbours (this tells the algo how many nearest number of neighbours to consider) , there are other parameters which can be used to tweak the model, Discussion for those would be beyond the scope here. Similarly for other models as well we would be tweaking the most significant parameters as per our problem's context.


Check the example below , depending on value of n_neighbours the new point would be classified  

![image.png](attachment:image.png)

In [1]:
## repeated import here 
## just to increase readability

from sklearn.neighbors import KNeighborsClassifier

scale = MinMaxScaler()
X_scaled=pd.DataFrame(scale.fit_transform(X),columns=X.columns)
test_scaled = pd.DataFrame(scale.fit_transform(test), columns=test.columns)
test_scaled.index=test.index

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,random_state=100)



## fit the model 
knn_model = KNeighborsClassifier(n_neighbors=6).fit(X_train,y_train)
knn_score = knn_model.score(X_test,y_test)
print('Score of fitted model is: ')
print(knn_score)

### let's get the scores now
predict_df = pd.DataFrame()
predict_df['PassengerId'] = test.index
predict_df['Survived'] = knn_model.predict(test_scaled)
#predict_df.to_csv('submission100_knn_nvalue_6.csv', index=False)
score_df = score_df.append({'Model_Name':'K Nearest Neighbour','Score':knn_score},ignore_index=True)


Based on our model let's look at survival rate in test set

In [1]:
temp = test.copy()
temp['Survived'] = predict_df['Survived'].tolist() 
print(temp)
print('Percentage of People who Survived - %3f'%(temp['Survived'].mean()*100))

Output of ML model depends heavily on parameters hence parameter tuning plays an important role. It has been observed that with proper parameter values set, model’s performance increase reasonably.

Above we have just used a single parameter for fitting KNN model, which is n_neighbours and used a value of 6 . It is interesting to find out how did I reach that value or know that at value 6 we get a realtively optimal model

In [1]:
from sklearn.model_selection import validation_curve

### this is how you do a validation curve
train_score,test_score = validation_curve(KNeighborsClassifier(),X,y,param_name='n_neighbors',param_range=range(1,11),cv=5)
train_score_mean = np.mean(train_score,axis=1)
test_score_mean = np.mean(test_score, axis=1)
plt.plot(range(1,11),train_score_mean,'-o',label='train score')
plt.plot(range(1, 11), test_score_mean, '-o', label='train score')
plt.xlabel('N_neighbours values')
plt.ylabel('Accuracy')
plt.title('Variation of Accuracy with input parameter n_neighbour')
plt.legend(loc='best')
plt.show()


**Logistic Regression ** -  

Can be used for regression and classification as well. It is similar to linear regression , only difference being before giving the output it passes it through a function which gives an output b/w 0 and 1.

Important Parameters - penalty norm - l1 or l2 etc
c - Regularization paramter higher values means less penalties and would lead to overfitting

Worth adding that we would be using GridSearch utility for Hyperparameter Optimization - while training a model you want to tweak the input parametes so as to achieve the optimal Accuracy. You could either do this manually by passing differnet values to input parameters and checking result which is going to be time consuming , so to avoid this heavy loading we use GridSearch Utility which can test the accuracy over combination of multiple input values and also uses cross validation to give more robust results

In [1]:
from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import GridSearchCV

lr_model = LogisticRegression(random_state=100)
grid_values = {'penalty':['l1', 'l2'],'C':[0.01, 0.1, 1, 10, 100]}
grid_search_model = GridSearchCV(lr_model,param_grid=grid_values,cv=3)
grid_search_model.fit(X_scaled,y)
print(grid_search_model.best_estimator_)
print('Model Accuracy')
print(grid_search_model.best_score_)
print(grid_search_model.best_params_)
predict_df = pd.DataFrame()
predict_df['PassengerId'] = test_scaled.index
predict_df['Survived'] = grid_search_model.predict(test_scaled)
predict_df.to_csv('submission101_lr_gsearch_opt.csv', index=False)
score_df = score_df.append({'Model_Name':'LR - with Grid Search','Score':grid_search_model.best_score_},ignore_index=True)


**Support Vector Machine ** 

This is similar to logisitc regression but applies a different objective function , while in LR the algo aims at probabilities , this model is mode inclined towards finding a hyperplane with a wide margin between various classes.

C-> Regularization Parameter


In [1]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV


svc_model = SVC(C=100,random_state=100).fit(X_train,y_train)
scv_score = svc_model.score(X_test,y_test) 

print('SVC Model score is ')
print(scv_score)

predict_df = pd.DataFrame()
predict_df['PassengerId'] = test_scaled.index
predict_df['Survived'] = svc_model.predict(test_scaled)
#predict_df.to_csv('submission101_lr_gsearch_opt.csv', index=False)
score_df = score_df.append({'Model_Name':'SVC Model - C 100','Score':scv_score},ignore_index=True)


**Gaussian NB ** 


In [1]:
from sklearn.naive_bayes import GaussianNB

X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=100)
gaussnb_model = GaussianNB().fit(X_train,y_train)
print('Naive Bayes Accuracy')
print(gaussnb_model.score(X_test,y_test))
gb_model_pred = gaussnb_model.predict(test)
score_df = score_df.append({'Model_Name':'Gaussian Naive Bayes','Score':scv_score},ignore_index=True)



Decision Trees - As the name suggests this model builds a tree format to reach the calssification result by using a series of if and else questions.

In [1]:
from sklearn.tree import DecisionTreeClassifier
dt_model = DecisionTreeClassifier(max_depth=3,random_state=100).fit(X_train,y_train)
dt_model_score = dt_model.score(X_test,y_test)
print('Decision Tree Accuracy')
print(dt_model_score)
score_df = score_df.append({'Model_Name':'Decision Tree','Score':dt_model_score},ignore_index=True)


**Ensemble Models**

**Gradient Boosting Classifier**

Important Parameters 
learning_rate
n_estimators


In [1]:
from sklearn.ensemble import GradientBoostingClassifier

## fitting the model on complete data set 
gbf_model = GradientBoostingClassifier(learning_rate=0.001,n_estimators=3000).fit(X,y)
gbf_score = gbf_model.score(X_test,y_test)
print('GBF score -')
print(gbf_score)
gbf_model_pred = gbf_model.predict(test)
score_df = score_df.append({'Model_Name':'GradientBoostingClassifier','Score':gbf_score},ignore_index=True)


predict_df = pd.DataFrame()
predict_df['PassengerId'] = test.index
predict_df['Survived'] = gbf_model.predict(test)
predict_df.to_csv('submission_temp.csv', index=False,line_terminator="")
    
### some more file clearning
file_data = open('submission_temp.csv', 'rb').read()
open('submission_gbfc_model.csv', 'wb').write(file_data[:-1])

**Random Forest Classifier **


In [1]:
from sklearn.ensemble import RandomForestClassifier

rf_model= RandomForestClassifier(n_estimators=1000,max_depth=7)
rf_model.fit(X,y)
rf_score = rf_model.score(X_test,y_test)
print('RF Model Score')
print(rf_score)
score_df = score_df.append({'Model_Name':'RandomForestClassifier','Score':rf_score},ignore_index=True)
rf_model_pred = rf_model.predict(test)



predict_df = pd.DataFrame()
predict_df['PassengerId'] = test.index
predict_df['Survived'] = rf_model.predict(test)
predict_df.to_csv('submission_temp.csv', index=False,line_terminator="")
    
### some more file clearning
file_data = open('submission_temp.csv', 'rb').read()
open('submission_rf_model.csv', 'wb').write(file_data[:-1])

Checking the final scores

In [1]:
print(score_df[['Model_Name','Score']].sort_values(by='Score'))

Voting Approach - Combining More models together and picking the output with majorit votes

Use Voting along with 
1) GB model
2) RF model
3) GBF model

In [1]:
final_pred = []
for i in range(0,len(test)):
   final_pred.append(s.mode(np.array([rf_model_pred[i],gb_model_pred[i],gbf_model_pred[i]])))

predict_df = pd.DataFrame()
predict_df['PassengerId'] = test.index
predict_df['Survived'] = final_pred
predict_df.to_csv('submission_temp.csv', index=False,line_terminator="")
    
### some more file clearning
file_data = open('submission_temp.csv', 'rb').read()
open('submission_final.csv', 'wb').write(file_data[:-1])

**Please UpVote if you found this Notebook Usefull**