<img src='https://www.maritimecyprus.com/wp-content/uploads/2015/10/titanic-infographic-696x431.jpg'>

# The Challenge - Titanic-Machine Learning from Disaster

> The sinking of the Titanic is one of the most infamous shipwrecks in history.
> 
> On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
> 
> While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
> 
> In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc). 

# Loading The Data

In [None]:
#for data processing
import numpy as np 
import pandas as pd

#for visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from collections import Counter
from statsmodels.stats.outliers_influence import variance_inflation_factor

import warnings
warnings.filterwarnings("ignore")

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#Load the data
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")

In [None]:
train_data.head(2)

In [None]:
test_data.head(2)

In [None]:
#Concatenating train and test for easy EDA
train_data['train_or_test']='train'
test_data['train_or_test']='test'
all=pd.concat([train_data,test_data],sort=False)

#Resetting index, removing old index
all.reset_index(inplace=True)
all.drop('index',axis=1,inplace=True)

In [None]:
all.head(2)

# Exploratory Data Analysis

Target variable: Survived (1/0), Potential Predictors: All Others

## Univariate Analysis

#### Why Univariate Analysis?

Univariate analysis is the simplest form of analyzing data. “Uni” means “one”, so in other words your data has only one variable. It doesn't deal with causes or relationships (unlike regression ) and it's major purpose is to describe; It takes data, summarizes that data and finds patterns in the data.

<a> https://www.statisticshowto.com/univariate/ </a>

In [None]:
all.info()

In [None]:
all.describe()

In [None]:
#Visualization to check for missing values
sns.heatmap(all.isnull())

Age, Cabin have significant rows with missing values,while Fare & Embarked have a few rows. Survived missing values from test data.

In [None]:
#Survived
sns.set_style('whitegrid')
sns.countplot('Survived',hue='train_or_test',data=all)

In [None]:
#Pclass
all['Pclass'].value_counts()

In [None]:
groupby_df = all[all['train_or_test']=='train'].groupby(['Pclass', 'Survived']).agg({'Survived': 'count'})
groupby_pcts = groupby_df.groupby(level=0).apply(lambda x:round(100 * x / x.sum(),2))
groupby_df,groupby_pcts

Clearly Pclass=1 has higher chance of survival ~63% vs Pclass=2(47%) and Pclass=3(24%)

In [None]:
#Name
all['Name'].value_counts()

In [None]:
sum(all['Name'].value_counts()>1)

In [None]:
all[(all['Name']=='Kelly, Mr. James') | (all['Name']=='Connolly, Miss. Kate')]

In [None]:
#Sex
all['Sex'].value_counts()

In [None]:
groupby_df = all[all['train_or_test']=='train'].groupby(['Sex', 'Survived']).agg({'Survived': 'count'})
groupby_pcts = groupby_df.groupby(level=0).apply(lambda x:round(100 * x / x.sum(),2))
groupby_df,groupby_pcts

Female has higher chance of survival ~74% vs Male 19%

In [None]:
#Age
sns.boxplot(all['Age'])

In [None]:
sns.distplot(all[all['Survived']==0]['Age'],bins=30,color='blue')
sns.distplot(all[all['Survived']==1]['Age'],bins=30,color='red')

Lower age has higher chance of survival and the ages are almost normally distributed. 

In [None]:
#SibSp
sns.countplot('SibSp',data=all)

In [None]:
groupby_df = all[all['train_or_test']=='train'].groupby(['SibSp', 'Survived']).agg({'Survived': 'count'})
groupby_pcts = groupby_df.groupby(level=0).apply(lambda x:round(100 * x / x.sum(),2))
groupby_df,groupby_pcts

Passengers with 1,2 Siblings/Spouses have a higher chance of Survival 

In [None]:
#Parch
sns.countplot('Parch',data=all)

In [None]:
groupby_df = all[all['train_or_test']=='train'].groupby(['Parch', 'Survived']).agg({'Survived': 'count'})
groupby_pcts = groupby_df.groupby(level=0).apply(lambda x:round(100 * x / x.sum(),2))
groupby_df,groupby_pcts

1-3 Parents/Children have higher chance of Survival

In [None]:
#Ticket
all['Ticket'].value_counts()

In [None]:
sum(all['Ticket'].value_counts()>1)

In [None]:
#Fare
sns.boxplot(all['Fare'])

In [None]:
sns.distplot(all[all['Survived']==0]['Fare'],bins=30,color='blue')
sns.distplot(all[all['Survived']==1]['Fare'],bins=30,color='red')

We see passengers with higher Fare have a higher chance of survival

In [None]:
#Cabin
all['Cabin'].value_counts()

In [None]:
sum(all['Cabin'].value_counts()>1)

In [None]:
all['Embarked'].value_counts()

In [None]:
groupby_df = all[all['train_or_test']=='train'].groupby(['Embarked', 'Survived']).agg({'Survived': 'count'})
groupby_pcts = groupby_df.groupby(level=0).apply(lambda x:round(100 * x / x.sum(),2))
groupby_df,groupby_pcts

Embarked C has a higher chance of survival

## Bivariate Analysis

#### Why Bivariate Analysis?

Bivariate analysis is one of the simplest forms of quantitative (statistical) analysis. It involves the analysis of two variables (often denoted as X, Y), for the purpose of determining the empirical relationship between them. Bivariate analysis can be helpful in testing simple hypotheses of association.

<a> https://www.statisticshowto.com/bivariate-analysis/ </a>

#### Bivariate Analysis Grouped by Nature of Variable
1. Continuous & Continous
2. Categorical & Categorical
3. Categorical & Continuous

In [None]:
all.dtypes

In [None]:
sns.heatmap(all.corr(),annot=True)

##### Continuous & Continuous

In [None]:
sns.jointplot(x='Age',y='Fare',data=all,kind='kde')

In [None]:
#Correlation
all.corr()['Fare']['Age']

##### Categorical & Categorical

In [None]:
from scipy.stats import chi2

In [None]:
def chi_test(df,col1,col2):
    
    #Contingency Table
    contingency_table=pd.crosstab(df[col1],df[col2])
    #print('contingency_table :-\n',contingency_table)

    #Observed Values
    Observed_Values = contingency_table.values 
    #print("\nObserved Values :-\n",Observed_Values)

    #Expected Values
    import scipy.stats
    b=scipy.stats.chi2_contingency(contingency_table)
    Expected_Values = b[3]
    #print("\nExpected Values :-\n",Expected_Values)

    #Degree of Freedom
    no_of_rows=len(contingency_table.iloc[0:2,0])
    no_of_columns=len(contingency_table.iloc[0,0:2])
    df=(no_of_rows-1)*(no_of_columns-1)
    #print("\nDegree of Freedom:-",df)

    #Significance Level 5%
    alpha=0.05
    #print('\nSignificance level: ',alpha)

    #chi-square statistic - χ2
    chi_square=sum([(o-e)**2./e for o,e in zip(Observed_Values,Expected_Values)])
    chi_square_statistic=chi_square[0]+chi_square[1]
    #print("\nchi-square statistic:-",chi_square_statistic)

    #critical_value
    critical_value=chi2.ppf(q=1-alpha,df=df)
    #print('\ncritical_value:',critical_value)

    #p-value
    p_value=1-chi2.cdf(x=chi_square_statistic,df=df)
    #print('\np-value:',p_value)

    #compare chi_square_statistic with critical_value and p-value which is the probability of getting chi-square>0.09 (chi_square_statistic)
    if chi_square_statistic>=critical_value:
        print("\nchi_square_statistic & critical_value - significant result, reject null hypothesis (H0), dependent.")
    else:
        print("\nchi_square_statistic & critical_value - not significant result, fail to reject null hypothesis (H0).")

    if p_value<=alpha:
        print("\np_value & alpha - significant result, reject null hypothesis (H0), dependent.")
    else:
        print("\np_value & alpha - not significant result, fail to reject null hypothesis (H0), independent.")

#### What is Chi Square Test?


    Chi-Square Test: This test is used to derive the statistical significance of relationship between the variables. Also, it tests whether the evidence in the sample is strong enough to generalize that the relationship for a larger population as well. Chi-square is based on the difference between the expected and observed frequencies in one or more categories in the two-way table. It returns probability for the computed chi-square distribution with the degree of freedom.

Probability of 0: It indicates that both categorical variable are dependent

Probability of 1: It shows that both variables are independent.

Probability less than 0.05: It indicates that the relationship between the variables is significant at 95% confidence. The chi-square test statistic for a test of independence of two categorical variables is found by:

In [None]:
#Sex & Pclass
chi_test(all,'Sex','Pclass')

In [None]:
#Sex & Parch
chi_test(all,'Sex','Parch')

In [None]:
#Sex & SibSp
chi_test(all,'Sex','SibSp')

In [None]:
#Sex & Embarked
chi_test(all,'Sex','Embarked')

In [None]:
#Pclass & SibSp
chi_test(all,'Pclass','SibSp')

In [None]:
#Pclass & Parch
chi_test(all,'Pclass','Parch')

In [None]:
#Pclass & Embarked
chi_test(all,'Pclass','Embarked')

In [None]:
#SibSp & Parch
chi_test(all,'SibSp','Parch')

In [None]:
#SibSp & Embarked
chi_test(all,'SibSp','Embarked')

In [None]:
#Parch & Embarked
chi_test(all,'Parch','Embarked')

All the categorical variables seem to be dependent on each other

##### Categorical & Continuous 

In [None]:
#Pclass & Age
sns.boxplot(x='Pclass',y='Age',data=all)

In [None]:
#Sex & Age
sns.boxplot(x='Sex',y='Age',data=all)

In [None]:
#Parch & Age
sns.boxplot(x='Parch',y='Age',data=all)

In [None]:
#SibSp & Age
sns.boxplot(x='SibSp',y='Age',data=all)

In [None]:
#Embarked & Age
sns.boxplot(x='Embarked',y='Age',data=all)

In [None]:
#Pclass & Fare
sns.boxplot(x='Pclass',y='Fare',data=all)

In [None]:
#Sex & Fare
sns.boxplot(x='Sex',y='Fare',data=all)

In [None]:
#Parch & Fare
sns.boxplot(x='Parch',y='Fare',data=all)

In [None]:
#SibSp & Fare
sns.boxplot(x='SibSp',y='Fare',data=all)

In [None]:
#Embarked & Fare
sns.boxplot(x='Embarked',y='Fare',data=all)

## Missing Value Treatment

In [None]:
all.isnull().sum()

In [None]:
#Filling Embarked with most common value
all['Embarked']=all['Embarked'].fillna('S')

#Filling Fare with mean(Fare)
all['Fare']=all['Fare'].fillna(all['Fare'].mean())

In [None]:
#Imputing Age
index_NaN_age = list(all[all["Age"].isnull()]["Age"].index)

for i in index_NaN_age :
    age_med = all["Age"].median()
    age_pred = all[((all['SibSp'] == all.iloc[i]["SibSp"]) & (all['Parch'] == all.iloc[i]["Parch"]) & (all['Pclass'] == all.iloc[i]["Pclass"]))]["Age"].median()
    if not np.isnan(age_pred) :
        all['Age'].iloc[i] = age_pred
    else :
        all['Age'].iloc[i] = age_med

In [None]:
sns.heatmap(all.isnull())

## Outlier Detection

#### What are outliers, how to treat them?

##### What is an Outlier?

Outlier is a commonly used terminology by analysts and data scientists as it needs close attention else it can result in wildly wrong estimations. Simply speaking, Outlier is an observation that appears far away and diverges from an overall pattern in a sample.

 
##### What are the types of Outliers?

Outlier can be of two types: Univariate and Multivariate. Above, we have discussed the example of univariate outlier. These outliers can be found when we look at distribution of a single variable. Multi-variate outliers are outliers in an n-dimensional space. In order to find them, you have to look at distributions in multi-dimensions.

Let us understand this with an example. Let us say we are understanding the relationship between height and weight. Below, we have univariate and bivariate distribution for Height, Weight. Take a look at the box plot. We do not have any outlier (above and below 1.5*IQR, most common method). Now look at the scatter plot. Here, we have two values below and one above the average in a specific segment of weight and height.

##### Outlier, Multivariate Outlier

What causes Outliers?

Whenever we come across outliers, the ideal way to tackle them is to find out the reason of having these outliers. The method to deal with them would then depend on the reason of their occurrence. Causes of outliers can be classified in two broad categories:

   1. Artificial (Error) / Non-natural
   2. Natural.

Let’s understand various types of outliers in more detail:

   * Data Entry Errors:- Human errors such as errors caused during data collection, recording, or entry can cause outliers in data.
    
   * Measurement Error: It is the most common source of outliers. This is caused when the measurement instrument used turns out to be faulty. For example: There are 10 weighing machines. 9 of them are correct, 1 is faulty. Weight measured by people on the faulty machine will be higher / lower than the rest of people in the group. The weights measured on faulty machine can lead to outliers.
    
   * Experimental Error: Another cause of outliers is experimental error. For example: In a 100m sprint of 7 runners, one runner missed out on concentrating on the ‘Go’ call which caused him to start late. Hence, this caused the runner’s run time to be more than other runners. His total run time can be an outlier.
    
   * Intentional Outlier: This is commonly found in self-reported measures that involves sensitive data. For example: Teens would typically under report the amount of alcohol that they consume. Only a fraction of them would report actual value. Here actual values might look like outliers because rest of the teens are under reporting the consumption.
    
   * Data Processing Error: Whenever we perform data mining, we extract data from multiple sources. It is possible that some manipulation or extraction errors may lead to outliers in the dataset.
    
   * Sampling error: For instance, we have to measure the height of athletes. By mistake, we include a few basketball players in the sample. This inclusion is likely to cause outliers in the dataset.
    
   * Natural Outlier: When an outlier is not artificial (due to error), it is a natural outlier. For instance: In my last assignment with one of the renowned insurance company, I noticed that the performance of top 50 financial advisors was far higher than rest of the population. Surprisingly, it was not due to any error. Hence, whenever we perform any data mining activity with advisors, we used to treat this segment separately.

 
##### What is the impact of Outliers on a dataset?

Outliers can drastically change the results of the data analysis and statistical modeling. There are numerous unfavourable impacts of outliers in the data set:

*     It increases the error variance and reduces the power of statistical tests
*     If the outliers are non-randomly distributed, they can decrease normality
*     They can bias or influence estimates that may be of substantive interest
*     They can also impact the basic assumption of Regression, ANOVA and other statistical model assumptions.

##### How to detect Outliers?

Most commonly used method to detect outliers is visualization. We use various visualization methods, like Box-plot, Histogram, Scatter Plot (above, we have used box plot and scatter plot for visualization). Some analysts also various thumb rules to detect outliers. Some of them are:

  * Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR
  * Use capping methods. Any value which out of range of 5th and 95th percentile can be considered as outlier
  * Data points, three or more standard deviation away from mean are considered outlier
  * Outlier detection is merely a special case of the examination of data for influential data points and it also depends on the business understanding
  * Bivariate and multivariate outliers are typically measured using either an index of influence or leverage, or distance. Popular indices such as Mahalanobis’ distance and Cook’s D are frequently used to detect outliers.
  * In SAS, we can use PROC Univariate, PROC SGPLOT. To identify outliers and influential observation, we also look at statistical measure like STUDENT, COOKD, RSTUDENT and others.

##### How to remove Outliers?

Most of the ways to deal with outliers are similar to the methods of missing values like deleting observations, transforming them, binning them, treat them as a separate group, imputing values and other statistical methods. Here, we will discuss the common techniques used to deal with outliers:

* Deleting observations: We delete outlier values if it is due to data entry error, data processing error or outlier observations are very small in numbers. We can also use trimming at both ends to remove outliers.

* Transforming and binning values: Transforming variables can also eliminate outliers. Natural log of a value reduces the variation caused by extreme values. Binning is also a form of variable transformation. Decision Tree algorithm allows to deal with outliers well due to binning of variable. We can also use the process of assigning weights to different observations.
 
* Variable Transformation, LOG
 
* Imputing: Like imputation of missing values, we can also impute outliers. We can use mean, median, mode imputation methods. Before imputing values, we should analyse if it is natural outlier or artificial. If it is artificial, we can go with imputing values. We can also use statistical model to predict values of outlier observation and after that we can impute it with predicted values.
 
* Treat separately: If there are significant number of outliers, we should treat them separately in the statistical model. One of the approach is to treat both groups as two different groups and build individual model for both groups and then combine the output.

In [None]:
# Outlier detection 

def detect_outliers(df,n,features):
    """
    Takes a dataframe df of features and returns a list of the indices
    corresponding to the observations containing more than n outliers according
    to the Tukey method.
    """
    outlier_indices = []
    
    # iterate over features(columns)
    for col in features:
        # 1st quartile (25%)
        Q1 = np.percentile(df[col], 25)
        # 3rd quartile (75%)
        Q3 = np.percentile(df[col],75)
        # Interquartile range (IQR)
        IQR = Q3 - Q1
        
        # outlier step
        outlier_step = 1.5 * IQR
        
        # Determine a list of indices of outliers for feature col
        outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step )].index
        
        # append the found outlier indices for col to the list of outlier indices 
        outlier_indices.extend(outlier_list_col)
        
    # select observations containing more than 2 outliers
    outlier_indices = Counter(outlier_indices)        
    multiple_outliers = list( k for k, v in outlier_indices.items() if v > n )
    
    return multiple_outliers   

# detect outliers from Age, SibSp , Parch and Fare
Outliers_to_drop = detect_outliers(all[all['train_or_test']=='train'],2,["Age","SibSp","Parch","Fare"])

In [None]:
all.loc[Outliers_to_drop] # Show the outliers rows

In [None]:
all = all.drop(Outliers_to_drop, axis = 0).reset_index(drop=True)

Till here, we have learnt about steps of data exploration, missing value treatment and techniques of outlier detection and treatment. These 3 stages will make your raw data better in terms of information availability and accuracy. Let’s now proceed to the final stage of data exploration. It is Feature Engineering.

# Feature Engineering

##### Pclass

In [None]:
all['Pclass'].value_counts() #No Feature Engineering

##### Fare

In [None]:
all['Fare'].median() #Before Feature Engineering

In [None]:
all['FareBand'] = pd.qcut(all['Fare'], 4)

In [None]:
all['FareBand'].value_counts()

In [None]:
all.loc[ all['Fare'] <= 7.9, 'Fare'] = 0
all.loc[(all['Fare'] > 7.9) & (all['Fare'] <= 14.4), 'Fare'] = 1
all.loc[(all['Fare'] > 14.4) & (all['Fare'] <= 30.5), 'Fare']   = 2
all.loc[ all['Fare'] > 30.5, 'Fare'] = 3

In [None]:
all['Fare'].value_counts()

Converted Fare into bands

##### Age

In [None]:
all['Age'].median() #Before Feature Engineering

In [None]:
all['AgeBand'] = pd.cut(all['Age'], 5)

In [None]:
all['AgeBand'].value_counts()

In [None]:
all.loc[ all['Age'] <= 16, 'Age'] = 0
all.loc[(all['Age'] > 16) & (all['Age'] <= 32), 'Age'] = 1
all.loc[(all['Age'] > 32) & (all['Age'] <= 48), 'Age'] = 2
all.loc[(all['Age'] > 48) & (all['Age'] <= 64), 'Age'] = 3
all.loc[ all['Age'] > 64, 'Age']=5

In [None]:
all['Age'].value_counts()

Converted Age into bands

##### Name

In [None]:
split_one = all['Name'].str.split('.', n=1, expand = True)
all['First'] = split_one[0]
all['Last'] = split_one[1]
split_two = all['First'].str.split(',', n=1, expand = True)
all['Last Name'] = split_two[0]
all['Title'] = split_two[1]
split_three = all['Title'].str.split('', n=1, expand = True)

split_three


In [None]:
all['Title'].value_counts()

In [None]:
all.drop(['First','Last','Name','Last Name'],axis = 1,inplace = True)

In [None]:
all.replace(to_replace = [ ' Don', ' Rev', ' Dr', ' Mme',
        ' Major', ' Sir', ' Col', ' Capt',' Jonkheer'], value = ' Honorary(M)', inplace = True)

all.replace(to_replace = [ ' Ms', ' Lady', ' Mlle',' the Countess', ' Dona'], value = ' Honorary(F)', inplace = True)

all['Title'].value_counts()

In [None]:
all = pd.get_dummies(all, columns = ['Title'],prefix='Title_',drop_first=True)
all.head()

##### SibSp + Parch = Family Size

In [None]:
all['Family'] = all['SibSp'] + all['Parch'] + 1

In [None]:
all['Single'] = all['Family'].map(lambda s: 1 if s == 1 else 0)
all['SmallF'] = all['Family'].map(lambda s: 1 if  s == 2  else 0)
all['MedF'] = all['Family'].map(lambda s: 1 if 3 <= s <= 4 else 0)
all['LargeF'] = all['Family'].map(lambda s: 1 if s >= 5 else 0)
all.head()

##### Embarked

In [None]:
all = pd.get_dummies(all, columns = ['Embarked'], prefix='Embarked_from_',drop_first=True)
all.head()

##### Cabin

In [None]:
all.drop('Cabin',axis=1,inplace=True)

##### Ticket

In [None]:
all['Ticket'].unique()

In [None]:
all['Ticket'].value_counts()

In [None]:
all['Ticket'] = all['Ticket'].astype(str)
all['Ticket_length'] = all['Ticket'].apply(len)
all['Ticket_length'].astype(int)
all['Ticket_length'].unique()

In [None]:
all['Ticket_length'] = np.where(((all['Ticket_length'] == 3) | (all['Ticket_length'] == 4) | (all['Ticket_length'] == 5)),4,all['Ticket_length'])

all['Ticket_length'] = np.where(((all['Ticket_length'] == 6)),5,all['Ticket_length'])

all['Ticket_length'] = np.where(((all['Ticket_length'] == 7) | (all['Ticket_length'] == 8) | (all['Ticket_length'] == 9) | (all['Ticket_length'] == 10) | (all['Ticket_length'] == 13)
                                 | (all['Ticket_length'] == 17)| (all['Ticket_length'] == 16)| (all['Ticket_length'] == 13)| (all['Ticket_length'] == 12) | (all['Ticket_length'] == 15)
                                 | (all['Ticket_length'] == 11)| (all['Ticket_length'] == 18)),12,all['Ticket_length'])



In [None]:
all['Ticket_length'].unique()

In [None]:
all['Ticket_length'] = all['Ticket_length'].astype(str)

all['Ticket_length'] = np.where(((all['Ticket_length'] == '4')),'Below 6',all['Ticket_length'])
all['Ticket_length'] = np.where(((all['Ticket_length'] == '5')),'At 6',all['Ticket_length'])
all['Ticket_length'] = np.where(((all['Ticket_length'] == '12')),'Above 6',all['Ticket_length'])

In [None]:
all['Ticket_length'].unique()

In [None]:
all = pd.get_dummies(all, columns=['Ticket_length'], prefix = 'Ticket_Length_',drop_first=True)
all.head()

In [None]:
all.drop(['Ticket'],axis = 1, inplace = True)

##### Sex

In [None]:
all = pd.get_dummies(all, columns = ['Sex'],prefix='Gender_',drop_first=True)
all.head()

In [None]:
all.info()

In [None]:
all.drop(['SibSp','Parch','Family','FareBand','AgeBand'],axis = 1,inplace = True)

In [None]:
all.info()

In [None]:
train_data,test_data=all[all['train_or_test']=='train'],all[all['train_or_test']=='test']
train_data.drop('train_or_test',axis=1,inplace=True)
test_data.drop('train_or_test',axis=1,inplace=True)

In [None]:
train_data.info()

In [None]:
test_data.info()

In [None]:
train_data.describe()

# Building Models

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curve

In [None]:
#train & test split
X_train, X_test, y_train, y_test = train_test_split(train_data.drop(['PassengerId','Survived'],axis=1), 
                                                    train_data['Survived'], test_size=0.30, 
                                                    random_state=101)

##### I compared 10 popular classifiers and evaluate the mean accuracy of each of them by a stratified kfold cross validation procedure.

    SVC
    Decision Tree
    AdaBoost
    Random Forest
    Extra Trees
    Gradient Boosting
    Multiple layer perceprton (neural network)
    KNN
    Logistic regression
    Linear Discriminant Analysis


In [None]:
#Cross validate model with Kfold stratified cross val
kfold = StratifiedKFold(n_splits=10)

In [None]:
#Modeling step Test differents algorithms 
random_state = 101
classifiers = []
classifiers.append(SVC(random_state=random_state))
classifiers.append(DecisionTreeClassifier(random_state=random_state))
classifiers.append(AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),random_state=random_state,learning_rate=0.1))
classifiers.append(RandomForestClassifier(random_state=random_state))
classifiers.append(ExtraTreesClassifier(random_state=random_state))
classifiers.append(GradientBoostingClassifier(random_state=random_state))
classifiers.append(MLPClassifier(random_state=random_state))
classifiers.append(KNeighborsClassifier())
classifiers.append(LogisticRegression(random_state = random_state))
classifiers.append(LinearDiscriminantAnalysis())

cv_results = []
for classifier in classifiers :
    cv_results.append(cross_val_score(classifier, X_train, y = y_train, scoring = "accuracy", cv = kfold, n_jobs=4))

cv_means = []
cv_std = []
for cv_result in cv_results:
    cv_means.append(cv_result.mean())
    cv_std.append(cv_result.std())

cv_res = pd.DataFrame({"CrossValMeans":cv_means,"CrossValerrors": cv_std,"Algorithm":["SVC","DecisionTree","AdaBoost",
"RandomForest","ExtraTrees","GradientBoosting","MultipleLayerPerceptron","KNeighboors","LogisticRegression","LinearDiscriminantAnalysis"]})

g = sns.barplot("CrossValMeans","Algorithm",data = cv_res, palette="Set3",orient = "h",**{'xerr':cv_std})
g.set_xlabel("Mean Accuracy")
g = g.set_title("Cross validation scores")

In [None]:
cv_res.sort_values('CrossValMeans',ascending=False)

In [None]:
###META MODELING  WITH ADABOOST, RF, EXTRATREES and GRADIENTBOOSTING

#Adaboost
DTC = DecisionTreeClassifier()

adaDTC = AdaBoostClassifier(DTC, random_state=7)

ada_param_grid = {"base_estimator__criterion" : ["gini", "entropy"],
              "base_estimator__splitter" :   ["best", "random"],
              "algorithm" : ["SAMME","SAMME.R"],
              "n_estimators" :[1,2],
              "learning_rate":  [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3,1.5]}

gsadaDTC = GridSearchCV(adaDTC,param_grid = ada_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)

gsadaDTC.fit(X_train,y_train)

ada_best = gsadaDTC.best_estimator_

#Best score
gsadaDTC.best_score_

In [None]:
#ExtraTrees 
ExtC = ExtraTreesClassifier()


##Search grid for optimal parameters
ex_param_grid = {"max_depth": [None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [False],
              "n_estimators" :[100,300],
              "criterion": ["gini"]}


gsExtC = GridSearchCV(ExtC,param_grid = ex_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)

gsExtC.fit(X_train,y_train)

ExtC_best = gsExtC.best_estimator_

#Best score
gsExtC.best_score_

In [None]:
#RFC Parameters tunning 
RFC = RandomForestClassifier()


##Search grid for optimal parameters
rf_param_grid = {"max_depth": [None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [False],
              "n_estimators" :[100,300],
              "criterion": ["gini"]}


gsRFC = GridSearchCV(RFC,param_grid = rf_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)

gsRFC.fit(X_train,y_train)

RFC_best = gsRFC.best_estimator_

#Best score
gsRFC.best_score_

In [None]:
#Gradient boosting tunning

GBC = GradientBoostingClassifier()
gb_param_grid = {'loss' : ["deviance"],
              'n_estimators' : [100,200,300],
              'learning_rate': [0.1, 0.05, 0.01],
              'max_depth': [4, 8],
              'min_samples_leaf': [100,150],
              'max_features': [0.3, 0.1] 
              }

gsGBC = GridSearchCV(GBC,param_grid = gb_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)

gsGBC.fit(X_train,y_train)

GBC_best = gsGBC.best_estimator_

#Best score
gsGBC.best_score_

In [None]:
#SVC classifier
SVMC = SVC(probability=True)
svc_param_grid = {'kernel': ['rbf'], 
                  'gamma': [ 0.001, 0.01, 0.1, 1],
                  'C': [1, 10, 50, 100,200,300, 1000]}

gsSVMC = GridSearchCV(SVMC,param_grid = svc_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)

gsSVMC.fit(X_train,y_train)

SVMC_best = gsSVMC.best_estimator_

#Best score
gsSVMC.best_score_

In [None]:
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
    """Generate a simple plot of the test and training learning curve"""
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

g = plot_learning_curve(gsRFC.best_estimator_,"RF mearning curves",X_train,y_train,cv=kfold)
g = plot_learning_curve(gsExtC.best_estimator_,"ExtraTrees learning curves",X_train,y_train,cv=kfold)
g = plot_learning_curve(gsSVMC.best_estimator_,"SVC learning curves",X_train,y_train,cv=kfold)
g = plot_learning_curve(gsadaDTC.best_estimator_,"AdaBoost learning curves",X_train,y_train,cv=kfold)
g = plot_learning_curve(gsGBC.best_estimator_,"GradientBoosting learning curves",X_train,y_train,cv=kfold)

Looking at learning curve to avoid overfitting of model scores

In [None]:
nrows = ncols = 2
fig, axes = plt.subplots(nrows = nrows, ncols = ncols, sharex="all", figsize=(15,15))

names_classifiers = [("AdaBoosting", ada_best),("ExtraTrees",ExtC_best),("RandomForest",RFC_best),("GradientBoosting",GBC_best)]

nclassifier = 0
for row in range(nrows):
    for col in range(ncols):
        name = names_classifiers[nclassifier][0]
        classifier = names_classifiers[nclassifier][1]
        indices = np.argsort(classifier.feature_importances_)[::-1][:40]
        g = sns.barplot(y=X_train.columns[indices][:40],x = classifier.feature_importances_[indices][:40] , orient='h',ax=axes[row][col])
        g.set_xlabel("Relative importance",fontsize=12)
        g.set_ylabel("Features",fontsize=12)
        g.tick_params(labelsize=9)
        g.set_title(name + " feature importance")
        nclassifier += 1

In [None]:
test_Survived_RFC = pd.Series(RFC_best.predict(test_data.drop(['PassengerId','Survived'],axis=1)), name="RFC")
test_Survived_ExtC = pd.Series(ExtC_best.predict(test_data.drop(['PassengerId','Survived'],axis=1)), name="ExtC")
test_Survived_SVMC = pd.Series(SVMC_best.predict(test_data.drop(['PassengerId','Survived'],axis=1)), name="SVC")
test_Survived_AdaC = pd.Series(ada_best.predict(test_data.drop(['PassengerId','Survived'],axis=1)), name="Ada")
test_Survived_GBC = pd.Series(GBC_best.predict(test_data.drop(['PassengerId','Survived'],axis=1)), name="GBC")

# Concatenate all classifier results
ensemble_results = pd.concat([test_Survived_RFC,test_Survived_ExtC,test_Survived_AdaC,test_Survived_GBC, test_Survived_SVMC],axis=1)


g= sns.heatmap(ensemble_results.corr(),annot=True)

# Final Output

A collection of several models working together on a single set is called an ensemble. The method is called Ensemble Learning. It is much more useful use all different models rather than any one.

In [None]:
votingC = VotingClassifier(estimators=[('rfc', RFC_best),('gbc',GBC_best)], voting='soft', n_jobs=4)
votingC = votingC.fit(X_train, y_train)

In [None]:
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': votingC.predict(test_data.drop(['PassengerId','Survived'],axis=1)).astype('int')})

output.to_csv('my_submission.csv', index=False)
print("Your submission was successfully saved!")

In [None]:
output

# References :
1. EDA - https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
2. Voting Classifier - https://medium.com/@sanchitamangale12/voting-classifier-1be10db6d7a5

## Feel free to share feedback, Upvote if you like/found the notebook useful!