### 1 Introduction
### 2 Load and check data
-  load data
- Outlier detection
-  joining train and test set
-  check for null and missing values
### 3 Feature analysis
- 3.1 Numerical values
- 3.2 Categorical values
### 4 Filling missing Values
- 4.1 Age
### 5 Feature engineering
- 5.1 Name/Title
- 5.2 Family Size
- 5.3 Cabin
- 5.4 Ticket
### 6 Modeling
- 6.1 Simple modeling
    - 6.1.1 Cross validate models
    - 6.1.2 Hyperparamater tunning for best models
    - 6.1.3 Plot learning curves
    - 6.1.4 Feature importance of the tree based classifiers
- 6.2 Ensemble modeling
    - 6.2.1 Combining models
- 6.3 Prediction
    - 6.3.1 Predict and Submit results

In [1]:
#import necessary libraries for the project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
%matplotlib inline

In [2]:
#2.1 load data
train = pd.read_csv("train.csv")
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
test = pd.read_csv('test.csv')
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [4]:
#2.2 Outlier detection
train.describe(include='all')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Chaffee, Mr. Herbert Fuller",male,,,,1601.0,,B96 B98,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


In [5]:
train['Age'][train['Age'] < 20 ].describe()

count    164.000000
mean      11.979695
std        6.656828
min        0.420000
25%        5.000000
50%       15.000000
75%       18.000000
max       19.000000
Name: Age, dtype: float64

In [6]:
train['Age'][train['Age'] > 65 ].describe()

count     8.000000
mean     71.562500
std       4.048258
min      66.000000
25%      70.000000
50%      70.750000
75%      71.750000
max      80.000000
Name: Age, dtype: float64

In the above 2 cells, I am trying to find outliers. 

The first one print the details of all kids(up to 19 years of age). There are 164 kids in the training dataset the youngest baby is just over months. 

The oldest person in training set is 80 years old and there are 8 people over the age of 65. 

In [7]:
train.loc[train['Age']== 0.420000]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
803,804,1,3,"Thomas, Master. Assad Alexander",male,0.42,0,1,2625,8.5167,,C


In [8]:
train.loc[train['Age']== 80]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
630,631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0,A23,S


In [9]:
#print the details of passengers who pay more than average ticket price  of $35
train['Fare'][train['Fare'] > 265 ].describe()
#I am not dropping the rows with highest fare. This may not be an outlier since the passengers Embarked is C, that is Cherbourg in France.
#So that could be the reason for higer fare. 

count      3.0000
mean     512.3292
std        0.0000
min      512.3292
25%      512.3292
50%      512.3292
75%      512.3292
max      512.3292
Name: Fare, dtype: float64

In [10]:
test.describe(include='all')

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,418.0,418.0,418,418,332.0,418.0,418.0,418,417.0,91,418
unique,,,418,2,,,,363,,76,3
top,,,"Widener, Mr. George Dunton",male,,,,PC 17608,,B57 B59 B63 B66,S
freq,,,1,266,,,,5,,3,270
mean,1100.5,2.26555,,,30.27259,0.447368,0.392344,,35.627188,,
std,120.810458,0.841838,,,14.181209,0.89676,0.981429,,55.907576,,
min,892.0,1.0,,,0.17,0.0,0.0,,0.0,,
25%,996.25,1.0,,,21.0,0.0,0.0,,7.8958,,
50%,1100.5,3.0,,,27.0,0.0,0.0,,14.4542,,
75%,1204.75,3.0,,,39.0,1.0,0.0,,31.5,,


In [11]:
print("Train data: ")
print(train.isna().sum()[(train.isna().sum()!=0)].reset_index())
print("Test data: ")
print(test.isna().sum()[test.isna().sum()!=0].reset_index())

Train data: 
      index    0
0       Age  177
1     Cabin  687
2  Embarked    2
Test data: 
   index    0
0    Age   86
1   Fare    1
2  Cabin  327


Training and testing dataset has missing values in 3 columns each. Both these datasets has missing values in Age and Cabin columns. 

In my opinion we can safely remove cabin data because there are so many missing items(more 77% in training and 78% testing data missing). 
In case of Age, nearly 20% data missing for training and testing set. But Age can be an important feature so I don't want to drop that from our dataset. I would like to impute value later. 

Other 2 missing values are Embarked in training and Fare in testing data. less than 1% data missing, so we can keep that data and impute values later. In my opinion both Embarked and Fare column value is insignificant. Let me analyze further to conclude that. 

In [12]:
train = train.drop(['Cabin'], axis=1)
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


In [13]:
test = test.drop(['Cabin'], axis=1)
test.shape

(418, 10)

Encode values for 'Age', 'Embarked' and 'Fare' columns in both train and testing sets. 

In [14]:
#Encode values for Age - Age is a number so we can fill the missing columns with the mean age value.
train['Age'] = train['Age'].fillna(round( train['Age'].mean()))
mode_em = train['Embarked'].mode()
train['Embarked'] = train['Embarked'].fillna(train['Embarked'].mode()[0])
print(train.isna().sum())

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64


In [17]:
#We can encode the values of testing set as well 
test['Age'] = test['Age'].fillna(round(test['Age'].mean()))
test['Fare'] = test['Fare'].fillna(round(test['Fare'].mean()))
print(test.isna().sum())

PassengerId    0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64


In [27]:
#Now lets analyze data - how people survived on different Pclass
train[['Pclass', 'Survived', 'Fare']].groupby(['Pclass'], as_index=False).mean()

Unnamed: 0,Pclass,Survived,Fare
0,1,0.62963,84.154687
1,2,0.472826,20.662183
2,3,0.242363,13.67555


In [36]:
train[['Sex', 'Survived']].groupby(['Sex'], as_index=False).count()

Unnamed: 0,Sex,Survived
0,female,314
1,male,577


In [38]:
train[['Age', 'Survived']].groupby(['Survived'], as_index=False).mean()

Unnamed: 0,Survived,Age
0,0,30.483607
1,1,28.595526


In [39]:
train[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).count()

Unnamed: 0,Embarked,Survived
0,C,168
1,Q,77
2,S,646
