## **Data Dictionary**

* Survived: 0 = No, 1 = Yes
* pclass  : Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
* sibsp   : # of sibilings / spouses aboard the Titanic
* parch   : # of parents   / children aboard the Titanic
* ticket  : Ticket number
* cabin   : Cabin number
* embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

# *Contents of the Notebook* #  

### Part1: Exploratory Data Analysis(EDA):  ###
1) Analysis of the features

2) Finding any releations or trends considering multiple features

  
### Part2: Feature Enginearing and Data Cleaning ###
1) Adding any few features 

2) Removing redundant feautres

3) Converting features into suitable form for modeling  


### Part3: Predictive Modeling ###
1) Running Basic Algorithms.

2) Cross Validation

3) Ensembling 

4) Important Features Extraction 

### Part1: Exploratory Data Analysis(EDA): ###

In [291]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [292]:
data = pd.read_csv('../Titanic/train.csv')

In [293]:
data.isnull().sum() #checking for total null values

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

### How many Survived?? ###

In [None]:
f,ax=plt.subplots(1,2,figsize=(18,8))
data['Survived'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Survived')
ax[0].set_ylabel('')
sns.countplot('Survived',data=data,ax=ax[1])
ax[1].set_title('Survived')
plt.show()

### Types Of Features ###
#### Categorical Features: ####  
A categorical variable is one that has two or more categories and each value in that feature can be categorised by them.For example, gender is a categorical variable having two categories (male and female). Now we cannot sort or give any ordering to such variables. They are also known as Nominal Variables.

**Categorical Features in the dataset: Sex,Embarked.**  



#### Ordinal Features: ####  
An ordinal variable is similar to categorical values, but the difference between them is that we can have relative ordering or sorting between the values. For eg: If we have a feature like Height with values Tall, Medium, Short, then Height is a ordinal variable. Here we can have a relative sort in the variable.

**Ordinal Features in the dataset: PClass**  



#### Continous Feature: ####  
A feature is said to be continous if it can take values between any two points or between the minimum or maximum values in the features column.

**Continous Features in the dataset: Age**

### Sex -> Categorical Feature ###

In [None]:
data.groupby(['Sex','Survived'])['Survived'].count()
# 1 = live, 0 = dead

In [None]:
# 그래프 갯수 및 사이즈 설정
# figure == 한개 액자, ax == 액자 목록
figure, ax = plt.subplots(1,2, figsize=(18,8))

# 첫번째 그래프 표현
data[['Sex', 'Survived']].groupby(['Sex']).mean().plot.bar(ax = ax[0])
#ax[0].set(ylim=(0,1))

# 두번째 그래프 표현
# hue == 갯수로 나타낼 항목
sns.countplot('Sex', hue='Survived', data=data, ax = ax[1])

# 그래프 이름 정하기
ax[0].set_title('Survived vs Sex')
ax[1].set_title('Sex:Survived vs Dead')
plt.legend(['Dead', 'Live'])

plt.show()

### Pclass -> Ordinal Feature ###   

            == 라벨(1,2,3) 인코딩 해도된다.

In [None]:
pd.crosstab(data.Pclass, data.Survived, margins = True).style.background_gradient(cmap = 'summer_r')

In [None]:
f,ax=plt.subplots(1,2,figsize=(18,8))

data['Pclass'].value_counts().plot.bar(color=['#CD7F32','#FFDF00','#D3D3D3'],ax=ax[0])

ax[0].set_title('Number Of Passengers By Pclass')
ax[0].set_ylabel('Count')
ax[0].set_xlabel('Pclass')

sns.countplot('Pclass',hue='Survived',data=data,ax=ax[1])
ax[1].set_title('Pclass:Survived vs Dead')

plt.legend(['Dead', 'Live'])
plt.show()

In [None]:
pd.crosstab([data.Sex,data.Survived],data.Pclass,margins=True).style.background_gradient(cmap='summer_r')

In [None]:
sns.factorplot('Pclass', 'Survived', hue='Sex', data=data)
plt.show()

### Age -> Continous Feature ###

In [None]:
print('Oldest Passenger was of:',data['Age'].max(),'Years')
print('Youngest Passenger was of:',data['Age'].min(),'Years')
print('Average Age on the ship:',data['Age'].mean(),'Years')

In [None]:
f, ax = plt.subplots(1,2,figsize = (18,8))

sns.violinplot("Pclass","Age", hue="Survived", data = data, split = True, ax = ax[0])

ax[0].set_title('Pclass and Age vs Survived')
ax[0].set_yticks(range(0,110,10))

sns.violinplot("Sex","Age", hue="Survived", data = data, split = True, ax = ax[1])
ax[1].set_title('Sex and Age vs Survived')
ax[1].set_yticks(range(0,110,10))
plt.show()

In [295]:
data['Initial']=0

for i in data:
    data['Initial']=data.Name.str.extract('([A-Za-z]+)\.') #lets extract the Salutations

In [None]:
pd.crosstab(data.Initial,data.Sex).T.style.background_gradient(cmap='summer_r') #Checking the Initials with the Sex

In [297]:
data['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don']
                        ,['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr']
                        ,inplace=True)

In [None]:
data.groupby('Initial')['Age'].mean() #lets check the average age by Initials

### Filling NaN Ages ###

In [298]:
## Assigning the NaN Values with the Ceil values of the mean ages
data.loc[(data.Age.isnull())&(data.Initial=='Mr'),'Age']=33
data.loc[(data.Age.isnull())&(data.Initial=='Mrs'),'Age']=36
data.loc[(data.Age.isnull())&(data.Initial=='Master'),'Age']=5
data.loc[(data.Age.isnull())&(data.Initial=='Miss'),'Age']=22
data.loc[(data.Age.isnull())&(data.Initial=='Other'),'Age']=46

In [None]:
data.Age.isnull().any()

In [None]:
f,ax=plt.subplots(1,2,figsize=(20,10))

data[data['Survived'] == 0].Age.plot.hist(ax = ax[0], bins = 20,edgecolor = 'black',color = 'red')

ax[0].set_title('Survived= 0')
x1=list(range(0,85,5))
ax[0].set_xticks(x1)

data[data['Survived'] == 1].Age.plot.hist(ax = ax[1], color = 'green',bins = 20,edgecolor = 'black')

ax[1].set_title('Survived= 1')
x2=list(range(0,85,5))
ax[1].set_xticks(x2)
plt.show()

In [None]:
sns.factorplot('Pclass','Survived',col='Initial',data=data)
plt.show()

### Embarked -> Categorical Value ###

In [None]:
pd.crosstab([data.Embarked, data.Pclass],[data.Sex,data.Survived]
            , margins = True).style.background_gradient(cmap='summer_r')

In [None]:
sns.factorplot('Embarked', 'Survived', data = data)
fig = plt.gcf()
fig.set_size_inches(5,3)
plt.show()

In [None]:
f,ax=plt.subplots(2,2,figsize=(20,15))

sns.countplot('Embarked',data=data,ax=ax[0,0])
ax[0,0].set_title('No. Of Passengers Boarded')

sns.countplot('Embarked',hue='Sex',data=data,ax=ax[0,1])
ax[0,1].set_title('Male-Female Split for Embarked')

sns.countplot('Embarked',hue='Survived',data=data,ax=ax[1,0])
ax[1,0].set_title('Embarked vs Survived')
ax[1,0].legend(['Dead','Live'])

sns.countplot('Embarked',hue='Pclass',data=data,ax=ax[1,1])
ax[1,1].set_title('Embarked vs Pclass')

plt.subplots_adjust(wspace=0.2,hspace=0.5)
plt.show()

In [None]:
sns.factorplot('Pclass','Survived',hue='Sex',col='Embarked',data=data)
plt.show()

#### Filling Embarked NaN ####  
As we saw that maximum passengers boarded from Port S, we replace NaN with S.

In [301]:
data['Embarked'].fillna('S', inplace = True)

### SibSip -> Discrete Feature ###   
This feature represents whether a person is alone or with his family members.   

Sibling = brother, sister, stepbrother, stepsister   

Spouse = husband, wife   

In [None]:
pd.crosstab([data.SibSp], data.Survived).style.background_gradient(cmap= 'summer_r')

In [None]:
f,ax=plt.subplots(1,2,figsize=(20,8))

sns.barplot('SibSp','Survived',data=data,ax=ax[0])
ax[0].set_title('SibSp vs Survived')

sns.factorplot('SibSp','Survived',data=data,ax=ax[1])
ax[1].set_title('SibSp vs Survived')

plt.close(3)
plt.show()

In [None]:
pd.crosstab(data.SibSp,data.Pclass).style.background_gradient(cmap='summer_r')

### Parch ###

In [None]:
pd.crosstab(data.Parch,data.Pclass).style.background_gradient(cmap='summer_r')

In [None]:
f,ax=plt.subplots(1,2,figsize=(20,8))
sns.barplot('Parch','Survived',data=data,ax=ax[0])
ax[0].set_title('Parch vs Survived')
sns.factorplot('Parch','Survived',data=data,ax=ax[1])
ax[1].set_title('Parch vs Survived')
#plt.close(2)
plt.show()

### Fare -> Continous Feature ###

In [None]:
print('Highest Fare was:',data['Fare'].max())
print('Lowest Fare was:',data['Fare'].min())
print('Average Fare was:',data['Fare'].mean())

In [None]:
f,ax=plt.subplots(1,3,figsize=(20,8))

sns.distplot(data[data['Pclass']==1].Fare,ax=ax[0])
ax[0].set_title('Fares in Pclass 1')

sns.distplot(data[data['Pclass']==2].Fare,ax=ax[1])
ax[1].set_title('Fares in Pclass 2')

sns.distplot(data[data['Pclass']==3].Fare,ax=ax[2])
ax[2].set_title('Fares in Pclass 3')

plt.show()

### Observations in a Nutshell for all features: ###   
**Sex**: The chance of survival for women is high as compared to men.

**Pclass**:There is a visible trend that being a **1st class passenger** gives you better chances of survival. The survival rate for **Pclass3 is very low**. For women, the chance of survival from Pclass1 is almost 1 and is high too for those from **Pclass2. Money Wins!!!.**

**Age**: Children less than 5-10 years do have a high chance of survival. Passengers between age group 15 to 35 died a lot.

**Embarked**: This is a very interesting feature. **The chances of survival at C looks to be better than even though the majority of Pclass1 passengers got up at S.** Passengers at Q were all from **Pclass3.**

**Parch+SibSp**: Having 1-2 siblings,spouse on board or 1-3 Parents shows a greater chance of probablity rather than being alone or having a large family travelling with you.

### Correlation Between The Features ###

In [None]:
sns.heatmap(data.corr(), annot = True, cmap = 'RdYlGn', linewidths = 0.2) #data.corr()-->correlation matrix

fig=plt.gcf()
fig.set_size_inches(10,8)

plt.show()

### Part2: Feature Engineering and Data Cleaning ###
Now what is Feature Engineering?

Whenever we are given a dataset with features, it is not necessary that all the features will be important. There maybe be many redundant features which should be eliminated. Also we can get or add new features by observing or extracting information from other features.

An example would be getting the Initals feature using the Name Feature. Lets see if we can get any new features and eliminate a few. Also we will tranform the existing relevant features to suitable form for Predictive Modeling.

### Age_band ###

In [None]:
data['Age_band']=0
data.loc[data['Age']<=16,'Age_band']=0
data.loc[(data['Age']>16)&(data['Age']<=32),'Age_band']=1
data.loc[(data['Age']>32)&(data['Age']<=48),'Age_band']=2
data.loc[(data['Age']>48)&(data['Age']<=64),'Age_band']=3
data.loc[data['Age']>64,'Age_band']=4

In [None]:
data.head(2)

In [None]:
data['Age_band'].value_counts().to_frame().style.background_gradient(cmap='summer')#checking the number of passenegers in each band

In [None]:
sns.factorplot('Age_band','Survived',data=data,col='Pclass')
plt.show()

### Family_Size and Alone ###

In [308]:
data['Family_Size']=0
data['Family_Size']=data['Parch']+data['SibSp']#family size
data['Alone']=0
data.loc[data.Family_Size==0,'Alone']=1#Alone

In [None]:
f,ax=plt.subplots(1,2,figsize=(18,6))

sns.factorplot('Family_Size','Survived',data=data,ax=ax[0])
ax[0].set_title('Family_Size vs Survived')

sns.factorplot('Alone','Survived',data=data,ax=ax[1])
ax[1].set_title('Alone vs Survived')

#plt.close(1)
plt.show()

In [None]:
sns.factorplot('Alone','Survived',data=data,hue='Sex',col='Pclass')
plt.show()

### Fare_Range ###

In [None]:
data['Fare_Range']=pd.qcut(data['Fare'],4)

data.groupby(['Fare_Range'])['Survived'].mean().to_frame().style.background_gradient(cmap='summer_r')

In [None]:
data.loc[:, ['Fare','Fare_Range']]

In [348]:
data['Fare_cat']=0
data.loc[data['Fare']<=7.91,'Fare_cat']=0
data.loc[(data['Fare']>7.91)&(data['Fare']<=14.454),'Fare_cat']=1
data.loc[(data['Fare']>14.454)&(data['Fare']<=31),'Fare_cat']=2
data.loc[(data['Fare']>31)&(data['Fare']<=513),'Fare_cat']=3

In [None]:
sns.factorplot('Fare_cat','Survived',data=data,hue='Sex')
plt.show()