#                                   EDA for Titanic dataset

In [1]:
%matplotlib notebook
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sn

In [2]:
df_train=pd.read_csv('F:/learning/kaggle/titanic/train.csv')
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Variable Identification

Here  we identify the datatypes of each and every features, so that it will be useful for selecting a technique of uni/bivariate analysis

In [3]:
df_train.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [4]:
df_train.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

#### Out of these,
#### Continuous Variable:
    1. Age
    2. Fare
#### Categorical Variable:
    Nominal Variable:
        1. Sex
        2. Name
        3. Cabin
        4. Embarked
        5. Survived
        6. SibSp
        7. Parch
        8. Ticket
    Ordinal Variable:
        1. Pclass

## Univariate Analysis

Here we analyse feature individually


For categorical variables, it is done through barplot

### Survived feature

In [12]:
plt.figure()
a1=sn.barplot(x=df_train["Survived"].value_counts().keys(),y=df_train["Survived"].value_counts(),data=df_train)
plt.xlabel('Survived or not')
plt.ylabel('Number of Passengers')
plt.title('Comaprison of people who survived and died')
plt.show(a1)

<IPython.core.display.Javascript object>

  


### Embarked feature

In [16]:
plt.figure()
a2=sn.barplot(x=df_train["Embarked"].value_counts().keys(),y=df_train["Embarked"].value_counts(),data=df_train)
plt.xlabel('Embark location')
plt.ylabel('Number of Passengers')
plt.title('Categories based on Embark location')
plt.show(a2)

<IPython.core.display.Javascript object>

  


### Pclass feature

In [17]:
plt.figure()
a3=sn.barplot(x=df_train["Pclass"].value_counts().keys(),y=df_train["Pclass"].value_counts(),data=df_train)
plt.xlabel('Class')
plt.ylabel('Number of Passengers')
plt.title('Categories based on classes')
plt.show(a3)

<IPython.core.display.Javascript object>

  


### SibSp feature

In [19]:
plt.figure()
a4=sn.barplot(x=df_train["SibSp"].value_counts().keys(),y=df_train["SibSp"].value_counts(),data=df_train)
plt.xlabel('# of siblings / spouses')
plt.ylabel('Number of Passengers')
plt.show(a4)

<IPython.core.display.Javascript object>

  """


### Sex feature

In [24]:
plt.figure()
a4=sn.barplot(x=df_train["Sex"].value_counts().keys(),y=df_train["Sex"].value_counts(),data=df_train)
plt.xlabel('Sex')
plt.ylabel('Number of Passengers')
plt.show(a4)

<IPython.core.display.Javascript object>

  """


For continuous variable, we do univariate analysis trough histogram

### Fare feature

In [25]:
plt.figure()
sn.distplot(df_train['Fare'])
plt.xlabel('Fare')
plt.ylabel('Freuency')
plt.show()

<IPython.core.display.Javascript object>

### Age feature

Before plotting we have to fill the NaN in Age feature with mean of the Age

In [27]:
df_train['Age']=df_train['Age'].fillna(df_train['Age'].mean())

In [28]:
plt.figure()
sn.distplot(df_train['Age'])
plt.xlabel('Age')
plt.ylabel('Freuency')
plt.show()

<IPython.core.display.Javascript object>

# Missing Value Treatment

Lets find the number of NaN in individual feature

In [32]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In embarked we have two NaNs. We know that 'S' is the predominant category in embarked i.e 'S' is the Mode. 
Thorugh mode imputation we fill the NaNs with'S'

In [33]:
df_train['Embarked']=df_train['Embarked'].fillna('S')

I wish to delete the entire cabin features as it don't contribute much to the classsification

In [36]:
df_train=df_train.drop('Cabin',axis=1)

## Feature Engineering

We have to convert all categorical variables into numerical type through one hot encoding to use those features for our training

We won't be using Name and Ticket features while training so it is not necessary to encode that

In [None]:
df_train=pd.get_dummies(df_train,columns=['Pclass','SibSp','Sex'])

In [41]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Name,Age,Parch,Ticket,Fare,Embarked,Pclass_1,Pclass_2,Pclass_3,SibSp_0,SibSp_1,SibSp_2,SibSp_3,SibSp_4,SibSp_5,SibSp_8,Sex_female,Sex_male
0,1,0,"Braund, Mr. Owen Harris",22.0,0,A/5 21171,7.25,S,0,0,1,0,1,0,0,0,0,0,0,1
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,0,PC 17599,71.2833,C,1,0,0,0,1,0,0,0,0,0,1,0
2,3,1,"Heikkinen, Miss. Laina",26.0,0,STON/O2. 3101282,7.925,S,0,0,1,1,0,0,0,0,0,0,1,0
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,0,113803,53.1,S,1,0,0,0,1,0,0,0,0,0,1,0
4,5,0,"Allen, Mr. William Henry",35.0,0,373450,8.05,S,0,0,1,1,0,0,0,0,0,0,0,1


### Now we are all set to go for modelling