In [None]:
import pandas as pd
import numpy as np
from scipy.stats import pearsonr, chi2_contingency
import seaborn as sns
%matplotlib inline

In [None]:
def plot_count(col_name):
    data.groupby(col_name).agg({'PassengerId':'count'}).plot(kind='bar');
    
def tab_count_surv(col_name):
    return data.groupby(col_name).agg({'PassengerId':'count','Survived':np.mean})
    
def chi2_pvalue(col_name):
    return 'P-value: ' + str(chi2_contingency(data.groupby(col_name).agg({'PassengerId':'count','Survived':np.sum}).values)[1])

In [None]:
data = pd.read_csv('./data/titanic-data.csv')

In [None]:
data.head()

In [None]:
data.shape

There are data about 891 passengers.
Data set contains information about survival, class, name, sex, age, number of siblings/spouses aboard, number of parents/children aboard, fare, cabin and port in which passenger embarked.

I would like to recalculate some variables to obtain
- total number of family members aboard
- indicator variables about having family, parents/chidren, siblings/spouse aboard
- indicator variable of missing values in Cabin

Main question I would like to answer is if there are any indicators allowing to predict survival. I would like also check some interactions between variables (ie. age and parch). I will also check if relations which shouldn't be present in dataset (like embarkment port and survival) are really missing.

As this is only exploratory analysis most of the results will not be statistically tested for significance

In [None]:
data['family']=data['SibSp']+data['Parch']

In [None]:
data['has_family']=data['family'].apply(lambda x: 1 if x>0 else 0)

In [None]:
data['has_Parch']=data['Parch'].apply(lambda x: 1 if x>0 else 0)

In [None]:
data['has_SibSp']=data['SibSp'].apply(lambda x: 1 if x>0 else 0)

In [None]:
data['has_cabin']=data['Cabin'].apply(lambda x: 1 if str(x)!='nan' else 0)

In [None]:
data.describe()

On average 38% passengers survived. Considered passengers age is on average 29.7.
About 40% has family on board. For almost 80% we do not have data about having cabin. Avarage fare is 32, but median is only 14.5 what suggest skewed distribution. Data values seems to be OK. I assume that passengers with fare equal to zero are crew.

In [None]:
data.corr()

In [None]:
sns.heatmap(data.corr(),square = True);

It seems that survival is correlated with Fare, Pclass, having cabin, but also with having family on board. Correlation with having parents/children seems to be higher than with having siblings/spouse. Also Fare, Pclass and having cabin has rather high correlation between each other. In the next step I am going to statistically test significance of these correlations.

In [None]:
for x in data.drop(['PassengerId','Survived'],axis=1).columns[data.drop(['PassengerId','Survived'],\
                                                                        axis=1).dtypes!='O']:
    cor = pearsonr(data.ix[data[x].notnull(),'Survived'],data.ix[data[x].notnull(),x])
    print x, 'Corr:', cor[0], 'p-value:', cor[1]

In [None]:
plot_count('Pclass')

In [None]:
tab_count_surv('Pclass')

In [None]:
chi2_pvalue('Pclass')

The higher class the higher survival rate. Differences between class are statistically significant due to chi2 test. From above charts we can also see that most of passengers was in 3rd class

In [None]:
data['Age'].plot(kind='hist');

We can observe that mostly 20-40 years old people were traveling by Titanic. In addition some of them had young
children with them.

In [None]:
data['age_bin']=pd.cut(data['Age'],bins=[0,5,10,20,30,40,50,60,100])

In [None]:
tab_count_surv('age_bin')

As difference is visible only for chldren below 5 years I will recalculate bins

In [None]:
data['age_bin']=pd.cut(data['Age'],bins=[0,5,100])

In [None]:
chi2_pvalue('age_bin')

Children up to five years has significantly higher survival ratio

In [None]:
data['age_bin']=pd.cut(data['Age'],bins=[0,1,2,3,4,5,10,20,30,40,50,60,100])

I would like to see if there is any interaction for age and having family (I assume lower survival rate for children without family)

In [None]:
data.pivot_table(index='age_bin',columns='has_family',values='Survived',aggfunc=np.mean)

Most of young children (below 10 years old) traveled with family. For people older than ten years in almost all categories having family was increasing survival ratio.

In [None]:
tab_count_surv('has_family')

In [None]:
chi2_pvalue('has_family')

And we can see that having family was significantly increasing survival ratio.

I would like to have equal bins for fare, so I calculate percentiles excluding zeros

In [None]:
data.ix[data['Fare']>0,'Fare'].describe(percentiles=[0,.2,.4,.6,.8,1])

In [None]:
data['Fare_bin']=pd.cut(data['Fare'],bins=[0,4.01,7.89,11.13,23,40.125,513],right=False)

In [None]:
tab_count_surv('Fare_bin')

In [None]:
chi2_pvalue('Fare_bin')

Survival significantly depends on fare paid for travel. There is only one exception - first non zero category has higher survival rate then following one. It is probably effect of some kind of interaction

I am recalculating again age bins to compare it with fare bins.

In [None]:
data['age_bin']=pd.cut(data['Age'],bins=[0,10,20,30,40,50,60,100])

Below table shows fraction of age bins per fare bin. Rows sums up to 1.

In [None]:
data.pivot_table(index='Fare_bin',columns='age_bin',values='PassengerId',aggfunc='count').div(
data.pivot_table(index='Fare_bin',columns='age_bin',values='PassengerId',aggfunc='count').sum(axis=1),axis=0)

In [None]:
data.pivot_table(index='Fare_bin',columns='age_bin',values='Survived',aggfunc=np.mean)

There is overrepresentation of young and healthy people(20-30 years old) in lowest non zero fare group and it can be cause of higher survival rate there. I assume it can be some kind of "special" group there as their survival ratio is very high for this fare class.

In [None]:
plot_count('has_cabin')

In [None]:
tab_count_surv('has_cabin')

In [None]:
chi2_pvalue('has_cabin')

Having cabin also significantly improve survival ratio, but we have to remember it is correlated also with wealth (defined by paid fare and Pclass)

In [None]:
tab_count_surv('Embarked')

In [None]:
chi2_pvalue('Embarked')

In [None]:
data.pivot_table(index='Fare_bin',columns='Embarked',values='PassengerId',aggfunc='count')

Cherbourg has highest survival ratio, what shouldn't occur. It is probably caused by higher ratio of wealthy people embarking there.

## Summary
Survival seems to depend mostly on wealth of passenger (Pclads, fare and having cabin) and on having family.