# Titanic Exploratory Data Analysis

# Introduction

This is an exploration of the Titanic dataset. My goal is to get an in depth understanding of the data and to shortlist a few promising transformations to experiment with during model creation, where my aim will be to predict passenger survival.

## Outline

1. [Get the data](#obtain)
1. [Explore the data](#explore)
1. [Promising transformations](#transformations)

<a id='obtain'></a>

# Get the data

Let's load the data and have a quick look at its structure.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
test = pd.read_csv('../input/titanic/test.csv')   # for basic checks
train = pd.read_csv('../input/titanic/train.csv')
train.head()

In [None]:
train.info()

In [None]:
test.info()

In [None]:
len(test)

In [None]:
total_len = len(train) + len(test)
print(len(train) / total_len * 100)
print(len(test) / total_len * 100)

To summarise the above:
* There are 891 instances in the training set, each one representing a unique passenger.
* There are 12 attributes: two of these are `float64`, five are `int64`, and the remaining five are `object`s.
* The target attribute is `Survived`, so it is not present on the test set. Additionally, it has no missing values.
* `Age`, `Cabin`, `Embarked` have missing values in the training set.
* In the test set, `Age`, `Cabin`, and `Fare` have missing values.
* `PassengerId` is a running index. Therefore, it will not provide any useful information during modelling and can be dropped.
* The dataset is split 70/30 into training and test sets.

In [None]:
train = train.drop(columns='PassengerId')

<a id='explore'></a>

# Explore the data

I will first explore the numerical attributes, followed by the categorical attributes. I'll start off broad to gain a general understanding of the kind of data I'm manipulating, and then I'll follow this up with an in-depth investigation of the individual attributes.

## Numerical attributes

In [None]:
train.describe()

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(12,8))
train['Survived'].value_counts().plot.bar(ax=ax[0,0], title='Survived')
train['Pclass'].value_counts().plot.bar(ax=ax[1,0], title='Pclass')
train['SibSp'].value_counts().plot.bar(ax=ax[0,1], title='SibSp')
train['Parch'].value_counts().plot.bar(ax=ax[1,1], title='Parch')
plt.setp(ax[:, 0], ylabel='Counts')
plt.setp(ax[0,0].xaxis.get_majorticklabels(), rotation=360)
plt.setp(ax[1,0].xaxis.get_majorticklabels(), rotation=360)
plt.setp(ax[0,1].xaxis.get_majorticklabels(), rotation=360)
plt.setp(ax[1,1].xaxis.get_majorticklabels(), rotation=360);

In [None]:
age_fare = train[['Age', 'Fare']]
age_fare.hist(bins=20, figsize=(12,4));

In [None]:
print(train['Survived'].value_counts())
print('\n')
print(train['Survived'].value_counts() / len(train) * 100)
print('\n')
print('0 = No, 1 = Yes')

The target attribute `Survived` is a binary attribute where 0 = No and 1 = Yes. Most passengers (62%) did not survive. `Pclass`, which represents ticket class, is an ordinal integer feature where 1 = 1st, 2 = 2nd, and 3 = 3rd class ticket. The majority of passengers had a 3rd class ticket, followed by 1st class, and then 2nd class. I will keep `Survived` and `Pclass`, as well as `SibSp` and `Parch`, as numerical attributes as most machine learning algorithms cannot work with categorical features.

For the remaining numerical attributes:
* `Age` approximates a normal distribution, but it is slightly skewed to the right.
* `Fare` is heavily skewed to the right.
* These attributes may need to be transformed later on to have a more bell-shaped distribution.
* There are some very different scales here (e.g., `Age` ranges from 0–80 while `Fare` appears to have values at about 500), so feature scaling will be necessary.

A very simple classifier might predict 'No' for every instance, given that most passengers did not survive. Doing this would result in an accuracy of about 60%. I want to create a model that significantly improves on this.

Next, I'll take a closer look at `Age` and `Fare`.

### `Age`

In [None]:
plt.boxplot([train['Age'].dropna(axis=0)]) # drop missing values otherwise it will not work
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Years');

In [None]:
q_1, q_3 = np.percentile(train['Age'].dropna(axis=0), [25, 75])
IQR = q_3 - q_1
upper_bound = q_3 + (1.5 * IQR)
age_outliers = train[train['Age'] > upper_bound]
age_outliers

In [None]:
len(age_outliers)

There are 11 data points above the upper bound (3rd quartile + 1.5 times the IQR).

In [None]:
print(np.min(age_outliers['Age']))
print(np.max(age_outliers['Age']))

These values seem reasonable.

Were younger passengers more likely to survive compared to older passengers?

In [None]:
surv = train[train['Survived'] == 1]
surv_no = train[train['Survived'] == 0]

surv_age = surv['Age'].dropna(axis=0)
surv_no_age = surv_no['Age'].dropna(axis=0)

plt.hist(surv_age, bins=50, alpha=0.4, label='Survived')
plt.hist(surv_no_age, bins=50, alpha=0.4, label='Did not survive')
plt.legend(loc='upper right');

Survival was higher for very young children (< 5 years of age). Additionally, there appears to be many young adults (up to about 30 years of age) who did not survive. Otherwise the distributions are similar.

### `Fare`

In [None]:
plt.boxplot(train['Fare'])
plt.title('Distribution of Fare')
plt.xlabel('Fare')
plt.ylabel('Dollars');

In [None]:
q_1, q_3 = np.percentile(train['Fare'], [25, 75])
IQR = q_3 - q_1
upper_bound = q_3 + (1.5 * IQR)
fare_outliers = train[train['Fare'] > upper_bound]
fare_outliers

In [None]:
train[train['Fare'] > 500]

In [None]:
len(fare_outliers)

In [None]:
len(fare_outliers) / len(train) * 100

The `Fare` values of 512.33 are extreme. According to [this source](https://www.encyclopedia-titanica.org/titanic-survivor/annie-moore-ward.html), the fare price is accurate. Additionally, there are 116 data points above the upper bound (3rd quartile + 1.5 times the IQR). This is 13% of all `Fare` values.

In [None]:
surv = train[train['Survived'] == 1]
surv_no = train[train['Survived'] == 0]

surv_fare = surv['Fare'].dropna(axis=0)
surv_no_fare = surv_no['Fare'].dropna(axis=0)

plt.hist(surv_fare, bins=20, alpha=0.4, label='Survived')
plt.hist(surv_no_fare, bins=20, alpha=0.4, label='Did not survive')
plt.legend(loc='upper right');

Additionally, as seen from the above plot, there appears to be many low/zero fare entries, particularly for those passengers that did not survive.

In [None]:
len(surv[surv['Fare'] <= 10])

In [None]:
len(surv_no[surv_no['Fare'] <= 10])

In [None]:
len(surv_no[surv_no['Fare'] == 0])

Machine learning algorithms typically do not handle skewed features very well. One way of dealing with this is by transforming the attribute logarithmically.

In [None]:
surv = train[train['Survived'] == 1]
surv_no = train[train['Survived'] == 0]

surv_fare = np.log10(surv['Fare'].dropna(axis=0).values+1)       # to adjust fare entries which are 0
surv_no_fare = np.log10(surv_no['Fare'].dropna(axis=0).values+1)

plt.hist(surv_fare, bins=20, alpha=0.4, label='Survived')
plt.hist(surv_no_fare, bins=20, alpha=0.4, label='Did not survive')
plt.legend(loc='upper right');

Passengers that did not survive typically paid less for their fare compared to those that did survive.

#### `Parch` and `SibSp`

How many passengers travelled alone (i.e., had a value of 0 for both `Parch` and `SibSp`)? Were they more likely to survive compared to those that travelled with family?

In [None]:
cond1 = train['Parch'] == 0
cond2 = train['SibSp'] == 0
len(train[cond1 & cond2])

In [None]:
cond1 = train['Parch'] == 0
cond2 = train['SibSp'] == 0
alone = train[cond1 & cond2]

cond1 = train['Parch'] != 0
cond2 = train['SibSp'] != 0
not_alone = train[cond1 & cond2]

print('Alone:')
print((alone['Survived'].value_counts()) / len(alone) * 100)
print('\n')
print('Not alone:')
print((not_alone['Survived'].value_counts()) / len(not_alone) * 100)

In [None]:
train['TravellingAlone'] = 0
cond = (train['Parch'] == 0) & (train['SibSp'] == 0)
train.loc[cond, 'TravellingAlone'] = 1
train['TravellingAlone'].value_counts().sort_values(ascending=True)

In [None]:
train['TravellingAlone'].value_counts(ascending=True).plot.bar()
plt.title('TravellingAlone')
plt.xticks(rotation=360)
plt.ylabel('Counts')
plt.figtext(0.90, 0.01, '0 = No, 1 = Yes', horizontalalignment='right');

In [None]:
print(alone['Survived'].value_counts())
print('\n')
print(alone['Survived'].value_counts() / len(alone) * 100)

In [None]:
print(not_alone['Survived'].value_counts().sort_values(ascending=False))
print('\n')
print(not_alone['Survived'].value_counts().sort_values(ascending=False) / len(not_alone) * 100)

In [None]:
alone = train[train['TravellingAlone'] == 1]
not_alone = train[train['TravellingAlone'] == 0]

survived_alone = alone['Survived'].value_counts()
survived_not_alone = not_alone['Survived'].value_counts().sort_values(ascending=False)

n_groups = 2
index = np.arange(n_groups)

width = 0.3

plt.bar(np.arange(len(survived_alone)), survived_alone, width=width, label='Alone')
plt.bar(np.arange(len(survived_not_alone)) + 0.3, survived_not_alone, width=width, label='Not Alone', color='mediumseagreen')
plt.xticks(index + 0.15, ('0', '1'), rotation=360)
plt.title('Survived by TravellingAlone')
plt.xlabel('Survived')
plt.ylabel('Counts')
plt.legend()
plt.figtext(0.90, 0.01, '0 = No, 1 = Yes', horizontalalignment='right');

Most passengers (537) were travelling alone. Of these passengers, 30% survived. In comparison, the survival of passengers who were *not* travelling alone was 49%. I wonder if survival varies as a function of `SibSp` and/or `Parch`? In particular, perhaps the likelihood of survival increases up until a certain point (large families might be a hindrance).

## Categorical attributes

### `Name`

In [None]:
train['Name'].head()

In [None]:
train['Name'].tail()

I wonder how helpful titles (e.g., Mr.) might be?

### `Sex`

In [None]:
train['Sex'].value_counts().plot.bar()
plt.title('Sex')
plt.xticks(rotation=360)
plt.ylabel('Counts');

I wonder if one sex was more likely to survive than the other? I expect that women and children were evacuated first.

In [None]:
males = train[train['Sex'] == 'male']
females = train[train['Sex'] == 'female']

survived_males = males['Survived'].value_counts()
survived_females = females['Survived'].value_counts().sort_values(ascending=True)

n_groups = 2
index = np.arange(n_groups)

width = 0.3

plt.bar(np.arange(len(survived_males)), survived_males, width=width, label='Male')
plt.bar(np.arange(len(survived_females)) + 0.3, survived_females, width=width, label='Female', color='mediumseagreen')
plt.xticks(index + 0.15, ('0', '1'), rotation=360)
plt.title('Survived by Sex')
plt.xlabel('Survived')
plt.ylabel('Counts')
plt.legend()
plt.figtext(0.90, 0.01, '0 = No, 1 = Yes', horizontalalignment='right');

The majority of female passengers survived, while most male passengers did not. A simple classifier might predict all female passengers survive while all male passengers do not. The accuracy of such a model on the test set is 76.5%. It will be interesting to see how much I can improve on this simple model using a classification algorithm and investing some time into feature engineering.

So, there are two potential baseline models to compare to:
1. Nobody survives. Accuracy is about 60%.
2. Females survive; males do not. Accuracy is 76.5%.

### `Ticket`

In [None]:
train['Ticket'].head()

In [None]:
train['Ticket'].tail()

In [None]:
train['Ticket'].value_counts()

According to the data dictionary, `Ticket`represents ticket number. I'm not sure how helpful this will be. Of those passengers with a ticket number, what do they have in common?

In [None]:
train[train['Ticket'] == 'CA. 2343']

In [None]:
train[train['Ticket'] == '347082']

These passengers were in the same family.

### `Cabin`

In [None]:
train['Cabin'].head()

In [None]:
train['Cabin'].tail()

In [None]:
train['Cabin'].value_counts()

The letter prefix (e.g., `G`, `C`) might provide useful information for the model. Perhaps this represents sections of the ship.

In [None]:
train['CabinLetter'] = train['Cabin'].str[:1]
train['CabinLetter'].value_counts()

In [None]:
train['CabinLetter'].value_counts().plot.bar()
plt.title('Cabin Prefix')
plt.xticks(rotation=360)
plt.ylabel('Counts');

In [None]:
train[train['CabinLetter'] == 'C'].head()

In [None]:
train['CabinLetter'].fillna('unassigned', inplace=True) # For passengers without an assigned cabin, give them a value of 'unassigned'
train['CabinLetter'].value_counts()

In [None]:
train['Cabin'].isnull().sum()

Additionally, there are many missing values for `Cabin`. Of those passengers that have a value for `Cabin`, what do they have in common?

In [None]:
cabin_notnull = train[train['Cabin'].notnull()]
cabin_notnull.head()

In [None]:
surv = train[train['Survived'] == 1]
surv_no = train[train['Survived'] == 0]

surv_fare = np.log10(surv['Fare'].dropna(axis=0).values+1)       # to adjust fare entries which are 0
surv_no_fare = np.log10(surv_no['Fare'].dropna(axis=0).values+1)

plt.hist(surv_fare, bins=20, alpha=0.4, label='Survived')
plt.hist(surv_no_fare, bins=20, alpha=0.4, label='Did not survive')
plt.legend(loc='upper right');

In [None]:
# Passengers where Cabin is not missing
age_fare_notnull = cabin_notnull[['Age', 'Fare']]
age_fare_notnull.hist(bins=20, figsize=(12,4), label='Assigned cabin')

# Passengers where Cabin is missing
cabin_null = train[train['Cabin'].isnull()]
age_fare_null = cabin_null[['Age', 'Fare']]
age_fare_null.hist(bins=20, figsize=(12,4), label='No cabin assigned');

In [None]:
counts_null = cabin_null['Survived'].value_counts()
counts_notnull = cabin_notnull['Survived'].value_counts().sort_values(ascending=True)

n_groups = 2
index = np.arange(n_groups)

width = 0.3

plt.bar(np.arange(len(counts_null)), counts_null, width=width, label='No cabin assigned')
plt.bar(np.arange(len(counts_notnull)) + 0.3, counts_notnull, width=width, label='Cabin assigned', color='mediumseagreen')
plt.xticks(index + 0.15, ('0', '1'), rotation=360)
plt.title('Survived by Cabin (assigned or unassigned)')
plt.xlabel('Survived')
plt.ylabel('Counts')
plt.legend()
plt.figtext(0.90, 0.01, '0 = No, 1 = Yes', horizontalalignment='right');

The majority of passengers with an assigned `Cabin` survived. I will create `CabinAssigned` to indicate whether or not a passenger was assigned a cabin. `CabinLetter`, which provides additional information (deck letters and unassigned), is also promising. Later I will compute the correlation between these derived features and `Survived` to see their potential usefulness for modelling.

In [None]:
train['CabinAssigned'] = train['Cabin'].notnull().convert_dtypes(convert_boolean=False)

### `Embarked`

In [None]:
train['Embarked'].value_counts(ascending=False).plot.bar()
plt.title('Embarked')
plt.xticks(rotation=360)
plt.ylabel('Counts')
plt.figtext(0.90, 0.01, 'S = Southampton, C = Cherbourg, Q = Queenstown', horizontalalignment='right');

In [None]:
print('Counts:')
print(train['Embarked'].value_counts(ascending=False))
print('\n')
print('%:')
print(train['Embarked'].value_counts(ascending=False) / len(train) * 100)

Most passengers embarked from Southampton. Were these people more or less likely to survive? How does `Embarked` relate to other attributes such as `Pclass`?

### `Pclass`

In [None]:
train['Pclass'].value_counts(ascending=False).plot.bar()
plt.title('Pclass')
plt.xticks(rotation=360)
plt.ylabel('Counts')
plt.figtext(0.90, 0.01, '1 = 1st class, 2 = 2nd class, 3 = 3rd class', horizontalalignment='right');

`Pclass` is a categorical attribute (ticket class) that has been encoded as numerical (1 = 1st, 2 = 2nd, and 3 = 3rd class).

## Missing values

In [None]:
(train.isnull().sum().sort_values(ascending=True) / len(train) * 100).plot.barh()
plt.title('Missing values')
plt.xlabel('Percentage')
plt.ylabel('Attribute');

In [None]:
print('Missing counts:')
print(train.isnull().sum().sort_values(ascending=False))
print('\n')
print('Missing %:')
print(train.isnull().sum().sort_values(ascending=False) / len(train) * 100)

`Cabin`, `Age`, and `Embarked` have missing values ranging from 0.2%–77.1%. I will impute missing `Embarked` values (most likely with the mode `S`) because there are so few. I may need to drop `Cabin` because there are so many missing values (687), and I'm not sure what to do about `Age` just yet (177 missing).

In [None]:
print('Missing counts:')
print(test.isnull().sum().sort_values(ascending=False))

On the test set, `Cabin`, `Age`, and `Fare` have missing values. Any imputations applied to the training set will also need to be applied to the test set. `Cabin` will most likely be dropped and replaced with a derived feature such as `CabinAssigned`.

## Correlations

In [None]:
make_num = {'Sex':         {'male': 0, 'female': 1},
            'Embarked':    {'S': 0, 'C': 1, 'Q': 2},
            'CabinLetter': {'unassigned': 0,
                            'C': 1,
                            'B': 2,
                            'D': 3,
                            'E': 4,
                            'A': 5,
                            'F': 6,
                            'G': 7,
                            'T': 8}
           }

train.replace(make_num, inplace=True)

In [None]:
corr_matrix = train.corr()

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(corr_matrix, annot=True, vmin=-1, vmax=1, cmap='RdBu');

In [None]:
corr_matrix['Survived'].sort_values(ascending=False)

To summarise the correlations above:
* There is a moderate positive relationship between `Survived` and `Sex`. This indicates that females were more likely to survive than males (0 = male, 1 = female).
* There is a moderate negative relationship between `Survived` and `Pclass`. So, passengers in a higher ticket class (e.g., 1st class) were more likely to survive than passengers in a lower ticket class (e.g., third class).
* There is a moderate positive relationship between `Survived` and `CabinAssigned` (the attribute I derived earlier, which indicates whether or not a passenger has an assigned cabin).
* `CabinLetter` has a weaker association with `Survived` compared to the other cabin attribute I derived, `CabinAssigned`. Perhaps the extra information is not so helpful.
* There is a weak positive relationship between `Survived` and `Fare`. So, those that paid more for their ticket were more likely to to survive than those that paid less.
* `Parch`, `SibSp`, and `Age` don't have much of a linear relationship with `Survived`. Perhaps there is some additional feature engineering work that can be done with these attributes.
* I derived `TravellingAlone` from `SibSp` and `Parch`. This attribute has a stronger, negative relationship with `Survived`. So, passengers that were not travelling alone were more likely to survive compared to passengers that were.

<a id='transformations'></a>

# Promising transformations

Here is a list of promising transformations I can experiment with when creating models to predict passenger survival:
* Fill missing `Age` values with the median.
* Create `TravellingAlone`, which indicates whether or not a passenger was travelling alone.
* Drop `Ticket` as it has no useful information.
* Create `CabinAssigned`, which indicates whether or not the passenger had a cabin assigned.
* Create `CabinLetter`, which is similar to `CabinAssigned` but includes additional information for passengers that *were* assigned a cabin (e.g., deck `C`).
* Drop `Cabin` because it is mostly missing, as the above attributes (which were derived from `Cabin`) should be useful
* Fill missing values for `Embarked` with the mode.
* Transform `Age` and `Fare` to make them more normally distributed (`Fare` is especially skewed).
* Scale features so they have similar values.

This concludes my first round of exporation of the Titanic dataset. Of course, not all of the above transformations will be useful (in fact, some may *decrease* accuracy). Next, I will train some models for predicting `Survival` using the insights I have obtained from this analysis. Once I inspect the outputs of these models, I might continue to do some more exploration to see if there are any other potentially useful transformations.