# Data Description

This notebook is used to initially explore the data to see what forms of preprocessing is required before building and testing the models.

# Imports

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

# Basic EDA

Here are what each feautre in the dataset represents:

![data_features](img/data_features.png)

In [2]:
# Import Data
train_df = pd.read_csv('../data/train.csv')
test_df = pd.read_csv('../data/test.csv')

train_df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [3]:
train_df.tail(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
881,882,0,3,"Markun, Mr. Johann",male,33.0,0,0,349257,7.8958,,S
882,883,0,3,"Dahlberg, Miss. Gerda Ulrika",female,22.0,0,0,7552,10.5167,,S
883,884,0,2,"Banfield, Mr. Frederick James",male,28.0,0,0,C.A./SOTON 34068,10.5,,S
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.05,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.125,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [4]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [5]:
train_df.shape

(891, 12)

In [6]:
train_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [7]:
train_df.describe(include=["object", "bool"])

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Braund, Mr. Owen Harris",male,347082,B96 B98,S
freq,1,577,7,4,644


## Missing Values

In [17]:
train_df.isnull().sum(axis = 0)

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [29]:
p_age = 177 / train_df.shape[0]
p_cabin = 687 / train_df.shape[0]
p_embarked = 2 / train_df.shape[0]

print(f'Proportion of missing values in Age is {p_age:.3f}.')
print(f'Proportion of missing values in Cabin is {p_cabin:.3f}.')
print(f'Proportion of missing values in Embarked is {p_embarked:.3f}.')

Proportion of missing values in Age is 0.199.
Proportion of missing values in Cabin is 0.771.
Proportion of missing values in Embarked is 0.002.


From the output above, we see that majority of the variable do not have missing values. However, `Cabin` has a very large proportion of missing values (77%) and will be very difficult to correctly impute. This feature will be dropped. `Age` has a relatively low proportion of missing values (20%) when compared to `Cabin`, however 20% is still a large part of the dataset. We will need to determine how to correct the missing values to capture valuable information. Lastely, `Embarked` has a very low proportion (0.2%) with only two missing entries. Both `Age` and `Embarked` will need to be corrected.

In [None]:
train_df[train_df['Age'].isnull()].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
17,18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0,,S
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.225,,C
26,27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.225,,C
28,29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q


In [None]:
train_df[train_df['Age'].isnull()].tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
859,860,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C
863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.55,,S
868,869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5,,S
878,879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S


One method of dealing with the missing values with `Age` is to input the age given the mean, median, or mode of a group. In this case, it might be useful to extract the title of each individual from their name and determine a summary statistic about the age in each group and impute the missing values based off of that. First let's look at two cases of the dataset, frist those that are younger than 16 and male, and those that are younger than 16 and female.

In [43]:
train_df[(train_df['Age'] < 18) & (train_df['Sex'] == 'male')].head(20)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
16,17,0,3,"Rice, Master. Eugene",male,2.0,4,1,382652,29.125,,Q
50,51,0,3,"Panula, Master. Juha Niilo",male,7.0,4,1,3101295,39.6875,,S
59,60,0,3,"Goodwin, Master. William Frederick",male,11.0,5,2,CA 2144,46.9,,S
63,64,0,3,"Skoog, Master. Harald",male,4.0,3,2,347088,27.9,,S
78,79,1,2,"Caldwell, Master. Alden Gates",male,0.83,0,2,248738,29.0,,S
86,87,0,3,"Ford, Mr. William Neal",male,16.0,1,3,W./C. 6608,34.375,,S
125,126,1,3,"Nicola-Yarred, Master. Elias",male,12.0,1,0,2651,11.2417,,C
138,139,0,3,"Osen, Mr. Olaf Elon",male,16.0,0,0,7534,9.2167,,S
163,164,0,3,"Calic, Mr. Jovo",male,17.0,0,0,315093,8.6625,,S


In [42]:
train_df[(train_df['Age'] < 16) & (train_df['Sex'] == 'female')].head(20)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
14,15,0,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,0,0,350406,7.8542,,S
22,23,1,3,"McGowan, Miss. Anna ""Annie""",female,15.0,0,0,330923,8.0292,,Q
24,25,0,3,"Palsson, Miss. Torborg Danira",female,8.0,3,1,349909,21.075,,S
39,40,1,3,"Nicola-Yarred, Miss. Jamila",female,14.0,1,0,2651,11.2417,,C
43,44,1,2,"Laroche, Miss. Simonne Marie Anne Andree",female,3.0,1,2,SC/Paris 2123,41.5792,,C
58,59,1,2,"West, Miss. Constance Mirium",female,5.0,1,2,C.A. 34651,27.75,,S
111,112,0,3,"Zabour, Miss. Hileni",female,14.5,1,0,2665,14.4542,,C
119,120,0,3,"Andersson, Miss. Ellis Anna Maria",female,2.0,4,2,347082,31.275,,S


Looking at the `Age` variable, we can likely infer the age of somebody based on their title, or lack thereof.

In [10]:
train_df[train_df['Embarked'].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


Above I've shown some very basic information about the data through some simple `pandas` EDA functions. Let's see what useful information we can extract to set us up for a successful solution going ahead.

1. The ***PassengerID*** variable is the unqiue numerical identifier for each passenger. Since each passenger will have a unique value, there is no need to include this variable in the study, it will be **dropped**.

2. The ***Survived*** variable describes whether or not the passenger survived and is our **target** class. The passenger either survived (1) or did not survive (0), therefore we have a binary classification problem.

3. The ***Pclass*** variable describes the socio-economic class value of the ticket of each passenger. It can take three values, `upper class` (1), `middle class` (2), and `lower class` (3). Since we have an hierarchal order (lower -> middle -> upper), this will be treated as an **ordinal categorical** variable.

4. The ***Name*** variable contains the names of each passenger along with titles, clarifiers, and other monikers. Since most names and monikers will be unqiue to each individual, they will not be very valuable to the study. The titles, however, may contain other valuable socio economic information such as level of education or profession. As such, a new column named **Title** will be created with the extracted title from each individual's name and be treated as a categorical variable. If there is no title in the name, it will be valued as `no_title`.

5. The ***Sex*** variable describes the gender of each passenger, either `male` or `female` (please note that this is historical information, as such it is only limited to two genders). Since there are only two possible values in this case, it will be treated as a **nominal** feature.

6. The ***Age*** variable contains the age of each passenger and can be seen as either a **continuous** or **discrete** feature depedning on how it is used. Here, age is given as a single **discrete** variable. However, there could be a wide spread of ages embarked on-baord, it might be more valuable to bin the age values into groups instead to get ranges of ages instead. This will be explored in a later section.

7. The ***SibSp*** and ***Parch*** variables describe the number of siblings/spouses and parents/children, respectively, for each passenger and can be described as a **discrete** variables.

8. The **Fare** variable described the fare cost of each passenger and is a continous variable.

9. The **Cabin** variable describes the cabin that each passenger was assigned to. This in itself is a useful variable, however from a brief glance there seem to be many missing entries which will create issues within the study. Thus it is better to drop it.

10. The **Embarked** variable describes the location from which each passenger embarked the Titanic and is a nominal type.