# Data Description

This notebook is used to initially explore the data to see what forms of preprocessing is required before building and testing the models.

# Imports

In [1]:
import pandas as pd

# Basic EDA

Here are what each feautre in the dataset represents:

![data_features](img/data_features.png)

In [10]:
# Import Data
train_df = pd.read_csv('../data/train.csv')
test_df = pd.read_csv('../data/test.csv')

train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [11]:
train_df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [4]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [6]:
train_df.shape

(891, 12)

In [8]:
train_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [9]:
train_df.describe(include=["object", "bool"])

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Braund, Mr. Owen Harris",male,347082,B96 B98,S
freq,1,577,7,4,644


Above I've shown some very basic information about the data through some simple `pandas` EDA functions. Let's see what useful information we can extract to set us up for a successful solution going ahead.

1. The ***PassengerID*** variable is the unqiue numerical identifier for each passenger. Since each passenger will have a unique value, there is no need to include this variable in the study, it will be **dropped**.

2. The ***Survived*** variable describes whether or not the passenger survived and is our **target** class. The passenger either survived (1) or did not survive (0), therefore we have a binary classification problem.

3. The ***Pclass*** variable describes the socio-economic class value of the ticket of each passenger. It can take three values, `upper class` (1), `middle class` (2), and `lower class` (3). Since we have an hierarchal order (lower -> middle -> upper), this will be treated as an **ordinal categorical** variable.

4. The ***Name*** variable contains the names of each passenger along with titles, clarifiers, and other monikers. Since most names and monikers will be unqiue to each individual, they will not be very valuable to the study. The titles, however, may contain other valuable socio economic information such as level of education or profession. As such, a new column named **Title** will be created with the extracted title from each individual's name and be treated as a categorical variable. If there is no title in the name, it will be valued as `no_title`.

5. The ***Sex*** variable describes the gender of each passenger, either `male` or `female` (please note that this is historical information, as such it is only limited to two genders). Since there are only two possible values in this case, it will be treated as a **nominal** feature.

6. The ***Age*** variable contains the age of each passenger and can be seen as either a **continuous** or **discrete** feature depedning on how it is used. Here, age is given as a single **discrete** v

Since there could be a wide spread of ages embarked on-baord, it might be more valuable to bin the age values into groups instead. This will be explored in a later section.

7. The ***SibSp*** and ***Parch*** variables describe the number of siblings/spouses and parents/children, respectively, for each passenger.

Let's look at the kind of preprocessing that would be required for each feature:

1. **survival**

Since this is our target column and it is already in a binary format, nothing will need to be done.

2. **pclass** 

This is a categorical feature that is alreay numerically categorized so nothing will need to be done to this column.

3. **sex**

This is a categorical feature that will need to be encoded. Since there are only two possible categories, we can convert it into a binary feature (0=male, 1=female)

4. **age**

This is a numerical column, however using each individual age will cause the algorithm to be overly-specific for certain ages. The better route would be to bin them into groups.

5. **sibsp**

Already being a numerical integer column, there is no need to perform any processing to this feature.

6. **parch**

Already being a numerical integer column, there is no need to perform any processing to this feature.

7. **ticket**

Since this column is the ticket number of each passenger, it is going to be a unique string for each passenger. It is better to drop this column.

8. **fare**

This is a numerical column, however it can be unqiue to each customer or groups of customer. It makes more sense to bin the fare into groups instead.