# Exploratory Data Analysis

Common steps in EDAs:
- <u>Dataset shape:</u> number of tables, rows, colums, date types, size in bytes.
- <u>Definition of features:</u> meaning of the features and notes about them.
- <u>Data model:</u> relationship betwen tables (if the dataeset contains more than one table)
- <u>Analysis of missing values:</u> number of features with missing values, missing values percentage, number of features with missing values, number of rows with missing values, number of missing values per feature, average missing values per row, average missing values per column, etcetera.
- <u>Measures of central tendency:</u> mean, median, mode.
- <u>Measures of spread:</u> variance, standard deviation, skewness, kurtosis
- <u>Distributions:</u> distributions of the data, quantiles, histograms, probability distributions.
- <u>Analysis of outliers:</u> anomalies, incorrect data.
- Relationship between variables: linear relationships, non-linear relationships, correlation
-
-
-

In [1]:
import os
import pandas as pd

In [2]:
datasets_path = os.getenv('DATASETS_DIR')
# os.listdir(datasets_path)
# https://www.kaggle.com/c/titanic/data
dataset_path = os.path.join(datasets_path,'titanic','train.csv')
df = pd.read_csv(dataset_path, index_col='PassengerId')
df

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [3]:
nrows, ncols = df.shape
print("Number of rows: {}\nNumber of columns:{}".format(nrows,ncols))

Number of rows: 891
Number of columns:11


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


- There are eleven features of which four are integers (survied, pclass, sibsp and parch), two are floats (age and fare), and five are strings (name,sex,ticket,cabin and embarked)

- There are missing values in three of the eleven features: age, cabin and embarked.

In [5]:
df.describe(include=["object"])

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Lester, Mr. James",male,347082,C23 C25 C27,S
freq,1,577,7,4,644


## Features definition

- Survived: Survival. 0:False, 1:True
- Pclass: ticket class (lower means more expensive) 1:1st, 2:2nd, 3:3rd
- Name: name of the passenger
- Sex: sex of the passenger
- Age: age of the passenger in years
- SibSp: number of siblings and spouses aboard the Titanic
- Parch: number of parents and children aboard the Titanic
- Ticket: ticket number
- Fare: passenger fare
- Cabin: cabin number
- Embarked: port of embarktion. C:Cherbourg, Q:Queenstown, S:Southampton

### Features notes

- **Pclass** a proxy for socio-economic status (SES): 1st:Upper; 2nd:Middle; 3rd:Lower
- **Age** is fractional if less than 1. If the age is estimated, is it in the form of xx.5
- **Sibsp**, the dataset defines family relations in this way. Sibling includes brother, sister, stepbrother, stepsister; and spouse, husband, wife (mistresses and fiancés were ignored).
- **Parch**, the dataset defines family relations in this way. Parent means mother, father and child means daughter, son, stepdaughter, stepson. Some children travelled only with a nanny, therefore parch equals 0 for them.
- **Name** includes name, surname and titles such as Mr., Mrs., Ms., and Miss. American English uses the dot after the title whereas British English does not. The titles in the time of the Titanic (1912) in the US meant:
    - **Mr.**, mister, which was used for an adult men. Regardless of wheter the person is married or not.
    - **Miss**, young unmarried women.
    - **Mrs.**, usually married women.
    - **Ms.** meant women in general.

## Analysis of missing values

In [15]:
missing_values = df.isnull().sum()
missing_values

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [21]:
missing_values_percentage = 100*missing_values/nrows
missing_values_percentage

Survived     0.000000
Pclass       0.000000
Name         0.000000
Sex          0.000000
Age         19.865320
SibSp        0.000000
Parch        0.000000
Ticket       0.000000
Fare         0.000000
Cabin       77.104377
Embarked     0.224467
dtype: float64

There are missing values in three features out of the eleven:
<table>
  <tr>
    <th>Feature</th>
    <th>Missing values</th>
    <th>Percentage missing</th>
  </tr>
  <tr>
    <td>Age</td>
    <td>177</td>
    <td>19.87%</td>
  </tr>
  <tr>
    <td>Cabin</td>
    <td>687</td>
    <td>77.10%</td>
  </tr>
    <tr>
    <td>Embarked</td>
    <td>2</td>
    <td>0.22%</td>
  </tr>
</table>

Why are there missing values in those features? There are only two missing values in the Embarked features, maybe stowaways?