# Data Visualization for Titanic open dataset

This is an open source dataset of the Titanic and this project is an attempt to manipulate and extract informations from it with Python and scikit-learn.

In the end we want to create a confusion matrix of the Titanic Dataset.
A Confusion Matrix is an array used in machine learning to evaluate classification performance of a model. It compares predictions to the real values and allow you to visualize where the model is right or wrong.

![](../assets/confusion_matrix.png)

Be sure to have selected the Python envrionnement to run the code.

In [2]:
%pip install pandas

Note: you may need to restart the kernel to use updated packages.


# Setting up the project
Here we import libraries and we are counting values at null in data. This is a first glance at what we will need to change in our dataset for two reasons:
1. First data has to be a numeric value representation (not object)
2. Since we manipulate data, a null value will not serve us in visualizations

In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
print("Working on Titanic dataset")
data = pd.read_csv("../assets/titanic/titanic.csv")
data.info()
print(data.isnull().sum())

Working on Titanic dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Survived     418 non-null    int64  
 2   Pclass       418 non-null    int64  
 3   Name         418 non-null    object 
 4   Sex          418 non-null    object 
 5   Age          332 non-null    float64
 6   SibSp        418 non-null    int64  
 7   Parch        418 non-null    int64  
 8   Ticket       418 non-null    object 
 9   Fare         417 non-null    float64
 10  Cabin        91 non-null     object 
 11  Embarked     418 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked

Here we are witnessing:
- `327` missing values for `Cabin`
- `86` migging values for `Age`
- `1` missing value for `Fare` (which are the prices for the ticket)

Theses are the values we'll need to complete.

# Data Manipulation
Since the data is not ready for visualization, we need to manipulate some columns and values to clean the data frame.

## Fill Missing Ages
First we create a filling function the missing Age values.
For that we are looping in the dataFrame (with unique values) and if a pclass in the `Pclass` column is not in out dictionary `age_fill_map` we add it with a median as value.

In [5]:
def fill_missing_ages(df: pd.DataFrame) -> pd.DataFrame:
    """
    filling missing ages in dataFrame (df)
    """
    age_fill_map = {}

    for pclass in df["Pclass"].unique():
        if pclass not in age_fill_map:
            age_fill_map[pclass] = df[df["Pclass"] == pclass]["Age"].median()

    # Apply the median onto df if row["Age"] is null otherwize keep the original age
    df["Age"] = df.apply(
        lambda row: age_fill_map[row["Pclass"]]
        if pd.isnull(row["Age"])
        else row["Age"],
        axis=1,
    )
    # df["Age"].fillna(df["Pclass"].map(age_fill_map), inplace=True)
    print(f"Age fill map: {age_fill_map}")

    return df

## Preprocessing Data
How we preprocessed data with the dataFrame object:
- Drop the columns `PassenerId`,`Name`,`Ticket`,`Cabin` from dataFrame because these values won't help us in seeing who survived in Titanic catastrophe.
- We fill the `Embarked` column of `S` if there is no data.
- Execute the `fill_missing_age()` function created earlier.
- Convert the gender in a binary representation (I hate this but hey, machine is reading 1 an 0).
- Add new column `FamilySize` which is a combination of `SibSp` and `Parch` columns (it stands for "Sibling" and "Parent").
- Add new column `IsAlone` because if `FamilySize` is 0 then the passenger is alone.
- Group column `FareBin` values in 4 diffrents groups
- Group column `AgeBin` values in 0,12,20,40,60 representing ages of passengers (np.inf is for infinite)
And write in `/assets/titanic/data_preprocessed.csv`.

In [None]:
def preprocess_data(df: pd.DataFrame) -> pd.DataFrame:
    """
    Drop unused columns, fill null values and convert in number type
    """
    df.drop(columns=["PassengerId", "Name", "Ticket", "Cabin"], inplace=True)

    # Fill the missing values as "S" for Southampton, the most common embarkation point in the data
    # df["Embarked"].fillna("S", inplace=True)
    df.drop(columns=["Embarked"], inplace=True)

    fill_missing_ages(df)

    # Convert Gender for model
    df["Sex"] = df["Sex"].map({"male": 1, "female": 0})

    # Feature engineering
    df["FamilySize"] = df["SibSp"] + df["Parch"]  # parents + children
    df["IsAlone"] = np.where(
        df["FamilySize"] == 0, 1, 0
    )  # where there is no one then insert 1
    df["FareBin"] = pd.qcut(
        df["Fare"], 4, labels=False
    )  # categorization for ticket prices
    df["AgeBin"] = pd.cut(
        df["Age"], bins=[0, 12, 20, 40, 60, np.inf], labels=False
    )  # bins for ranged age of passengers

    with open("../assets/titanic/data_preprocessed.csv", "w") as f:
        df.to_csv(f, index=False)

    return df

# Run the preprocessing

In [None]:
# Preprocessing data
print("Preprocessing data...")
data = pd.read_csv("../assets/titanic/titanic.csv")
preprocessed_data = preprocess_data(data)

# Create Features / Target Variables (Make Flashcards)
X = preprocessed_data.drop(columns=["Survived"])
y = preprocessed_data["Survived"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

Preprocessing data...
Age fill map: {np.int64(3): np.float64(24.0), np.int64(2): np.float64(26.5), np.int64(1): np.float64(42.0)}
