# 🚢 Titanic Estimator

6-step ML framework <a href="https://www.mrdbourke.com/a-6-step-field-guide-for-building-machine-learning-projects/">here</a>:

1️⃣ **Problem Definition:** What problem are we trying to solve? 

2️⃣ **Data:** What data do we have?

3️⃣ **Evaluation:** What defines success? What metric will we use to evaluate our model? 

4️⃣ **Features:** What features should we use?

5️⃣ **Model:** What model will weuse to solve the problem?

6️⃣ **Experimentation:** What have we tried and what else can we try?


### 1. Problem Definition

Predicting the surviving passengers of the Titanic shipwreck based on the features of the passengers.

### 2. Data

Data comes from <a href="https://www.kaggle.com/competitions/titanic/overview">Kaggle</a>. See the data dictionary for detailed descriptions of the features.

### 3. Evaluation

Accuracy of the model, evaluated when submitting to the Kaggle competition.

### 4. Features

In [100]:
# Data exploration
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


# Machine learning
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [101]:
# Set random seed
seed = 99
np.random.seed(seed)

In [102]:
# Load data
train_df = pd.read_csv("data/train.csv")
test_df = pd.read_csv("data/test.csv")

In [103]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [104]:
# Split train_data into features (X) and labels (y)
X = train_df.drop("Survived", axis=1)
y = train_df["Survived"]

# Split features and labels into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=seed)

In [105]:
# Data shapes
X_train.shape, X_valid.shape, y_train.shape, y_valid.shape

((712, 11), (179, 11), (712,), (179,))

In [106]:
# Check missing labels
y_train.isna().sum(), y_valid.isna().sum()

(np.int64(0), np.int64(0))

In [107]:
# Check missing features
print("X_train.isna().sum()", X_train.isna().sum())
print("X_valid.isna().sum()", X_valid.isna().sum())

X_train.isna().sum() PassengerId      0
Pclass           0
Name             0
Sex              0
Age            144
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          547
Embarked         1
dtype: int64
X_valid.isna().sum() PassengerId      0
Pclass           0
Name             0
Sex              0
Age             33
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          140
Embarked         1
dtype: int64


In [108]:
# Check for duplicate rows
print("X_train.duplicated().sum()", X_train.duplicated(subset=X_train.columns.difference(['PassengerId'])).sum())
print("X_valid.duplicated().sum()", X_valid.duplicated(subset=X_valid.columns.difference(['PassengerId'])).sum())

X_train.duplicated().sum() 0
X_valid.duplicated().sum() 0


In [109]:
# Impute missing values in categorical features with 'Unknown' and encode them
categorical_features = ["Sex", "Cabin", "Embarked"]

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value="Unknown")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))])

In [110]:
# Impute missing values in 'Age' column with the median age
numeric_features = ["Age"]

numeric_transformer = Pipeline(steps=[
    ("missing_indicator", SimpleImputer(strategy="constant", add_indicator=True)),
    ("imputer", SimpleImputer(strategy="median"))])

In [111]:
# Create a preprocessor that drops Name and Ticket columns and then applies the transformers
preprocessor = ColumnTransformer(
    transformers=[
        ("drop_cols", "drop", ["Name", "Ticket"]),
        ("cat", categorical_transformer, categorical_features),
        ("num", numeric_transformer, numeric_features)],
    remainder="passthrough",
    force_int_remainder_cols=False)

### 5. Model

In [112]:
# Instantiate the model and fit it to the training data
model = Pipeline(steps=[("preprocessor", preprocessor),
                        ("model", RandomForestClassifier())])

model.fit(X_train, y_train)

In [113]:
# Evaluate model on training and validation data
print(f"Accuracy on training data: {model.score(X_train, y_train):.3f}")
print(f"Accuracy on validation data: {model.score(X_valid, y_valid):.3f}")

Accuracy on training data: 1.000
Accuracy on validation data: 0.827


### 6. Experimentation