### Structure
1. Understanding the problem
2. Exploratory Data Analysis (EDA) & visualization
3. Model training, tuning & evaluation
4. Upload

### 1. Understanding the problem

**Goal:** It is your job to predict if a passenger survived the sinking of the Titanic or not.
For each in the test set, you must predict a 0 or 1 value for the variable.

**Metric:**
Your score is the percentage of passengers you correctly predict. This is known as accuracy.

**Submission File Format:**
You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.

The file should have exactly 2 columns:
- PassengerId (sorted in any order)
- Survived (contains your binary predictions: 1 for survived, 0 for deceased)

### 2. EDA

#### 2.1 Data loading & exploration

In [None]:
# Load libraries & datasets
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split


In [None]:
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

In [None]:
train.head()

In [None]:
train['Sex'].value_counts()

In [None]:
# survival rate of women vs men
female_survival_rate = train.loc[train['Sex'] == 'female', 'Survived'].mean().round(4)
male_survival_rate = train.loc[train['Sex'] == 'male', 'Survived'].mean().round(4)
print(f"Female survival rate: {female_survival_rate}")
print(f"Male survival rate: {male_survival_rate}")

In [None]:
# Plotting a stacked age distribtion histogram on condition of survived or not
plt.figure(figsize=(10, 6))
sns.histplot(data=train, x='Age', hue='Survived', multiple='stack', bins=30)
plt.title('Age Distribution by Survival Status')

In [None]:
# Plotting a stacked fare distribtion histogram on condition of survived or not on a logarithmic scale
plt.figure(figsize=(10, 6))
plt.hist([train[train['Survived'] == 0]['Fare'], train[train['Survived'] == 1]['Fare']], stacked=False, color=['red', 'green'], bins=30, label=['Not Survived', 'Survived'])
plt.title('Stacked Fare Distribution by Survival')
plt.xlabel('Fare')
plt.ylabel('Number of Passengers')
plt.yscale('log')
plt.legend()
plt.show()

In [None]:
# random forest classifier with more features
from sklearn.ensemble import RandomForestClassifier

y = train["Survived"]
features = ["Pclass", "Sex", "SibSp", "Parch", "Age", "Fare"]

X = pd.get_dummies(train[features])
X_test = pd.get_dummies(test[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test.PassengerId, 'Survived': predictions})
output.to_csv('data/rf_submission.csv', index=False)

In [None]:
# plot model decision tree
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plt.figure(figsize=(20,10))
plot_tree(model.estimators_[0], feature_names=list(X.columns), filled=True)
plt.show()