In [1]:
import pandas as pd
from metrics import evaluate_model

In [2]:
# Loading datasets
train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')

# Reasoning

After abandoning this project for a long time, I wanted to finish it. However, despite my notes, I was not really sure what my thought process was when I last worked on it. Therefore, I wanted to start the project again, incorporating my old code wherever I could, but with a clearer understanding of what I wanted to do and how I would achieve it. I started again on 5/15/24 and aim to finish by 5/27/24. The steps I intend to complete are:

- Beginning model using rule-based constraints
- decision tree
- random forest
- cleaning and more thorough EDA
- rerun dt and rf models
- feature engineering
- rerun dt and rf models
- hyperparameter tuning
- final dt and rf models

I also intend to submit my models to kaggle with an aim of getting to at least 90% accuracy. I am unsure how many times I will submit, but hopefully at least once per data improvement stage.

In [3]:
# Overall

overall_survival_rate = sum(train['Survived'])/len(train['Survived'])
print(f'The overall survival rate for the test set: {overall_survival_rate:.2%}')

The overall survival rate for the test set: 38.38%


In [4]:
# men vs female

women = train.loc[train['Sex'] == 'female']['Survived']
women_survival_rate = sum(women)/len(women)
print(f'The percentage of women that survived: {women_survival_rate:.2%}')

men = train.loc[train['Sex'] == 'male']['Survived']
men_survival_rate = sum(men)/len(men)
print(f'The percentage of men that survived: {men_survival_rate:.2%}')

The percentage of women that survived: 74.20%
The percentage of men that survived: 18.89%


In [13]:
# young vs old

children = train.loc[train['Age'] < 16]['Survived']
children_survival_rate = sum(children)/len(children)
print(f'The percentage of child passengers that survived: {children_survival_rate:.2%}')

adults = train.loc[train['Age'] >= 16]['Survived']
adults_survival_rate = sum(adults)/len(adults)
print(f'The percentage of adult passengers that survived: {adults_survival_rate:.2%}')

The percentage of child passengers that survived: 59.04%
The percentage of adult passengers that survived: 38.19%


In [15]:
# Women and Children

women_and_children = train.loc[(train['Age'] < 16) | (train['Sex'] == 'female')]['Survived']
women_and_children_survival_rate = sum(women_and_children)/len(women_and_children)
print(f'The percentage of women and children that survived: {women_and_children_survival_rate:.2%}')


not_women_and_children = train.loc[~((train['Age'] < 16) | (train['Sex'] == 'female'))]['Survived']
not_women_and_children_survival_rate = sum(not_women_and_children)/len(not_women_and_children)
print(f'The percentage of adult men that survived: {not_women_and_children_survival_rate:.2%}')

The percentage of women and children that survived: 71.75%
The percentage of adult men that survived: 16.39%


In [16]:
# Predictions for Train Data
pred_women_list = [1 if row['Sex'] == 'female' else 0 for index, row in train.iterrows()]
pred_children_list = [1 if row['Age'] < 16 else 0 for index, row in train.iterrows()]
pred_women_and_children_list = [1 if (row['Sex'] == 'female' or row['Age'] < 16) else 0 for index, row in train.iterrows()]

# add columns into df
train['pred_women'], train['pred_children'], train['pred_women_and_children'] = pred_women_list, pred_children_list, pred_women_and_children_list

In [17]:
# Evaluate Model for Women Rule
evaluate_model(train['Survived'], train['pred_women'], 'Women')

Women Confusion Matrix:
 [[468  81]
 [109 233]]
Women Accuracy: 0.7868
Women Precision: 0.7420
Women Recall: 0.6813
Women F1 Score: 0.7104


In [19]:
# Evaluate Model for Children Rule
evaluate_model(train['Survived'], train['pred_children'], 'Children')

Children Confusion Matrix:
 [[515  34]
 [293  49]]
Children Accuracy: 0.6330
Children Precision: 0.5904
Children Recall: 0.1433
Children F1 Score: 0.2306


In [21]:
# Evaluate Model for Women and Children Rule
evaluate_model(train['Survived'], train['pred_women_and_children'], 'Women and Children')

Women and Children Confusion Matrix:
 [[449 100]
 [ 88 254]]
Women and Children Accuracy: 0.7890
Women and Children Precision: 0.7175
Women and Children Recall: 0.7427
Women and Children F1 Score: 0.7299


Not a great improvement over just women, but does increase accuracy slightly. Precision is lower, indicating more false positives, so maybe age isn't the best indicator for survival

In [11]:
# creating predictions for test data
pred_test_list = [1 if (row['Sex'] == 'female' or row['Age'] < 16) else 0 for index, row in test.iterrows()]

# putting the predictions in the correct format for Kaggle
submission = pd.DataFrame({'PassengerId': test['PassengerId'], 'Survived': pred_test_list})
submission.to_csv('./submissions/women_and_children_submission.csv', index=False)

The rule based predictions for the adage "women and children first" scored 0.75837. This score is not great, but it is a baseline and will hopefully improve as I continue my work.

In [49]:
# train test split

from sklearn.model_selection import train_test_split

train_columns = train.drop(columns=['Survived', 'Name', 'Ticket', 'Cabin'], axis=1)
train_data = pd.get_dummies(train_columns)
test_data = train['Survived']

X_train, X_test, y_train, y_test = train_test_split(train_data, test_data, test_size=0.2)

In [50]:
# decision tree

from sklearn.tree import DecisionTreeClassifier

dt_classifier = DecisionTreeClassifier()

dt_classifier.fit(X_train, y_train)

y_pred_dt = dt_classifier.predict(X_test)

benchmark_metrics(y_test, y_pred_dt, 'Decision Tree')

Decision Tree Confusion Matrix:
 [[90 28]
 [23 38]]
Decision Tree Accuracy: 0.7151
Decision Tree Precision: 0.5758
Decision Tree Recall: 0.6230
Decision Tree F1 Score: 0.5984


In [56]:
# random forest

from sklearn.ensemble import RandomForestClassifier

rf_classifier = RandomForestClassifier()

rf_classifier.fit(X_train, y_train)

y_pred_rf = rf_classifier.predict(X_test)

benchmark_metrics(y_test, y_pred_rf, 'Random Forest')

Random Forest Confusion Matrix:
 [[102  16]
 [ 19  42]]
Random Forest Accuracy: 0.8045
Random Forest Precision: 0.7241
Random Forest Recall: 0.6885
Random Forest F1 Score: 0.7059


In [52]:
# Xgboost

from xgboost import XGBClassifier

xgb_classifier = XGBClassifier()

xgb_classifier.fit(X_train, y_train)

y_pred_xgb = xgb_classifier.predict(X_test)

benchmark_metrics(y_test, y_pred_xgb, 'Gradient Boosting')

Gradient Boosting Confusion Matrix:
 [[95 23]
 [19 42]]
Gradient Boosting Accuracy: 0.7654
Gradient Boosting Precision: 0.6462
Gradient Boosting Recall: 0.6885
Gradient Boosting F1 Score: 0.6667


In [55]:
test_submission = test.drop(columns=['Name', 'Ticket', 'Cabin'], axis=1)
test_submission = pd.get_dummies(test_submission)

simple_rf_predictions = rf_classifier.predict(test_submission)

simple_rf_submission = pd.DataFrame({'PassengerId': test['PassengerId'], 'Survived': simple_rf_predictions})
simple_rf_submission.to_csv('./submissions/simple_rf_submission.csv', index=False)

Interestingly, the simple random forest model had the same score as my women and children rule based model at 0.75837. I checked to make sure the files were different and I didn't upload the wrong one as well.

Hopefully the score improves as I update the models.