Repo for the Kaggle "Titanic: machine learning from disaster"

📖

Problem definition
Overview
Workflow
Afterthoughts
Resources that helped me learn

Problem definition

Train a 🏷️ classification model to determine whether or not a passenger survived in the test set based on the survival information in the training set. More details are described 👉 here.

Overview

Ah the famous Titanic Kaggle competition. I did not expect to have this much fun working through this project, honestly. In this classic classification problem, I got to try out a lot of algorithms, including naive bayes, logistic regression, decision tree, KNN, random forest, SVM, XGBoost, soft voting classifier. For me, random forest produced the highest accuracy at predicting the test dataset (78%). I did spend hours trying to increase the accuracy of my algorithm. I tried hyperparameter tuning, engineering new and dropping low-impact features. Nothing I tried really helped with increasing the accuracy 😅. Although, I am not good at fine-tuning or feature engineering yet.

Project duration: 2020/10/21 - 2020/10/23

Workflow

EDA
- some numerical feature (training set)
  - 891 samples, representing 40% (891 in 2224) of passengers boarded the Titanic;
  - survival rate ~ 38% (actual survival rate on Titanic is 32%)
  - the majority of passengers traveled alone (~75%)
- data wrangling
  - missing data imputation
    - This is the part where I can be creative. I was inspired by this notebook that instead of filling missing values with mean, median or mode, I can try a few more things.
  - Age feature had a lot more missing values. I never thought I can actually build a model to predict those values on other features such as titles and how much they paid for their tickets. Is it guaranteed more accurate than just replacing with mean? Probably not, but very cool!
  - Feature engineering
  - Drop not-so-useful features
- data visualization, feature relationship exploration
  - pivot tables for quick assumptions
  - histogram correlation heatmap kernal density estimation plot to confirm assumptions
  - which features might contribute significantly to the model
- Final features: Figure 1: Features used in model training.

Model summary

Classifier Accuracy	Baseline	10-fold Stratified Cross Validation	Hyperparameter Tuned
naive bayes	0.7388	0.7665	NA
Perceptron	0.7910	0.7227	NA
linear SVM	0.7873	0.7834	NA
KNN	0.7910	0.7890	NA
rbf SVM	0.8097	0.8216	0.8249
logistic regression	0.8134	0.7991	0.8036
decision tree	0.8396	0.8093	0.8093
random forests	0.8545	0.8126	0.8250

Table 1. Accuracy stats for base classifier benchmark, after k-fold cross validation, and after hyperparameter tuning for the top classifiers.

Figure 2. Confusion Matrix for base classifiers. Seems like Naive Bayes was very good at predicting survivors, whereas KNN was very good at predicting non-survivors.

Figure 3. Feature importance plots for final classifiers.

Submission scores

Model Score

AdaBoost 0.78468

Voting (soft, weighted) 0.77511

logistic regression 0.76555

random forests 0.75837

Afterthoughts

Resources

Hyperparameter Tuning the Random Forest in Python by Will Koehrsen
A Statistical Analysis & ML workflow of Titanic by Masum Rumi
Titanic Project Example Walk Through by Ken Jee
EDA To Prediction (DieTanic) by Ashwini Swain
Predicting the Survival of Titanic Passengers by Niklas Donges

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
catboost_info		catboost_info
data		data
figures		figures
submission		submission
.gitignore		.gitignore
EDA_final.ipynb		EDA_final.ipynb
EDA_scratch.ipynb		EDA_scratch.ipynb
README.md		README.md
model_training.py		model_training.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Repo for the Kaggle "Titanic: machine learning from disaster"

Problem definition

Overview

Workflow

Afterthoughts

Resources

Other interesting posts

About

Releases

Packages

Languages

Model	Score
AdaBoost	0.78468
Voting (soft, weighted)	0.77511
logistic regression	0.76555
random forests	0.75837

zhangyang2017/kaggle-Titanic

Folders and files

Latest commit

History

Repository files navigation

Repo for the Kaggle "Titanic: machine learning from disaster"

Problem definition

Overview

Workflow

Afterthoughts

Resources

Other interesting posts

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages