TitanicSurvivalPrediction

A classification model for predicting the survival of Titanic passengers

Introduction

In this repo we will try to build a classification model for the titanic data set available on kaggle

Dataset

The datasets can be downloaded from Titanic - Machine Learning from Disaster. In this project I have used open source DVC as a data version control tool. The actual data is stored on google drive and only the .dvc version tracking file is saved on the github repository.
Benifits:

This avoids pushing huge data sets and artifacts to the github repository
Enables versioning and tracking of the data files without creating multiple copies of the same data with slighlty different names
Enables version control for saved models
Enables easy sharing of up-to-date data files among teams

other approache could be using a RDBMS for data storage:

Reading and writing data to the DB by multiple teams can slow down the process.
Not suitable for huge images, text, audio and video datasets
Version control is not possible with the above approach

Notebooks

The notebooks contain the basic EDA. please refer EDA

Data Preprocessing

Highlights from the preprocessing

Null value imputation
- Age - mean of the train Age feature
- Embarked - unknown
- Fare - mean of the train Fare feature
- Sex - unknown
New features
- Feature Title created from Name
- Feature IsAlone is created from the Family size
- Feature Pclass_Age created from Pclass and Age
- Feature Fare_Embarked from Fare and Embarked
Encoding
- Features Age and Fare are binned in to categories and then label encoded
- All the other Features are label encoded
Drop columns
- PassengerId,Ticket,Cabin
Feature importance and feature selection
- Tested XGBoost feature importance for identifying the important features

Note: Features cabin and ticket can be further engineered for better accuracy. The preprocessing of training and test data is done separately. This helps in performing inference on a single data point or batch of samples without reprocessing the training data.

Experimentation

I have used MLflow for maintaining a track of different experiments run. Importants information like model name, parameters and evaluation metrics are logged to mlflow. The results can be viewed on a UI by running the below command
mlflow ui

Hyperaparameter Tuning

Performed GridsearchCV for Random Forest and XGboost Used Optuna (alternative for GridSearch) for hyperparameter tuning
TODO Need to add the optuna scripts here

TODO Need to add the EDA scripts here
TODO Model interpretation
TODO Explore and identify tools to get model interpretability
TODO Write a flask app to take a data point and respond with the probability of survival
TODO Use github action for creating a release
TODO Unit testing in data scince and integration with github actions
TODO Explore options for deployment (AWS,Azure)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
models		models
notebook		notebook
src		src
submissions		submissions
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TitanicSurvivalPrediction

Introduction

Dataset

Notebooks

Data Preprocessing

Experimentation

Hyperaparameter Tuning

About

Releases

Packages

Languages

ram-ch/TitanicSurvivalPrediction

Folders and files

Latest commit

History

Repository files navigation

TitanicSurvivalPrediction

Introduction

Dataset

Notebooks

Data Preprocessing

Experimentation

Hyperaparameter Tuning

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages