Skip to content

redahamdouch/titanic_classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

CLASSIFICATION ON TITANIC DATA

titanic

Table of contents :

  1. Context
  2. Data Analysis
    1. Confusion matrix
    2. Gender Distribution
    3. Passenger class survive probabilty
  3. Data Transformation
    1. Fare before transformation
    2. Fare after transformation
  4. Modeling
    1. Model selection
    2. Final model
  5. What could be explored more ?

CONTEXT :

  • This is one of the first machine learning projects I did during my studies.

  • The Titanic competition is a well-known challenge in the field of data science, which consists of predicting the probability that a person will or will not survive the impact of the famous ship against an iceberg on 14 April 1912.

  • Here is the data dictionary of the dataset:

Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
  • I managed to get a final score of 81% accuracy using the support vector model
  • All the code I made is documented with comments to understand my approach throughout the project. I have added to this read the most interesting tracks of my work:

1) Data Analysis

a) Confusion matrix

correlation_matrix

Correlation matrix of the variables. "survived" variable is the dependent one

  • From the correlation matrix, we can notice that the most correlated variables to survival is the gender, the fare and class de la personnes, lets take a closer look in this variables to try to understand why.

b) Gender Distribution

gender_distrib

  • As we can see 80.9% of women survived against 27.3% of men. This is because women and children were given priority to board the lifeboats

c) Passenger class survive probabilty

pclass_survive_rat

Here we can see that the probability of survival is higher if the passenger is in a upper class, this is due to the fact that the upper class areas on the titanic are naturally less crowded, so there is less crowding and more safety boats available to 1st and 2nd class passengers.

2) Data Transformation

a) Fare before transformation

fare_distrib

Distribution of the fare variable

Concerning the fare variable, well as the graph above shows, we have a right skewed dsitribrution of prices paid by passengers.

b) Fare after transformation

To improve the correlation of this variable with our dependent variable, I used the lograthmic scale which gives us the following distribution:

log_fare_distrib

Distribution of the fare variable after performing a log scale transformation

This increased the correlation between fare (with log) and survied category. (0.24 -> 0.31)

3) Modeling

a) Model selection

models

Train accuracy, test accuracy, cross validation mean accuracy for each model tested

  • After putting to the test several ML algorithms we found some overfitting problem : Huge difference bewteen the accuracy of the training and testing phase for the following methods:

  • RandomForest

  • DecisionTree

  • KNeighboors

  • Catboost

  • After some trials to reduce de max depth and pruning on tree-bases models, we finally decided to choose the Support Vecotor Classifier, wichi has the best performance and remains stable.

b) Final model

Finally we performed a finetuing process on SV classifier and came up with a final accuracy of 81%.

confusion_matrix

Confusion matrix of the final model

4) What could be explored more ?

  • Try to do more feature engenieering, to come up with more correlated data
  • test other ML algorithms (XGBoost, MLP)
  • Fine tune all the tested algorithms
  • Try to use SMOTE to have a 50/50 repartition of the values (regarding the dependent variable)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published