Titanic, in full Royal Mail Ship (RMS) Titanic, British luxury passenger liner that sank on April 14β15, 1912, during its maiden voyage, en route to New York City from Southampton, England, killing about 1,500 (see Researcherβs Note: Titanic) passengers and ship personnel. One of the most famous tragedies in modern history, it inspired numerous stories, several films, and a musical and has been the subject of much scholarship and scientific speculation.
As it prepared to embark on its maiden voyage, the Titanic was one of the largest and most opulent ships in the world. It had a gross registered tonnage (i.e., carrying capacity) of 46,328 tons, and when fully laden the ship displaced (weighed) more than 52,000 tons. The Titanic was approximately 882.5 feet (269 metres) long and about 92.5 feet (28.2 metres) wide at its widest point.
We aim at creating a machine learning model that provides information on the fate of passengers on the Titanic, summarized according to economic status (class), sex, age and survival.
We will build a predictive model that answers the question: βWhat sorts of people were more likely to survive?β Using passenger data (ie name, age, gender, socio-economic class, etc).
In the project following, we will be performing the following steps:
- Extracting the datasets.
- Perform exploratory analysis on the data.
- Check for missing values and extracting important features.
- Use seaborn and matplotlib to aid visualizations.
- Perform data preprocessing; compute missing values, convert features into numeric ones, group values into categories and created a few new features.
- Train different Machine Learning models and choose the best one.
- Apply cross validation on the chosen model.
- Tune the performance of the model using hyperparameter values.
- Computed the models precision, recall and f-score.
- The dataset reference was taken from https://www.kaggle.com/c/titanic
- The dataset consists of a training and testing dataset and also a gender submission file.
All the dependencies and required libraries are included in the file requirements.txt
See here
Each of the following data blocks is explained in deep in the jupyter notebook.
- Import the libraries.
- Load the datasets.
- Perform Data Exploration.
- Perform Data Preprocessing
- Train several Machine Learning models and compare their results.
All the blockwise outputs are present in the Images
folder for reference See here
- Stochastic Gradient Descent (SGD): Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable)
- Random Forest: A random forest is a machine learning technique that's used to solve regression and classification problems. It utilizes ensemble learning, which is a technique that combines many classifiers to provide solutions to complex problems.
- Logistic Regression: In statistics, the logistic model is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick.
- K Nearest Neighbor: The k-nearest neighbors (KNN) algorithm is a simple, supervised machine learning algorithm that can be used to solve both classification and regression problems.
- Gaussian Naive Bayes: Gaussian Naive Bayes is a variant of Naive Bayes that follows Gaussian normal distribution and supports continuous data.
- Perceptron:A perceptron model, in Machine Learning, is a supervised learning algorithm of binary classifiers. A single neuron, the perceptron model detects whether any function is an input or not and classifies them in either of the classes.
- Linear Support Vector Machine: Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for Classification as well as Regression problems. However, primarily, it is used for Classification problems in Machine Learning.
- Decision Tree: A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility.
The Random Forest classifier goes on the first place.
Random Forest is a supervised learning algorithm. Like you can already see from itβs name, it creates a forest and makes it somehow random. The forest it builds, is an ensemble of Decision Trees, most of the time trained with the βbaggingβ method. The general idea of the bagging method is that a combination of learning models increases the overall result. To say it in simple words: Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.
-
One big advantage of random forest is, that it can be used for both classification and regression problems, which form the majority of current machine learning systems. With a few exceptions a random-forest classifier has all the hyperparameters of a decision-tree classifier and also all the hyperparameters of a bagging classifier, to control the ensemble itself.
-
The random-forest algorithm brings extra randomness into the model, when it is growing the trees.
-
Instead of searching for the best feature while splitting a node, it searches for the best feature among a random subset of features.
-
This process creates a wide diversity, which generally results in a better model. Therefore when you are growing a tree in random forest, only a random subset of the features is considered for splitting a node.
Another great quality of random forest is that they make it very easy to measure the relative importance of each feature. Sklearn measure a features importance by looking at how much the treee nodes, that use that feature, reduce impurity on average (across all trees in the forest). It computes this score automaticall for each feature after training and scales the results so that the sum of all importances is equal to 1.
Below you can see the hyperparamter tuning for the parameters criterion, min_samples_leaf, min_samples_split and n_estimators:
- The first row is about the not-survived-predictions: 493 passengers were correctly classified as not survived (called true negatives) and 56 where wrongly classified as not survived (false positives).
- The second row is about the survived-predictions: 93 passengers where wrongly classified as survived (false negatives) and 249 where correctly classified as survived (true positives).
For each person the Random Forest algorithm has to classify, it computes a probability based on a function and it classifies the person as survived (when the score is bigger the than threshold) or as not survived (when the score is smaller than the threshold). Thatβs why the threshold plays an important part.
Above you can clearly see that the recall is falling of rapidly at a precision of around 85%.
Another way is to plot the precision and recall against each other:
Another way to evaluate and compare your binary classifier is provided by the ROC AUC Curve. This curve plots the true positive rate (also called recall) against the false positive rate (ratio of incorrectly classified negative instances), instead of plotting the precision versus the recall.
The ROC AUC Score is the corresponding score to the ROC AUC Curve. It is simply computed by measuring the area under the curve, which is called AUC. A classifiers that is 100% correct, would have a ROC AUC Score of 1 and a completely random classiffier would have a score of 0.5.
Below you can see a before and after picture of the βtrain_dfβ dataframe:
-
We started with the data exploration where we got a feeling for the dataset, checked about missing data and learned which features are important. During this process we used seaborn and matplotlib to do the visualizations. During the data preprocessing part, we computed missing values, converted features into numeric ones, grouped values into categories and created a few new features.
-
Afterwards we started training 8 different machine learning models, picked one of them (random forest) and applied cross validation on it. Then we discussed how random forest works, took a look at the importance it assigns to the different features and tuned itβs performace through optimizing itβs hyperparameter values. Lastly, we looked at itβs confusion matrix and computed the models precision, recall and f-score.
There is still room for improvement, like doing a more extensive feature engineering, by comparing and plotting the features against each other and identifying and removing the noisy features.
- https://towardsdatascience.com/predicting-the-survival-of-titanic-passengers-30870ccc7e8
- https://www.kaggle.com/c/titanic
- https://www.britannica.com/topic/Titanic
π CREATOR- https://github.com/theshredbox