[Titanic - Machine Learning from Disaster](https://www.kaggle.com/c/titanic/code?competitionId=3136&sortBy=voteCount)

# 1. Overview

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 

In [2]:
# DF Train
df_train = pd.read_csv("Titanic/train.csv")
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
# DF Test 
df_test = pd.read_csv("Titanic/test.csv")
df_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


* The files `train.csv` and `test.csv` have a set of columns with almost the same name, except that the column `"Survived"` does not appear in the file `test.csv`. The problem is to use the remaining columns of the file `train.csv` to train a model to predict this `"Survived"` column based on the columns of the file `test.csv`.

* The column `"Cabin"` in the two data files has missing values.

## Meaning of each information field

* `"Pclass"`: seat class. 1 = _Upper_ rank, 2 = _Middle_ rank, 3 = _Lower_ rank. Thus, the information field `"Pclass"` can be both a category and numeric because it is ordinal. This feature can affect passenger survivability as more luxurious may have better (or perhaps conversely) more subjective safety measures.

* `"Sex"`: passenger gender.

* `"Age"`: age of the passenger. If age is less than 1, it is odd (0.42); if age is an estimate, it is xx.5. This will also be a potential feature to predict outcomes for the problem because children and the elderly are in a higher risk group.

* `"Sibsp"`: number of siblings or spouses on board.

* `"Parch"`: number of parents/children on board.

* `"Ticket"`: ticket number.

* `"Fare"`: fare.

* `"Cabin"`: cabin number.

* `"Embarked"`: Place of boarding, `C` = Cherbourg, `Q` = Queenstown, `S` = Southamton.

In the above information, we can see that there is information in the form of numbers like `Age, Fare, Parch, Sibsp`, there is information in the form of categories like `Pclass, Sex, Ticket, Cabin, Embarked`. An initial assessment may suggest that information may help model such as `Pclass, Age, Parch, Sibsp` and information that may be less useful such as `Cabin, Embarked' , Ticket, Fare`.

# 2. Statistics

In [6]:
df_train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


* `"PassengerID", "Pclass"` although they are category information, they are still listed here because, when not specified, fields of information that all values ​​can be converted to Numbers are considered numeric information.

* In each information field, the statistics shown for the _not missing_ values ​​in that field are:
    * `count`: the number of elements _not missing_,
    * `mean`: mean value,
    * `std`: variance,
    * `min`: minimum value,
    * `max`: maximum value,
    * `50%`: median -> value where exactly half of the elements in the column have a value less than or equal to it.
    * `25%`: 25% of the elements in the column have a value less than or equal to it,
    * `75%`: 75% of the elements in the column have a value less than or equal to it,

* For the `Survived` column, the mean in the column is `0.384`. This is the _label_ column that the model needs to predict. This column only carries the values ​​0 and 1, so we can say that 38.4% of the values ​​in the column are equal to 1. This shows that the data is relatively balanced between the 0 and 1 classes.

* With the column `Age`, we see that `count = 714` and is less than the number of words in the remaining columns (891). This shows that up to 891 - 714 = 177 data samples have `Age` missing. The youngest person on board is only 0.42 years old, while the most senior is 80 years old.

* For the `Sibsp` column, the maximum number of siblings or spouses with a single passenger is 8, but up to 75% of passengers have at most one sibling or spouse travelling with them. This shows that the distribution of this data is quite skewed (_skewed_).

* The `Parch` column is similarly skewed when a passenger has up to 6 children/parents while 75% of the passengers have no children/parents travelling with them.

* The `Fare` column is skewed as the mean is 32 while the median is only 14, and the maximum value goes up to 512. Passengers with zero-fare are likely to be in the crew.

In [8]:
df_test.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,3.0,39.0,1.0,0.0,31.5
max,1309.0,3.0,76.0,8.0,9.0,512.3292


* The number of elements in this set is 418 (equal to `count` in the `PassengerID` column).

* The columns `Age, Fare` have many missing values. Thus, even though the training set has no missing `Fare` values, the test set has one row missing this value.

* The statistics in the columns `Age, SibSp, Parch` and `Fare` are relatively consistent with the training set.

Since pandas often need to load the entire file into RAM, it is unsuitable for large datasets.
For big data, please read more about [Dask](https://dask.org/), [Modin](https://modin.readthedocs.io/en/latest/) with syntax similar to [Pandas](https://pandas.pydata.org/docs/) or [PySpark](https://spark.apache.org/docs/latest/api/python/) for data processing on distributed systems.