# Tabular Playground Series - Apr 2021
In this notebook, we perform and analyse the `Titanic Dataset` generated using the CTGAN. We need to create the machine learning model that predict the `Survived` field using the 11 different variables. Evaluation is depend upon the `accuracy` of the model. 

# Data Dictionary
| Variable | Definition | Key |
| -------- | ---------- | --- |
| survival | Survival  |0 = No, 1 = Yes |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex | |
| Age | Age in years | |
| sibsp | # of siblings / spouses aboard the Titanic | |
| parch | # of parents / children aboard the Titanic | |
| ticket | Ticket number | |
| fare | Passenger fare | |
| cabin | Cabin number | |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |

# Variable Notes
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fianc√©s were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

# Load the Dataset 
In this section, we import all the useful libraries and load the dataset into the notebook.

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import missingno

plt.style.use('dark_background')

from pandas.plotting import scatter_matrix

import warnings
warnings.filterwarnings('ignore')

In [None]:
train_df = pd.read_csv('../input/tabular-playground-series-apr-2021/train.csv')
train_df.head()

# Perform statistics opertaion
In this section, we perform the basic statistics operation like mean, standardization, min, max, etc.

In [None]:
train_df.describe()

One of the weird observation in the dataset is in the Age column, as it had a minimum age of 0.080 which is really not possible. We need to handle this errorness in the dataset and replace it with something else.

In [None]:
missingno.bar(train_df, color='orangered');

Since, we have lots of missing value in `Cabin` column so filling out with some random value doesnot make a good call. So we going to drop out the column from the dataset and fill the rest of the missing column with the help of the EDA.

# Exploratory Data Analysis
In this section, we perform the Exploratory Data Analysis or EDA to understand the dataset and find the useful patterns within the dataset between the different variables.

## Univariate

In [None]:
plt.pie(train_df.Sex.value_counts(), labels=['Male', 'Female'], colors=['orangered', 'lightsalmon'], autopct="%1.2f%%")
plt.title('Sex Distribution Graph', fontweight='bold', fontsize=18);

In [None]:
def univariate_graph(title, xlabel, x, y, ylabel='Frequency'):
    plt.bar(x, y, color='orangered')
    plt.title(title, fontweight='bold', fontsize=14)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.show();

In [None]:
univariate_graph(x=['Survived', 'Not Survived'],
                y=train_df.Survived.value_counts(),
                title='Survived Distribution',
                xlabel='Survived')

In [None]:
univariate_graph(x=['Lower', 'Upper', 'Middle'],
                y=train_df.Pclass.value_counts(),
                title='Pclass Distribution',
                xlabel='Pclass')

In [None]:
plt.hist(train_df.Age, bins=10, color='orangered')
plt.title('Age Distribution', fontweight='bold', fontsize=14)
plt.xlabel('Age')
plt.ylabel('Frequency');

So most of the passengers are from the age 20 to 40 years. But again, passenger with the age below the 0 or 5 is not possible that they are travelling on the ship. We need to handle such case before fitting the model.

In [None]:
univariate_graph(x=train_df.SibSp.value_counts().index,
                y=train_df.SibSp.value_counts(),
                title='Sibling/Spouse Distribution',
                xlabel='SibSp')

So, most of the passengers on the Titanic are came alone. Some of them are come in couple or sibling while some of them come with their family.

In [None]:
univariate_graph(x=train_df.Parch.value_counts().index,
                y=train_df.Parch.value_counts(),
                title='Parch Distribution',
                xlabel='Parch')

In [None]:
univariate_graph(x=['Southampton', 'Cherbourg', 'Queenstown'],
                y=train_df.Embarked.value_counts(),
                title='Embarked Distribution',
                xlabel='Embarked')

So most of the passenger are going to the `Southampton`.

In [None]:
plt.hist(train_df.Fare, bins=5, color='orangered')
plt.title('Fare Distribution', fontweight='bold', fontsize=14)
plt.xlabel('Fare')
plt.ylabel('Frequency');

## Bivariate

In [None]:
sample_col = [col for col in train_df.columns if pd.api.types.is_numeric_dtype(train_df[col])]
plt.style.use('dark_background')
data = train_df.dropna()
plt.boxplot(data[sample_col[1:]], patch_artist=True, labels=sample_col[1:])
plt.title('Outlier Chart', fontsize=24, fontweight='bold');

So, we have outlier value in Fare. We have to see more deeply in SibSp and Parch column but seeing the dataset only we can say that it don't have any outlier.

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5))
ax1.boxplot(train_df['SibSp'], patch_artist=True, labels=['SibSp'])
ax1.set_title('SibSp Outlier Chart', fontsize=18, fontweight='bold')
ax2.boxplot(train_df['Parch'], patch_artist=True, labels=['Parch'])
ax2.set_title('Parch Outlier Chart', fontsize=18, fontweight='bold')
data = train_df.dropna()
ax3.boxplot(data['Age'], patch_artist=True, labels=['Age'])
ax3.set_title('Age Outlier Chart', fontsize=18, fontweight='bold');

Yeah!! We found the outlier in the `SibSp`, `Parch` and `Age` when we check these column more closely.

In [None]:
sns.heatmap(train_df.corr(), annot=True, cmap="YlOrBr");

In [None]:
plt.bar(['female', 'male'], train_df['Sex'][train_df['Survived'] == 1].value_counts(), width=0.3, color='orangered')
plt.bar(['female', 'male'], train_df['Sex'][train_df['Survived'] == 0].value_counts().sort_values(), bottom=train_df['Sex'][train_df['Survived'] == 1].value_counts(), width=0.3, color='lightsalmon')
plt.legend(['Survived', 'NotSurvived'])
plt.title('Sex Survived Relationship', fontsize=18, fontweight='bold')
plt.show();

In [None]:
plt.bar(['Upper', 'Middle', 'Lower'], train_df['Pclass'][train_df['Survived'] == 1].value_counts().sort_values(), color='orangered')
plt.bar(['Upper', 'Middle', 'Lower'], train_df['Pclass'][train_df['Survived'] == 0].value_counts(), color='lightsalmon', bottom=train_df['Pclass'][train_df['Survived'] == 1].value_counts().sort_values())
plt.title('Pclass Survived Relationship', fontsize=18, fontweight='bold')
plt.legend(['Survived', 'NotSurvived'])
plt.show();

In [None]:
plt.bar([0, 1, 2, 3, 4, 8, 5], train_df['SibSp'][train_df['Survived'] == 1].value_counts(), color='orangered')
plt.bar([0, 1, 2, 3, 4, 8, 5], train_df['SibSp'][train_df['Survived'] == 0].value_counts(), color='lightsalmon', bottom=train_df['SibSp'][train_df['Survived'] == 1].value_counts())
plt.title('SibSp Survived Relationship', fontsize=18, fontweight='bold')
plt.legend(['Survived', 'NotSurvived'])
plt.show();

If you got value from this, and/or if you think this can be improved, please let me know in the comments. Thanks again for reading. üôè

Follow me on LinkedIn [@abhishek-vaish](https://www.linkedin.com/in/abhishek-vaish) and Twitter [@abhishekvaish](https://twitter.com/abhishek_vaish_)