# Titanic Dataset

This notebook is a guide to data analysis to understand the steps that can be done to familiarize with and explore the data using Python, followed by predictive modelling.


## EDA & Cleaning: Exploring continuous features

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

This dataset contains information about 891 people who were on board the ship when departed on April 15th, 1912. As noted in the description on Kaggle's website, some people aboard the ship were more likely to survive the wreck than others. There were not enough lifeboats for everybody so women, children, and the upper-class were prioritized. Using the information about these 891 passengers, the challenge is to build a model to predict which people would survive based on the following fields:

- **Name** (str) - Name of the passenger
- **Pclass** (int) - Ticket class
- **Sex** (str) - Sex of the passenger
- **Age** (float) - Age in years
- **SibSp** (int) - Number of siblings and spouses aboard
- **Parch** (int) - Number of parents and children aboard
- **Ticket** (str) - Ticket number
- **Fare** (float) - Passenger fare
- **Cabin** (str) - Cabin number
- **Embarked** (str) - Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

**This section focuses on exploring the `Pclass`, `Age`, `SibSp`, `Parch`, and `Fare` features.**

In [1]:
# Import data analysis libraries

import numpy as np
import ? as pd


SyntaxError: invalid syntax (2763413276.py, line 4)

In [91]:
import piplite
await piplite.install('seaborn')

# Import visualization libraries
import matplotlib.pyplot as ?
import seaborn as sns


In [92]:
# Read in the dataset

titanic = pd.read_csv(?)

In [None]:
# If we want to check out the first 10 rows, we put in 10 as the argument.
titanic.head(?) # a method

In [None]:
titanic.shape # an attribute

In [None]:
type(?)

In [None]:
titanic.columns

In [None]:
titanic.info()

In [None]:
# Generate a summary of statistics for each numerical column of the data frame
?.describe()

In [None]:
titanic.isnull().sum()

In [None]:
titanic['Survived'].value_counts()

## Exploratory Data Analysis using Visualization

Data visualization is useful to gain insight and understand what happened in the past in a given context. Let us look at various plots that we can draw using Python. 

1. Drawing Plots 
Matplotlib is a library for creating 2D plots of arrays in Python.
It provides extensive set of plotting APIs to create various plots such as scattered, bar, box, and distribution plots with custom styling and annotation. 

Seaborn is a library for making elegant charts in Python and is well integrated with Pandas DataFrame. 
It provides a high-level interface for drawing innovative and informative statistical charts. 

To create graphs and plots, we need to import matplotlib.pyplot and seaborn modules. 
To display the plots on the Jupyter Notebook, we need to provide a directive %matplotlib inline. 


In [None]:
# to see the overall survival 

sns.countplot(x='Survived', data = ?)
plt.show()

seaborn.countplot

Shows the counts of observations in each categorical bin using bars.

seaborn.catplot

Provides access to several axes-level functions that show the relationship between a numerical and one or more categorical variables using one of several visual representations.

In [None]:
sns.catplot(data = titanic, x = 'Survived', kind ='count')
plt.show()

In [None]:
#Try of Pclass

sns.catplot(data = titanic,
            x = 'Survived',
            kind = ?,
            hue = 'Pclass')

In [None]:
sns.catplot(data = titanic,
            x = 'Survived',
            kind = 'count',
            hue = 'Sex')

In [None]:
sns.catplot(data = titanic,
            x = 'Survived',
            kind = 'count',
            hue = 'Pclass',
           col='Sex')

In [None]:
sns.catplot(data = titanic,
            x = 'Embarked',
            col = 'Survived',
            kind = 'count')

In [None]:
sns.catplot(data = titanic,
            x = 'Pclass',
            col = 'Survived',
            hue = 'Embarked',
            kind = 'count')

In [None]:
#SibSp and Parch
sns.catplot(data = titanic,
            x = 'SibSp',
            y= 'Survived',
            kind = 'point',
            aspect = 2,
            errorbar = None)

In [None]:
sns.catplot(data = titanic,
            x = 'Parch',
            y= 'Survived',
            kind = 'point',
            aspect = 2,
            errorbar = None)

In [None]:
titanic['Family_Cnt'] = titanic['SibSp'] + titanic['Parch']
sns.catplot(data = titanic,
            x = 'Family_Cnt',
            y= 'Survived',
            kind = 'point',
            aspect = 2,
            errorbar = None)

## Data Preprocessing

From the EDA (Exploratory Data Analysis), we should have identified which variables are missing. Now, in the data preprocessing stage, we can start dealing with them. 

In [None]:
titanic.isnull().sum()*100 / len(titanic)

## Handling Missing Values

In real world, the datasets are not clean and may have missing values. We must know how to find and deal with these missing values. 

One of the strategies to deal with missing values is to remove them from the dataset. 

However, whenever possible, we would like to impute the data, which is a process of filling the missing values. 

Let us take a look at the example on how to handle missing values. Only 3 columns out of 12 contain missing data. 



### Drop NAs

In [114]:
titanic_new = titanic.dropna()

In [None]:
titanic_new.isnull().sum()

In [116]:
titanic_new.shape

(183, 13)

### Fill NAs / Imputation

In [None]:
titanic.isnull().sum()

### Mode Imputation

In [None]:
# titanic.pivot_table('PassengerId', index = 'Survived', columns = 'Embarked', aggfunc='count')
mode_embarked = titanic['Embarked'].mode()[0] # Python still thinks the mode-aggregated object is a DataSeries (one column of a DataFrame), so we need to select the string inside the DataSeries, hence the [0].
mode_embarked

In [119]:
titanic['Embarked'].fillna('S',inplace = True)

In [None]:
titanic.isnull().sum()

### Mean Imputation

In [None]:
#titanic.groupby(titanic['Cabin'].isnull()).mean()

titanic[titanic['Age'].isnull()]

In [None]:
mean_age = round(titanic['Age'].mean(),1)
mean_age

In [None]:
titanic.groupby('Survived')['Age'].mean()

In [125]:
titanic['Age'].replace(np.nan, mean_age, inplace=True)
# titanic_train.iloc[[5,19,28,863,878],:]

In [None]:
titanic.info()

In [None]:
titanic.isnull().sum()

### Deletion

In [128]:
titanic.drop(columns='Cabin', inplace=True)

In [None]:
# Check new dataset
titanic.head()

In [None]:
titanic.drop(columns=['PassengerId','Name','Ticket','Embarked'], inplace=True)

In [131]:
titanic = pd.get_dummies(titanic, columns=['Sex'])

In [None]:
titanic.head()

In [133]:
# Drop Sex_female
titanic.drop('Sex_female', axis=1, inplace=True) # axis=1 specifies that a column is being dropped. If we want to drop rows, we specify axis=0

In [None]:
titanic.head()

We perform the same dummification process of getting dummy columns (the Sex_male and Sex_female are called dummy variables, which are obtained from the original Sex column) for the Pclass column. This time, we add an additional argument drop_first=True to the get_dummies() method to drop one irrelevant column:

In [137]:
titanic = pd.get_dummies(titanic, columns=['Pclass'], drop_first=True)

In [138]:
titanic.head()

Unnamed: 0,Survived,Age,SibSp,Parch,Fare,Family_Cnt,Sex_male,Pclass_2,Pclass_3
0,0,22.0,1,0,7.25,1,1,0,1
1,1,38.0,1,0,71.2833,1,0,0,0
2,1,26.0,0,0,7.925,0,0,0,1
3,1,35.0,1,0,53.1,1,0,0,0
4,0,35.0,0,0,8.05,0,1,0,1


In the Pclass column, we had 3 categories: first, second and third class passengers. One-hot encoding for 3 categories works like this: if the passenger is in 1st class, Pclass_1 = 1 and Pclass_2 = Pclass_3 = 0. If the passenger is in 2nd class, Pclass_2 = 1 and Pclass_1 = Pclass_3 = 0, and similarly for 3rd class passengers.

In this case, all the information is captured in two columns (the irrelevant column was already dropped by specifying the drop_first=True argument in the previous line of code). Likewise, if we have 4 categories in a column, we create 3 dummies and drop one, and so on.

## Predictive Analysis
In order to make predictions about the survival rate, we first need to separate the train dataset into independent variables (all the columns except Survived) and the dependent variable (the target column Survived). The test dataset does not contain the Survived column because we are supposed to predict it. Next, we choose a machine learning algorithm and train it using the train dataset. Finally, we ask it to make predictions for the target column using the test dataset.