# Titanic dataset - Binary Classification

The Titanic dataset present a binary classification problem, where the goal is to predict whether a passenger survived or not. 

The dataset is split into two files: train.csv and test.csv. The train.csv file contains the training data, while the test.csv file contains the test data. The test data does not contain the Survived column, which is the target variable. The goal is to predict the Survived column for the test data.

Notes: 
- The dataset is small, with only 891 samples in the training data.
- The dataset is imbalanced, with 549 samples of class 0 and 342 samples of class 1.
- The dataset contains missing values that need to be handled
- The dataset contains categorical variables that need to be encoded

The dataset is available on [Kaggle](https://www.kaggle.com/c/titanic) and contains the following columns:

| Variable | Definition                          | Key                        |
|----------|-------------------------------------|----------------------------|
| survival | Survival                            | 0 = No, 1 = Yes            |
| pclass   | Ticket class                        | 1 = 1st, 2 = 2nd, 3 = 3rd  |
| sex      | Sex                                 |                            |
| Age      | Age in years                        |                            |
| sibsp    | # of siblings / spouses aboard the Titanic |                    |
| parch    | # of parents / children aboard the Titanic |                    |
| ticket   | Ticket number                       |                            |
| fare     | Passenger fare                      |                            |
| cabin    | Cabin number                        |                            |
| embarked | Port of Embarkation                 | C = Cherbourg, Q = Queenstown, S = Southampton |



In [8]:
# import watermark
%reload_ext watermark
%watermark

Last updated: 2024-05-27T16:52:00.389685+02:00

Python implementation: CPython
Python version       : 3.11.6
IPython version      : 8.24.0

Compiler    : MSC v.1935 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
CPU cores   : 12
Architecture: 64bit



In [9]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.float_format', lambda x: '%.2f' % x)

%watermark -w
%watermark -iv

Watermark: 2.4.3

pandas    : 2.1.4
numpy     : 1.26.4
seaborn   : 0.13.2
matplotlib: 3.7.5



In [10]:
# load dataset
train_df = pd.read_csv('data/train.csv')
test_df = pd.read_csv('data/test.csv')

# combine train and test dataset to a single dataset
train_df['source'] = 'train'
test_df['source'] = 'test'
data = pd.concat([train_df, test_df], ignore_index=True)

# print the shape of the dataset
print('Train dataset shape (rows, columns):', train_df.shape)
print('Test dataset shape (rows, columns):', test_df.shape)

data.head()

Train dataset shape (rows, columns): (891, 13)
Test dataset shape (rows, columns): (418, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,source
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,train
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.28,C85,C,train
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.92,,S,train
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,train
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,train


In [11]:
# inspect dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
 12  source       1309 non-null   object 
dtypes: float64(3), int64(4), object(6)
memory usage: 133.1+ KB


In [12]:
data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,1309.0,891.0,1309.0,1046.0,1309.0,1309.0,1308.0
mean,655.0,0.38,2.29,29.88,0.5,0.39,33.3
std,378.02,0.49,0.84,14.41,1.04,0.87,51.76
min,1.0,0.0,1.0,0.17,0.0,0.0,0.0
25%,328.0,0.0,2.0,21.0,0.0,0.0,7.9
50%,655.0,0.0,3.0,28.0,0.0,0.0,14.45
75%,982.0,1.0,3.0,39.0,1.0,0.0,31.27
max,1309.0,1.0,3.0,80.0,8.0,9.0,512.33


In [13]:
# check for duplicates
data.duplicated().sum()
# test_df.duplicated().sum()

0

In [14]:
# count missing values
data.isnull().sum()
# test_df.isnull().sum()

PassengerId       0
Survived        418
Pclass            0
Name              0
Sex               0
Age             263
SibSp             0
Parch             0
Ticket            0
Fare              1
Cabin          1014
Embarked          2
source            0
dtype: int64

Number of persons/data points in total = 1309.  
- 263 missing values in the Age column
- 1 missing value in the Fare column
- 1014 missing values in the Cabin column
- 2 missing values in the Embarked column
