# Data Preprocessing for Titanic Life Prediction

**About the Titanic Life Prediction**<br><br>
This project uses the built-in `titanic` dataset from the Seaborn library, which contains information about 891 passengers aboard the ill-fated voyage, representing approximately 40% of the actual passengers on board. Each row in the dataset represents a passenger, with various attributes that provide insights into their background, ticket information, and most importantly, whether they survived the disaster.

Logistic regression is particularly well-suited for the Titanic dataset because it addresses the fundamental classification problem inherent in the data: predicting whether a passenger survived or perished based on various predictor variables. The primary target variable (`survived`) is binary - passengers either survived (1) or died (0). Logistic regression is specifically designed for binary outcomes, making it a natural choice for this type of classification task. It doesn't just predict class labels; it provides probability estimates of survival.

## Libraries
- Numpy and pandas
- Matplotlib and Seaborn
- Warnings (to avoid minor errors)

# Tasks
- Import Libraries
- Load Dataset
- Understanding Dataset
- Sanity Check
- Exploratory Data Analysis
- Handling Missing Values
- Handle Categorical Values
- Handle Outliers
- Data Type Optimization
- Feature Engineering
- Save The File

# Import libraries

In [1]:
import numpy as np, pandas as pd
import matplotlib.pyplot as plt, seaborn as sns

import warnings
warnings.filterwarnings("ignore")

# Load Dataset

In [2]:
# loading dataset from seaborn library
df = sns.load_dataset('titanic')

In [3]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


# Sanity Check
- Duplicate Handling
- Remove Unuseful Columns

In [4]:
# quick overview of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


## Handling Duplicates

In [5]:
# check duplicates
df.duplicated().sum()

np.int64(107)

In [6]:
# Check what makes records "duplicate"
duplicates = df[df.duplicated()]
print("Sample duplicates:")
print(duplicates.head(10))

Sample duplicates:
     survived  pclass     sex   age  sibsp  parch     fare embarked   class  \
47          1       3  female   NaN      0      0   7.7500        Q   Third   
76          0       3    male   NaN      0      0   7.8958        S   Third   
77          0       3    male   NaN      0      0   8.0500        S   Third   
87          0       3    male   NaN      0      0   8.0500        S   Third   
95          0       3    male   NaN      0      0   8.0500        S   Third   
101         0       3    male   NaN      0      0   7.8958        S   Third   
121         0       3    male   NaN      0      0   8.0500        S   Third   
133         1       2  female  29.0      1      0  26.0000        S  Second   
173         0       3    male  21.0      0      0   7.9250        S   Third   
196         0       3    male   NaN      0      0   7.7500        Q   Third   

       who  adult_male deck  embark_town alive  alone  
47   woman       False  NaN   Queenstown   yes   True  

**Note:**  We should not drop these duplicates for this dataset even though there are 107 duplicates among 891 entries.

**Why Keep Duplicates**
- *Real-World Reality:* Multiple passengers genuinely had identical characteristics
- *Historical Accuracy:* These represent actual people with same demographics
- *Small Dataset:* With only 891 records, removing 107 (12%) would significantly reduce our dataset
- *No True Duplicates:* Unlike data entry errors, these are legitimate similar cases

**What These Duplicates Likely Represent**
- Multiple passengers of the same age, class, gender traveling alone
- Families with similar characteristics
- Crew members or passengers with identical booking details

## Remove Unuseful Columns

In [7]:
df.drop(['deck', 'embarked', 'alive', 'adult_male', 'class'], axis=1, inplace=True)

In [8]:
df.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'who',
       'embark_town', 'alone'],
      dtype='object')