# **The Titanic Dataset: A Journey Through Data Cleaning**  
**Author: Zahra Haider**  
`Created: [06/04/2025]`  

---

## **🌊 Introduction: Setting Sail with Data**  

Welcome aboard the Titanic dataset! Just like the passengers of the ill-fated ship, our dataset has its own share of mysteries and challenges. But fear not—today, we'll navigate through the rough seas of missing values, inconsistencies, and duplicates to uncover a clean and reliable dataset.  

## 🛠️ **Tools Installed** for this voyage:  
- **Python --3.12.4**  
- **jupyter --1.1.1**  
- **code --1.99.0**   

Let's dive in!  

---

## **🔍 First Glance: The Raw Data**  

When we first load the dataset, it's like opening an old, weathered passenger manifest. Here's what we see:  

In [13]:
import pandas as pd  # Pandas is like Excel for Python

# Load the CSV file
df = pd.read_csv('Titanic.csv')  # Replace with your file path
df.head()  # Show first 5 rows (see the mess?)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


## Observations:

- Missing values in `Cabin` (most are `NaN`)
- Other columns like `Age` and `Fare` also have gaps
- The dataset feels incomplete—like a puzzle missing a few pieces

---

## Check missing values

---

In [14]:
print(df.isnull().sum())  # Counts NaN per column

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


# 🧹 Chapter 1: The Great Cleanup

---

## 1. The Case of the Missing Ages

Some passengers' ages are unknown. To fix this, we'll fill the gaps with the average age of all passengers.

---

In [15]:
df['Age'].fillna(df['Age'].mean(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace=True)


**Why the average?** *It's a fair guess—better than leaving blanks!*

---

## 2. The Vanishing Cabin Column

The Cabin column is almost entirely empty (327 missing values!). Since it’s more missing than present, we’ll drop it entirely.

---

In [16]:
df.drop('Cabin', axis=1, inplace=True)

*Farewell, mysterious cabins!*

---

## 3. The Embarked Mystery

A few passengers are missing their Embarked port (where they boarded). Instead of guessing, we’ll **drop these rows**—it’s a small sacrifice for accuracy.

---

In [17]:
df.dropna(subset=['Embarked'], inplace=True)

*No ticket, No Boarding!*

---

## 4. The Fare Gap

One passenger’s fare is missing. We’ll fill it with the **median fare** (less affected by outliers than the average).

---

In [18]:
df['Fare'].fillna(df['Fare'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Fare'].fillna(df['Fare'].median(), inplace=True)


*Fair fare for all!*

---

## 5. The Duplicate Passengers

Duplicate rows can skew our analysis. Let’s ensure each passenger is unique:

---

In [19]:
df.drop_duplicates(inplace=True)

*No doppelgängers allowed!*

---

## ✨ The Cleaned Dataset

After all our efforts, here’s the polished dataset:

---

In [20]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,S


## What’s improved?

- No missing values in `Age`, `Fare`, or `Embarked`.
- The `Cabin` column is gone—no more clutter.
- Every row is unique.

---

## 📁 Saving Our Work

Finally, we’ll save this cleaned dataset for future voyages:

---

In [21]:
df.to_csv('Titanic_clean.csv', index=False)  # Save cleaned version

*Our dataset is now shipshape and ready for analysis!*

---

# 🔮 Epilogue: What's Next?

With our data cleaned, the real adventure begins. We can now:

1. Explore survival rates by class or gender
2. Visualize passenger demographics
3. Train machine learning models to predict survival

*But that's a story for another day...*

**Until then, happy analyzing!** 🚢

---

*Authored with care by Zahra Haider*