# Titanic Dataset – Data Cleaning & Preprocessing

This notebook performs data cleaning on the Titanic dataset to prepare it for
exploratory data analysis. Missing values, duplicates, incorrect formats, and
inconsistencies are handled to ensure a clean and reliable dataset for further analysis.


# 1. Importing Libraries

In [28]:
# Importing essential libraries for data analysis and visualization
import pandas as pd              # For data loading, manipulation, and analysis
import numpy as np               # For numerical computations
import seaborn as sns            # For statistical visualizations
import matplotlib.pyplot as plt  # For creating plots and customizing charts

# Optional styling for cleaner visuals
plt.style.use('ggplot')


Why:
To import all necessary libraries for data analysis and visualization.

Observation:
Libraries loaded successfully. Ready to proceed with EDA.tion.


# 2. Loading Dataset

In [27]:
#Loading Dataset
df = pd.read_csv('train.csv')   # Load Titanic dataset
df.head()                       # Preview first 5 rows

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Why:
To view the first few rows and confirm all 12 Titanic columns are loaded correctly.

Observation:
Dataset contains details like Survived, Pclass, Sex, Age, Fare, Cabin, Embarked etc.

# 3. Inspecting Dataset Structure

In [20]:
# Checking the number of rows and columns in the dataset
df.shape

(891, 12)

Why:
To understand the size of the dataset before proceeding with cleaning and analysis.

Observation:
The dataset contains 891 rows and 12 columns, indicating a moderate-sized dataset suitable for EDA.

In [8]:
# Structure + data types + missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Why:
To check dataset shape, data types, and columns with missing values.

Observation:
891 rows and 12 columns. Age, Cabin, and Embarked have missing values.

In [9]:
# Statistical summary of numerical and categorical columns
df.describe(include='all')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Dooley, Mr. Patrick",male,,,,347082.0,,G6,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


Why:
To understand statistical summary before creating visualizations.

Observation:
Average age ~29 years. Fare varies widely. Most passengers are in 3rd class.

In [10]:
# Checking the total number of missing values in each column
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Why:
To know which columns require basic cleaning to avoid plot errors.

Observation:
Cabin has many missing values. Age moderate, Embarked few.

In [11]:
# Checking how many passengers survived and not survived
df['Survived'].value_counts()


Survived
0    549
1    342
Name: count, dtype: int64

Why:
To check survival distribution and class balance.

Observation:
549 did not survive, 342 survived.

In [12]:
# Checking duplicate rows
df.duplicated().sum()


np.int64(0)

Why:
To verify whether the dataset contains any repeated rows that could affect the accuracy of analysis.

Observation:
The duplicate count is 0, meaning there are no repeated rows in the dataset and no duplicate removal is required.

# 4. Handling Missing Values

In [17]:
# Filling missing age with median (numeric)
df['Age'] = df['Age'].fillna(df['Age'].median())

# Filling Embarked with mode (categorical)
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])

# Cabin has too many missing values, marking Unknown
df['Cabin'] = df['Cabin'].fillna('Unknown')


Why:
To ensure visualizations do not break due to missing values.

Observation:
Missing values handled using simple median/mode/Unknown filling suitable for EDA.

In [18]:
# Checking missing values after cleaning
df.isnull().sum()


PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

Why:
To clean missing values so that visualizations and statistical analysis do not break due to null entries.

Observation:
Age values were filled with the median, Embarked with the mode, and Cabin was replaced with “Unknown” due to many missing entries.
All missing values have now been handled, and the dataset is clean for EDA.

# 5. Final Cleaned Dataset 

In [21]:
# Final preview after data cleaning
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,Unknown,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,Unknown,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,Unknown,S


Why:
To verify that all cleaning steps have been applied correctly and confirm that
the dataset is now free from missing values or inconsistenci
e# Observation:
The first few rows show the cleaned dataset with filled Age, Embarked, and Cabin
values, no duplicate rows, and consistent formatting. This confirms that the
dataset is ready to be used for EDA in the next notebook.


# 6.Saving Cleaned File

In [1]:
# Saved cleaned dataset
df.to_csv("cleaned_titanic.csv", index=False)

NameError: name 'df' is not defined

Why:
To export the cleaned dataset into a separate CSV file so it can be used later
in the EDA notebook without repeating the cleaning steps.

Observation:
The cleaned dataset has been successfully saved as **cleaned_titanic.csv**.
This file will now be used as the input for all visualizations in the EDA phase.
