In [3]:
import pandas as pd
import numpy as np

# Load dataset
df = pd.read_csv("/content/Titanic-Dataset.csv")

# View first rows
df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


pd.read_csv() loads the dataset into a DataFrame

.head() shows the first 5 rows to understand columns and sample data



In [4]:
# Dataset info
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


.info() displays column names, data types, and non-null counts to identify missing values

In [5]:
df.isnull().sum()


Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0


The .isnull().sum() function counts missing values in each column.

Columns with high missing values (e.g., Cabin) may be dropped.

Columns with few missing values (e.g., Age) can be filled using mean or median.

Categorical columns with missing values (e.g., Embarked) can be filled using mode.

This step helps decide the best cleaning strategy before modeling or reporting.

In [16]:
# Fill missing numeric values using median
df['Age'] = df['Age'].fillna(df['Age'].median())

# Fill missing categorical values using mode
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])

# Drop column with excessive missing values (if exists)
df = df.drop(columns=['Cabin'], errors='ignore')


Missing numerical values were filled using the median to minimize outlier impact.

Missing categorical values were filled using the mode.

Columns with excessive missing values were removed when present.

Future-safe Pandas syntax was used to avoid chained assignment issues.

In [7]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Embarked     891 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 76.7+ KB


In [17]:
# Check number of rows before removing duplicates
rows_before = df.shape[0]

# Remove duplicate rows
df = df.drop_duplicates()

# Check number of rows after removing duplicates
rows_after = df.shape[0]

print("Rows before removing duplicates:", rows_before)
print("Rows after removing duplicates:", rows_after)
print("Duplicates removed:", rows_before - rows_after)


Rows before removing duplicates: 891
Rows after removing duplicates: 891
Duplicates removed: 0


.drop_duplicates() removes duplicate rows from the dataset.

Row count before and after removal is compared to verify the operation.

This ensures data integrity and prevents biased analysis or modeling.

Verification confirms that duplicates were handled correctly.

In [18]:
# Convert Survived column to categorical datatype
df['Survived'] = df['Survived'].astype('category')

# Convert Pclass column to categorical datatype
df['Pclass'] = df['Pclass'].astype('category')


Datatype conversion ensures accurate calculations and memory efficiency.

Categorical columns are converted using .astype('category') for better analysis.

Date strings are converted to datetime to enable time-based calculations.

Proper datatypes prevent errors during modeling and reporting.

In [19]:
# Create Age Category column
df['Age_Category'] = pd.cut(
    df['Age'],
    bins=[0, 12, 20, 40, 60, 100],
    labels=['Child', 'Teen', 'Adult', 'Middle-aged', 'Senior']
)

# Create Fare Band column
df['Fare_Band'] = pd.cut(
    df['Fare'],
    bins=[0, 10, 30, 100, 600],
    labels=['Low', 'Medium', 'High', 'Very High']
)


New columns were created using conditional logic to enhance insights.

Age_Category groups passengers into meaningful age segments.

Fare_Band categorizes ticket prices into cost ranges.

Feature engineering improves data interpretability and modeling performance.

In [20]:
# Save the cleaned dataset to a CSV file
df.to_csv("cleaned_data.csv", index=False)


The cleaned dataset was successfully saved using .to_csv().

index=False prevents the DataFrame index from being written to the file.

The file cleaned_data.csv can be downloaded and reused for analysis or modeling.

This confirms completion of the data cleaning pipeline.

# Overall Summary
Loaded the dataset using Pandas to understand its structure and columns.

Checked missing values using .isnull().sum() to identify columns requiring cleaning.

Filled missing numerical values with the median to reduce the impact of outliers.

Filled missing categorical values with the mode as it represents the most frequent value.

Removed duplicate records to maintain data accuracy and integrity.

Converted appropriate columns to categorical datatype for efficient analysis.

Created new features such as Age Category and Fare Band for better insights.

Ensured all transformations were applied using safe Pandas methods.

Saved the cleaned dataset for future analysis and modeling.