# ðŸ§¹ Lesson 3: Data Cleaning in Pandas

Real-world data is messy.  
In this lesson, we fix missing values, inconsistent formatting, incorrect dtypes, outliers, and duplicates.


Import + Sample Messy Dataset

In [4]:
import pandas as pd
import numpy as np

data = {
    'Name': ['Piyush', 'Amit', 'Sneha', 'Amit', None],
    'Age': [26, None, 27, 25, 30],
    'City': ['Bhopal', 'Delhi', 'pune', 'Delhi', 'Mumbai'],
    'Salary': [55000, 60000, None, 60000, 120000]
}

df = pd.DataFrame(data)
df


Unnamed: 0,Name,Age,City,Salary
0,Piyush,26.0,Bhopal,55000.0
1,Amit,,Delhi,60000.0
2,Sneha,27.0,pune,
3,Amit,25.0,Delhi,60000.0
4,,30.0,Mumbai,120000.0


Step 1 â€” Detect Missing Values

In [5]:
df.isna().sum()


Name      1
Age       1
City      0
Salary    1
dtype: int64

Step 2 â€” Handling Missing Values

We demonstrate multiple approaches:

In [6]:
# Fill missing numeric values
df['Age'] = df['Age'].fillna(df['Age'].median())
df

Unnamed: 0,Name,Age,City,Salary
0,Piyush,26.0,Bhopal,55000.0
1,Amit,26.5,Delhi,60000.0
2,Sneha,27.0,pune,
3,Amit,25.0,Delhi,60000.0
4,,30.0,Mumbai,120000.0


In [7]:
# Fill missing string
df['Name'] = df['Name'].fillna('Unknown')
df

Unnamed: 0,Name,Age,City,Salary
0,Piyush,26.0,Bhopal,55000.0
1,Amit,26.5,Delhi,60000.0
2,Sneha,27.0,pune,
3,Amit,25.0,Delhi,60000.0
4,Unknown,30.0,Mumbai,120000.0


In [8]:
# Fill Salary missing using mean
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())
df

Unnamed: 0,Name,Age,City,Salary
0,Piyush,26.0,Bhopal,55000.0
1,Amit,26.5,Delhi,60000.0
2,Sneha,27.0,pune,73750.0
3,Amit,25.0,Delhi,60000.0
4,Unknown,30.0,Mumbai,120000.0


### Filling Missing Values
- Use median for Age (less sensitive to outliers)
- Use mean for Salary
- Replace missing Name with "Unknown"


Step 3 â€” Fix Inconsistent Text Formatting

In [9]:
df['City'] = df['City'].str.title()
df

Unnamed: 0,Name,Age,City,Salary
0,Piyush,26.0,Bhopal,55000.0
1,Amit,26.5,Delhi,60000.0
2,Sneha,27.0,Pune,73750.0
3,Amit,25.0,Delhi,60000.0
4,Unknown,30.0,Mumbai,120000.0


(Title case ensures "pune" â†’ "Pune")

Step 4 â€” Remove Duplicates

In [10]:
df.drop_duplicates(inplace=True)


In [11]:
df

Unnamed: 0,Name,Age,City,Salary
0,Piyush,26.0,Bhopal,55000.0
1,Amit,26.5,Delhi,60000.0
2,Sneha,27.0,Pune,73750.0
3,Amit,25.0,Delhi,60000.0
4,Unknown,30.0,Mumbai,120000.0


Step 5 â€” Convert Data Types

In [12]:
df['Age'] = df['Age'].astype('int')
df

Unnamed: 0,Name,Age,City,Salary
0,Piyush,26,Bhopal,55000.0
1,Amit,26,Delhi,60000.0
2,Sneha,27,Pune,73750.0
3,Amit,25,Delhi,60000.0
4,Unknown,30,Mumbai,120000.0


Step 6 â€” Detect Outliers (Simple Example)

In [13]:
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1

df_outliers = df[(df['Salary'] < (Q1 - 1.5*IQR)) | (df['Salary'] > (Q3 + 1.5*IQR))]
df_outliers


Unnamed: 0,Name,Age,City,Salary
4,Unknown,30,Mumbai,120000.0


If outliers exist, handle them:

In [14]:
df['Salary'] = np.where(df['Salary'] > (Q3 + 1.5*IQR), Q3 + 1.5*IQR, df['Salary'])
df

Unnamed: 0,Name,Age,City,Salary
0,Piyush,26,Bhopal,55000.0
1,Amit,26,Delhi,60000.0
2,Sneha,27,Pune,73750.0
3,Amit,25,Delhi,60000.0
4,Unknown,30,Mumbai,94375.0


### âœ… Summary

In this lesson, we learned essential real-world data cleaning techniques:
- Detecting and filling missing values
- Normalizing text formatting
- Removing duplicate rows
- Converting data types
- Identifying and correcting outliers

This is the core skill that makes data usable for analysis and modeling.
