# üêº Pandas - Class 3: Data Cleaning Essentials
Welcome to **Class 3** of our Pandas series. Today we‚Äôll learn how to clean and prepare data for analysis.

## Handling Missing Values
Real-world datasets often have missing data. Pandas offers:
- `isna()` / `isnull()` ‚Üí detect missing values (returns True/False)
- `dropna()` ‚Üí remove missing values (rows/columns)
- `fillna()` ‚Üí replace missing values with a given value or method (e.g., forward fill)

Always inspect how much data is missing before deciding to drop or fill.

In [52]:
import pandas as pd
import numpy as np
a = {
    "Name": ["Alice", "Bob", "Charlie", "David", "Emma"],
    "Age": [25, None, 35, 40, None],
    "City": ["Delhi", "Mumbai", None, "Bangalore", "Chennai"],
    "Score": [85, 91, np.nan, 88, 95]
}

b = pd.DataFrame(a)

In [53]:
b

Unnamed: 0,Name,Age,City,Score
0,Alice,25.0,Delhi,85.0
1,Bob,,Mumbai,91.0
2,Charlie,35.0,,
3,David,40.0,Bangalore,88.0
4,Emma,,Chennai,95.0


In [54]:
b.isna()
b.isnull()
b.isna().sum()

Unnamed: 0,0
Name,0
Age,2
City,1
Score,1


In [55]:
b.dropna()
b

Unnamed: 0,Name,Age,City,Score
0,Alice,25.0,Delhi,85.0
1,Bob,,Mumbai,91.0
2,Charlie,35.0,,
3,David,40.0,Bangalore,88.0
4,Emma,,Chennai,95.0


In [56]:
# b['Age'] = b['Age'].fillna(0)

b['Age'] = b['Age'].fillna(b['Age'].mean())
b

Unnamed: 0,Name,Age,City,Score
0,Alice,25.0,Delhi,85.0
1,Bob,33.333333,Mumbai,91.0
2,Charlie,35.0,,
3,David,40.0,Bangalore,88.0
4,Emma,33.333333,Chennai,95.0


In [57]:
b['City'] = b['City'].fillna(b['City'].mode())
b

Unnamed: 0,Name,Age,City,Score
0,Alice,25.0,Delhi,85.0
1,Bob,33.333333,Mumbai,91.0
2,Charlie,35.0,Delhi,
3,David,40.0,Bangalore,88.0
4,Emma,33.333333,Chennai,95.0


In [58]:
b['Score'] = b['Score'].fillna(b['Score'].median())
b

Unnamed: 0,Name,Age,City,Score
0,Alice,25.0,Delhi,85.0
1,Bob,33.333333,Mumbai,91.0
2,Charlie,35.0,Delhi,89.5
3,David,40.0,Bangalore,88.0
4,Emma,33.333333,Chennai,95.0


## Changing Data Types (`astype`)
- Convert a column to a different type using `astype()`.
- Examples: converting strings to numbers, integers to floats, or columns to category.

Be careful: make sure the data is compatible with the new type!

In [61]:
a = {
    "Name": ["Alice", "Bob", "Charlie", "David", "Emma"],
    "Age": [25, 30, 35, 40, 45],
    "Score": [85, 91, 88, 95, 92]
}

b = pd.DataFrame(a)
b

Unnamed: 0,Name,Age,Score
0,Alice,25,85
1,Bob,30,91
2,Charlie,35,88
3,David,40,95
4,Emma,45,92


In [62]:
b['Score'] = b['Score'].astype(float)
b

Unnamed: 0,Name,Age,Score
0,Alice,25,85.0
1,Bob,30,91.0
2,Charlie,35,88.0
3,David,40,95.0
4,Emma,45,92.0


## Replacing Values (`replace`)
- Use `replace()` to substitute specific values or patterns.
- Useful for cleaning labels, fixing typos, or mapping categories.

In [63]:
c = b.copy()
c

Unnamed: 0,Name,Age,Score
0,Alice,25,85.0
1,Bob,30,91.0
2,Charlie,35,88.0
3,David,40,95.0
4,Emma,45,92.0


In [64]:
c['Name'] = c['Name'].replace('Alice', 'Alice Smith')
c

Unnamed: 0,Name,Age,Score
0,Alice Smith,25,85.0
1,Bob,30,91.0
2,Charlie,35,88.0
3,David,40,95.0
4,Emma,45,92.0


In [65]:
c['Score'] = c['Score'].replace(85, 85.5)
c

Unnamed: 0,Name,Age,Score
0,Alice Smith,25,85.5
1,Bob,30,91.0
2,Charlie,35,88.0
3,David,40,95.0
4,Emma,45,92.0


## Dropping Rows & Columns (`drop`)
- `drop(labels, axis=0)` ‚Üí drop rows by index labels.
- `drop(labels, axis=1)` ‚Üí drop columns by name.
- Use `inplace=True` if you want to modify the DataFrame directly.

In [70]:
c.drop(['Name'], axis=1)
c

Unnamed: 0,Name,Age,Score
0,Alice Smith,25,85.5
1,Bob,30,91.0
2,Charlie,35,88.0
3,David,40,95.0
4,Emma,45,92.0


In [73]:
c.drop([0,2], axis=0)


Unnamed: 0,Name,Age,Score
1,Bob,30,91.0
3,David,40,95.0
4,Emma,45,92.0


## Detecting & Removing Duplicates
- `duplicated()` ‚Üí returns a Boolean Series marking duplicate rows.
- `drop_duplicates()` ‚Üí remove duplicate rows.
- You can specify `subset` (columns) and `keep` ('first', 'last', or False).

In [75]:
data = {
    "Name": ["Alice", "Bob", "Charlie", "Alice", "Bob", "David"],
    "Age": [25, 30, 35, 25, 30, 40],
    "Score": [88, 92, 79, 88, 92, 85]
}

df = pd.DataFrame(data)

df

Unnamed: 0,Name,Age,Score
0,Alice,25,88
1,Bob,30,92
2,Charlie,35,79
3,Alice,25,88
4,Bob,30,92
5,David,40,85


In [76]:
df.duplicated()

Unnamed: 0,0
0,False
1,False
2,False
3,True
4,True
5,False


In [77]:
df.drop_duplicates()

Unnamed: 0,Name,Age,Score
0,Alice,25,88
1,Bob,30,92
2,Charlie,35,79
5,David,40,85


## Mini Practice
1. Create a messy DataFrame with missing values, wrong data types, typos, extra columns, and duplicates.
2. Clean it step-by-step using what you learned:
   - Handle missing values
   - Change data types
   - Replace incorrect values
   - Drop unnecessary rows/columns
   - Remove duplicates

In [60]:
import pandas as pd
import numpy as np

# 1. Create a messy DataFrame
data = {
    "Name": ["Alice", "Bob", "Chrlie", "David", "Alice", None],
    "Age": ["25", "30", None, "40", "25", "30"],      # ages stored as strings, with missing value
    "City": ["Delhi", "Mumbai", "Pun", "Bangalore", "Delhi", "Mumbai"],  # typo: "Pun" -> "Pune"
    "Score": [85, None, 78, 88, 85, 91],              # missing value in Score
    "Extra": ["x", "y", "z", "x", "y", "z"]           # extra column we don't really need
}

df = pd.DataFrame(data)

print("Original messy DataFrame:")
print(df)

# 2. Handle missing values
# Fill or drop as needed

# 3. Change data types
# Convert Age to int, Score to float, etc.

# 4. Replace incorrect values
# Fix 'Pun' -> 'Pune', maybe unify Name spellings if needed

# 5. Drop unnecessary rows/columns
# Remove 'Extra' or any row you don't want

# 6. Remove duplicates
# Use duplicated() and drop_duplicates()

# 7. Final clean DataFrame
# Print the cleaned DataFrame and check info()


Original messy DataFrame:
     Name   Age       City  Score Extra
0   Alice    25      Delhi   85.0     x
1     Bob    30     Mumbai    NaN     y
2  Chrlie  None        Pun   78.0     z
3   David    40  Bangalore   88.0     x
4   Alice    25      Delhi   85.0     y
5    None    30     Mumbai   91.0     z


---
## Summary
- Learned to handle missing values (`isna`, `dropna`, `fillna`).
- Converted column data types with `astype`.
- Cleaned up incorrect entries with `replace`.
- Removed unwanted rows/columns with `drop`.
- Detected and removed duplicates.

# Task
Clean the DataFrame created in cell `5cd3e99c` by performing the following steps: first, prepare the messy DataFrame by ensuring the selected cell only creates and prints the initial DataFrame. Then, in new code cells, handle missing values in 'Age' (after converting it to numeric) and 'Score', change the data types of 'Age' to integer and 'Score' to float, replace incorrect values like 'Pun' to 'Pune' in 'City' and 'Chrlie' to 'Charlie' in 'Name', drop the 'Extra' column, and remove any duplicate rows. Finally, print the cleaned DataFrame and its `info()` in a new code cell to verify the cleaning operations.

## Prepare Messy DataFrame

### Subtask:
Modify the selected cell (5cd3e99c) to solely create and print the initial messy DataFrame, removing the commented cleaning instructions to set up the exercise.


**Reasoning**:
To prepare the messy DataFrame for the cleaning exercise, I need to modify cell `5cd3e99c` by removing all commented lines and instructions related to subsequent cleaning steps, leaving only the code that creates and prints the initial DataFrame.



In [1]:
import pandas as pd
import numpy as np

# 1. Create a messy DataFrame
data = {
    "Name": ["Alice", "Bob", "Chrlie", "David", "Alice", None],
    "Age": ["25", "30", None, "40", "25", "30"],
    "City": ["Delhi", "Mumbai", "Pun", "Bangalore", "Delhi", "Mumbai"],
    "Score": [85, None, 78, 88, 85, 91],
    "Extra": ["x", "y", "z", "x", "y", "z"]
}

df = pd.DataFrame(data)

print("Original messy DataFrame:")
print(df)

Original messy DataFrame:
     Name   Age       City  Score Extra
0   Alice    25      Delhi   85.0     x
1     Bob    30     Mumbai    NaN     y
2  Chrlie  None        Pun   78.0     z
3   David    40  Bangalore   88.0     x
4   Alice    25      Delhi   85.0     y
5    None    30     Mumbai   91.0     z


**Reasoning**:
The next step is to handle the 'Age' column. This involves converting it to a numeric type (float first to allow for NaN), filling missing values, and then converting it to integer as per the task requirements.



In [2]:
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
df['Age'] = df['Age'].fillna(df['Age'].mean())
df['Age'] = df['Age'].astype(int)
df

Unnamed: 0,Name,Age,City,Score,Extra
0,Alice,25,Delhi,85.0,x
1,Bob,30,Mumbai,,y
2,Chrlie,30,Pun,78.0,z
3,David,40,Bangalore,88.0,x
4,Alice,25,Delhi,85.0,y
5,,30,Mumbai,91.0,z


**Reasoning**:
Following the task instructions, I will now handle the missing values in the 'Score' column and ensure its data type is float. I will use the median to fill missing values as it is robust to outliers.



In [3]:
df['Score'] = df['Score'].fillna(df['Score'].median())
df['Score'] = df['Score'].astype(float)
df

Unnamed: 0,Name,Age,City,Score,Extra
0,Alice,25,Delhi,85.0,x
1,Bob,30,Mumbai,85.0,y
2,Chrlie,30,Pun,78.0,z
3,David,40,Bangalore,88.0,x
4,Alice,25,Delhi,85.0,y
5,,30,Mumbai,91.0,z
