# üêº Pandas - Class 3: Data Cleaning Essentials
Welcome to **Class 3** of our Pandas series. Today we‚Äôll learn how to clean and prepare data for analysis.

## Handling Missing Values
Real-world datasets often have missing data. Pandas offers:
- `isna()` / `isnull()` ‚Üí detect missing values (returns True/False)
- `dropna()` ‚Üí remove missing values (rows/columns)
- `fillna()` ‚Üí replace missing values with a given value or method (e.g., forward fill)

Always inspect how much data is missing before deciding to drop or fill.

In [7]:
import pandas as pd
import numpy as np
a = {
    "Name": ["Alice", "Bob", "Charlie", "David", "Emma"],
    "Age": [25, None, 35, 40, None],
    "City": ["Delhi", "Mumbai", None, "Bangalore", "Chennai"],
    "Score": [85, 91, np.nan, 88, 95]
}

b = pd.DataFrame(a)

In [17]:
b

Unnamed: 0,Name,Age,Score
0,Alice,25,88
1,Bob,30,92
2,Charlie,35,79
3,David,40,85
4,Emma,22,95


## Changing Data Types (`astype`)
- Convert a column to a different type using `astype()`.
- Examples: converting strings to numbers, integers to floats, or columns to category.

Be careful: make sure the data is compatible with the new type!

In [10]:
a = {
    "Name": ["Alice", "Bob", "Charlie", "David", "Emma"],
    "Age": [25, 30, 35, 40, 22],
    "Score": [88, 92, 79, 85, 95]
}

b = pd.DataFrame(a)

## Replacing Values (`replace`)
- Use `replace()` to substitute specific values or patterns.
- Useful for cleaning labels, fixing typos, or mapping categories.

## Dropping Rows & Columns (`drop`)
- `drop(labels, axis=0)` ‚Üí drop rows by index labels.
- `drop(labels, axis=1)` ‚Üí drop columns by name.
- Use `inplace=True` if you want to modify the DataFrame directly.

## Detecting & Removing Duplicates
- `duplicated()` ‚Üí returns a Boolean Series marking duplicate rows.
- `drop_duplicates()` ‚Üí remove duplicate rows.
- You can specify `subset` (columns) and `keep` ('first', 'last', or False).

In [11]:
data = {
    "Name": ["Alice", "Bob", "Charlie", "Alice", "Bob", "David"],
    "Age": [25, 30, 35, 25, 30, 40],
    "Score": [88, 92, 79, 88, 92, 85]
}

df = pd.DataFrame(data)

## Mini Practice
1. Create a messy DataFrame with missing values, wrong data types, typos, extra columns, and duplicates.
2. Clean it step-by-step using what you learned:
   - Handle missing values
   - Change data types
   - Replace incorrect values
   - Drop unnecessary rows/columns
   - Remove duplicates

In [12]:
import pandas as pd
import numpy as np

# 1. Create a messy DataFrame
data = {
    "Name": ["Alice", "Bob", "Chrlie", "David", "Alice", None],
    "Age": ["25", "30", None, "40", "25", "30"],      # ages stored as strings, with missing value
    "City": ["Delhi", "Mumbai", "Pun", "Bangalore", "Delhi", "Mumbai"],  # typo: "Pun" -> "Pune"
    "Score": [85, None, 78, 88, 85, 91],              # missing value in Score
    "Extra": ["x", "y", "z", "x", "y", "z"]           # extra column we don't really need
}

df = pd.DataFrame(data)

print("Original messy DataFrame:")
print(df)

# 2. Handle missing values
# Fill or drop as needed

# 3. Change data types
# Convert Age to int, Score to float, etc.

# 4. Replace incorrect values
# Fix 'Pun' -> 'Pune', maybe unify Name spellings if needed

# 5. Drop unnecessary rows/columns
# Remove 'Extra' or any row you don't want

# 6. Remove duplicates
# Use duplicated() and drop_duplicates()

# 7. Final clean DataFrame
# Print the cleaned DataFrame and check info()


Original messy DataFrame:
     Name   Age       City  Score Extra
0   Alice    25      Delhi   85.0     x
1     Bob    30     Mumbai    NaN     y
2  Chrlie  None        Pun   78.0     z
3   David    40  Bangalore   88.0     x
4   Alice    25      Delhi   85.0     y
5    None    30     Mumbai   91.0     z


---
## Summary
- Learned to handle missing values (`isna`, `dropna`, `fillna`).
- Converted column data types with `astype`.
- Cleaned up incorrect entries with `replace`.
- Removed unwanted rows/columns with `drop`.
- Detected and removed duplicates.