# Pandas - Part 3: Data Cleaning

This notebook covers handling missing data, duplicates, and type conversions.

**Topics covered:**
- Detecting missing values (isnull, notnull)
- Handling missing data (fillna, dropna)
- Removing duplicates
- Type conversions (astype)
- String operations

**Problems:** 15 (Easy: 1-5, Medium: 6-10, Hard: 11-15)

In [None]:
# ============================================
# SETUP - Run this cell first!
# ============================================
import pandas as pd
import numpy as np
import sys
sys.path.insert(0, '..')
from utils.checker import check

# Sample data with missing values
df = pd.DataFrame({
    'name': ['Alice', 'Bob', None, 'David', 'Eve'],
    'age': [25, np.nan, 35, 28, np.nan],
    'salary': [50000, 60000, np.nan, 55000, 45000],
    'dept': ['IT', 'HR', 'IT', None, 'HR']
})
print(df)
print("\nSetup complete!")

---
## Problem 1: Check for Missing Values
**Difficulty:** Easy

### Concept
Missing values (NaN, None, NaT) are common in real datasets. The `isnull()` method creates a boolean DataFrame showing which values are missing.

### Syntax
```python
df.isnull()   # Returns boolean DataFrame
df.notnull()  # Opposite of isnull()
```

### Example
```python
>>> df.isnull()
    name    age  salary
0  False  False   False
1  False   True   False
```

### Task
Create a boolean DataFrame showing which values are missing in `df`. Use `isnull()`. Store in `missing_mask`.

### Expected Properties
- Should be a pandas DataFrame
- Should have same shape as df
- Element at row 2, 'name' column should be True (missing)

In [None]:
# Your solution:
missing_mask = None

In [None]:
# Verification
check.is_type(missing_mask, pd.DataFrame, "P1: Type check")
check.has_shape(missing_mask, df.shape, "P1: Shape")
check.is_true(missing_mask.iloc[2]['name'] == True, "P1: Name missing at row 2", "Row 2 name should be missing")

---
## Problem 2: Count Missing Values per Column
**Difficulty:** Easy

### Concept
To get a summary of missing data, combine `isnull()` with `sum()`. Since True=1 and False=0, summing gives the count.

### Syntax
```python
df.isnull().sum()  # Sum along rows (axis=0)
```

### Example
```python
>>> df.isnull().sum()
name      1
age       2
salary    0
dtype: int64
```

### Task
Count the number of missing values in each column of `df`. Store in `missing_counts`.

### Expected Properties
- Should be a pandas Series
- 'age' column should have 2 missing values

In [None]:
# Your solution:
missing_counts = None

In [None]:
# Verification
check.is_type(missing_counts, pd.Series, "P2: Type check")
check.is_true(missing_counts['age'] == 2, "P2: Age missing count", "Age should have 2 missing values")

---
## Problem 3: Drop Rows with Any Missing Values
**Difficulty:** Easy

### Concept
`dropna()` removes rows (or columns) containing missing values. By default, it drops any row with at least one NaN.

### Syntax
```python
df.dropna()           # Drop rows with any NaN
df.dropna(axis=1)     # Drop columns with any NaN
df.dropna(how='all')  # Drop only if all values are NaN
```

### Example
```python
>>> clean_df = df.dropna()
>>> len(clean_df)  # Fewer rows than original
```

### Task
Drop all rows that have any missing values from `df`. Store in `df_clean`.

### Expected Properties
- Should be a pandas DataFrame
- Should have 2 rows (only complete rows)
- Should have no missing values

In [None]:
# Your solution:
df_clean = None

In [None]:
# Verification
check.is_type(df_clean, pd.DataFrame, "P3: Type check")
check.has_length(df_clean, 2, "P3: Length")
check.has_no_nulls(df_clean, "P3: No nulls")

---
## Problem 4: Fill Missing Values with a Constant
**Difficulty:** Easy

### Concept
`fillna()` replaces missing values with a specified value. This is useful when you want to use a default value instead of removing data.

### Syntax
```python
df['column'].fillna(value)
df.fillna(value)  # Fill all columns
```

### Example
```python
>>> df['age'].fillna(0)
0    25.0
1     0.0  # Was NaN
2    35.0
```

### Task
Fill all missing values in `df['age']` with 0. Store in `age_filled`.

### Expected Properties
- Should be a pandas Series
- Should have no null values
- Index 1 should be 0 (was NaN)

In [None]:
# Your solution:
age_filled = None

In [None]:
# Verification
check.is_type(age_filled, pd.Series, "P4: Type check")
check.has_no_nulls(age_filled, "P4: No nulls")
check.is_true(age_filled.iloc[1] == 0, "P4: Filled with 0", "Index 1 should be 0")

---
## Problem 5: Convert Data Type
**Difficulty:** Easy

### Concept
The `astype()` method converts a Series or DataFrame column to a different data type. This is necessary after filling NaN values in numeric columns.

### Syntax
```python
df['column'].astype(dtype)
# Common dtypes: int, float, str, 'int64', 'float64', 'object'
```

### Example
```python
>>> df['age'].astype(int)
```

### Task
Convert `age_filled` to integer type. Store in `age_int`.

### Expected Properties
- Should be a pandas Series
- dtype should be int64

In [None]:
# Your solution:
age_int = None

In [None]:
# Verification
check.is_type(age_int, pd.Series, "P5: Type check")
check.has_dtype(age_int, np.int64, "P5: Dtype")

---
## Problem 6: Fill Missing with Mean
**Difficulty:** Medium

### Concept
A common strategy for numeric columns is filling missing values with the mean (or median). This preserves the overall distribution better than using 0.

### Syntax
```python
df['column'].fillna(df['column'].mean())
```

### Example
```python
>>> mean_salary = df['salary'].mean()
>>> df['salary'].fillna(mean_salary)
```

### Task
Fill missing values in `df['salary']` with the column's mean. Store in `salary_filled`.

### Expected Properties
- Should be a pandas Series
- Should have no null values
- Value at index 2 should be approximately 52500

In [None]:
# Your solution:
salary_filled = None

In [None]:
# Verification
check.is_type(salary_filled, pd.Series, "P6: Type check")
check.has_no_nulls(salary_filled, "P6: No nulls")
check.is_true(abs(salary_filled.iloc[2] - 52500.0) < 1, "P6: Mean value", "Should be filled with mean")

---
## Problem 7: Drop Columns with Missing Values
**Difficulty:** Medium

### Concept
Sometimes it's better to remove columns with too many missing values rather than trying to fill them. Use `axis=1` to drop columns instead of rows.

### Syntax
```python
df.dropna(axis=1)         # Drop columns with any NaN
df.dropna(axis=1, how='all')  # Drop only if all values NaN
```

### Example
```python
>>> df.dropna(axis=1)  # Removes columns with missing data
```

### Task
Drop all columns that have any missing values from `df`. Store in `df_cols_clean`.

### Expected Properties
- Should be a pandas DataFrame
- Should have 0 columns (all columns in df have missing values)

In [None]:
# Your solution:
df_cols_clean = None

In [None]:
# Verification
check.is_type(df_cols_clean, pd.DataFrame, "P7: Type check")
check.is_true(len(df_cols_clean.columns) == 0, "P7: No columns", "All columns have missing values")

---
## Problem 8: Detect Duplicates
**Difficulty:** Medium

### Concept
The `duplicated()` method returns a boolean Series indicating duplicate rows. By default, it marks all duplicates except the first occurrence.

### Syntax
```python
df.duplicated()              # Mark duplicates (keep first)
df.duplicated(keep='last')   # Keep last occurrence
df.duplicated(keep=False)    # Mark all duplicates
```

### Example
```python
>>> df.duplicated()
0    False
1    False
2     True  # Duplicate of row 1
```

### Task
The setup code creates `df_dup` with duplicate rows. Use `duplicated()` to get a boolean Series showing duplicates. Store in `is_dup`.

### Expected Properties
- Should be a pandas Series
- Should have 1 True value (one duplicate)

In [None]:
# Setup for this problem
df_dup = pd.DataFrame({
    'A': [1, 2, 2, 3],
    'B': ['a', 'b', 'b', 'c']
})

# Your solution:
is_dup = None

In [None]:
# Verification
check.is_type(is_dup, pd.Series, "P8: Type check")
check.is_true(is_dup.sum() == 1, "P8: One duplicate", "Should have 1 duplicate row")

---
## Problem 9: Remove Duplicates
**Difficulty:** Medium

### Concept
`drop_duplicates()` removes duplicate rows from a DataFrame, keeping only the first (or last) occurrence.

### Syntax
```python
df.drop_duplicates()              # Keep first occurrence
df.drop_duplicates(keep='last')   # Keep last occurrence
df.drop_duplicates(subset=['col']) # Check duplicates in specific columns
```

### Example
```python
>>> df.drop_duplicates()
```

### Task
Remove duplicate rows from `df_dup`. Store in `df_no_dup`.

### Expected Properties
- Should be a pandas DataFrame
- Should have 3 rows (one duplicate removed)

In [None]:
# Your solution:
df_no_dup = None

In [None]:
# Verification
check.is_type(df_no_dup, pd.DataFrame, "P9: Type check")
check.has_length(df_no_dup, 3, "P9: Length")

---
## Problem 10: String Operations - Upper Case
**Difficulty:** Medium

### Concept
Pandas provides string methods through the `.str` accessor. These methods automatically handle NaN values, unlike regular Python string methods.

### Syntax
```python
df['column'].str.upper()    # Convert to uppercase
df['column'].str.lower()    # Convert to lowercase
df['column'].str.strip()    # Remove whitespace
df['column'].str.contains('pattern')  # Check for pattern
```

### Example
```python
>>> df['name'].str.upper()
0    ALICE
1      BOB
2     None
```

### Task
Convert all values in `df['dept']` to uppercase (NaN values will remain NaN). Store in `dept_upper`.

### Expected Properties
- Should be a pandas Series
- First element should be 'IT' (already uppercase)

In [None]:
# Your solution:
dept_upper = None

In [None]:
# Verification
check.is_type(dept_upper, pd.Series, "P10: Type check")
check.is_true(dept_upper.iloc[0] == 'IT', "P10: First element", "First element should be 'IT'")

---
## Problem 11: Forward Fill Missing Values
**Difficulty:** Hard

### Concept
Forward fill (ffill) propagates the last valid observation forward to fill missing values. This is useful for time series or ordered data.

### Syntax
```python
df['column'].ffill()  # Forward fill
df['column'].bfill()  # Backward fill
```

### Example
```python
>>> s = pd.Series([1, np.nan, np.nan, 4])
>>> s.ffill()
0    1.0
1    1.0  # Filled from index 0
2    1.0  # Filled from index 0
3    4.0
```

### Task
Use forward fill (ffill) to fill missing values in `df['age']`. Store in `age_ffill`.

### Expected Properties
- Should be a pandas Series
- Index 1 should be 25.0 (forward filled from index 0)
- Index 4 should be 28.0 (forward filled from index 3)

In [None]:
# Your solution:
age_ffill = None

In [None]:
# Verification
check.is_type(age_ffill, pd.Series, "P11: Type check")
check.is_true(age_ffill.iloc[1] == 25.0, "P11: Index 1", "Should be forward filled to 25.0")
check.is_true(age_ffill.iloc[4] == 28.0, "P11: Index 4", "Should be forward filled to 28.0")

---
## Problem 12: Replace Values
**Difficulty:** Hard

### Concept
The `replace()` method substitutes specific values with new ones. This is useful for correcting data or standardizing categories.

### Syntax
```python
df['column'].replace(old_value, new_value)
df['column'].replace([val1, val2], [new1, new2])  # Multiple replacements
df['column'].replace({'old1': 'new1', 'old2': 'new2'})  # Dict mapping
```

### Example
```python
>>> df['status'].replace('active', 'ACTIVE')
```

### Task
In `df['dept']`, replace 'IT' with 'Technology'. Store in `dept_replaced`.

### Expected Properties
- Should be a pandas Series
- Index 0 should be 'Technology' (was 'IT')
- Index 2 should be 'Technology' (was 'IT')

In [None]:
# Your solution:
dept_replaced = None

In [None]:
# Verification
check.is_type(dept_replaced, pd.Series, "P12: Type check")
check.is_true(dept_replaced.iloc[0] == 'Technology', "P12: Index 0", "Should be 'Technology'")
check.is_true(dept_replaced.iloc[2] == 'Technology', "P12: Index 2", "Should be 'Technology'")

---
## Problem 13: Drop Rows with Threshold
**Difficulty:** Hard

### Concept
The `thresh` parameter in `dropna()` specifies the minimum number of non-null values required to keep a row. This is more flexible than dropping any row with missing data.

### Syntax
```python
df.dropna(thresh=n)  # Keep rows with at least n non-null values
```

### Example
```python
>>> # Keep rows with at least 3 non-null values
>>> df.dropna(thresh=3)
```

### Task
Drop rows that have less than 3 non-null values from `df`. Use `thresh` parameter. Store in `df_thresh`.

### Expected Properties
- Should be a pandas DataFrame
- Should have 4 rows (one row dropped)

In [None]:
# Your solution:
df_thresh = None

In [None]:
# Verification
check.is_type(df_thresh, pd.DataFrame, "P13: Type check")
check.has_length(df_thresh, 4, "P13: Length")

---
## Problem 14: Fill Different Values per Column
**Difficulty:** Hard

### Concept
You can fill different columns with different values by passing a dictionary to `fillna()`. This allows column-specific filling strategies.

### Syntax
```python
df.fillna({'col1': value1, 'col2': value2})
```

### Example
```python
>>> df.fillna({
...     'age': 0,
...     'name': 'Unknown',
...     'salary': df['salary'].median()
... })
```

### Task
Fill missing values in df with different values per column:
- 'age' with 0
- 'salary' with median
- 'name' with 'Unknown'
- 'dept' with 'Other'

Store in `df_filled`.

### Expected Properties
- Should be a pandas DataFrame
- Should have no null values
- Index 2 name should be 'Unknown'
- Index 3 dept should be 'Other'

In [None]:
# Your solution:
df_filled = None

In [None]:
# Verification
check.is_type(df_filled, pd.DataFrame, "P14: Type check")
check.has_no_nulls(df_filled, "P14: No nulls")
check.is_true(df_filled['name'].iloc[2] == 'Unknown', "P14: Name filled", "Should be 'Unknown'")
check.is_true(df_filled['dept'].iloc[3] == 'Other', "P14: Dept filled", "Should be 'Other'")

---
## Problem 15: Interpolate Missing Values
**Difficulty:** Hard

### Concept
Interpolation fills missing values by estimating them based on surrounding values. Linear interpolation creates a straight line between known points.

### Syntax
```python
df['column'].interpolate()           # Linear by default
df['column'].interpolate(method='polynomial', order=2)  # Other methods
```

### Example
```python
>>> s = pd.Series([1, np.nan, np.nan, 4])
>>> s.interpolate()
0    1.0
1    2.0  # Interpolated
2    3.0  # Interpolated
3    4.0
```

### Task
The setup creates a Series `s` with missing values. Use linear interpolation to fill them. Store in `s_interp`.

### Expected Properties
- Should be a pandas Series
- Index 1 should be approximately 2.0
- Index 2 should be approximately 3.0

In [None]:
# Setup
s = pd.Series([1, np.nan, np.nan, 4, 5])

# Your solution:
s_interp = None

In [None]:
# Verification
check.is_type(s_interp, pd.Series, "P15: Type check")
check.is_true(abs(s_interp.iloc[1] - 2.0) < 0.1, "P15: Index 1", "Should be approximately 2.0")
check.is_true(abs(s_interp.iloc[2] - 3.0) < 0.1, "P15: Index 2", "Should be approximately 3.0")

---
## Summary

Run this cell to see your overall progress on this notebook.

In [None]:
check.summary()