# Capstone Project 2: Real-World Data Cleaning Challenge

This project guides you through cleaning a messy dataset with real-world issues.

**Objectives:**
- Handle various types of missing data
- Fix inconsistent data formats
- Deal with outliers
- Standardize categories
- Create a clean, analysis-ready dataset

**Problems:** 15 (Progressive difficulty)

In [None]:
# ============================================
# SETUP - Run this cell first!
# ============================================
import sys
sys.path.insert(0, '..')
from utils.checks import capstone_data_cleaning as verify

# Dataset path (provided for convenience)
MESSY_DATA_PATH = '../datasets/synthetic/messy_data.csv'

print("Checker loaded!")
print(f"Dataset path: {MESSY_DATA_PATH}")
print("\nNow import the libraries you need and load the dataset.")

---
## Problem 0: Import Libraries and Load Data
**Difficulty:** Easy

### Concept
Data cleaning requires libraries for data manipulation and visualization. Load the messy dataset to begin.

### Syntax
```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

%matplotlib inline
np.random.seed(42)

# Load the messy dataset
messy_df = pd.read_csv(MESSY_DATA_PATH)
```

### Task
1. Import the required libraries
2. Load the messy dataset into `messy_df`

### Expected Properties
- All libraries should be importable
- `messy_df` should be a DataFrame

In [None]:
# Your solution:


In [None]:
# Verification
check.is_true('np' in dir(), "P0a: NumPy imported", "Import numpy as np")
check.is_true('pd' in dir(), "P0b: Pandas imported", "Import pandas as pd")
check.is_true('plt' in dir(), "P0c: Matplotlib imported", "Import matplotlib.pyplot as plt")
check.is_true('os' in dir(), "P0d: OS module imported", "Import os module")
check.is_true('messy_df' in dir(), "P0e: Dataset loaded", "Load the messy dataset into messy_df")

---
## Problem 1: Initial Data Quality Assessment
**Difficulty:** Easy

### Concept
Before cleaning data, you need to assess its quality. The first step is identifying missing values - how many are in each column and what percentage they represent. This helps prioritize cleaning efforts.

### Syntax
```python
df.isnull().sum()  # Count missing values per column
(df.isnull().sum() / len(df)) * 100  # Percentage missing

# Create summary DataFrame
missing_summary = pd.DataFrame({
    'missing_count': df.isnull().sum(),
    'missing_pct': (df.isnull().sum() / len(df)) * 100
})
```

### Example
```python
>>> data = pd.DataFrame({'A': [1, None, 3], 'B': [4, 5, None]})
>>> data.isnull().sum()
A    1
B    1
>>> (data.isnull().sum() / len(data)) * 100
A    33.33
B    33.33
```

### Task
Create a DataFrame called `missing_summary` with two columns:
- `'missing_count'`: count of missing values per column
- `'missing_pct'`: percentage of missing values per column

Work with a copy of messy_df called `df`.

### Expected Properties
- `missing_summary` should be a DataFrame
- Should have columns 'missing_count' and 'missing_pct'
- Index should match the columns in df

In [None]:
# Your solution:
df = messy_df.copy()  # Work with a copy
missing_summary = None

In [None]:
# Verification
check.is_type(missing_summary, pd.DataFrame, "P1: Type check")
check.contains_column(missing_summary, 'missing_count', "P1: Has missing_count column")
check.contains_column(missing_summary, 'missing_pct', "P1: Has missing_pct column")

---
## Problem 2: Examine Data Types
**Difficulty:** Easy

### Concept
Understanding data types is crucial for cleaning. Columns might be stored as the wrong type (e.g., numbers as strings, dates as objects). The `dtypes` attribute shows the current data type of each column.

### Syntax
```python
df.dtypes          # Returns Series of data types
df.info()          # Shows dtypes plus memory usage
df['col'].dtype    # Get type of specific column
```

### Example
```python
>>> data = pd.DataFrame({'A': [1, 2, 3], 'B': ['x', 'y', 'z']})
>>> data.dtypes
A     int64
B    object
```

### Task
Get the data types of all columns in df and store in `current_dtypes`.

### Expected Properties
- `current_dtypes` should be a pandas Series
- Should have an entry for each column in df

In [None]:
# Your solution:
current_dtypes = None

In [None]:
# Verification
check.is_type(current_dtypes, pd.Series, "P2: Type check")
check.has_length(current_dtypes, len(df.columns), "P2: Has entry for each column")

---
## Problem 3: Remove Duplicate Rows
**Difficulty:** Easy

### Concept
Duplicate rows can skew analysis. They might result from data collection errors or merging issues. Pandas provides methods to detect and remove duplicates.

### Syntax
```python
df.duplicated()              # Returns boolean Series marking duplicates
df.duplicated().sum()        # Count duplicates
df.drop_duplicates()         # Remove duplicates
df.drop_duplicates(inplace=True)  # Modify in place
```

### Example
```python
>>> data = pd.DataFrame({'A': [1, 2, 1], 'B': [3, 4, 3]})
>>> data.duplicated().sum()
1
>>> data.drop_duplicates()
   A  B
0  1  3
1  2  4
```

### Task
1. Count the number of duplicate rows and store in `duplicates_removed`
2. Remove duplicate rows from df

### Expected Properties
- After removal, df should have no duplicate rows
- `duplicates_removed` should be a non-negative integer

In [None]:
# Your solution:
duplicates_removed = None

In [None]:
# Verification
check.is_type(duplicates_removed, (int, np.integer), "P3: Type check")
check.is_true(df.duplicated().sum() == 0, "P3: No duplicates remain", "All duplicates should be removed")

---
## Problem 4: Clean String Columns
**Difficulty:** Easy

### Concept
String data often has inconsistent formatting - extra whitespace, mixed case, etc. Pandas string methods help standardize text data for consistent analysis.

### Syntax
```python
df['col'].str.strip()      # Remove leading/trailing whitespace
df['col'].str.lower()      # Convert to lowercase
df['col'].str.upper()      # Convert to uppercase
df['col'].str.title()      # Title Case
```

### Example
```python
>>> names = pd.Series(['  JOHN  ', 'jane', '  BOB'])
>>> names.str.strip().str.title()
0    John
1    Jane
2     Bob
```

### Task
For all string (object) columns in df:
1. Strip whitespace
2. Convert to title case

Store the count of string columns cleaned in `string_cols_cleaned`.

### Expected Properties
- `string_cols_cleaned` should be a positive integer
- String columns should have no leading/trailing whitespace

In [None]:
# Your solution:
string_cols_cleaned = None

In [None]:
# Verification
check.is_type(string_cols_cleaned, (int, np.integer), "P4: Type check")
check.is_true(string_cols_cleaned > 0, "P4: At least one string column", "Should have cleaned at least one string column")

---
## Problem 5: Convert Numeric Columns
**Difficulty:** Medium

### Concept
Sometimes numeric data is stored as strings. This prevents mathematical operations. `pd.to_numeric()` converts strings to numbers, with options for handling errors.

### Syntax
```python
pd.to_numeric(series, errors='raise')    # Raise error on invalid
pd.to_numeric(series, errors='coerce')   # Convert invalid to NaN
pd.to_numeric(series, errors='ignore')   # Leave invalid unchanged
```

### Example
```python
>>> values = pd.Series(['1', '2', 'invalid', '4'])
>>> pd.to_numeric(values, errors='coerce')
0    1.0
1    2.0
2    NaN
3    4.0
```

### Task
For columns that should be numeric but aren't, convert them using `pd.to_numeric()` with `errors='coerce'`. Store the count of columns converted in `numeric_cols_converted`.

### Expected Properties
- `numeric_cols_converted` should be a non-negative integer
- Converted columns should have numeric dtype

In [None]:
# Your solution:
numeric_cols_converted = None

In [None]:
# Verification
check.is_type(numeric_cols_converted, (int, np.integer), "P5: Type check")
check.is_true(numeric_cols_converted >= 0, "P5: Non-negative count", "Should be a non-negative number")

---
## Problem 6: Fill Missing Numeric Values
**Difficulty:** Medium

### Concept
Missing numeric values need to be handled. Common strategies include filling with mean, median (robust to outliers), or mode. The median is often preferred for skewed distributions.

### Syntax
```python
df['col'].fillna(value)               # Fill with specific value
df['col'].fillna(df['col'].mean())    # Fill with mean
df['col'].fillna(df['col'].median())  # Fill with median
```

### Example
```python
>>> values = pd.Series([1, 2, None, 4, 5])
>>> values.fillna(values.median())
0    1.0
1    2.0
2    3.0  # median
3    4.0
4    5.0
```

### Task
Fill missing values in all numeric columns with the median of that column. Store the total count of values filled in `values_filled`.

### Expected Properties
- After filling, numeric columns should have no missing values
- `values_filled` should be a non-negative integer

In [None]:
# Your solution:
values_filled = None

In [None]:
# Verification
check.is_type(values_filled, (int, np.integer), "P6: Type check")
numeric_cols = df.select_dtypes(include=[np.number]).columns
numeric_missing = df[numeric_cols].isnull().sum().sum()
check.is_true(numeric_missing == 0, "P6: No missing numeric values", "All numeric missing values should be filled")

---
## Problem 7: Fill Missing Categorical Values
**Difficulty:** Medium

### Concept
For categorical data, filling with the mode (most frequent value) maintains the distribution. Alternatively, use a placeholder like 'Unknown' to preserve information about missingness.

### Syntax
```python
df['col'].mode()[0]           # Get most frequent value
df['col'].fillna('Unknown')   # Fill with placeholder
df['col'].fillna(df['col'].mode()[0])  # Fill with mode
```

### Example
```python
>>> categories = pd.Series(['A', 'B', 'A', None, 'A'])
>>> categories.mode()[0]
'A'
>>> categories.fillna('A')
0    A
1    B
2    A
3    A  # filled
4    A
```

### Task
Fill missing values in all categorical (object) columns with the mode of that column. If a column has no mode, use 'Unknown'. Store the count of values filled in `cat_filled`.

### Expected Properties
- After filling, df should have no missing values at all
- `cat_filled` should be a non-negative integer

In [None]:
# Your solution:
cat_filled = None

In [None]:
# Verification
check.is_type(cat_filled, (int, np.integer), "P7: Type check")
total_missing = df.isnull().sum().sum()
check.is_true(total_missing == 0, "P7: No missing values", "All missing values should be filled")

---
## Problem 8: Detect Outliers Using IQR
**Difficulty:** Medium

### Concept
Outliers are extreme values that can distort analysis. The Interquartile Range (IQR) method defines outliers as values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR, where Q1 and Q3 are the 25th and 75th percentiles.

### Syntax
```python
Q1 = df['col'].quantile(0.25)
Q3 = df['col'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = (df['col'] < lower_bound) | (df['col'] > upper_bound)
```

### Example
```python
>>> data = pd.Series([1, 2, 3, 4, 5, 100])  # 100 is an outlier
>>> Q1, Q3 = data.quantile([0.25, 0.75])
>>> IQR = Q3 - Q1
>>> outliers = (data < Q1 - 1.5*IQR) | (data > Q3 + 1.5*IQR)
>>> outliers.sum()
1
```

### Task
For each numeric column, count outliers using the IQR method. Store results in a dictionary `outlier_counts` with column names as keys and counts as values.

### Expected Properties
- `outlier_counts` should be a dictionary
- Should have an entry for each numeric column
- All counts should be non-negative

In [None]:
# Your solution:
outlier_counts = None

In [None]:
# Verification
check.is_type(outlier_counts, dict, "P8: Type check")
check.is_true(len(outlier_counts) > 0, "P8: Has entries", "Should have outlier counts for numeric columns")

---
## Problem 9: Handle Outliers with Winsorization
**Difficulty:** Medium

### Concept
Winsorization caps extreme values at upper and lower bounds rather than removing them. This preserves row count while reducing outlier impact. Values below the lower bound are set to the bound, and values above the upper bound are capped.

### Syntax
```python
# Method 1: Manual clipping
df['col'].clip(lower=lower_bound, upper=upper_bound)

# Method 2: Using numpy
np.clip(df['col'], lower_bound, upper_bound)
```

### Example
```python
>>> values = pd.Series([1, 2, 3, 4, 100])
>>> values.clip(lower=1, upper=10)
0     1
1     2
2     3
3     4
4    10  # capped from 100
```

### Task
For each numeric column, cap outliers at the IQR bounds (Q1 - 1.5×IQR and Q3 + 1.5×IQR). Store the total count of values capped in `values_capped`.

### Expected Properties
- `values_capped` should be a non-negative integer
- No values should remain outside the IQR bounds

In [None]:
# Your solution:
values_capped = None

In [None]:
# Verification
check.is_type(values_capped, (int, np.integer), "P9: Type check")
check.is_true(values_capped >= 0, "P9: Non-negative count", "Should be a non-negative number")

---
## Problem 10: Standardize Category Values
**Difficulty:** Medium

### Concept
Categorical data often has inconsistent formatting ('yes', 'Yes', 'YES'). Standardizing to a consistent format ensures proper grouping and analysis.

### Syntax
```python
df['col'].str.strip().str.title()   # Clean and title case
df['col'].str.lower()                # All lowercase
df['col'].nunique()                  # Count unique values
```

### Example
```python
>>> categories = pd.Series(['yes', 'YES', ' Yes ', 'no'])
>>> categories.nunique()
4  # Before standardization
>>> categories = categories.str.strip().str.title()
>>> categories.nunique()
2  # After: 'Yes' and 'No'
```

### Task
For all categorical columns, standardize values to title case (already done in Problem 4, but ensure it's consistent). Track unique values before and after. Store in dictionaries `unique_before` and `unique_after`.

### Expected Properties
- Both should be dictionaries
- `unique_after` values should be less than or equal to `unique_before`

In [None]:
# Your solution:
unique_before = None
unique_after = None

In [None]:
# Verification
check.is_type(unique_before, dict, "P10a: unique_before is dict")
check.is_type(unique_after, dict, "P10b: unique_after is dict")

---
## Problem 11: Create Data Cleaning Report
**Difficulty:** Medium

### Concept
Documentation is key in data cleaning. A cleaning report summarizes what was done, helping others understand the transformations and assess data quality.

### Syntax
```python
report = {
    'metric_name': value,
    'rows_before': len(original_df),
    'rows_after': len(cleaned_df),
    'missing_pct_before': (original_df.isnull().sum().sum() / original_df.size) * 100
}
```

### Example
```python
>>> report = {
...     'original_rows': 1000,
...     'final_rows': 950,
...     'duplicates_removed': 50
... }
>>> for key, value in report.items():
...     print(f"{key}: {value}")
```

### Task
Create a dictionary `cleaning_report` with these keys:
- 'original_rows': row count before cleaning
- 'final_rows': row count after cleaning
- 'duplicates_removed': count from Problem 3
- 'missing_values_filled': sum of values_filled and cat_filled
- 'outliers_capped': count from Problem 9
- 'original_missing_pct': percentage of missing values in messy_df
- 'final_missing_pct': percentage of missing values in df (should be 0)

### Expected Properties
- Should be a dictionary
- Should have all required keys
- 'final_missing_pct' should be 0.0

In [None]:
# Your solution:
cleaning_report = None

In [None]:
# Verification
check.is_type(cleaning_report, dict, "P11: Type check")
check.contains(cleaning_report.keys(), 'original_rows', "P11a: Has original_rows")
check.contains(cleaning_report.keys(), 'final_missing_pct', "P11b: Has final_missing_pct")
check.is_true(cleaning_report.get('final_missing_pct', -1) == 0.0, "P11c: No missing values", "final_missing_pct should be 0.0")

---
## Problem 12: Validate Final Data Types
**Difficulty:** Easy

### Concept
After cleaning, verify that columns have appropriate data types. This ensures the data is ready for analysis and prevents type-related errors.

### Syntax
```python
df.dtypes                  # Check all types
df['col'].astype('int64')  # Convert type if needed
```

### Example
```python
>>> df.dtypes
age       int64
name     object
score   float64
```

### Task
Get the final data types of all columns and store in `final_dtypes`.

### Expected Properties
- Should be a pandas Series
- Should have an entry for each column

In [None]:
# Your solution:
final_dtypes = None

In [None]:
# Verification
check.is_type(final_dtypes, pd.Series, "P12: Type check")
check.has_length(final_dtypes, len(df.columns), "P12: Has entry for each column")

---
## Problem 13: Visualize Data Quality Improvement
**Difficulty:** Medium

### Concept
Visualizations communicate cleaning impact effectively. Comparing before/after metrics shows the improvement in data quality.

### Syntax
```python
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].bar(x, before_values, label='Before')
axes[0].bar(x, after_values, alpha=0.7, label='After')
axes[1].hist(cleaned_data)
```

### Example
```python
>>> fig, ax = plt.subplots()
>>> ax.bar(['Before', 'After'], [missing_before, missing_after])
>>> ax.set_ylabel('Missing Values')
```

### Task
Create a figure with 2 subplots:
- Left: Compare missing values before and after
- Right: Show distribution of a numeric column after cleaning

### Expected Properties
- `axes` should have 2 elements
- Both plots should be created

In [None]:
# Your solution:
fig = None
axes = None

In [None]:
# Verification
check.is_not_none(axes, "P13: Axes created")
check.has_length(axes, 2, "P13: Has 2 subplots")

---
## Problem 14: Save Cleaned Data
**Difficulty:** Easy

### Concept
After cleaning, save the data for future use. CSV format is widely compatible, but you can also use pickle for preserving exact Python data types.

### Syntax
```python
df.to_csv('filepath.csv', index=False)     # Save to CSV
df.to_pickle('filepath.pkl')                # Save to pickle
os.path.exists('filepath.csv')              # Check if file exists
```

### Example
```python
>>> df.to_csv('cleaned_data.csv', index=False)
>>> os.path.exists('cleaned_data.csv')
True
```

### Task
Save df to '../datasets/synthetic/cleaned_data.csv'. Store True/False success status in `save_success`.

### Expected Properties
- File should be created at the specified path
- `save_success` should be True

In [None]:
# Your solution:
save_success = None

In [None]:
# Verification
check.is_type(save_success, bool, "P14: Type check")
check.is_true(save_success == True, "P14: File saved", "File should be saved successfully")

---
## Problem 15: Create Final Summary
**Difficulty:** Medium

### Concept
A comprehensive summary provides complete documentation of the final dataset, including shape, columns, data types, and basic statistics.

### Syntax
```python
summary = {
    'shape': df.shape,
    'columns': list(df.columns),
    'dtypes': dict(df.dtypes),
    'memory_usage': df.memory_usage(deep=True).sum()
}
```

### Example
```python
>>> summary = {
...     'shape': (1000, 5),
...     'columns': ['A', 'B', 'C', 'D', 'E'],
...     'missing_values': 0
... }
```

### Task
Create a dictionary `final_summary` with:
- 'shape': df.shape
- 'columns': list of column names
- 'dtypes': dictionary of column: dtype
- 'missing_values': total missing values (should be 0)
- 'numeric_summary': df.describe().to_dict()
- 'memory_usage': total memory usage in bytes

### Expected Properties
- Should be a dictionary
- 'missing_values' should be 0
- Should have all required keys

In [None]:
# Your solution:
final_summary = None

In [None]:
# Verification
check.is_type(final_summary, dict, "P15: Type check")
check.contains(final_summary.keys(), 'shape', "P15a: Has shape")
check.contains(final_summary.keys(), 'missing_values', "P15b: Has missing_values")
check.is_true(final_summary.get('missing_values', -1) == 0, "P15c: No missing values", "missing_values should be 0")

---
## Summary and Lessons Learned

Document the key lessons from this data cleaning exercise:

1. **Missing Data**: How did you handle different types of missing values?
2. **Duplicates**: What impact did removing duplicates have?
3. **Data Types**: Why is correct data typing important?
4. **Outliers**: What strategy worked best for outliers in your data?
5. **Standardization**: How did standardization improve data quality?

In [None]:
check.summary()