# Task 5 · Dataset Preparation for Consumer Complaint Classification

**Objective:** Prepare a balanced dataset for text classification from the Consumer Complaint dataset (Kaggle version).

**Pipeline Steps in this Notebook**
1. Load the raw Kaggle dataset  
2. Exploratory Data Analysis (EDA)  
3. Select relevant features  
4. Map `Product` values into 4 canonical categories  
5. Balance the dataset (equal samples per class)  
6. Final cleanup & save the balanced CSV  
7. Sanity check (peek a few samples)  

_Input expected: a Kaggle CSV at `consumer_complaints.csv` in the working directory._

## 1️⃣ Load Raw Kaggle Dataset

In [None]:
import pandas as pd

INPUT_PATH = "consumer_complaints.csv"  # Update if your filename differs
print("📥 Loading dataset from:", INPUT_PATH)
df = pd.read_csv(INPUT_PATH, low_memory=False)
print(f"✅ Loaded: {df.shape[0]} rows × {df.shape[1]} columns")
df.head()

## 2️⃣ Exploratory Data Analysis (EDA)

In [None]:
# Show column names
print("Columns:", list(df.columns))

# Missing values (top 10)
missing_pct = df.isnull().mean().sort_values(ascending=False) * 100
print("\n🔍 Missing value percentages (top 10):")
print(missing_pct.head(10))

# Product distribution (top 15)
print("\n📊 Product value counts (top 15):")
print(df["Product"].value_counts().head(15))

## 3️⃣ Select Relevant Features

In [None]:
# Keep only columns we need for the downstream classifier
df = df[["Product", "Consumer complaint narrative"]]

# Drop rows without narrative text
before = len(df)
df = df.dropna(subset=["Consumer complaint narrative"])
after = len(df)
print(f"✅ Dropped {before - after} rows with empty narratives; remaining: {after}")

## 4️⃣ Map `Product` to Canonical Categories

In [None]:
# Map long/variant product names into the 4 assignment categories

def map_category(p):
    s = str(p).lower()
    if "credit reporting" in s or "credit repair" in s:
        return "Credit reporting, repair, or other"
    if "debt collection" in s:
        return "Debt collection"
    if "consumer loan" in s or s.strip() == "loan":
        return "Consumer Loan"
    if "mortgage" in s:
        return "Mortgage"
    return None

df["canon_product"] = df["Product"].apply(map_category)
df = df[df["canon_product"].notnull()].reset_index(drop=True)

print("📊 Counts after mapping to 4 classes:")
print(df["canon_product"].value_counts())

df.head()

## 5️⃣ Balance the Dataset

In [None]:
# Choose how many per class (adjust as needed based on availability)
TARGET_PER_CLASS = 2000  # e.g., 500 / 1000 / 2000 / 5000

df_balanced = (
    df.groupby("canon_product", group_keys=False)
      .apply(lambda x: x.sample(min(len(x), TARGET_PER_CLASS), random_state=42))
      .reset_index(drop=True)
)

print("✅ Class distribution after balancing:")
print(df_balanced["canon_product"].value_counts())
print(f"🎯 Total complaints after balancing: {len(df_balanced)}")

## 6️⃣ Final Cleanup & Save Balanced CSV

In [None]:
import os

SAVE_DIR = "data"
os.makedirs(SAVE_DIR, exist_ok=True)
SAVE_PATH = os.path.join(SAVE_DIR, "ccdb_balanced.csv")

# Rename narrative column to 'text' and keep only what the model notebook expects
df_balanced = df_balanced.rename(columns={"Consumer complaint narrative": "text"})
df_balanced = df_balanced[["text", "canon_product"]]

df_balanced.to_csv(SAVE_PATH, index=False)
print(f"💾 Saved to: {SAVE_PATH}")
print("Final shape:", df_balanced.shape)

## 7️⃣ Sanity Check (Peek a Few Samples)

In [None]:
# Show one random example per class
for label, group in df_balanced.groupby("canon_product"):
    print(f"\n--- {label} ---")
    print(group["text"].sample(1, random_state=42).values[0][:300], "...")