## Consumer Complaint Retrieval Augmented Generation (RAG) - Exploratory Data Analysis

This notebook provides a comprehensive exploratory data analysis (EDA) of the CFPB Consumer Complaint dataset. The goal is to understand the structure, distribution, and content of the complaints to better inform our RAG pipeline strategy.

**Key objectives:**

**Initial distribution analysis:** Understanding the raw volume and product categories.

**Narrative analysis:** Deep dive into the text data (length, keywords).

**Cleaning and Filter validation:** Verifying the impact of our preprocessing steps.

**Stratified Sampling:** Ensuring a representative dataset for downstream tasks.

In [None]:
import sys
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# allow imports from project root
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

# local helpers
from src.file_handling import load_raw_data, save_processed_data
import src.eda as eda_mod
from src.preprocess import preprocess_data, create_stratified_sample
from src import config

plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 200)

print("✓ Imports complete!")
print(f"Project root: {project_root}")


1. Load Raw Data

We load the official CFPB complaints dataset. Note that this is a large CSV file with millions of records.

In [None]:

raw_data_path = project_root / "data" / "raw" / "complaints.csv"

if not raw_data_path.exists():
    print(f"❌ File not found at: {raw_data_path}. Please download the dataset.")
else:
    df_raw = load_raw_data(raw_data_path)
    print(f"✓ Loaded {len(df_raw):,} complaints")

## 2. Global Distribution Analysis

**Objective:** Analyze the distribution of complaints across different Products.

We look at the relative volume of complaints across all product categories to identify the major areas of consumer concern.

In [None]:
eda_mod.plot_product_distribution(df_raw, title="Raw Complaint Distribution by Product Category")


## 2.1 Initial Narrative Length Analysis

**Objective:** Calculate and visualize the length (word count) of the Consumer complaint narrative on the raw data.

**Key Question:** Are there very short or very long narratives? Identifying these outliers early helps inform our filtering strategy.

In [None]:
# Analyze length on raw data
# Note: Some raw records may have empty narratives, which we handle (fillna)
eda_mod.plot_narrative_length_distribution(df_raw, narrative_col='Consumer complaint narrative')

## 2.2 Narrative Presence Analysis

**Objective:** Identify the number of complaints with and without narratives.

Since our RAG pipeline relies on text semantic search, we must filter for those that have valid text content.

In [None]:
presence = eda_mod.narrative_presence_analysis(df_raw)
print(f"Total records: {presence['total']:,}")
print(f"Records with narration: {presence['with_narrative']:,} ({presence['percentage_with']:.2f}%)")
print(f"Records without narration: {presence['without_narrative']:,}")

## 3. Preprocessing & Filtering

**Objective:** Filter the dataset to meet the project's requirements and clean the text.

We apply the following steps:
1. **Filter Products**: Include only records for the five specified products:
    - Credit card
    - Personal loan
    - Buy Now, Pay Later (BNPL)
    - Savings account
    - Money transfers
2. **Remove Empty Narratives**: Remove any records with empty Consumer complaint narrative fields.
3. **Clean Text**: Clean the text narratives to improve embedding quality by:
    - Lowercasing text.
    - Removing special characters or boilerplate text (e.g., "I am writing to file a complaint...").

In [None]:

df_filtered = preprocess_data(df_raw)
print(f"\nFiltered dataset shape: {df_filtered.shape}")

# Explicitly save the cleaned and filtered dataset
save_path = project_root / "data" / "filtered_complaints.csv"
save_processed_data(df_filtered, save_path)

## Visualizing the Filtered Dataset

This visualization shows the distribution of standardized products in our refined dataset.

In [None]:
eda_mod.plot_product_distribution(df_filtered, title="Filtered & Standardized Complaint Distribution")


## 4. Deep Dive: Narrative Text Analysis

Now we analyze the textual content of the narratives to gain insights for chunking and search strategies. We explicitly look at the distribution of word counts to identify outliers (very short or very long narratives).

In [None]:
eda_mod.plot_narrative_length_distribution(df_filtered)

## Narrative Length by Product

Analyzing narrative length across different product categories to identify variability in consumer description detail.

In [None]:
eda_mod.plot_length_by_product(df_filtered)

## Top Keywords Identification

Identifying high-frequency terms helps us understand common pain points and consumer language.

In [None]:
eda_mod.plot_top_keywords(df_filtered)


## 5. Temporal and Company Analysis

Understanding the volume over time and the most complained-about companies.

In [None]:
eda_mod.plot_temporal_trends(df_filtered)
plt.show()
eda_mod.plot_company_distribution(df_filtered)

## 6. Stratified Sampling for RAG Prototyping

To build a responsive prototype, we create a representative sample that maintains product distribution proportions.

In [None]:

target_size = 15000
df_sampled = create_stratified_sample(df_filtered, target_size=target_size)

# Final verification of counts
print(f"Final Sample Size: {len(df_sampled):,}")
print("Sample distribution:")
print(df_sampled['Product'].value_counts(normalize=True))