# RAG Pipeline

This notebook demonstrates indexing pipeline for the consumer complaint dataset. It focuses on efficiency by leveraging pre-computed embeddings.

## Project Architecture & Flow

1.  **Data Ingestion**: Raw consumer complaints are ingested (handled in `eda.ipynb`).
2.  **Preprocessing**: Text is cleaned, sensitive info removed, and handled.
3.  **Embedding Generation (Optimization)**:
    *   Instead of generating embeddings on the fly for 1.3M+ records (which is slow), we use a **Pre-built Embedding Store** (`data/complaint_embeddings.parquet`).
    *   This file contains `(text_chunk, embedding_vector, metadata)` triplets.
4.  **Vector Store Construction**:
    *   We load the parquet file and build a **FAISS Index**.
    *   The index allows for millisecond-latency similarity search.
5.  **RAG Inference**:
    *   The application (`app.py`) loads this FAISS index.
    *   User Query -> Embedding -> FAISS Search -> Top-K Context -> LLM -> Answer.

### In This Notebook:
We specifically handle : Efficiently loading/building the FAISS index from the pre-built parquet file and verifying it works.

In [1]:
import sys
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# allow imports from project root
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

from src import config
config.setup_hf_cache()

from src import vectorstore
from src.chunking import get_chunk_stats

plt.style.use('seaborn-v0_8-whitegrid')
print("✓ Imports and setup complete!")

[OK] HuggingFace cache set to: c:\Users\Acer\Documents\KAIM_PROJECT\TEST\rag-complaint-chatbot\models\hf




✓ Imports and setup complete!


## 1. Efficient Vector Store Loading

In production, we often avoid re-indexing 1.3M records by loading a pre-persisted FAISS index or building it from raw embeddings in batches.

In [2]:
print(f"Vector store path: {config.VECTOR_STORE_DIR}")
print(f"Pre-built embeddings: {config.PREBUILT_EMBEDDINGS_PATH}")

# Load existing or build from parquet
# Set force_rebuild=True to ensure we use the pre-built parquet file if needed
vs = vectorstore.load_vector_store(force_rebuild=False)
print(f"\n✓ Vector store active with {vs.index.ntotal:,} vectors.")

Vector store path: c:\Users\Acer\Documents\KAIM_PROJECT\TEST\rag-complaint-chatbot\vector_store\faiss
Pre-built embeddings: c:\Users\Acer\Documents\KAIM_PROJECT\TEST\rag-complaint-chatbot\data\complaint_embeddings.parquet
Loading embedding model: sentence-transformers/all-MiniLM-L6-v2
  (First run will download ~80MB to c:\Users\Acer\Documents\KAIM_PROJECT\TEST\rag-complaint-chatbot\models\hf)
[OK] Embedding model loaded
Loading existing FAISS index from c:\Users\Acer\Documents\KAIM_PROJECT\TEST\rag-complaint-chatbot\vector_store\faiss...
[OK] Vector store loaded (ntotal=1,375,327)

✓ Vector store active with 1,375,327 vectors.


## 2. Verification through Semantic Search

We test the index with a sample query to ensure retrieval is functional and metadata is correctly preserved.

In [3]:
query = "I lost my credit card and there are fraudulent charges"
results = vectorstore.search_similar(vs, query, k=3)

vectorstore.print_search_results(results, query)

SEARCH RESULTS for: 'I lost my credit card and there are fraudulent charges'

--- Result 1 ---
Complaint ID: 2695792
Product: Credit card or prepaid card
Category: Credit Card
Issue: Problem with a purchase shown on your statement
Company: SYNCHRONY FINANCIAL
Chunk: 0/1
Content preview:
i lost my credit card and and their are fraudulent charges and transactions of 2500.00 . the charges do n't belong to me. i have never used my credit card....

--- Result 2 ---
Complaint ID: 8236492
Product: Credit card
Category: Credit Card
Issue: Problem with a purchase shown on your statement
Company: Chime Financial Inc
Chunk: 0/1
Content preview:
i lost my credit debit card and unauthorized charges where submitted to my account....

--- Result 3 ---
Complaint ID: 1409416
Product: Credit card
Category: Credit Card
Issue: Credit card protection / Debt protection
Company: JPMORGAN CHASE & CO.
Chunk: 0/1
Content preview:
i found fraudulent charges on my credit card....



### Inspecting Metadata and Chunks

Good RAG requires precise metadata tracking (e.g., `chunk_index`, `complaint_id`).

In [4]:
sample_doc = results[0]
print("Sample Metadata:")
for k, v in sample_doc.metadata.items():
    print(f"  {k}: {v}")

print(f"\nContent Snippet:\n{sample_doc.page_content[:200]}...")

Sample Metadata:
  chunk_index: 0
  company: SYNCHRONY FINANCIAL
  complaint_id: 2695792
  date_received: 2017-10-07
  issue: Problem with a purchase shown on your statement
  product: Credit card or prepaid card
  product_category: Credit Card
  state: AZ
  sub_issue: Card was charged for something you did not purchase with the card
  total_chunks: 1

Content Snippet:
i lost my credit card and and their are fraudulent charges and transactions of 2500.00 . the charges do n't belong to me. i have never used my credit card....
