# RAG Pipeline: Production-Level Indexing

This notebook demonstrates the production-level indexing pipeline for the CSV data and how to leverage pre-built embeddings for large-scale retrieval.

### Key features:
1. **Alignment with `src/vectorstore.py`**: Using the same optimized logic as the production scripts.
2. **Memory-Efficient Processing**: Batch loading for large datasets.
3. **Pre-built Integration**: Demonstrating how to load the 1.3M+ records store from `data/complaint_embeddings.parquet`.

In [1]:
import sys
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# allow imports from project root
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

from src import config
config.setup_hf_cache()

from src import vectorstore
from src.chunking import get_chunk_stats

plt.style.use('seaborn-v0_8-whitegrid')
print("✓ Imports and setup complete!")

[OK] HuggingFace cache set to: c:\Users\Acer\Documents\KAIM_PROJECT\TEST\rag-complaint-chatbot\models\hf




✓ Imports and setup complete!


## 1. Efficient Vector Store Loading

In production, we often avoid re-indexing 1.3M records by loading a pre-persisted FAISS index or building it from raw embeddings in batches.

In [2]:
print(f"Vector store path: {config.VECTOR_STORE_DIR}")
print(f"Pre-built embeddings: {config.PREBUILT_EMBEDDINGS_PATH}")

# Load existing or build from parquet
vs = vectorstore.load_vector_store()
print(f"\n✓ Vector store active with {vs.index.ntotal:,} vectors.")

Vector store path: c:\Users\Acer\Documents\KAIM_PROJECT\TEST\rag-complaint-chatbot\vector_store\faiss
Pre-built embeddings: c:\Users\Acer\Documents\KAIM_PROJECT\TEST\rag-complaint-chatbot\data\complaint_embeddings.parquet
Loading embedding model: sentence-transformers/all-MiniLM-L6-v2
  (First run will download ~80MB to c:\Users\Acer\Documents\KAIM_PROJECT\TEST\rag-complaint-chatbot\models\hf)


'(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /sentence-transformers/all-MiniLM-L6-v2/resolve/main/modules.json (Caused by NameResolutionError("HTTPSConnection(host=\'huggingface.co\', port=443): Failed to resolve \'huggingface.co\' ([Errno 11001] getaddrinfo failed)"))'), '(Request ID: 854cb151-19ba-40d7-a574-9e82ef81120a)')' thrown while requesting HEAD https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/./modules.json
Retrying in 1s [Retry 1/5].
'(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /sentence-transformers/all-MiniLM-L6-v2/resolve/main/modules.json (Caused by NameResolutionError("HTTPSConnection(host=\'huggingface.co\', port=443): Failed to resolve \'huggingface.co\' ([Errno 11001] getaddrinfo failed)"))'), '(Request ID: 985820e4-4480-4a59-9eaf-87192eb9ddc5)')' thrown while requesting HEAD https://huggingface.co/sentence-transformers/all-M

[OK] Embedding model loaded
Loading existing FAISS index from c:\Users\Acer\Documents\KAIM_PROJECT\TEST\rag-complaint-chatbot\vector_store\faiss...
[OK] Vector store loaded (ntotal=1,375,327)

✓ Vector store active with 1,375,327 vectors.


## 2. Verification through Semantic Search

We test the index with a sample query to ensure retrieval is functional and metadata is correctly preserved.

In [3]:
query = "I lost my credit card and there are fraudulent charges"
results = vectorstore.search_similar(vs, query, k=3)

vectorstore.print_search_results(results, query)

SEARCH RESULTS for: 'I lost my credit card and there are fraudulent charges'

--- Result 1 ---
Complaint ID: 2695792
Product: Credit card or prepaid card
Category: Credit Card
Issue: Problem with a purchase shown on your statement
Company: SYNCHRONY FINANCIAL
Chunk: 0/1
Content preview:
i lost my credit card and and their are fraudulent charges and transactions of 2500.00 . the charges do n't belong to me. i have never used my credit card....

--- Result 2 ---
Complaint ID: 8236492
Product: Credit card
Category: Credit Card
Issue: Problem with a purchase shown on your statement
Company: Chime Financial Inc
Chunk: 0/1
Content preview:
i lost my credit debit card and unauthorized charges where submitted to my account....

--- Result 3 ---
Complaint ID: 1409416
Product: Credit card
Category: Credit Card
Issue: Credit card protection / Debt protection
Company: JPMORGAN CHASE & CO.
Chunk: 0/1
Content preview:
i found fraudulent charges on my credit card....



### Inspecting Metadata and Chunks

Good RAG requires precise metadata tracking (e.g., `chunk_index`, `complaint_id`).

In [4]:
sample_doc = results[0]
print("Sample Metadata:")
for k, v in sample_doc.metadata.items():
    print(f"  {k}: {v}")

print(f"\nContent Snippet:\n{sample_doc.page_content[:200]}...")

Sample Metadata:
  chunk_index: 0
  company: SYNCHRONY FINANCIAL
  complaint_id: 2695792
  date_received: 2017-10-07
  issue: Problem with a purchase shown on your statement
  product: Credit card or prepaid card
  product_category: Credit Card
  state: AZ
  sub_issue: Card was charged for something you did not purchase with the card
  total_chunks: 1

Content Snippet:
i lost my credit card and and their are fraudulent charges and transactions of 2500.00 . the charges do n't belong to me. i have never used my credit card....


## 3. Building From Parquet (Internal Logic)

To handle 1.3M records without crashing memory, we use `pyarrow` to process the parquet file in batches. This logic is encapsulated in `vectorstore.build_vector_store_from_parquet`.

In [5]:
# Example of how batches ARE processed (mental model)
# for batch in pf.iter_batches(batch_size=50000):
#     df_batch = batch.to_pandas()
#     vectorstore.add_embeddings(texts=df_batch['document'], ...)

print("Batch processing is automatically handled by the src scripts to ensure scalability.")

Batch processing is automatically handled by the src scripts to ensure scalability.
