# RAG Pipeline: Production-Level Indexing

This notebook demonstrates the production-level indexing pipeline for the CSV data and how to leverage pre-built embeddings for large-scale retrieval.

### Key features:
1. **Alignment with `src/vectorstore.py`**: Using the same optimized logic as the production scripts.
2. **Memory-Efficient Processing**: Batch loading for large datasets.
3. **Pre-built Integration**: Demonstrating how to load the 1.3M+ records store from `data/complaint_embeddings.parquet`.

In [None]:
import sys
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# allow imports from project root
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

from src import config
config.setup_hf_cache()

from src import vectorstore
from src.chunking import get_chunk_stats

plt.style.use('seaborn-v0_8-whitegrid')
print("✓ Imports and setup complete!")

## 1. Efficient Vector Store Loading

In production, we often avoid re-indexing 1.3M records by loading a pre-persisted FAISS index or building it from raw embeddings in batches.

In [None]:
print(f"Vector store path: {config.VECTOR_STORE_DIR}")
print(f"Pre-built embeddings: {config.PREBUILT_EMBEDDINGS_PATH}")

# Load existing or build from parquet
vs = vectorstore.load_vector_store()
print(f"\n✓ Vector store active with {vs.index.ntotal:,} vectors.")

## 2. Verification through Semantic Search

We test the index with a sample query to ensure retrieval is functional and metadata is correctly preserved.

In [None]:
query = "I lost my credit card and there are fraudulent charges"
results = vectorstore.search_similar(vs, query, k=3)

vectorstore.print_search_results(results, query)

### Inspecting Metadata and Chunks

Good RAG requires precise metadata tracking (e.g., `chunk_index`, `complaint_id`).

In [None]:
sample_doc = results[0]
print("Sample Metadata:")
for k, v in sample_doc.metadata.items():
    print(f"  {k}: {v}")

print(f"\nContent Snippet:\n{sample_doc.page_content[:200]}...")

## 3. Building From Parquet (Internal Logic)

To handle 1.3M records without crashing memory, we use `pyarrow` to process the parquet file in batches. This logic is encapsulated in `vectorstore.build_vector_store_from_parquet`.

In [None]:
# Example of how batches ARE processed (mental model)
# for batch in pf.iter_batches(batch_size=50000):
#     df_batch = batch.to_pandas()
#     vectorstore.add_embeddings(texts=df_batch['document'], ...)

print("Batch processing is automatically handled by the src scripts to ensure scalability.")