# RAG Pipeline: Chunking, Embedding, and Indexing

This notebook orchestrates the process of converting processed CFPB complaint data into a searchable vector store. 

**Pipeline Steps:**
1. Setup environment and HuggingFace cache.
2. Load processed data from disk.
3. Convert complaints to LangChain Documents.
4. Split documents into smaller semantic chunks.
5. Generate embeddings and build a FAISS vector index.
6. Verify the index with retrieval tests.

In [None]:
import sys
from pathlib import Path
import pandas as pd

# 1. Setup PROJECT_ROOT to allow importing from src/
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

# 2. Import config and setup HuggingFace cache
from src import config
config.setup_hf_cache()

# 3. Import required custom modules
from src.file_handling import load_processed_data
from src.docs import dataframe_to_documents, print_document_sample
from src.chunking import chunk_documents, get_chunk_stats
from src.vectorstore import create_vector_store, load_vector_store, get_retriever, print_search_results

print("✓ Imports and setup complete!")

from src.preprocess import create_stratified_sample

## 1.1 Stratified Sampling

We apply stratified sampling to select a representative subset of 10,000 complaints. This ensures the vector store is built on a manageable dataset while maintaining the proportional distribution of product categories from the filtered dataset.

In [None]:
# Select a stratified sample of 10k records
target_sample_size = 10000
print(f"Applying stratified sampling (Target: {target_sample_size})...")
df = create_stratified_sample(df, target_size=target_sample_size)

# Verify final shape
print(f"Sample size for indexing: {len(df):,}")

## 1. Load Processed Data

We load the data generated by the EDA notebook.

In [None]:
# Load cleaned data
processed_data_path = config.PROCESSED_DATA_PATH
df = load_processed_data(processed_data_path)

# Verify document text exists
text_col = 'clean_narrative'
if text_col in df.columns:
    print(f"✓ Found '{text_col}' for indexing")
    # Show snippet of first valid row
    print(f"Snippet: {df[text_col].iloc[0][:100]}...")
else:
    print(f"❌ ERROR: {text_col} not found in the dataset!")

print(f"Total records loaded: {len(df):,}")

## 2. Convert to LangChain Documents

We use `src.docs` to convert rows into structured objects that LangChain understands, preserving metadata for retrieval.

In [None]:
docs = dataframe_to_documents(df)

# Preview a document
if docs:
    print_document_sample(docs[0])

## 3. Chunk the Documents

Break long narratives into manageable pieces for better embedding search accuracy.

In [None]:
# Splitting documents into chunks
chunks = chunk_documents(docs)

# Display chunk stats
stats = get_chunk_stats(chunks)
print(f"\nChunking Stats: {stats}")

## 4. Create Embedding and Vector Store (FAISS)

This step converts text into high-dimensional vectors and stores them in a local index.

In [None]:
# Create and persist vector store
# Note: On the first run, this download the model (~80MB)
vectorstore = create_vector_store(chunks)
print("✓ Vector store built and saved to disk.")

## 5. Test Loading and Retrieval

Verify that we can reload the index from disk and perform a search.

In [None]:
# Test loading from disk
vectorstore_v2 = load_vector_store()

# Test retrieval
query = "unauthorized charge on my credit card"
retriever = get_retriever(vectorstore_v2, k=3)
results = retriever.invoke(query)

# Display results
print_search_results(results, query)

## 6. Explore Vector Store

A quick look into the index content.

In [None]:
print(f"Total vectors in index: {vectorstore_v2.index.ntotal:,}")
print(f"Sample chunk metadata from retriever result:")
print(results[0].metadata)