# RAG Pipeline: Chunking, Embedding, and Indexing

This notebook orchestrates the process of converting processed CFPB complaint data into a searchable vector store. 

**Pipeline Steps:**
1. Setup environment and HuggingFace cache.
2. Load processed data from disk.
3. Convert complaints to LangChain Documents.
4. Split documents into smaller semantic chunks.
5. Generate embeddings and build a FAISS vector index.
6. Verify the index with retrieval tests.

In [1]:
import sys
from pathlib import Path
import pandas as pd

# 1. Setup PROJECT_ROOT to allow importing from src/
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

# 2. Import config and setup HuggingFace cache
from src import config
config.setup_hf_cache()

# 3. Import required custom modules
from src.file_handling import load_processed_data
from src.docs import dataframe_to_documents, print_document_sample
from src.chunking import chunk_documents, get_chunk_stats
from src.vectorstore import create_vector_store, load_vector_store, get_retriever, print_search_results

print("✓ Imports and setup complete!")

✓ HuggingFace cache set to: c:\Users\Acer\Documents\KAIM_PROJECT\TEST\rag-complaint-chatbot\models\hf




✓ Imports and setup complete!


## 1. Load Processed Data

We load the data generated by the EDA notebook.

In [2]:
# Load cleaned data
processed_data_path = config.PROCESSED_DATA_PATH
df = load_processed_data(processed_data_path)

# Verify document text exists
text_col = 'clean_narrative'
if text_col in df.columns:
    print(f"✓ Found '{text_col}' for indexing")
    # Show snippet of first valid row
    print(f"Snippet: {df[text_col].iloc[0][:100]}...")
else:
    print(f"❌ ERROR: {text_col} not found in the dataset!")

print(f"Total records loaded: {len(df):,}")

✓ Loaded 12,000 processed DATA from filtered_complaints.csv
✓ Found 'clean_narrative' for indexing
Snippet: during the whole time that i had wells fargo ive experienced on going issues that has never been res...
Total records loaded: 12,000


## 2. Convert to LangChain Documents

We use `src.docs` to convert rows into structured objects that LangChain understands, preserving metadata for retrieval.

In [3]:
docs = dataframe_to_documents(df)

# Preview a document
if docs:
    print_document_sample(docs[0])

✓ Converted 12,000 rows to LangChain Documents
  Sample metadata keys: ['complaint_id', 'product', 'sub_product', 'issue', 'sub_issue', 'company', 'state', 'date_received', 'timely_response', 'consumer_disputed']
DOCUMENT SAMPLE
Content:
during the whole time that i had wells fargo ive experienced on going issues that has never been res was forced to close my account with a balance that did not belong to me. theres been a lot of unaut...
------------------------------------------------------------
Metadata:
  complaint_id: 7075210
  product: Checking or savings account
  sub_product: Checking account
  issue: Problem with a lender or other company charging your account
  sub_issue: Transaction was not authorized
  company: WELLS FARGO & COMPANY
  state: SC
  date_received: 2023-06-05
  timely_response: Yes
  consumer_disputed: None


## 3. Chunk the Documents

Break long narratives into manageable pieces for better embedding search accuracy.

In [4]:
# Splitting documents into chunks
chunks = chunk_documents(docs)

# Display chunk stats
stats = get_chunk_stats(chunks)
print(f"\nChunking Stats: {stats}")

[OK] Created text splitter (chunk_size=500, overlap=50)
[OK] Chunking complete:
  Original documents: 12,000
  After chunking: 38,024
  Expansion ratio: 3.17x

Chunking Stats: {'total_chunks': 38024, 'min_length': 3, 'max_length': 500, 'mean_length': 372.5, 'median_length': 411}


## 4. Create Embedding and Vector Store (FAISS)

This step converts text into high-dimensional vectors and stores them in a local index.

In [5]:
# Create and persist vector store
# Note: On the first run, this download the model (~80MB)
vectorstore = create_vector_store(chunks)
print("✓ Vector store built and saved to disk.")

Creating vector store with 38,024 documents...
  Persist directory: c:\Users\Acer\Documents\KAIM_PROJECT\TEST\rag-complaint-chatbot\vector_store\faiss
Loading embedding model: sentence-transformers/all-MiniLM-L6-v2
  (First run will download ~80MB to c:\Users\Acer\Documents\KAIM_PROJECT\TEST\rag-complaint-chatbot\models\hf)


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✓ Embedding model loaded
✓ FAISS index built (ntotal=38,024)
✓ Vector store persisted to c:\Users\Acer\Documents\KAIM_PROJECT\TEST\rag-complaint-chatbot\vector_store\faiss
✓ Vector store built and saved to disk.


## 5. Test Loading and Retrieval

Verify that we can reload the index from disk and perform a search.

In [6]:
# Test loading from disk
vectorstore_v2 = load_vector_store()

# Test retrieval
query = "unauthorized charge on my credit card"
retriever = get_retriever(vectorstore_v2, k=3)
results = retriever.invoke(query)

# Display results
print_search_results(results, query)

Loading vector store from c:\Users\Acer\Documents\KAIM_PROJECT\TEST\rag-complaint-chatbot\vector_store\faiss...
Loading embedding model: sentence-transformers/all-MiniLM-L6-v2
  (First run will download ~80MB to c:\Users\Acer\Documents\KAIM_PROJECT\TEST\rag-complaint-chatbot\models\hf)
✓ Embedding model loaded
✓ Vector store loaded (ntotal=38,024)
✓ Created retriever (k=3)
SEARCH RESULTS for: 'unauthorized charge on my credit card'

--- Result 1 ---
Complaint ID: N/A
Product: N/A
Issue: N/A
Company: N/A
Chunk Index: 4
Content preview:
. furthermore, i did not have the credit card in my physical presence and eyesight the entire time i was in xxxx ( xx xx xxxx-xx xx xxxx ). i handed the card over to merchants on several occasions for them to run it through for other charges that i recognize ( like for a dinner or gas ). so, it is p...

--- Result 2 ---
Complaint ID: N/A
Product: N/A
Issue: N/A
Company: N/A
Chunk Index: 0
Content preview:
my card was charge unauthorize. but credit card co

## 6. Explore Vector Store

A quick look into the index content.

In [7]:
print(f"Total vectors in index: {vectorstore_v2.index.ntotal:,}")
print(f"Sample chunk metadata from retriever result:")
print(results[0].metadata)

Total vectors in index: 38,024
Sample chunk metadata from retriever result:
{'complaint_id': 3374042, 'product': 'Credit card or prepaid card', 'sub_product': 'General-purpose credit card or charge card', 'issue': 'Problem with a purchase shown on your statement', 'sub_issue': 'Card was charged for something you did not purchase with the card', 'company': 'CAPITAL ONE FINANCIAL CORPORATION', 'state': 'IL', 'date_received': '2019-09-13', 'timely_response': 'Yes', 'consumer_disputed': None, 'chunk_index': 4}
