In [3]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
%reload_ext autoreload

In [1]:
#%pip install ipywidgets

In [4]:
import pandas as pd
import numpy as np
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
from faiss import IndexFlatL2
import pickle
from pathlib import Path
import uuid
import os
import sys

In [7]:
sys.path.append(os.path.abspath('../src/'))

In [8]:
from data_process import data_loader
from text_chunker import chunk_narratives

In [9]:
df = data_loader('../data/processed/filtered_complaints.csv')

2025-08-19 21:01:19,099 - INFO - Data loaded successfully from ../data/processed/filtered_complaints.csv


In [10]:
# 1. Text Chunking
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  # ~100-150 words, suitable for coherent complaint segments
    chunk_overlap=50,  # Small overlap to maintain context across chunks
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]
)

In [11]:
# Generate chunks
chunks, metadata = chunk_narratives(df, text_splitter)
print(f"Total chunks created: {len(chunks)}")

Total chunks created: 15055


In [12]:
# 2. Generate Embeddings
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = embedding_model.encode(chunks, batch_size=32, show_progress_bar=True)
print(f"Embedding shape: {embeddings.shape}")

Batches:   0%|          | 0/471 [00:00<?, ?it/s]

Embedding shape: (15055, 384)


In [13]:
embedding_model.save('../models/all-MiniLM-L6-v2')

In [14]:
# 3. Create and Populate FAISS Index
dimension = embeddings.shape[1]
index = IndexFlatL2(dimension)
index.add(embeddings)

In [15]:
# 4. Save Vector Store and Metadata
vector_store_path = '../vector_store/faiss_index.bin'
metadata_path = '../vector_store/metadata.pkl'

In [16]:
# Save FAISS index
with open(vector_store_path, 'wb') as f:
    pickle.dump(index, f)

In [17]:
# Save metadata
with open(metadata_path, 'wb') as f:
    pickle.dump({'chunks': chunks, 'metadata': metadata}, f)

In [18]:
print(f"Vector store saved to: {vector_store_path}")
print(f"Metadata saved to: {metadata_path}")

Vector store saved to: ../vector_store/faiss_index.bin
Metadata saved to: ../vector_store/metadata.pkl


In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4541 entries, 0 to 4540
Data columns (total 7 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Consumer complaint narrative  4541 non-null   object
 1   Product                       4541 non-null   object
 2   Issue                         4541 non-null   object
 3   Company                       4541 non-null   object
 4   Date received                 4541 non-null   object
 5   narrative_length              4541 non-null   int64 
 6   cleaned_narrative             4541 non-null   object
dtypes: int64(1), object(6)
memory usage: 248.5+ KB


In [20]:
# Save a sample of chunks for verification
sample_chunks = pd.DataFrame({
    'chunk_id': [m['chunk_id'] for m in metadata],
    'product': [m['product'] for m in metadata],
    'chunk_text': chunks
})
sample_chunks.head(10).to_csv('../vector_store/sample_chunks.csv', index=False)
print(f"Sample chunks saved to: {'../vector_store/sample_chunks.csv'}")

Sample chunks saved to: ../vector_store/sample_chunks.csv


### Report Section: Chunking Strategy and Embedding Model Choice

For the text chunking strategy, I utilized LangChain's RecursiveCharacterTextSplitter with a chunk_size of 500 characters and a chunk_overlap of 50 characters. The chunk size was chosen to balance capturing coherent segments of complaint narratives (approximately 100-150 words) while ensuring embeddings remain semantically meaningful. Complaints often contain distinct issues (e.g., billing disputes, customer service issues), and smaller chunks help isolate these for better retrieval precision. 
The overlap of 50 characters maintains context across chunk boundaries, especially for narratives split mid-sentence. I experimented with larger chunk sizes (e.g., 1000 characters), but they risked diluting specific issues in longer narratives, while smaller chunks (e.g., 200 characters) fragmented context excessively. The chosen parameters were validated by inspecting sample chunks, ensuring they retained meaningful complaint details.

The sentence-transformers/all-MiniLM-L6-v2 model was selected for embedding due to its efficiency and performance in semantic similarity tasks. This lightweight model (22M parameters, 384-dimensional embeddings) is optimized for short text, making it ideal for complaint narratives, which are typically concise yet descriptive. It provides a good balance between embedding quality and computational efficiency, suitable for indexing large datasets like the CFPB complaints. The model’s pre-training on diverse datasets ensures robust handling of financial terminology and consumer language. 
FAISS was chosen for the vector store due to its speed and scalability for similarity search, with metadata (complaint ID, product, chunk ID) stored alongside each embedding to enable traceability to the original complaint. The vector store and metadata are persisted in the vector_store/ directory for downstream retrieval tasks.