# Task 2: Text Chunking, Embedding & Vector Store Indexing

This notebook demonstrates the vector store indexing pipeline:
1. Load preprocessed complaints
2. Chunk text using RecursiveCharacterTextSplitter
3. Generate embeddings with sentence-transformers
4. Build and persist FAISS index

**Note:** For full indexing of 471k complaints, use `python src/index_vector_store.py` instead (takes ~23 min).

In [1]:
import sys
sys.path.insert(0, '..')

import pandas as pd
import numpy as np
from pathlib import Path
from tqdm.notebook import tqdm
import faiss
import pickle
from sentence_transformers import SentenceTransformer
from langchain_text_splitters import RecursiveCharacterTextSplitter
import hashlib

  from pydantic.v1.fields import FieldInfo as FieldInfoV1


## Configuration

In [2]:
# Paths
FILTERED_DATA_PATH = Path('../data/filtered_complaints.csv')
VECTOR_STORE_PATH = Path('../vector_store')
FAISS_INDEX_PATH = VECTOR_STORE_PATH / 'faiss_index.bin'
METADATA_PATH = VECTOR_STORE_PATH / 'metadata.pkl'

# Chunking parameters
CHUNK_SIZE = 500
CHUNK_OVERLAP = 100

# Embedding model
EMBEDDING_MODEL = 'sentence-transformers/paraphrase-MiniLM-L3-v2'
EMBEDDING_DIM = 384

# For demo, limit rows (set to None for full dataset)
DEMO_LIMIT = 1000

## 1. Load Preprocessed Data

In [3]:
df = pd.read_csv(FILTERED_DATA_PATH)
print(f"Total complaints: {len(df):,}")

if DEMO_LIMIT:
    df = df.head(DEMO_LIMIT)
    print(f"Using demo subset: {len(df):,} complaints")

df.head()

Total complaints: 471,668
Using demo subset: 1,000 complaints


Unnamed: 0,complaint_id,product,product_original,sub_product,issue,sub_issue,narrative,company,date_received
0,14069121,credit_card,Credit card,Store credit card,Getting a credit card,Card opened without my consent or knowledge,a [REDACTED] [REDACTED] card was opened under ...,"CITIBANK, N.A.",2025-06-13
1,14061897,savings_account,Checking or savings account,Checking account,Managing an account,Deposits and withdrawals,i made the mistake of using my wellsfargo debi...,WELLS FARGO & COMPANY,2025-06-13
2,14047085,credit_card,Credit card,General-purpose credit card or charge card,"Other features, terms, or problems",Other problem,"dear cfpb, i have a secured credit card with c...","CITIBANK, N.A.",2025-06-12
3,14040217,credit_card,Credit card,General-purpose credit card or charge card,Incorrect information on your report,Account information incorrect,i have a citi rewards cards. the credit balanc...,"CITIBANK, N.A.",2025-06-12
4,13968411,credit_card,Credit card,General-purpose credit card or charge card,Problem with a purchase shown on your statement,Credit card company isn't resolving a dispute ...,b'i am writing to dispute the following charge...,"CITIBANK, N.A.",2025-06-09


## 2. Text Chunking

We use `RecursiveCharacterTextSplitter` from LangChain:
- **chunk_size=500**: Keeps chunks focused while preserving context
- **chunk_overlap=100**: Ensures continuity between chunks
- **separators**: Prioritizes splitting at paragraph/sentence boundaries

In [4]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]
)

# Example: chunk a single narrative
sample_narrative = df['narrative'].iloc[0]
sample_chunks = splitter.split_text(sample_narrative)

print(f"Original length: {len(sample_narrative)} chars")
print(f"Number of chunks: {len(sample_chunks)}")
print(f"\nFirst chunk ({len(sample_chunks[0])} chars):")
print(sample_chunks[0][:300] + "...")

Original length: 541 chars
Number of chunks: 2

First chunk (332 chars):
a [REDACTED] [REDACTED] card was opened under my name by a fraudster. i received a notice from [REDACTED] that an account was just opened under my name. i reached out to [REDACTED] [REDACTED] to state that this activity was unauthorized and not me. [REDACTED] [REDACTED] confirmed this was fraudulent...


In [5]:
# Chunk all complaints
chunks = []

for _, row in tqdm(df.iterrows(), total=len(df), desc="Chunking"):
    narrative = row['narrative']
    if pd.isna(narrative) or not narrative.strip():
        continue
    
    text_chunks = splitter.split_text(narrative)
    
    for i, chunk_text in enumerate(text_chunks):
        chunk_id = hashlib.md5(f"{row['complaint_id']}_{i}".encode()).hexdigest()
        
        chunks.append({
            'id': chunk_id,
            'text': chunk_text,
            'metadata': {
                'complaint_id': str(row['complaint_id']),
                'product': row['product'],
                'product_original': row['product_original'] if pd.notna(row['product_original']) else '',
                'issue': row['issue'] if pd.notna(row['issue']) else '',
                'company': row['company'] if pd.notna(row['company']) else '',
                'chunk_index': i,
                'total_chunks': len(text_chunks)
            }
        })

print(f"\nCreated {len(chunks):,} chunks from {len(df):,} complaints")
print(f"Average chunks per complaint: {len(chunks)/len(df):.2f}")

Chunking:   0%|          | 0/1000 [00:00<?, ?it/s]


Created 3,673 chunks from 1,000 complaints
Average chunks per complaint: 3.67


## 3. Generate Embeddings

In [6]:
# Load embedding model
model = SentenceTransformer(EMBEDDING_MODEL)
print(f"Model: {EMBEDDING_MODEL}")
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")

Model: sentence-transformers/paraphrase-MiniLM-L3-v2
Embedding dimension: 384


In [7]:
# Generate embeddings
texts = [c['text'] for c in chunks]
embeddings = model.encode(texts, show_progress_bar=True, convert_to_numpy=True)

print(f"\nEmbeddings shape: {embeddings.shape}")

Batches:   0%|          | 0/115 [00:00<?, ?it/s]


Embeddings shape: (3673, 384)


## 4. Build FAISS Index

In [8]:
# Create FAISS index (L2 distance)
index = faiss.IndexFlatL2(EMBEDDING_DIM)

# Add embeddings
embeddings_float32 = embeddings.astype('float32')
index.add(embeddings_float32)

print(f"FAISS index built with {index.ntotal:,} vectors")

FAISS index built with 3,673 vectors


In [9]:
# Prepare metadata for storage
metadata_list = []
for c in chunks:
    metadata_list.append({
        'id': c['id'],
        'text': c['text'],
        **c['metadata']
    })

print(f"Metadata entries: {len(metadata_list):,}")

Metadata entries: 3,673


## 5. Test Query

In [10]:
# Test semantic search
test_query = "billing dispute credit card"
query_embedding = model.encode([test_query], convert_to_numpy=True).astype('float32')

k = 5
distances, indices = index.search(query_embedding, k)

print(f"Query: '{test_query}'")
print(f"\nTop {k} results:")
for i, (dist, idx) in enumerate(zip(distances[0], indices[0]), 1):
    meta = metadata_list[idx]
    print(f"\n{i}. [Distance: {dist:.4f}] Product: {meta['product']}")
    print(f"   Issue: {meta['issue']}")
    print(f"   Text: {meta['text'][:150]}...")

Query: 'billing dispute credit card'

Top 5 results:

1. [Distance: 25.5237] Product: credit_card
   Issue: Problem with a purchase shown on your statement
   Text: . the credit card company is acting in fraudulent matter making me pay a charge for something i no longer have,...

2. [Distance: 26.3177] Product: credit_card
   Issue: Fees or interest
   Text: am writing to formally address serious concerns regarding my experience with the credit card account issued by your company, under account number [RED...

3. [Distance: 26.3630] Product: credit_card
   Issue: Getting a credit card
   Text: . i believe this constitutes a violation of the fair credit billing act and request immediate assistance in resolving this matter....

4. [Distance: 26.7926] Product: credit_card
   Issue: Problem with a purchase shown on your statement
   Text: . at this point, they were already hand my credit card and had the information. no charge was put on my card until i arrived back in the us where they...

## 6. Save Index (Optional)

**Note:** This demo uses a subset. For full indexing, run:
```bash
python src/index_vector_store.py
```

In [11]:
# Uncomment to save demo index
# VECTOR_STORE_PATH.mkdir(parents=True, exist_ok=True)
# faiss.write_index(index, str(FAISS_INDEX_PATH))
# with open(METADATA_PATH, 'wb') as f:
#     pickle.dump(metadata_list, f)
# print("Index saved!")

## Summary

| Step | Output |
|------|--------|
| Chunking | 500 chars, 100 overlap |
| Embedding | paraphrase-MiniLM-L3-v2 (384-dim) |
| Index | FAISS IndexFlatL2 |

For full production indexing (~23 min for 471k complaints â†’ 1.6M chunks), use the CLI script.