##  Embedding and Indexing Script

###  Objective
This notebook processes consumer complaint narratives by:
1. Chunking the text data using a consistent strategy.
2. Embedding each chunk using a pre-defined embedding model.
3. Storing the embeddings in a FAISS vector index along with associated metadata for future semantic search and retrieval.


In [1]:
import sys
import os

# Go two levels up from the notebook to the project root
project_root = os.path.abspath(os.path.join(os.getcwd(), "../.."))

# Join the path to 'src'
src_path = os.path.join(project_root, "src")

# Add 'src' to Python path
if src_path not in sys.path:
    sys.path.append(src_path)

# Confirm it's added
print("src path added:", src_path)


src path added: c:\Users\ABC\Desktop\10Acadamy\week_6\Intelligent-Complaint-Analysis-for-Financial-Services\src


### ⚙️ Configuration

- **Input File**:  
  `complaints_clean.csv`  
  Path:  
  `C:\Users\ABC\Desktop\10Acadamy\week_6\Intelligent-Complaint-Analysis-for-Financial-Services\data\clean\complaints_clean.csv`

- **Chunking Parameters**:
  - `chunk_size = 300`
  - `chunk_overlap = 50`
  - `chunks_per_loop = 1000`
  - `sample_limit = 5000` (samples total to process)

- **Output Paths**:
  - FAISS Index:  
    `vector_store/index_300_50.faiss`
  - Metadata CSV:  
    `vector_store/meta_300_50.csv`

---


In [2]:
# notebooks/embed_index_modular.ipynb

from embedding import EmbeddingModel
from chunking import  batch_chunk_texts,chunk_with_langchain
from indexing import FaissIndexer
import pandas as pd
from tqdm import tqdm
import os

# --- Config ---
CSV_PATH = r"C:\Users\ABC\Desktop\10Acadamy\week_6\Intelligent-Complaint-Analysis-for-Financial-Services\data\clean\complaints_clean.csv"
INDEX_PATH = r"C:\Users\ABC\Desktop\10Acadamy\week_6\Intelligent-Complaint-Analysis-for-Financial-Services\data\clean\vector_store\index_300_50.faiss"
META_PATH = r"C:\Users\ABC\Desktop\10Acadamy\week_6\Intelligent-Complaint-Analysis-for-Financial-Services\data\clean\vector_store\meta_300_50.csv"
CHUNK_SIZE = 300
CHUNK_OVERLAP = 50
CHUNKSIZE = 1000
SAMPLE_SIZE = 5000



  from .autonotebook import tqdm as notebook_tqdm


### 🛠️ Pipeline Steps

1. **Read Sample Data**:
   - Load up to 5,000 complaint records using pandas.

2. **Chunking**:
   - For each record, extract the `"Consumer complaint narrative"`.
   - Use the `chunk_with_langchain()` function to split the text into overlapping chunks.
   - Track metadata (`Complaint ID`, `Product`, chunk text).

3. **Embedding**:
   - Use the `EmbeddingModel` class to encode each chunk.
   - The embedding model used is assumed to be `sentence-transformers/all-MiniLM-L6-v2`.

4. **Indexing**:
   - Use `FaissIndexer` to store embeddings.
   - Embeddings are indexed with FAISS for fast similarity search.
   - Metadata is saved to a `.csv` file for traceability.

---

### 📂 Modular Design

This notebook leverages a modular design by importing from custom Python modules:
- `embedding.py` → Embedding logic.
- `chunking.py` → Text splitting logic using LangChain.
- `indexing.py` → FAISS indexing and metadata handling.

This ensures easy reuse and scalability in downstream scripts.

---

In [3]:
os.makedirs("vector_store", exist_ok=True)
df_sample = pd.read_csv(CSV_PATH, nrows=SAMPLE_SIZE)
embedder = EmbeddingModel()
indexer = FaissIndexer(dim=embedder.get_dimension())

for i in range(0, len(df_sample), CHUNKSIZE):
    df = df_sample.iloc[i:i+CHUNKSIZE]
    chunk_texts = []
    chunk_meta = []

    for _, row in df.iterrows():
        complaint_id = row.get("Complaint ID")
        product = row.get("Product")
        text = row.get("Consumer complaint narrative", "")

        if not isinstance(text, str) or not text.strip():
            continue

        chunks = chunk_with_langchain(text, chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
        for chunk in chunks:
            chunk_texts.append(chunk)
            chunk_meta.append({
                "complaint_id": complaint_id,
                "product": product,
                "chunk_text": chunk
            })

    if chunk_texts:
        embeddings = embedder.encode(chunk_texts)
        indexer.add(embeddings, chunk_meta)



 Notes

- Non-string or empty complaint texts are skipped.
- Progress is shown using `tqdm` if desired.
- The final FAISS index and metadata file are saved and can be used in the RAG pipeline's retriever module.

---

### Use Case
This indexing logic will later be used to:
- Perform semantic similarity search based on user questions.
- Retrieve relevant complaint chunks to feed into a language model for answer generation.

---

In [4]:
indexer.save(INDEX_PATH, META_PATH)
print("\n✅ Embedding and indexing completed.")



✅ Saved FAISS index to C:\Users\ABC\Desktop\10Acadamy\week_6\Intelligent-Complaint-Analysis-for-Financial-Services\data\clean\vector_store\index_300_50.faiss and metadata to C:\Users\ABC\Desktop\10Acadamy\week_6\Intelligent-Complaint-Analysis-for-Financial-Services\data\clean\vector_store\meta_300_50.csv

✅ Embedding and indexing completed.
