##  Embedding and Indexing Script

###  Objective
This notebook processes consumer complaint narratives by:
1. Chunking the text data using a consistent strategy.
2. Embedding each chunk using a pre-defined embedding model.
3. Storing the embeddings in a FAISS vector index along with associated metadata for future semantic search and retrieval.


In [18]:
import sys
import os

# Go two levels up from the notebook to the project root
project_root = os.path.abspath(os.path.join(os.getcwd(), "../.."))

# Join the path to 'src'
src_path = os.path.join(project_root, "src")

# Add 'src' to Python path
if src_path not in sys.path:
    sys.path.append(src_path)

# Confirm it's added
print("src path added:", src_path)


src path added: c:\Users\ABC\Desktop\10Acadamy\week_6\Intelligent-Complaint-Analysis-for-Financial-Services\src


### ⚙️ Configuration

- **Input File**:  
  `complaints_clean.csv`  
  Path:  
  `C:\Users\ABC\Desktop\10Acadamy\week_6\Intelligent-Complaint-Analysis-for-Financial-Services\data\clean\complaints_clean.csv`

- **Chunking Parameters**:
  - `chunk_size = 300`
  - `chunk_overlap = 50`
  - `chunks_per_loop = 1000`
  - `sample_limit = 5000` (samples total to process)

- **Output Paths**:
  - FAISS Index:  
    `vector_store/index_300_50.faiss`
  - Metadata CSV:  
    `vector_store/meta_300_50.csv`

---


In [19]:
# notebooks/embed_index_modular.ipynb

from embedding import EmbeddingModel
from chunking import  batch_chunk_texts,chunk_with_langchain
from indexing import FaissIndexer
import pandas as pd
from tqdm import tqdm
import os




### 🛠️ Pipeline Steps

1. **Read Sample Data**:
   - Load up to 5,000 complaint records using pandas.

2. **Chunking**:
   - For each record, extract the `"Consumer complaint narrative"`.
   - Use the `chunk_with_langchain()` function to split the text into overlapping chunks.
   - Track metadata (`Complaint ID`, `Product`, chunk text).

3. **Embedding**:
   - Use the `EmbeddingModel` class to encode each chunk.
   - The embedding model used is assumed to be `sentence-transformers/all-MiniLM-L6-v2`.

4. **Indexing**:
   - Use `FaissIndexer` to store embeddings.
   - Embeddings are indexed with FAISS for fast similarity search.
   - Metadata is saved to a `.csv` file for traceability.

---

### 📂 Modular Design

This notebook leverages a modular design by importing from custom Python modules:
- `embedding.py` → Embedding logic.
- `chunking.py` → Text splitting logic using LangChain.
- `indexing.py` → FAISS indexing and metadata handling.

This ensures easy reuse and scalability in downstream scripts.

---

In [20]:
import yaml
CONFIG_PATH = "../../config.yaml"

# Load the configuration from the YAML file
with open(CONFIG_PATH, 'r') as f:
    config = yaml.safe_load(f)


In [21]:
os.makedirs("vector_store", exist_ok=True)
# NEW CODE
# Load config from your YAML file first...

# --- Path Construction (Corrected) ---
# Get the relative path from the config file
relative_csv_path = config['data']['csv_path']

# Construct the correct path from the notebook's location
# Go up one directory ('..') and then join with the path from the config
full_csv_path = os.path.join("../..", relative_csv_path)

print(f"Correctly constructed full path: {full_csv_path}")

# --- Data Loading (Using the new correct path) ---
sample_size_from_config = config['processing']['sample_size']

if sample_size_from_config:
    print(f"Loading {sample_size_from_config} samples...")
    df_full = pd.read_csv(full_csv_path, nrows=sample_size_from_config)
else:
    print("Loading the full dataset...")
    df_full = pd.read_csv(full_csv_path)

print(f"Loaded {len(df_full)} records.")



Correctly constructed full path: ../..\data/clean/complaints_clean.csv
Loading the full dataset...
Loaded 445576 records.


In [24]:


# Access the nested keys for the embedding model
model_name = config['embedding']['model_name']

# Access the nested keys for the output paths
index_path = config['output']['index_path']
meta_path = config['output']['meta_path']

# --- Verification ---
# Print the loaded values to confirm they are correct
print(f"Embedding Model: {model_name}")
print(f"Index Path: {index_path}")
print(f"Metadata Path: {meta_path}")

Embedding Model: BAAI/bge-large-en-v1.5
Index Path: data/vector_store/index_bge_large_300_20.faiss
Metadata Path: data/vector_store/meta_bge_large_300_20.csv


 Notes

- Non-string or empty complaint texts are skipped.
- Progress is shown using `tqdm` if desired.
- The final FAISS index and metadata file are saved and can be used in the RAG pipeline's retriever module.

---

### Use Case
This indexing logic will later be used to:
- Perform semantic similarity search based on user questions.
- Retrieve relevant complaint chunks to feed into a language model for answer generation.

---

In [None]:
indexer.save(INDEX_PATH, META_PATH)
print("\n✅ Embedding and indexing completed.")



✅ Saved FAISS index to C:\Users\ABC\Desktop\10Acadamy\week_6\Intelligent-Complaint-Analysis-for-Financial-Services\data\clean\vector_store\index_300_50.faiss and metadata to C:\Users\ABC\Desktop\10Acadamy\week_6\Intelligent-Complaint-Analysis-for-Financial-Services\data\clean\vector_store\meta_300_50.csv

✅ Embedding and indexing completed.
