# 🔍 FAISS + Vector DB Integration for RAG Pipelines

This notebook demonstrates how to:
- Preprocess and chunk documents
- Generate embeddings using OpenAI
- Store and search with FAISS (Facebook AI Similarity Search)
- Retrieve relevant chunks for RAG pipelines

🔧 Libraries: `faiss`, `openai`, `tiktoken`, `numpy`
⚠️ Install missing packages using `!pip install` if needed.

In [None]:
# !pip install faiss-cpu openai tiktoken numpy

## 1️⃣ Preprocess and Chunk the Input Text

In [None]:
import tiktoken
import numpy as np

def chunk_text(text, max_tokens=250):
    enc = tiktoken.encoding_for_model("gpt-4")
    sentences = text.split(". ")
    chunks, current = [], ""
    for sentence in sentences:
        if len(enc.encode(current + sentence)) < max_tokens:
            current += sentence + ". "
        else:
            chunks.append(current.strip())
            current = sentence + ". "
    if current:
        chunks.append(current.strip())
    return chunks

## 2️⃣ Generate Embeddings with OpenAI

In [None]:
import openai

def embed_text(text_list, model="text-embedding-ada-002"):
    embeddings = []
    for text in text_list:
        resp = openai.Embedding.create(input=text, model=model)
        vector = resp["data"][0]["embedding"]
        embeddings.append(vector)
    return np.array(embeddings).astype("float32")

## 3️⃣ Create FAISS Index and Store Embeddings

In [None]:
import faiss

def build_faiss_index(vectors):
    dim = vectors.shape[1]
    index = faiss.IndexFlatL2(dim)
    index.add(vectors)
    return index

## 4️⃣ Perform Semantic Search
Query is embedded and searched against the FAISS index.

In [None]:
def search_faiss(query, chunks, index, all_embeddings, k=3):
    query_embed = embed_text([query])
    D, I = index.search(query_embed, k)
    return [chunks[i] for i in I[0]]

## 5️⃣ End-to-End Demo Function
Processes a document and returns top chunks based on query.

In [None]:
def rag_faiss_pipeline(full_text, user_query):
    chunks = chunk_text(full_text)
    vectors = embed_text(chunks)
    index = build_faiss_index(vectors)
    return search_faiss(user_query, chunks, index, vectors)

## ✅ Summary: Benefits of FAISS Integration

| Feature        | Benefit                          |
|----------------|-----------------------------------|
| Fast search    | Scales to millions of vectors     |
| Cost control   | Reduces token input to GPT calls  |
| Precision      | Finds semantically relevant chunks|
| Integration    | Works with LangChain, LlamaIndex  |