<a href="https://colab.research.google.com/github/sssangeetha/OutamationAI_OCR_RAG_Automation/blob/main/Copy_of_Comparing_Open_Source_Embedding_Models_for_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🔧 Section 1: Setup

Install necessary packages: llama-index, llama-index-embeddings-huggingface, pymupdf

Optional: nest_asyncio if needed for Colab

In [1]:
!pip install llama-index llama-index-embeddings-huggingface pymupdf
!pip install nest_asyncio

Collecting llama-index
  Downloading llama_index-0.14.6-py3-none-any.whl.metadata (13 kB)
Collecting llama-index-embeddings-huggingface
  Downloading llama_index_embeddings_huggingface-0.6.1-py3-none-any.whl.metadata (458 bytes)
Collecting pymupdf
  Downloading pymupdf-1.26.5-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Collecting llama-index-cli<0.6,>=0.5.0 (from llama-index)
  Downloading llama_index_cli-0.5.3-py3-none-any.whl.metadata (1.4 kB)
Collecting llama-index-core<0.15.0,>=0.14.6 (from llama-index)
  Downloading llama_index_core-0.14.6-py3-none-any.whl.metadata (2.5 kB)
Collecting llama-index-embeddings-openai<0.6,>=0.5.0 (from llama-index)
  Downloading llama_index_embeddings_openai-0.5.1-py3-none-any.whl.metadata (400 bytes)
Collecting llama-index-indices-managed-llama-cloud>=0.4.0 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.9.4-py3-none-any.whl.metadata (3.7 kB)
Collecting llama-index-llms-openai<0.7,>=0.6.0 (from llama-index)
  Down



In [1]:
import nest_asyncio
nest_asyncio.apply()

from llama_index.core import VectorStoreIndex, Document, Settings, get_response_synthesizer
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.query_engine import RetrieverQueryEngine
import fitz  # PyMuPDF
import time
Settings.llm = None

LLM is explicitly disabled. Using MockLLM.


In [8]:

import os# Set up Google API key for Gemini
GOOGLE_API_KEY = "my-key"  # Replace with your actual API key
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

# 📄 Section 2: Load & Extract Text from a Sample PDF

In [9]:
# Replace this path with your own uploaded file
pdf_path = "/content/sample_data/Sangeetha_Jannapureddy_Visa_Software_Engineer.pdf"
doc = fitz.open(pdf_path)
text = "\\n".join([page.get_text() for page in doc])

print(f"✅ Extracted {len(text.split())} words from the contract.")

✅ Extracted 710 words from the contract.


In [10]:
# Define a sentence splitter (can also use TokenTextSplitter or CharacterTextSplitter)
text_splitter = SentenceSplitter(chunk_size=50, chunk_overlap=50)

# Turn raw text into a list of Document objects
documents = [Document(text=text)]

# Convert into nodes (smaller chunks)
nodes = text_splitter.get_nodes_from_documents(documents)

# Then create the index from these nodes
index = VectorStoreIndex(nodes)




# 🧠 Section 3: Initialize and Compare Embedding Models

In [13]:
embedding_models = {
    "MiniLM-L6-v2": "sentence-transformers/all-MiniLM-L6-v2",
    "BGE-small-en": "BAAI/bge-small-en-v1.5",
    "E5-small-v2": "intfloat/e5-small-v2"
}

query = "What is my total experience? do you think i can apply for a role at Google with this experience?"

results = {}

for model_name, model_path in embedding_models.items():
    print(f"\\n🔍 Testing Embedding Model: {model_name}")

    # Configure the embedding model
    embed_model = HuggingFaceEmbedding(model_name=model_path)
    Settings.embed_model = embed_model

    # Build the index
    start_time = time.time()

    retriever = index.as_retriever(similarity_top_k=2)
    query_engine = RetrieverQueryEngine.from_args(retriever=retriever)

    # Run the query
    response = query_engine.query(query)
    end_time = time.time()

    # Store results
    results[model_name] = {
        "response": str(response),
        "time": round(end_time - start_time, 2)
    }

\n🔍 Testing Embedding Model: MiniLM-L6-v2
\n🔍 Testing Embedding Model: BGE-small-en
\n🔍 Testing Embedding Model: E5-small-v2


# 📊 Section 4: Compare Outputs

In [15]:
for model, result in results.items():
    print(f"\\n==============================")
    print(f"🧠 Model: {model}")
    print(f"⏱️ Retrieval Time: {result['time']} seconds")
    print(f"📄 Top Response:\\n{result['response']}")
    print(f"==============================\\n")

🧠 Model: MiniLM-L6-v2
⏱️ Retrieval Time: 0.04 seconds
📄 Top Response:\nContext information is below.
---------------------
Seeking opportunities to continue driving performance enhancements and
scalable solutions.
EXPERIENCE
Software Engineer
SystechCorp-Client (Blue Cross Blue Shield Association)
October 2025 - Present
NJ,

OpenShift
Soft Skills adaptability, communication, Leadership, problem solving, stakeholder management, technical writing, time-management
APIs and Messaging GRPC, GraphQL, API gateways, REST APIs, RabbitMQ
Observability ELK,
---------------------
Given the context information and not prior knowledge, answer the query.
Query: What is my total experience? do you think i can apply for a role at Google with this experience?
Answer: 
🧠 Model: BGE-small-en
⏱️ Retrieval Time: 0.03 seconds
📄 Top Response:\nContext information is below.
---------------------
Seeking opportunities to continue driving performance enhancements and
scalable solutions.
EXPERIENCE
Software Engin