<a href="https://colab.research.google.com/github/silvia-j-escobar/ExternDataScience/blob/main/Comparing_Open_Source_Embedding_Models_for_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>




Compare Open Embedding Models
Silvia Escobar


In [None]:
# Silvia Escobar

In [1]:
!pip install llama-index llama-index-embeddings-huggingface pymupdf
!pip install nest_asyncio



# 🔧 Section 1: Setup

Install necessary packages: llama-index, llama-index-embeddings-huggingface, pymupdf

Optional: nest_asyncio if needed for Colab

In [3]:
import nest_asyncio
nest_asyncio.apply()

from llama_index.core import VectorStoreIndex, Document, Settings, get_response_synthesizer
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.query_engine import RetrieverQueryEngine
import fitz  # PyMuPDF
import time
Settings.llm = None

LLM is explicitly disabled. Using MockLLM.


LLM is explicitly disabled. Using MockLLM.


# 📄 Section 2: Load & Extract Text from a Sample PDF

In [4]:
!pip install pymupdf




In [8]:
from google.colab import files
uploaded = files.upload()


Saving sample_contract.pdf to sample_contract.pdf


In [9]:
# Replace this path with your own uploaded file
pdf_path = "/content/sample_contract.pdf"
doc = fitz.open(pdf_path)
text = "\\n".join([page.get_text() for page in doc])

print(f"✅ Extracted {len(text.split())} words from the contract.")

✅ Extracted 315 words from the contract.


In [10]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Default (example) embedding model
embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [11]:
embed_model = HuggingFaceEmbedding(model_name="intfloat/e5-small-v2")


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

In [12]:
# Define a sentence splitter (can also use TokenTextSplitter or CharacterTextSplitter)
text_splitter = SentenceSplitter(chunk_size=50, chunk_overlap=50)

# Turn raw text into a list of Document objects
documents = [Document(text=text)]

# Convert into nodes (smaller chunks)
nodes = text_splitter.get_nodes_from_documents(documents)

# Set the embedding model for Settings before creating the index
Settings.embed_model = embed_model

# Then create the index from these nodes
index = VectorStoreIndex(nodes)

# 🧠 Section 3: Initialize and Compare Embedding Models

In [13]:
embedding_models = {
    "MiniLM-L6-v2": "sentence-transformers/all-MiniLM-L6-v2",
    "BGE-small-en": "BAAI/bge-small-en-v1.5",
    "E5-small-v2": "intfloat/e5-small-v2"
}

query = "What are the penalties for late payments?"

results = {}

for model_name, model_path in embedding_models.items():
    print(f"\n🔍 Testing Embedding Model: {model_name}")

    # Configure the embedding model
    embed_model = HuggingFaceEmbedding(model_name=model_path)
    Settings.embed_model = embed_model

    # Build the index
    start_time = time.time()

    retriever = index.as_retriever(similarity_top_k=2)
    query_engine = RetrieverQueryEngine.from_args(retriever=retriever)

    # Run the query
    response = query_engine.query(query)
    end_time = time.time()

    # Store results
    results[model_name] = {
        "response": str(response),
        "time": round(end_time - start_time, 2)
    }


🔍 Testing Embedding Model: MiniLM-L6-v2

🔍 Testing Embedding Model: BGE-small-en


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]


🔍 Testing Embedding Model: E5-small-v2


In [14]:
query = "What is the maximum loan amount a borrower can apply for?"
response = query_engine.query(query)

print(response)

Context information is below.
---------------------
Payment terms are
net 30 days from receipt of invoice.
2.3 Late payments shall bear interest at the rate of 1.5% per month from the due date until paid in full.
3.

4.2 Refunds are issued at the sole discretion of Service Provider and will be processed within 30 days
of approval.
\n4.3 No refunds will be issued for completed projects that meet the specifications outlined in Exhibit A.
5.
---------------------
Given the context information and not prior knowledge, answer the query.
Query: What is the maximum loan amount a borrower can apply for?
Answer: 


# 📊 Section 4: Compare Outputs

In [15]:
for model, result in results.items():
    print(f"\\n==============================")
    print(f"🧠 Model: {model}")
    print(f"⏱️ Retrieval Time: {result['time']} seconds")
    print(f"📄 Top Response:\\n{result['response']}")
    print(f"==============================\\n")

🧠 Model: MiniLM-L6-v2
⏱️ Retrieval Time: 0.04 seconds
📄 Top Response:\nContext information is below.
---------------------
Payment terms are
net 30 days from receipt of invoice.
2.3 Late payments shall bear interest at the rate of 1.5% per month from the due date until paid in full.
3.

4.2 Refunds are issued at the sole discretion of Service Provider and will be processed within 30 days
of approval.
\n4.3 No refunds will be issued for completed projects that meet the specifications outlined in Exhibit A.
5.
---------------------
Given the context information and not prior knowledge, answer the query.
Query: What are the penalties for late payments?
Answer: 
🧠 Model: BGE-small-en
⏱️ Retrieval Time: 0.05 seconds
📄 Top Response:\nContext information is below.
---------------------
Payment terms are
net 30 days from receipt of invoice.
2.3 Late payments shall bear interest at the rate of 1.5% per month from the due date until paid in full.
3.

4.2 Refunds are issued at the sole discretion

**Task: Test, Compare, and Choose the Best Embedding Model**


In [46]:
# Choose 3 Embedding Models # MiniLM-L6-v2
embedding_models = {
    "MiniLM-L6-v2": "sentence-transformers/all-MiniLM-L6-v2",
    #"BGE-small-en": "BAAI/bge-small-en-v1.5",
    #"E5-small-v2": "intfloat/e5-small-v2"
}

query = "What are the penalties for late payments?"

results = {}

for model_name, model_path in embedding_models.items():
    print(f"\n🔍 Testing Embedding Model: {model_name}")

    # Configure the embedding model
    embed_model = HuggingFaceEmbedding(model_name=model_path)
    Settings.embed_model = embed_model

    # Build the index
    start_time = time.time()

    retriever = index.as_retriever(similarity_top_k=2)
    query_engine = RetrieverQueryEngine.from_args(retriever=retriever)

    # Run the query
    response = query_engine.query(query)
    end_time = time.time()

    # Store results
    results[model_name] = {
        "response": str(response),
        "time": round(end_time - start_time, 2)
    }


🔍 Testing Embedding Model: MiniLM-L6-v2


In [51]:
query = "How long does this agreement last?"
response = query_engine.query(query)

print(response)

Context information is below.
---------------------
3.2 Either party may terminate this Agreement upon thirty (30) days written notice to the other party.
4.

3. TERM AND TERMINATION
3.1 This Agreement shall commence on the Effective Date and shall continue for a period of one (1)
year, unless earlier terminated as provided herein.
---------------------
Given the context information and not prior knowledge, answer the query.
Query: How long does this agreement last?
Answer: 


In [55]:
query = "What is the effective date of the agreement?"
response = query_engine.query(query)

print(response)

Context information is below.
---------------------
3. TERM AND TERMINATION
3.1 This Agreement shall commence on the Effective Date and shall continue for a period of one (1)
year, unless earlier terminated as provided herein.

SERVICE AGREEMENT CONTRACT
This Service Agreement (the "Agreement") is entered into as of January 15, 2025 (the "Effective Date")
by and between:
ABC Company Inc.,
---------------------
Given the context information and not prior knowledge, answer the query.
Query: What is the effective date of the agreement?
Answer: 


In [56]:
query = "How much time does the client have to pay after receiving an invoice?"
response = query_engine.query(query)

print(response)

Context information is below.
---------------------
2.2 Service Provider shall invoice Client on a monthly basis for Services performed. Payment terms are
net 30 days from receipt of invoice.

Payment terms are
net 30 days from receipt of invoice.
2.3 Late payments shall bear interest at the rate of 1.5% per month from the due date until paid in full.
3.
---------------------
Given the context information and not prior knowledge, answer the query.
Query: How much time does the client have to pay after receiving an invoice?
Answer: 


In [57]:
for node in retriever.retrieve(query):
    print(node.get_text())

2.2 Service Provider shall invoice Client on a monthly basis for Services performed. Payment terms are
net 30 days from receipt of invoice.
Payment terms are
net 30 days from receipt of invoice.
2.3 Late payments shall bear interest at the rate of 1.5% per month from the due date until paid in full.
3.


In [31]:
# Choose 3 Embedding Models # BGE-small-en
embedding_models = {
    #"MiniLM-L6-v2": "sentence-transformers/all-MiniLM-L6-v2",
    "BGE-small-en": "BAAI/bge-small-en-v1.5",
    #"E5-small-v2": "intfloat/e5-small-v2"
}

query = "What are the penalties for late payments?"

results = {}

for model_name, model_path in embedding_models.items():
    print(f"\n🔍 Testing Embedding Model: {model_name}")

    # Configure the embedding model
    embed_model = HuggingFaceEmbedding(model_name=model_path)
    Settings.embed_model = embed_model

    # Build the index
    start_time = time.time()

    retriever = index.as_retriever(similarity_top_k=2)
    query_engine = RetrieverQueryEngine.from_args(retriever=retriever)

    # Run the query
    response = query_engine.query(query)
    end_time = time.time()

    # Store results
    results[model_name] = {
        "response": str(response),
        "time": round(end_time - start_time, 2)
    }


🔍 Testing Embedding Model: BGE-small-en


In [59]:
query = "How long does this agreement last?"
response = query_engine.query(query)

print(response)

Context information is below.
---------------------
3.2 Either party may terminate this Agreement upon thirty (30) days written notice to the other party.
4.

3. TERM AND TERMINATION
3.1 This Agreement shall commence on the Effective Date and shall continue for a period of one (1)
year, unless earlier terminated as provided herein.
---------------------
Given the context information and not prior knowledge, answer the query.
Query: How long does this agreement last?
Answer: 


In [60]:
query = "What is the effective date of the agreement?"
response = query_engine.query(query)

print(response)

Context information is below.
---------------------
3. TERM AND TERMINATION
3.1 This Agreement shall commence on the Effective Date and shall continue for a period of one (1)
year, unless earlier terminated as provided herein.

SERVICE AGREEMENT CONTRACT
This Service Agreement (the "Agreement") is entered into as of January 15, 2025 (the "Effective Date")
by and between:
ABC Company Inc.,
---------------------
Given the context information and not prior knowledge, answer the query.
Query: What is the effective date of the agreement?
Answer: 


In [34]:
query = "How much time does the client have to pay after receiving an invoice?"
response = query_engine.query(query)

print(response)

Context information is below.
---------------------
2.2 Service Provider shall invoice Client on a monthly basis for Services performed. Payment terms are
net 30 days from receipt of invoice.

Payment terms are
net 30 days from receipt of invoice.
2.3 Late payments shall bear interest at the rate of 1.5% per month from the due date until paid in full.
3.
---------------------
Given the context information and not prior knowledge, answer the query.
Query: How much time does the client have to pay after receiving an invoice?
Answer: 


In [43]:
for node in retriever.retrieve(query):
    print(node.get_text())

2.2 Service Provider shall invoice Client on a monthly basis for Services performed. Payment terms are
net 30 days from receipt of invoice.
Payment terms are
net 30 days from receipt of invoice.
2.3 Late payments shall bear interest at the rate of 1.5% per month from the due date until paid in full.
3.


In [61]:
# Choose 3 Embedding Models # E5-small-v2
embedding_models = {
    #"MiniLM-L6-v2": "sentence-transformers/all-MiniLM-L6-v2",
    #"BGE-small-en": "BAAI/bge-small-en-v1.5",
    "E5-small-v2": "intfloat/e5-small-v2"
}

query = "What are the penalties for late payments?"

results = {}

for model_name, model_path in embedding_models.items():
    print(f"\n🔍 Testing Embedding Model: {model_name}")

    # Configure the embedding model
    embed_model = HuggingFaceEmbedding(model_name=model_path)
    Settings.embed_model = embed_model

    # Build the index
    start_time = time.time()

    retriever = index.as_retriever(similarity_top_k=2)
    query_engine = RetrieverQueryEngine.from_args(retriever=retriever)

    # Run the query
    response = query_engine.query(query)
    end_time = time.time()

    # Store results
    results[model_name] = {
        "response": str(response),
        "time": round(end_time - start_time, 2)
    }


🔍 Testing Embedding Model: E5-small-v2


In [62]:
query = "How long does this agreement last?"
response = query_engine.query(query)

print(response)

Context information is below.
---------------------
3.2 Either party may terminate this Agreement upon thirty (30) days written notice to the other party.
4.

3. TERM AND TERMINATION
3.1 This Agreement shall commence on the Effective Date and shall continue for a period of one (1)
year, unless earlier terminated as provided herein.
---------------------
Given the context information and not prior knowledge, answer the query.
Query: How long does this agreement last?
Answer: 


In [63]:
query = "What is the effective date of the agreement?"
response = query_engine.query(query)

print(response)

Context information is below.
---------------------
3. TERM AND TERMINATION
3.1 This Agreement shall commence on the Effective Date and shall continue for a period of one (1)
year, unless earlier terminated as provided herein.

SERVICE AGREEMENT CONTRACT
This Service Agreement (the "Agreement") is entered into as of January 15, 2025 (the "Effective Date")
by and between:
ABC Company Inc.,
---------------------
Given the context information and not prior knowledge, answer the query.
Query: What is the effective date of the agreement?
Answer: 


In [64]:
query = "How much time does the client have to pay after receiving an invoice?"
response = query_engine.query(query)

print(response)

Context information is below.
---------------------
2.2 Service Provider shall invoice Client on a monthly basis for Services performed. Payment terms are
net 30 days from receipt of invoice.

Payment terms are
net 30 days from receipt of invoice.
2.3 Late payments shall bear interest at the rate of 1.5% per month from the due date until paid in full.
3.
---------------------
Given the context information and not prior knowledge, answer the query.
Query: How much time does the client have to pay after receiving an invoice?
Answer: 


In [65]:
for node in retriever.retrieve(query):
    print(node.get_text())

2.2 Service Provider shall invoice Client on a monthly basis for Services performed. Payment terms are
net 30 days from receipt of invoice.
Payment terms are
net 30 days from receipt of invoice.
2.3 Late payments shall bear interest at the rate of 1.5% per month from the due date until paid in full.
3.
