Testing Vector Embeddings for Retrieval-Augmented Generation (RAG) Applications:


The primary goal of this project is to evaluate and compare various vector embedding models in the context of Retrieval-Augmented Generation (RAG) applications. This involves determining which models produce the most accurate and contextually relevant embeddings for use in RAG systems. The evaluation will be based on the cosine similarity metric to quantify the similarity between model outputs for the same queries.


Retrieval-Augmented Generation (RAG) is a hybrid approach that enhances the performance of generative models by incorporating relevant information retrieved from large datasets. This involves two main steps: retrieving relevant documents or passages using vector embeddings and then using these retrieved pieces of information to generate responses. The quality of the vector embeddings directly impacts the retrieval accuracy and, consequently, the overall performance of the RAG system. Evaluating different embedding models ensures that the most effective ones are utilized, leading to better retrieval and generation results.

In this section we are testing different models of vector embeddings using a simple Python script, and using the cosine similarity between the different models’ answers so we can see which model is more accurate.


In [7]:
import chromadb
import chromadb.utils.embedding_functions as embedding_functions
import numpy as np
from sentence_transformers import SentenceTransformer

# Creating a chroma client
chroma_client = chromadb.Client()

# Jina AI embedding function
huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(
    api_key="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",
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

# Create a collection with the Jina embedding function
collection = chroma_client.create_collection(name="test_collections-1", embedding_function=huggingface_ef, metadata={"hnsw:space": "cosine"})



# Add documents to the collection
collection.upsert(
    documents=[
        "The cat sat on the mat.",
        "The dog barked at the mailman.",
        "The quick brown fox jumps over the lazy dog.",
        "I love playing with my cat.",
        "The mailman delivered the package.",
        "She sells seashells by the seashore.",
        "How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
        "A journey of a thousand miles begins with a single step.",
        "I am the greatest!",
        "All that glitters is not gold."
    ],
    ids=["id1", "id2","id3","id4","id5","id6","id7","id8","id9","id10"]
)


# Query collection
results = collection.query(
    query_texts=["What did the dog do?"],
    n_results=2, # how many results to return
)
results

{'ids': [['id2', 'id3']],
 'distances': [[0.5399942398071289, 0.5448101758956909]],
 'metadatas': [[None, None]],
 'embeddings': None,
 'documents': [['The dog barked at the mailman.',
   'The quick brown fox jumps over the lazy dog.']],
 'uris': None,
 'data': None}

In [9]:
import chromadb
import chromadb.utils.embedding_functions as embedding_functions
import numpy as np
from sentence_transformers import SentenceTransformer

# Creating a chroma client
chroma_client = chromadb.Client()

# Jina AI embedding function
huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(
    api_key="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",
    model_name="jinaai/jina-embeddings-v2-base-en"
)

# Create a collection with the Jina embedding function
collection = chroma_client.create_collection(name="test_collections-2", embedding_function=huggingface_ef, metadata={"hnsw:space": "cosine"})



# Add documents to the collection
collection.upsert(
    documents=[
        "The cat sat on the mat.",
        "The dog barked at the mailman.",
        "The quick brown fox jumps over the lazy dog.",
        "I love playing with my cat.",
        "The mailman delivered the package.",
        "She sells seashells by the seashore.",
        "How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
        "A journey of a thousand miles begins with a single step.",
        "I am the greatest!",
        "All that glitters is not gold."
    ],
    ids=["id1", "id2","id3","id4","id5","id6","id7","id8","id9","id10"]
)


# Query collection
results = collection.query(
    query_texts=["What did the dog do?"],
    n_results=2, # how many results to return
)
results

{'ids': [['id9', 'id6']],
 'distances': [[0.528051495552063, 0.5929942727088928]],
 'metadatas': [[None, None]],
 'embeddings': None,
 'documents': [['I am the greatest!', 'She sells seashells by the seashore.']],
 'uris': None,
 'data': None}

In [10]:
import chromadb
import chromadb.utils.embedding_functions as embedding_functions
import numpy as np
from sentence_transformers import SentenceTransformer

# Creating a chroma client
chroma_client = chromadb.Client()

# Jina AI embedding function
huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(
    api_key="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",
    model_name="maidalun1020/bce-embedding-base_v1"
)

# Create a collection with the Jina embedding function
collection = chroma_client.create_collection(name="test_collections-3", embedding_function=huggingface_ef, metadata={"hnsw:space": "cosine"})



# Add documents to the collection
collection.upsert(
    documents=[
        "The cat sat on the mat.",
        "The dog barked at the mailman.",
        "The quick brown fox jumps over the lazy dog.",
        "I love playing with my cat.",
        "The mailman delivered the package.",
        "She sells seashells by the seashore.",
        "How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
        "A journey of a thousand miles begins with a single step.",
        "I am the greatest!",
        "All that glitters is not gold."
    ],
    ids=["id1", "id2","id3","id4","id5","id6","id7","id8","id9","id10"]
)


# Query collection
results = collection.query(
    query_texts=["What did the dog do?"],
    n_results=2, # how many results to return
)
results

{'ids': [['id2', 'id7']],
 'distances': [[0.49716949462890625, 0.6792410612106323]],
 'metadatas': [[None, None]],
 'embeddings': None,
 'documents': [['The dog barked at the mailman.',
   'How much wood would a woodchuck chuck if a woodchuck could chuck wood?']],
 'uris': None,
 'data': None}

In [11]:
import chromadb
import chromadb.utils.embedding_functions as embedding_functions
import numpy as np
from sentence_transformers import SentenceTransformer

# Creating a chroma client
chroma_client = chromadb.Client()

# Jina AI embedding function
huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(
    api_key="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",
    model_name="BAAI/bge-small-en-v1.5"
)

# Create a collection with the Jina embedding function
collection = chroma_client.create_collection(name="test_collections-4", embedding_function=huggingface_ef, metadata={"hnsw:space": "cosine"})



# Add documents to the collection
collection.upsert(
    documents=[
        "The cat sat on the mat.",
        "The dog barked at the mailman.",
        "The quick brown fox jumps over the lazy dog.",
        "I love playing with my cat.",
        "The mailman delivered the package.",
        "She sells seashells by the seashore.",
        "How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
        "A journey of a thousand miles begins with a single step.",
        "I am the greatest!",
        "All that glitters is not gold."
    ],
    ids=["id1", "id2","id3","id4","id5","id6","id7","id8","id9","id10"]
)


# Query collection
results = collection.query(
    query_texts=["What did the dog do?"],
    n_results=2, # how many results to return
)
results

{'ids': [['id2', 'id3']],
 'distances': [[0.334303081035614, 0.45860934257507324]],
 'metadatas': [[None, None]],
 'embeddings': None,
 'documents': [['The dog barked at the mailman.',
   'The quick brown fox jumps over the lazy dog.']],
 'uris': None,
 'data': None}

In [14]:
import chromadb
import chromadb.utils.embedding_functions as embedding_functions
import numpy as np
from sentence_transformers import SentenceTransformer

# Creating a chroma client
chroma_client = chromadb.Client()

# Jina AI embedding function
huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(
    api_key="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",
    model_name="infgrad/stella-base-en-v2"
)

# Create a collection with the Jina embedding function
collection = chroma_client.create_collection(name="test_collections-5", embedding_function=huggingface_ef, metadata={"hnsw:space": "cosine"})



# Add documents to the collection
collection.upsert(
    documents=[
        "The cat sat on the mat.",
        "The dog barked at the mailman.",
        "The quick brown fox jumps over the lazy dog.",
        "I love playing with my cat.",
        "The mailman delivered the package.",
        "She sells seashells by the seashore.",
        "How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
        "A journey of a thousand miles begins with a single step.",
        "I am the greatest!",
        "All that glitters is not gold."
    ],
    ids=["id1", "id2","id3","id4","id5","id6","id7","id8","id9","id10"]
)


# Query collection
results = collection.query(
    query_texts=["What did the dog do?"],
    n_results=2, # how many results to return
)
results

{'ids': [['id2', 'id3']],
 'distances': [[0.34219467639923096, 0.38874876499176025]],
 'metadatas': [[None, None]],
 'embeddings': None,
 'documents': [['The dog barked at the mailman.',
   'The quick brown fox jumps over the lazy dog.']],
 'uris': None,
 'data': None}

In [16]:
import chromadb
import chromadb.utils.embedding_functions as embedding_functions
import numpy as np
from sentence_transformers import SentenceTransformer

# Creating a chroma client
chroma_client = chromadb.Client()

# Jina AI embedding function
huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(
    api_key="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",
    model_name="thenlper/gte-small"
)

# Create a collection with the Jina embedding function
collection = chroma_client.create_collection(name="test_collections-6", embedding_function=huggingface_ef, metadata={"hnsw:space": "cosine"})



# Add documents to the collection
collection.upsert(
    documents=[
        "The cat sat on the mat.",
        "The dog barked at the mailman.",
        "The quick brown fox jumps over the lazy dog.",
        "I love playing with my cat.",
        "The mailman delivered the package.",
        "She sells seashells by the seashore.",
        "How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
        "A journey of a thousand miles begins with a single step.",
        "I am the greatest!",
        "All that glitters is not gold."
    ],
    ids=["id1", "id2","id3","id4","id5","id6","id7","id8","id9","id10"]
)


# Query collection
results = collection.query(
    query_texts=["What did the dog do?"],
    n_results=2, # how many results to return
)
results

{'ids': [['id2', 'id3']],
 'distances': [[0.12128406763076782, 0.14693737030029297]],
 'metadatas': [[None, None]],
 'embeddings': None,
 'documents': [['The dog barked at the mailman.',
   'The quick brown fox jumps over the lazy dog.']],
 'uris': None,
 'data': None}

In [17]:
import chromadb
import chromadb.utils.embedding_functions as embedding_functions
import numpy as np
from sentence_transformers import SentenceTransformer

# Creating a chroma client
chroma_client = chromadb.Client()

# Jina AI embedding function
huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(
    api_key="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",
    model_name="Snowflake/snowflake-arctic-embed-m"
)

# Create a collection with the Jina embedding function
collection = chroma_client.create_collection(name="test_collections-7", embedding_function=huggingface_ef, metadata={"hnsw:space": "cosine"})



# Add documents to the collection
collection.upsert(
    documents=[
        "The cat sat on the mat.",
        "The dog barked at the mailman.",
        "The quick brown fox jumps over the lazy dog.",
        "I love playing with my cat.",
        "The mailman delivered the package.",
        "She sells seashells by the seashore.",
        "How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
        "A journey of a thousand miles begins with a single step.",
        "I am the greatest!",
        "All that glitters is not gold."
    ],
    ids=["id1", "id2","id3","id4","id5","id6","id7","id8","id9","id10"]
)


# Query collection
results = collection.query(
    query_texts=["What did the dog do?"],
    n_results=2, # how many results to return
)
results

{'ids': [['id2', 'id3']],
 'distances': [[0.1772594451904297, 0.18935298919677734]],
 'metadatas': [[None, None]],
 'embeddings': None,
 'documents': [['The dog barked at the mailman.',
   'The quick brown fox jumps over the lazy dog.']],
 'uris': None,
 'data': None}

In [18]:
import chromadb
import chromadb.utils.embedding_functions as embedding_functions
import numpy as np
from sentence_transformers import SentenceTransformer

# Creating a chroma client
chroma_client = chromadb.Client()

# Jina AI embedding function
huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(
    api_key="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",
    model_name="avsolatorio/GIST-Embedding-v0"
)

# Create a collection with the Jina embedding function
collection = chroma_client.create_collection(name="test_collections-8", embedding_function=huggingface_ef, metadata={"hnsw:space": "cosine"})



# Add documents to the collection
collection.upsert(
    documents=[
        "The cat sat on the mat.",
        "The dog barked at the mailman.",
        "The quick brown fox jumps over the lazy dog.",
        "I love playing with my cat.",
        "The mailman delivered the package.",
        "She sells seashells by the seashore.",
        "How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
        "A journey of a thousand miles begins with a single step.",
        "I am the greatest!",
        "All that glitters is not gold."
    ],
    ids=["id1", "id2","id3","id4","id5","id6","id7","id8","id9","id10"]
)


# Query collection
results = collection.query(
    query_texts=["What did the dog do?"],
    n_results=2, # how many results to return
)
results

{'ids': [['id2', 'id3']],
 'distances': [[0.23246049880981445, 0.26549404859542847]],
 'metadatas': [[None, None]],
 'embeddings': None,
 'documents': [['The dog barked at the mailman.',
   'The quick brown fox jumps over the lazy dog.']],
 'uris': None,
 'data': None}