# Compare embeddings performance

We use different approaches to create embeddings from the same texts and compare their performance.

## Dependencies

We use Python 3.9 for that.

To run the required vector database locally, use this command:  
`docker run --name 04-compare-embeddings-demo-vectordb -p 6333:6333 -p 6334:6334 -d qdrant/qdrant`

In [17]:
%pip install chromadb
%pip install langchain
%pip install langchain-community
%pip install langchain-chroma
%pip install langchain-huggingface
%pip install langchain-openai
%pip install pickleshare
%pip install qdrant-client
%pip install tabulate

Collecting protobuf (from onnxruntime>=1.14.1->chromadb)
  Using cached protobuf-4.25.4-cp37-abi3-macosx_10_9_universal2.whl.metadata (541 bytes)
Using cached protobuf-4.25.4-cp37-abi3-macosx_10_9_universal2.whl (394 kB)
Installing collected packages: protobuf
  Attempting uninstall: protobuf
    Found existing installation: protobuf 5.28.1
    Uninstalling protobuf-5.28.1:
      Successfully uninstalled protobuf-5.28.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
grpcio-tools 1.66.1 requires protobuf<6.0dev,>=5.26.1, but you have protobuf 4.25.4 which is incompatible.[0m[31m
[0mSuccessfully installed protobuf-4.25.4
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kerne

## Configuration:

Please select the model you want to use for the transformations.

In [18]:
llm_source = "openai" # openai or hf for huggingface
embedding_source = "openai" # openai or hf for huggingface

llm_model = "gpt-4o"
temperature = 0

embeddings_model = "text-embedding-ada-002"

markdown_documents_path = "../../tt-readme"

use_cached_documents = True
use_cached_transforms = True
reindex_documents = True

## Test different approaches of indexing

This will
- create a question for each document,
- create an answer for each document and
- summarize each document

## Load and split markdown contents of the TT Readme


In [19]:
if use_cached_documents:
    print("Skipping loading documents from markdown files")
else:

    from langchain.document_loaders import DirectoryLoader, TextLoader
    from langchain.text_splitter import MarkdownHeaderTextSplitter

    readme_documents = DirectoryLoader(
        markdown_documents_path,
        glob="**/*.md",
        loader_cls=TextLoader
        ).load()

    headers_to_split_on = [
        ("#", "Header 1"),
        ("##", "Header 2"),
    ]

    splitter = MarkdownHeaderTextSplitter(headers_to_split_on)

    split_documents = []
    for doc in readme_documents:
        result = splitter.split_text(doc.page_content)

        if isinstance(result, list):
            for res in result:
                res.metadata.update(doc.metadata)
            split_documents.extend(result)
        else:
            result.metadata.update(doc.metadata)
            split_documents.append(result)

    # For brevity, reduce amount of entries to a few only
    # split_documents = split_documents[50:60]

    index  = 1
    for doc in split_documents:
        doc.metadata["index"] = index
        index += 1
        doc.metadata["original_content"] = doc.page_content
        #print(doc.metadata)
        #print("\n")

Skipping loading documents from markdown files


### Persist the data to files or load cached files

In [20]:
import pickle

if (use_cached_documents):
    print("Loading documents from file")
    with open("./cache/split_documents.pickle", "rb") as f:
        split_documents = pickle.load(f)
else:
    print("Writing documents to file")
    with open("./cache/split_documents.pickle", "wb") as f:
        pickle.dump(split_documents, f)

Loading documents from file


## Massage content into new embedding documents

In [21]:
from langchain_openai import ChatOpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

llm = ChatOpenAI(model=llm_model, temperature=temperature)

def build_chain(prompt):
    return LLMChain(llm=llm, prompt=PromptTemplate(input_variables=["input"], template=prompt))

question_chain = build_chain("Formuliere drei verschiedene deutsche Fragen, die der folgende Text beantwortet: {input}")
answer_chain = build_chain("Erkläre in zwei bis drei deutschen Sätzen, was der folgende Text beantwortet: {input}")
summarize_chain = build_chain("Erstelle eine kurze deutsche Zusammenfassung des folgenden Textes: {input}")

In [22]:
import copy

def transform_documents(chain, file):
    if use_cached_transforms:
        print(f"Loading cached file {file}")
        with open(f"cache/{llm_model}_{file}_documents.pickle", "rb") as f:
            result = pickle.load(f)
        return result
    else:
        result = copy.deepcopy(split_documents)
        for doc in result:
            print(f"Transforming {file} document {doc.metadata['index']} with model {llm_model}")
            doc.metadata["original_content"] = copy.copy(doc.page_content)
            doc.page_content = chain.run(doc.page_content)
        print(f"Writing {file} documents from model {llm_model} to file")
        with open(f"cache/{llm_model}_{file}_documents.pickle", "wb") as f:
            pickle.dump(result, f)
        return result

question_documents = transform_documents(question_chain, "questions")
answer_documents = transform_documents(answer_chain, "answers")
summary_documents = transform_documents(summarize_chain, "summaries")

Loading cached file questions
Loading cached file answers
Loading cached file summaries


## Prepare Embeddings model

In [23]:
from langchain_openai import OpenAIEmbeddings

embeddings = None

if embedding_source == "openai":
    embeddings = OpenAIEmbeddings(model=embeddings_model)

## Prepare store

In [24]:
from langchain.vectorstores import Qdrant

def store(documents, collection_name):
    Qdrant.from_documents(
        documents,
        url="http://localhost:6333",
        embedding=embeddings,
        collection_name=collection_name,
        force_recreate=True,
    )

pure_collection = f"{embeddings_model}-{llm_model}-p"
question_collection = f"{embeddings_model}-{llm_model}-q"
answer_collection = f"{embeddings_model}-{llm_model}-a"
summary_collection = f"{embeddings_model}-{llm_model}-s"

collections = [pure_collection, question_collection, answer_collection, summary_collection]

## Create embeddings and store them in different collections

In [25]:
if reindex_documents:
    store(split_documents, pure_collection)
    store(question_documents, question_collection)
    store(answer_documents, answer_collection)
    store(summary_documents, summary_collection)

## Search with a query in the different indexes

In [26]:
queries = [
    "Was mache ich, wenn ich meinen letzten Zug verpasst habe?",
    "Nach wie vielen Jahren kann ich mein Notebook erneuern?",
    "Was ist MITOD?",
]

In [27]:
from qdrant_client import QdrantClient

client = QdrantClient("http://localhost:6333")

def search(collection, query):
    return Qdrant(client, collection, embeddings)._similarity_search_with_relevance_scores(query)

collections = [pure_collection, question_collection, answer_collection, summary_collection]

result_table = []
result_table.append(["Collection"] + queries)

for collection in collections:
    row = []
    for query in queries:
        print(f"Searching {collection} for {query}")
        search_results = search(collection, query)

        row.append("\n".join([f"{document.metadata['index']} - {score}" for document, score in search_results]))

    result_table.append([collection] + row)

Searching text-embedding-ada-002-gpt-4o-p for Was mache ich, wenn ich meinen letzten Zug verpasst habe?
Searching text-embedding-ada-002-gpt-4o-p for Nach wie vielen Jahren kann ich mein Notebook erneuern?
Searching text-embedding-ada-002-gpt-4o-p for Was ist MITOD?
Searching text-embedding-ada-002-gpt-4o-q for Was mache ich, wenn ich meinen letzten Zug verpasst habe?
Searching text-embedding-ada-002-gpt-4o-q for Nach wie vielen Jahren kann ich mein Notebook erneuern?
Searching text-embedding-ada-002-gpt-4o-q for Was ist MITOD?
Searching text-embedding-ada-002-gpt-4o-a for Was mache ich, wenn ich meinen letzten Zug verpasst habe?
Searching text-embedding-ada-002-gpt-4o-a for Nach wie vielen Jahren kann ich mein Notebook erneuern?
Searching text-embedding-ada-002-gpt-4o-a for Was ist MITOD?
Searching text-embedding-ada-002-gpt-4o-s for Was mache ich, wenn ich meinen letzten Zug verpasst habe?
Searching text-embedding-ada-002-gpt-4o-s for Nach wie vielen Jahren kann ich mein Notebook ern

In [28]:
from tabulate import tabulate

print(tabulate(result_table, tablefmt="grid", headers="firstrow"))

+---------------------------------+-------------------------------------------------------------+-----------------------------------------------------------+------------------+
| Collection                      | Was mache ich, wenn ich meinen letzten Zug verpasst habe?   | Nach wie vielen Jahren kann ich mein Notebook erneuern?   | Was ist MITOD?   |
| text-embedding-ada-002-gpt-4o-p | 195 - 0.8112347                                             | 189 - 0.7783993                                           | 137 - 0.8327009  |
|                                 | 193 - 0.80416673                                            | 146 - 0.7780005                                           | 41 - 0.7708699   |
|                                 | 194 - 0.80018735                                            | 156 - 0.7773744                                           | 13 - 0.76783705  |
|                                 | 155 - 0.79777145                                            | 99 - 0.7752526   

# To check a result, put the index in the following cell and run it

In [29]:
found_index = 156

# find the document with the metadata index of the found_index variable

found_document = None
for doc in split_documents:
    if doc.metadata["index"] == found_index:
        found_document = doc
        break

print(f'{found_document.page_content}\n\n')
print(f'{found_document.metadata}\n\n')

for doc in question_documents:
    if doc.metadata["index"] == found_index:
        found_document = doc
        break

print(f"Questions: {found_document.page_content}\n\n")

for doc in answer_documents:
    if doc.metadata["index"] == found_index:
        found_document = doc
        break

print(f"Answers: {found_document.page_content}\n\n")

for doc in summary_documents:
    if doc.metadata["index"] == found_index:
        found_document = doc
        break

print(f"Summary: {found_document.page_content}\n\n")


### Büroausstattung  
Zur Standard-Ausstattung gehört nach bestandener Probezeit ein höhenverstellbarer Tisch sowie ein "personalisierter" Bürostuhl (personalisiert heißt, dass dein Name drauf steht und er entsprechend deinen Bedürfnissen eingestellt wurde). Gleiches gilt im Übrigen auch fürs HomeOffice (hast du einen HomeOffice Vertrag, wird dir alles nach Hause geliefert). Bestellung und Lieferung vor allem der Tische ist aufwendig, deshalb machen wir das dann, wenn klar ist, dass du mit dieser Ausstattung länger arbeiten wirst.  
### Hardwareausstattung  
Jeder Mitarbeiter bekommt Folgendes an Hardware:  
* Notebook nach Wahl (z. B. Macbook, Surface Book, Dell XPS 15) - Wichtig ist, dass du dich mit deiner Hardware wohl fühlst und mit maximaler Leistung arbeiten kannst.
* Tablet nach Wahl
* Telefon nach Wahl
* 2x Monitore (alternativ kann auch ein etwas größerer Monitor bestellt werden)
* Noise-Cancelling Kopfhörer  
Außerdem alles, was man sonst noch an Zubehör benötigt:  
* Netzte