# Evaluation

Wir beginnen mit dem Setup. Wir bauen eine kleine RAG Klasse, mit der wir einfacher arbeiten können.


In [30]:
import os
import numpy as np

from dotenv import load_dotenv
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings

if os.getcwd().endswith("notebooks"):
    os.chdir("..")

load_dotenv()

class RAG:
    def __init__(self, model="gpt-4.1-mini"):
        self.llm = AzureChatOpenAI(model=model)
        self.embeddings = AzureOpenAIEmbeddings(model="text-embedding-3-small")
        self.doc_embeddings = None
        self.docs = None

    def load_documents(self, documents):
        """Load documents and compute their embeddings."""
        self.docs = documents
        self.doc_embeddings = self.embeddings.embed_documents(documents)

    def get_most_relevant_docs(self, query):
        """Find the most relevant document for a given query."""
        if not self.docs or not self.doc_embeddings:
            raise ValueError("Documents and their embeddings are not loaded.")

        query_embedding = self.embeddings.embed_query(query)
        similarities = [
            np.dot(query_embedding, doc_emb)
            / (np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb))
            for doc_emb in self.doc_embeddings
        ]
        most_relevant_doc_index = np.argmax(similarities)
        return [self.docs[most_relevant_doc_index]]

    def generate_answer(self, query, relevant_doc):
        """Generate an answer for a given query based on the most relevant document."""
        prompt = f"question: {query}\n\nDocuments: {relevant_doc}"
        messages = [
            ("system", "You are a helpful assistant that answers questions based on given documents only."),
            ("human", prompt),
        ]
        ai_msg = self.llm.invoke(messages)
        return ai_msg.content

Hier sind eine Metriken die wir zur Evaluation verwenden:

* Context-Recall: Anzahl vom Kontext untermauerter Aussagen / Anzahl Aussagen 
    * ein LLM bricht hierzu die Antwort in Aussagen auf und beurteilt für jede dieser, ob sie aus dem Kontxt hervorgeht
* Faithfulness: misst die Qualität des generierten Textes hinsichtlich dquality of your RAG pipeline's generator by evaluating whether the actual_output factually aligns with the contents of your retrieval_context.

For this we will create a test case below. We will make use of DeepEval, a framework for evaluation that provides many convenient measures.

Before we can use it, we'll need to do some setup. Run the below lines on your terminal before continuing with the notebook.

Wir initialisieren eine Instanz der RAG Klasse und laden ein Dokument ein:

In [31]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Initialize RAG instance
rag = RAG()

# Load documents
pdf_path = "data/Broschure-Sortiment-MicroFluidics.pdf"
loader = PyPDFLoader(pdf_path)
rag.load_documents(documents=
    [p.page_content for p in loader.load_and_split(text_splitter=RecursiveCharacterTextSplitter())]
)

Nun sind wir bereit für eine Testanfrage:

In [None]:
# Query and retrieve the most relevant document
query = "What is the whisper technology?"
relevant_doc = rag.get_most_relevant_docs(query)

# Generate an answer
answer = rag.generate_answer(query, relevant_doc)

print(f"Query: {query}")
print(f"Relevant Document: {relevant_doc}")
print(f"Answer: {answer}")

## Aufgabe

1.1. Welche Metriken gibt es? Siehe: https://docs.ragas.io/en/latest/concepts/metrics/available_metrics

1.2. Vergleiche die Modelle "gpt-4o" und "gpt-4.1-mini" oder probiere andere Parameter des Textsplitters. Wie verändern sich die Metriken?

In [39]:
sample_queries = [
    "What is the whisper technology?",
]

expected_responses = [
   "Whisper technology refers to an innovative valve drive technology used in Bürkert Whisper Valves."
]

In [34]:
dataset = []

for query,reference in zip(sample_queries,expected_responses):

    relevant_docs = rag.get_most_relevant_docs(query)
    response = rag.generate_answer(query, relevant_docs)
    dataset.append(
        {
            "user_input":query,
            "retrieved_contexts":relevant_docs,
            "response":response,
            "reference":reference
        }
    )

In [35]:
from ragas import EvaluationDataset
evaluation_dataset = EvaluationDataset.from_list(dataset)

In [None]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

llm = AzureChatOpenAI(model="gpt-4.1-mini")
embeddings = AzureOpenAIEmbeddings(model="text-embedding-3-small")
evaluator_llm = LangchainLLMWrapper(llm)

In [None]:
from ragas.metrics import Faithfulness, FactualCorrectness

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[Faithfulness(), FactualCorrectness(mode="precision")],
    llm=evaluator_llm
)
result