# Using LLMs to Evaluate RAG Solutions

## Overview
This notebook demonstrates how to use Large Language Models (LLMs) to evaluate Retrieval-Augmented Generation (RAG) solutions. 

Key goals:
- Load and configure LLMs, starting with OpenAI models and extending to local models.
- Use a parser to handle outputs.
- Explore how LLMs can assist in evaluating RAG workflows.

### References
- [GitHub Repository by svpino](https://github.com/svpino/llm/blob/main/local.ipynb)
- [YouTube Video](https://www.youtube.com/watch?v=ZPX3W77h_1E&t=1161s)

# Environment Setup

1. **Create a virtual environment**:
   ```bash
   python3 -m venv venv
   ```

2. **Activate the virtual environment**:
   - Linux/Mac:
     ```bash
     source venv/bin/activate
     ```
   - Windows:
     ```bash
     venv\Scripts\activate
     ```

3. **Install dependencies**:
   ```bash
   pip install -r requirements.txt
   ```

4. **Set up `.env` file**:
   - Add the following line:
     ```
     OPENAI_API_KEY=your_api_key_here
     ```

5. **Run the cells sequentially for a step-by-step demonstration.**

### **Code Cell 1**: Load API Key and Specify Model

In [None]:
import os
from dotenv import load_dotenv
from langchain_ollama import OllamaLLM  # Replaces langchain_community.llms.Ollama
from langchain_huggingface import HuggingFaceEmbeddings  # New embedding option
from langchain_chroma import Chroma  # Adds Chroma vector storage
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_openai.chat_models import ChatOpenAI
from langchain_openai.embeddings import OpenAIEmbeddings
import torch 

load_dotenv()

device = "cuda" if torch.cuda.is_available() else "cpu"

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

AVAILABLE_LLMS = {
    "ChatGPT3.5-turbo": "gpt-3.5-turbo",
    "Llama3.2-3b": "llama3.2:3b",
}

# Device setup (GPU or CPU)
device = "cuda" if torch.cuda.is_available() else "cpu"




Using model: gpt-3.5-turbo
Input:
  - Tell me a joke.
Output (wo. parser):
  - Why did the scarecrow win an award? Because he was outstanding in his field!

Using model: llama3.2:3b
Input:
  - Tell me a joke.
Output (wo. parser):
  - A man walked into a library and asked the librarian, "Do you have any books on Pavlov's dogs and Schrödinger's cat?" The librarian replied, "It rings a bell, but I'm not sure if it's here or not."


In [92]:
def set_up_model(model_name):
    """
    Sets up a model and its associated embedding based on the provided model name.
    
    Parameters:
        model_name (str): The key of the model in AVAILABLE_LLMS (e.g., "ChatGPT3.5-turbo", "Llama3.2-3b").
    
    Returns:
        chain: The initialized model chain.
    """
    # Define available LLMs
    AVAILABLE_LLMS = {
        "ChatGPT3.5-turbo": "gpt-3.5-turbo",
        "Llama3.2-3b": "llama3.2:3b",
    }
    
    # Check for valid input
    if model_name not in AVAILABLE_LLMS.keys():
        raise ValueError(f"Unsupported model: {model_name}. Please choose from: {list(AVAILABLE_LLMS.keys())}")
    
    # Load the model and embeddings based on the input
    if model_name == "ChatGPT3.5-turbo":
        # Set up GPT model
        OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
        model = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model=AVAILABLE_LLMS[model_name])
        
        # Set up embedding model
        embeddings = OpenAIEmbeddings()
    
    elif model_name == "Llama3.2-3b":
        # Set up Llama model
        model = OllamaLLM(model=AVAILABLE_LLMS[model_name])
        
        # Set up embedding model
        emb_model_name = "sentence-transformers/all-mpnet-base-v2"
        device = "cuda" if torch.cuda.is_available() else "cpu"
        model_kwargs = {'device': device}
        encode_kwargs = {'normalize_embeddings': False}
        
        embeddings = HuggingFaceEmbeddings(
            model_name=emb_model_name,
            model_kwargs=model_kwargs,
            encode_kwargs=encode_kwargs
        )
    else:
        # This should never be reached due to the initial validation
        raise ValueError(f"Unexpected error for model: {model_name}")    
    
    return model



# **Code Cell 2**: Run chain and display output without parser

In [81]:
def test_chain(chain, prompt, model_name, has_parser=False):
    """
    Tests the chain by invoking it with a given prompt and prints the results.
    
    Parameters:
        chain: The chain to be tested.
        prompt (str): The input prompt for the chain.
        model_name (str): The name of the model being tested.
        has_parser (bool): Whether the chain includes a parser.
    """
    # Invoke the chain and get the output
    output = chain.invoke(prompt)
    
    # Print results with improved structure
    print("\n" + "="*50)
    print(f"Testing Chain: {model_name} {'with Parser' if has_parser else 'without Parser'}")
    print("-" * 50)
    print("Input Prompt:")
    print(f"  {prompt}")
    print("-" * 50)
    print("Output:")
    print(f"  {output}")
    print("="*50 + "\n")


# Initialize parser
parser = StrOutputParser()

# Test chains for all available models
for model_name in AVAILABLE_LLMS.keys():
    # Set up chain without parser and test it
    model = set_up_model(model_name)
    chain = model
    test_chain(chain, EXAMPLE_PROMPT, model_name, has_parser=False)
    
    # Set up chain with parser and test it
    chain = model | parser
    test_chain(chain, EXAMPLE_PROMPT, model_name, has_parser=True)



Testing Chain: ChatGPT3.5-turbo without Parser
--------------------------------------------------
Input Prompt:
  Tell me a joke.
--------------------------------------------------
Output:
  content="Why don't scientists trust atoms? \n\nBecause they make up everything!" additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 15, 'prompt_tokens': 12, 'total_tokens': 27, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None} id='run-579a40ce-755f-4ce0-9232-99973ec85c12-0' usage_metadata={'input_tokens': 12, 'output_tokens': 15, 'total_tokens': 27, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}


Testing Chain: ChatGPT3.5-turbo with Parse

##  **Code Cell 4**: Data generation - Scrape some websites and split the content into pages

In [82]:
 from langchain_community.document_loaders import WebBaseLoader
 from langchain_text_splitters import RecursiveCharacterTextSplitter
 import textwrap

# wikipages to scrape to use in our RAG
web_pages = [
    'https://sv.wikipedia.org/wiki/Hans_Alfredson', 
    'https://sv.wikipedia.org/wiki/Tage_Danielsson',
    'https://sv.wikipedia.org/wiki/Hasse_och_Tage'
    ]

# Initialize a web loader to scrape the specified pages
loader_multiple_pages = WebBaseLoader(web_pages)

# Load the content from the web pages and split it into individual documents
docs = loader_multiple_pages.load_and_split()

# Use RecursiveCharacterTextSplitter to divide the documents into smaller chunks
# - `chunk_size=1000`: Maximum size of each chunk in characters
# - `chunk_overlap=200`: Number of overlapping characters between consecutive chunks
pages = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200
).split_documents(docs)

#clean up the pages slightly, by removing newlines etc.
for page in pages:
    
    # Dedent the text to remove excessive indentation
    cleaned_content = textwrap.dedent(page.page_content)
    # Replace newlines with spaces to remove line breaks
    page.page_content = cleaned_content.replace("\n", " ").replace("\r", " ")
    print(page.page_content)
    print('')


Hans Alfredson – Wikipedia                                      Hoppa till innehållet        Huvudmeny      Huvudmeny flytta till sidofältet dölj    		Navigering    HuvudsidaIntroduktionDeltagarportalenBybrunnenSenaste ändringarnaSlumpartikelLadda upp filerKontakta WikipediaHjälp                    Sök            Sök                       Utseende                 Stöd Wikipedia  Skapa konto  Logga in         Personliga verktyg      Stöd Wikipedia Skapa konto Logga in      		Sidor för utloggade redigerare läs mer    BidragDiskussion                             Innehåll flytta till sidofältet dölj     Inledning      1 Biografi     Växla underavsnittet Biografi      1.1 Bakgrund         1.2 Tidig karriär, Svenska Ord         1.3 Egna projekt         1.4 Senare år         1.5 Familj           2 Eftermäle         3 Priser och utmärkelser         4 Skulptur         5 Filmografi     Växla underavsnittet Filmografi      5.1 Som regissör och manusförfattare i urval

1.5 Familj           2 Efter

# From pages to Q-A pairs for evaluation

In this section, we focus on building a **knowledge base** from the extracted data. A **knowledge base** is an organized representation of information designed for retrieval and reasoning. It differs from the raw pages or documents in that it transforms the text into structured, machine-readable formats (e.g., embeddings and vectors) for efficient querying and analysis.

#### Key Components and Their Roles:
1. **The Pages (Raw Documents):**  
   - These are the unprocessed, textual data loaded directly from the source (e.g., a website or other text files).  
   - In our case, they are split into smaller, manageable chunks for processing and stored in a `pandas` DataFrame with a single column labeled `text`. This `DataFrame` is the foundational representation of the raw content.

2. **The Knowledge Base:**  
   - The knowledge base is a higher-level abstraction created from the raw documents.  
   - It organizes the raw text into a structure suitable for reasoning and retrieval, associating metadata with text chunks and creating a framework for querying based on semantics.  
   - Unlike the raw pages, the knowledge base processes the text to make it machine-queryable (e.g., by creating embeddings or organizing it into topics). This is done using the `KnowledgeBase` class from `giskard.rag`.

3. **The Vector Store:**  
   - The vector store is a data structure used to store numerical representations of the text, called **embeddings**.  
   - These embeddings are created using language models and capture the semantic meaning of text chunks.  
   - The vector store allows for similarity searches, enabling the retrieval of relevant chunks of text based on user queries.

4. **The Test Set:**  
   - The test set is a collection of pre-defined questions, reference answers, and contexts generated from the knowledge base.  
   - It serves as a validation tool to evaluate the performance of a system (e.g., a chatbot) in retrieving and reasoning over the knowledge base.  
   - The test set ensures that the knowledge base and its associated components (e.g., retrievers) are functioning correctly.

#### The Relationships Between These Components:
- The **raw documents** (pages) are the input for creating the **DataFrame**, where the content is structured into rows.  
- The **knowledge base** is built from the DataFrame and provides the logical interface for interacting with the content.  
- The **vector store** is a supporting data structure within the knowledge base, storing the embeddings that make retrieval efficient.  
- The **test set** is generated from the knowledge base and is used to assess its accuracy and relevance when answering questions.

---

#### Steps in the Code:
1. **Convert Documents into a DataFrame:**  
   - The content of each document is stored in a DataFrame as a single column labeled `text`. This is the initial step to structure the raw text data.

2. **Build the Knowledge Base:**  
   - Using the `KnowledgeBase` class from `giskard.rag`, the DataFrame is transformed into an organized knowledge base.

3. **(Optional) Generate the Test Set:**  
   - With the knowledge base in place, we generate a test set using the `generate_testset` function. This creates a suite of questions with reference answers to validate the knowledge base's performance.

#### Summary:
- The **DataFrame** is the raw structured data.
- The **Knowledge Base** is an abstraction for reasoning and querying the data.
- The **Vector Store** is the numerical representation (embeddings) supporting efficient retrieval.
- The **Test Set** validates the quality of the knowledge base and the system using it.

This structured flow ensures that the raw data is processed and validated systematically, creating a robust pipeline for reasoning over text-based content.

# **Code Cell 5**: Load the content of documents/pages in a vector store

In [83]:
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import DocArrayInMemorySearch

#note, here we use the "pages", i.e not neccesarily the full documents are being embedded. This is a choice, and has consequences in granularity and size of context. It might sometimes be preferreble to use the full documents etc.  
vectorstore = DocArrayInMemorySearch.from_documents(
    pages, embedding=OpenAIEmbeddings()
)

## **Code Cell 6**: Create dataframe

In [84]:
import pandas as pd

df = pd.DataFrame([p.page_content for p in pages], columns=["text"])
df.head(10)

Unnamed: 0,text
0,Hans Alfredson – Wikipedia ...
1,1.5 Familj 2 Eftermäle 3 Pri...
2,العربيةČeštinaDanskDeutschEnglishEspañolفارسیF...
3,Hans Alfredson Hans Alfredson med sin hedersg...
4,"Från vänster: Hans Alfredson, Lissi Alandh, Mi..."
5,Biografi[redigera | redigera wikitext] Bakgrun...
6,"Tidig karriär, Svenska Ord[redigera | redigera..."
7,Alfredson debuterade 1948 som skribent på Hels...
8,AB Svenska Ord. De producerade revyer och film...
9,Alfredson gjorde sina första revysketcher till...


## **Code Cell 7**: Create knowledge base

In [85]:
from giskard.rag import KnowledgeBase
knowledge_base = KnowledgeBase(df)

## **Code Cells 7-8**: Generate and save testset as jsonl
This cell took me 3min20s to run on Jarvis with pages from the Hasse och Tage-wikis. If you have a test-set.jsonl in your folder structure to the left, you can skip this cell

In [86]:
from giskard.rag import generate_testset

# testset = generate_testset(
#     knowledge_base,
#     num_questions=60,
#     agent_description="A chatbot answering questions about the Machine Learning School Website",
# )
#testset.save("test-set.jsonl")

## **Code Cell 8**: Sample the test set
By sampling the test set, we can see that giskard has generated questions and reference answers based on the documents/pages we encoded in the knowledge base.

In [87]:
import pandas as pd
import json

def load_jsonl_to_dataframe(jsonl_file_path):
    data = []
    with open(jsonl_file_path, 'r') as file:
        for line in file:
            data.append(json.loads(line))
    return pd.DataFrame(data)

# File path to your JSONL file
jsonl_file_path = "test-set.jsonl"

# Load the test set
test_set_df = load_jsonl_to_dataframe(jsonl_file_path)

for index, row in enumerate(test_set_df.head(3).iterrows()):
    print(f"Question {index + 1}: {row[1]['question']}")
    print(f"Reference answer: {row[1]['reference_answer']}")
    print("Reference context:")
    print(row[1]['reference_context'])
    print("******************", end="\n\n")

Question 1: What are the main sections listed in the table of contents for the article about Hans Alfredson?
Reference answer: 1.5 Familj, 2 Eftermäle, 3 Priser och utmärkelser, 4 Skulptur, 5 Filmografi, 6 Teater, 7 Bibliografi, 8 Diskografi, 9 Referenser, 10 Vidare läsning (litteratur om Hans Alfredson), 11 Externa länkar.
Reference context:
Document 1: 1.5 Familj           2 Eftermäle         3 Priser och utmärkelser         4 Skulptur         5 Filmografi     Växla underavsnittet Filmografi      5.1 Som regissör och manusförfattare i urval         5.2 Manus i samarbete med Tage Danielsson         5.3 Roller         5.4 Övrigt           6 Teater     Växla underavsnittet Teater      6.1 Regi         6.2 Roller       6.2.1 Scenografi         6.2.2 Dramatik         6.2.3 Översättningar (i urval)             7 Bibliografi         8 Diskografi         9 Referenser     Växla underavsnittet Referenser      9.1 Noter         9.2 Webbkällor           10 Vidare läsning (litteratur om Hans Alfr

# Moving towards evaluation of the rag using this test set

# **Code Cell 9**: Prepare the prompt template

In [88]:
from langchain.prompts import PromptTemplate

template = """
Answer the question based on the context below. If you can't 
answer the question, reply "I don't know" and nothing else.

Context: {context}

Question: {question}
"""

prompt = PromptTemplate.from_template(template)
test_prompt = prompt.format(context="I wear a yellow shirt and blue shoes", question="What color is my sock")
print(test_prompt)


for model_name in AVAILABLE_LLMS.keys():
    chain = set_up_chain(model_name)
    output = chain.invoke(test_prompt)
    print("")
    print(f"asking {model_name}")
    print(output)


Answer the question based on the context below. If you can't 
answer the question, reply "I don't know" and nothing else.

Context: I wear a yellow shirt and blue shoes

Question: What color is my sock


asking ChatGPT3.5-turbo
I don't know

asking Llama3.2-3b
I don't know


# **Code Cell 10:**  Illustrate the retriever
Here we can see how the retriever works, given the prompt

In [69]:
prompt = "När föddes Hasse Alfredson?"
retriever = vectorstore.as_retriever()
retriever.get_relevant_documents(prompt)

  retriever.get_relevant_documents(prompt)


[Document(metadata={'source': 'https://sv.wikipedia.org/wiki/Hans_Alfredson', 'title': 'Hans Alfredson – Wikipedia', 'language': 'sv'}, page_content='Från vänster: Hans Alfredson, Lissi Alandh, Mille Schmidt, Tage Danielsson och Gösta Ekman med simborgarmärket 1962. Hans Folke "Hasse" Alfredson, ursprungligen Alfredsson, född 28 juni 1931 i Sankt Pauli församling i Malmö,[1] död 10 september 2017[2] på Lidingö,[3][4] var en svensk komiker, filmskapare, skådespelare och författare. Han var främst känd för sitt mångåriga samarbete med Tage Danielsson under benämningen Hasse och Tage, men hade även en omfattande självständig produktion. Han var chef för Skansen i Stockholm åren 1992–1994.[5]'),
 Document(metadata={'source': 'https://sv.wikipedia.org/wiki/Hans_Alfredson', 'title': 'Hans Alfredson – Wikipedia', 'language': 'sv'}, page_content='Hans Alfredson  Hans Alfredson med sin hedersguldbagge på Guldbaggegalan 2013.FöddHans Folke Alfredsson[1]28 juni 1931[1]Sankt Pauli församling, Malm

# **Code Cell 11**: Construct the rag chain

In [90]:
from langchain_openai.chat_models import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from operator import itemgetter

def build_rag_chain(retriever, llm):
    # Stolen from merima
    template = """
    Use the following context to answer the question. Only use  information from the context provided.
    Do not ask questions.

    Context: {context}

    Question: {question}

    Answer: """

    prompt = PromptTemplate.from_template(template)

    doc_chain = retriever | (lambda docs: docs if docs else None)

    rag_chain = (
        {"context": retriever | (lambda docs: " ".join(doc.page_content for doc in docs) if docs else ""),
        "question": RunnablePassthrough()}
        | prompt
        | llm
        | (lambda output: output.replace("\n", " ").strip())
        | StrOutputParser()
    )

    combined_chain = RunnableParallel(
        {
            "question": RunnablePassthrough(),
            "answer": rag_chain,
            "docs": doc_chain,
        }
    )

    return combined_chain

retriever = vectorstore.as_retriever()

model_name = "Llama3.2-3b"
model = set_up_model(model_name)
rag_chain = build_rag_chain(retriever, model)

# **Code Cell 12**: Lets test it

In [91]:
rag_chain.invoke({"question": "När föddes hasse?"})

TypeError: argument 'text': 'dict' object cannot be converted to 'PyString'

# Evaluating the model on the test set
We define a function that invokes the cain with a specific question and returns the answer. Then we can sue evaluate function to evaluate the model on the test set. This funciton will caompare the answers from the chain with the reference asnwers in the test set.

In [24]:
def answer_fn(question, history=None):
    return chain.invoke({"question": question})

Cellen nedan tog ca 2min 20s på jarvis med hasse o tage-datan med 3.5-turbo

In [25]:
from giskard.rag import evaluate

report = evaluate(answer_fn, testset=testset, knowledge_base=knowledge_base)

Asking questions to the agent: 100%|██████████| 60/60 [01:06<00:00,  1.11s/it]
CorrectnessMetric evaluation: 100%|██████████| 60/60 [01:05<00:00,  1.09s/it]


In [31]:
display(report)

In [28]:
report.to_html("report.html")

In [29]:
report.correctness_by_question_type()

Unnamed: 0_level_0,correctness
question_type,Unnamed: 1_level_1
complex,0.6
conversational,0.2
distracting element,0.4
double,0.7
simple,0.6
situational,0.4


In [1]:
failures = report.get_failures()
failures


NameError: name 'report' is not defined