# RAG Evaluation

In this notebook, we evaluate different RAG (Retrieval-Augmented Generation) approaches to determine the most effective method for our task.

## Evaluation Process

To conduct a comprehensive evaluation, we will:

1. Use questions from ground-truth dataset
2. Implement multiple RAG approaches
3. Generate responses using each approach
4. Measure the performance of each method using different techniques 
5. Compare the results and select the best-performing approach

## RAG Approaches to Evaluate

We will assess the following RAG approaches:

1. Prompt with 5 full articles in context or 10 chunks
3. More sophisticated prompt or more simplistic

## Evaluation Metrics

To measure the effectiveness of each approach, we'll use the following metrics:
- LLM-as-a-judge (percent of non-relevant)
- ROUGE score (mean, median)

## Preparation

In [None]:
import json
import random
import re

from tqdm import tqdm
from anthropic import Anthropic
from elasticsearch import Elasticsearch
from sentence_transformers import SentenceTransformer
from rouge import Rouge
import pandas as pd

In [None]:
with open('../data/ground-truth.json', 'r') as f_in:
    ground_truth = json.load(f_in)

with open('../data/site_content.json', 'r') as f_in:
    raw_doc = json.load(f_in)

first we need to start Elasticsearch locally, if it's not started yet

```bash
docker run -it \
    --rm \
    --name elasticsearch \
    -p 9200:9200 \
    -p 9300:9300 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    docker.elastic.co/elasticsearch/elasticsearch:8.4.3
```

In [None]:
model_name = "all-MiniLM-L12-v2"
model = SentenceTransformer(model_name)


In [None]:
es_client = Elasticsearch('http://localhost:9200') 

In [None]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "url": {"type": "text"},
            "header": {"type": "text"},
            "main_content": {"type": "text"}
        }
    }
}

index_name = "esearchtext"

es_client.indices.delete(index=index_name, ignore_unavailable=True)
es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'esearchtext'})

In [None]:
data = [{'url': k, 'header':v['header'], 'main_content':v['main_content']} for k,v in raw_doc.items()]
for doc in tqdm(data):
    es_client.index(index=index_name, document=doc)

100%|███████████████████████████████████████████████████████████████████████████████| 1030/1030 [00:13<00:00, 77.41it/s]


In [None]:
def elastic_search_boost(query):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["url", "header", "main_content^3"],
                        "type": "best_fields"
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)
    
    result_docs = []
    
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    
    return result_docs

In [None]:

index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "url": {"type": "text"},
            "header": {"type": "text"},
            "main_content": {"type": "text"},
            "main_content_vector": {
                "type": "dense_vector",
                "dims": model.get_sentence_embedding_dimension(),
                "index": True,
                "similarity": "cosine"
            },
            
        }
    }
}

index_name_vector = "esearchvector_chunks_2"

es_client.indices.delete(index=index_name_vector, ignore_unavailable=True)
es_client.indices.create(index=index_name_vector, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'esearchvector_chunks_2'})

In [None]:
def chunk_data(raw_doc, chunk_size=1000, overlap=100):
    def chunk_content(content, chunk_size=1000, overlap=100):
        chunks = []
        start = 0
        while start < len(content):
            end = start + chunk_size
            chunk = content[start:end]
            chunks.append(chunk)
            start = end - overlap
        return chunks

    chunked_data = []
    for k, v in raw_doc.items():
        content_chunks = chunk_content(v['main_content'], chunk_size, overlap)
        for i, chunk in enumerate(content_chunks):
            chunked_data.append({
                'url': k,
                'header': v['header'],
                'main_content': chunk,
                'chunk_index': i
            })
    
    return chunked_data

In [None]:
data_chunk = chunk_data(raw_doc)

In [None]:
for doc in tqdm(data_chunk):
    doc['header_vector'] = model.encode(doc['header'])
    doc['main_content_vector'] = model.encode(doc['main_content'])

100%|███████████████████████████████████████████████████████████████████████████████| 2834/2834 [06:32<00:00,  7.22it/s]


In [None]:
for doc in tqdm(data_chunk):
    es_client.index(index=index_name_vector, document=doc)

100%|██████████████████████████████████████████████████████████████████████████████| 2834/2834 [00:24<00:00, 116.01it/s]


In [None]:
def elastic_search_combined_10(query):
    vector = model.encode(query)
    search_query = {
        "_source": ["url", "header", "main_content", "header_vector", "main_content_vector"],
        "query": {
            "bool": {
                "should": [
                    {
                        "multi_match": {
                            "query": query,
                            "fields": ["header", "main_content"],
                            "type": "best_fields",
                            "tie_breaker": 0.3
                        }
                    },
                    {
                        "script_score": {
                            "query": {"match_all": {}},
                            "script": {
                                "source": "cosineSimilarity(params.query_vector, 'main_content_vector') + 1.0",
                                "params": {"query_vector": vector}
                            }
                        }
                    }
                ]
            }
        },
        "size": 10
    }
    
    es_results = es_client.search(
        index=index_name_vector,
        body=search_query
    )
    
    result_docs = [hit['_source'] for hit in es_results['hits']['hits']]
    return result_docs

In [None]:
clientA = Anthropic()
def llm(prompt):
    response = clientA.messages.create(
        model="claude-3-5-sonnet-20240620",
        max_tokens = 500,
        messages=[
            {
              "role": "user",
              "content": [
                {
                  "type": "text",
                  "text": prompt
                }
              ]
            }
      ]
    )
    return response.content[0].text

def llm_haiku(prompt):
    response = clientA.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens = 500,
        messages=[
            {
              "role": "user",
              "content": [
                {
                  "type": "text",
                  "text": prompt
                }
              ]
            }
      ]
    )
    return response.content[0].text

In [None]:
def format_search_results(search_results: list[dict]) -> str:
    formatted_results = ""
    for result in search_results:
        formatted_results += f"- **{result['header']}**\n  {result['main_content']}\n  URL: {result['url']}\n\n"
    return formatted_results.strip()

## Building prompt

In [None]:
def build_basic_prompt(query: str, search_results: list[dict]) -> str:
    prompt = f"Question: {query}\n\nContext:\n{format_search_results(search_results)}"
    return prompt

In [None]:
def build_structured_prompt(query: str, search_results: list[dict]) -> str:
    prompt = f"""
Question: {query}

Context:


Instructions:
1. Analyze the question and identify key points related to New Zealand visas.
2. Review the provided context for relevant information.
3. Formulate a clear, concise answer based on official information.
4. If the question cannot be fully answered with the given context, state this clearly.
5. Use markdown syntax for formatting (**bold** for emphasis, *italics* for titles).
6. Include at least one relevant URL as a reference at the end of your answer.

Please provide your answer below:
"""
    return prompt.strip()



In [None]:
def build_big_prompt(query, search_results):
    prompt_template = f"""
You are an AI assistant specializing in answering questions about New Zealand visas. Your knowledge comes from official New Zealand immigration information. You will be provided with context from relevant articles and a specific question to answer.

First, review the following context:

<context>
{format_search_results(search_results)}
</context>

Process this context carefully. Each item in the context contains a URL, a header, and main content. Use this information to inform your answers, ensuring you provide accurate and up-to-date information about New Zealand visas.

Now, answer the following question:

<question>
{query}
</question>

To answer the question:
1. Analyze the question and identify the key points related to New Zealand visas.
2. Search through the provided context for relevant information.
3. Formulate a clear, concise answer based on the official information.
4. If the question cannot be fully answered with the given context, state this clearly and provide the most relevant information available.

Write your answer using short markdown syntax, as it will be displayed in a Telegram chat. Use **bold** for emphasis and *italics* for titles or important terms.

Always include at least one relevant URL from the context as a reference. Format the URL reference at the end of your answer like this:
[Source](URL)

If multiple sources are used, include them as separate reference links.

Provide your answer within <answer> tags.
""".strip()
    return prompt_template


In [None]:
def extract_basic_answer(llm_response: str) -> str:
    return llm_response.strip()

def extract_structured_answer(llm_response: str) -> str:
    answer_start = llm_response.find("Please provide your answer below:")
    if answer_start != -1:
        return llm_response[answer_start + len("Please provide your answer below:"):].strip()
    else:
        return llm_response.strip()

def extract_big_answer(llm_response: str) -> str:
    match = re.search(r'<answer>(.*?)</answer>', llm_response, re.DOTALL)
    if match:
        return match.group(1).strip()
    else:
        return llm_response.strip()

In [None]:
def rag(query, version):
    if version == "text+simple":
        search = elastic_search_boost
        build_prompt = build_basic_prompt
        extractor = extract_basic_answer
    elif version == "text+structured":
        search = elastic_search_boost
        build_prompt = build_structured_prompt
        extractor = extract_structured_answer
    elif version == "text+big":
        search = elastic_search_boost
        build_prompt = build_big_prompt
        extractor = extract_big_answer
    elif version == "vector+simple":
        search = elastic_search_combined_10
        build_prompt = build_basic_prompt
        extractor = extract_basic_answer
    elif version == "vector+structured":
        search = elastic_search_combined_10
        build_prompt = build_structured_prompt
        extractor = extract_structured_answer
    elif version == "vector+big":
        search = elastic_search_combined_10
        build_prompt = build_big_prompt
        extractor = extract_big_answer
    
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return extractor(answer)
    

In [None]:
rag("I am from russia", "text+simple")

'Based on the information provided, since you are from Russia, you would need to apply for a visa before traveling to New Zealand. Russia is not on the list of visa waiver countries, so Russian citizens cannot use an NZeTA for entry and must obtain an appropriate visa in advance. You would need to explore the visa options and apply for the relevant visa type depending on your purpose of travel (e.g. visitor visa, work visa, student visa etc). An NZeTA or visa-free entry is not available for Russian passport holders.'

## Evaluation relevance

In [None]:
def evaluate_relevance(question: str, answer_llm: str) -> dict:
    prompt2_template = """
You are an expert evaluator for a Retrieval-Augmented Generation (RAG) system.
Your task is to analyze the relevance of the generated answer to the given question.
Based on the relevance of the generated answer, you will classify it
as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".

Here is the data for evaluation:

Question: {question}
Generated Answer: {answer_llm}

Please analyze the content and context of the generated answer in relation to the question
and provide your evaluation in parsable JSON without using code blocks:

{{
  "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
  "Explanation": "[Provide a brief explanation for your evaluation]"
}}
""".strip()

    # Format the prompt with the question and answer
    evaluation_prompt = prompt2_template.format(question=question, answer_llm=answer_llm)

    # Get the evaluation from the LLM
    evaluation_response = llm_haiku(evaluation_prompt)

    # Parse the JSON response
    try:
        evaluation_result = json.loads(evaluation_response)
    except json.JSONDecodeError:
        # If JSON parsing fails, return an error result
        return {
            "Relevance": "ERROR",
            "Explanation": "Failed to parse LLM response as JSON"
        }

    # Validate the structure of the parsed result
    if "Relevance" not in evaluation_result or "Explanation" not in evaluation_result:
        return {
            "Relevance": "ERROR",
            "Explanation": "LLM response does not contain expected fields"
        }

    # Validate the Relevance value
    if evaluation_result["Relevance"] not in ["NON_RELEVANT", "PARTLY_RELEVANT", "RELEVANT"]:
        evaluation_result["Relevance"] = "ERROR"
        evaluation_result["Explanation"] += " (Invalid Relevance value)"

    rouge = Rouge()
    scores = rouge.get_scores(answer_llm, question)
    # Add ROUGE scores to the evaluation result as plain text
    evaluation_result["ROUGE_1"] = scores[0]['rouge-1']['f']
    evaluation_result["ROUGE_2"] = scores[0]['rouge-2']['f']
    evaluation_result["ROUGE_3"] = scores[0]['rouge-l']['f']
    
    
    
    return evaluation_result

In [None]:
version = "text+simple"
relevance_list = []
for i in range(5,8):
    question = ground_truth[i]["question"]
    answer = rag(question, version)
    score = evaluate_relevance(question, answer)
    score["question"] = question
    score["answer"] = answer
    relevance_list.append(score)
    

In [None]:
relevance_df = pd.DataFrame(relevance_list)
relevance_df

Unnamed: 0,Relevance,Explanation,ROUGE_1,ROUGE_2,ROUGE_3,question,answer
0,PARTLY_RELEVANT,The generated answer does not directly suggest...,0.105263,0.011976,0.105263,What does the content suggest users to do first?,"Based on the content provided, there is no cle..."
1,RELEVANT,The generated answer directly addresses the qu...,0.285714,0.083333,0.238095,Where can users navigate back to if needed?,"Based on the context provided, users can navig..."
2,RELEVANT,The generated answer thoroughly addresses the ...,0.170732,0.104167,0.146341,How many ways of interacting with the page are...,"Based on the context provided, there are 5 way..."


In [None]:
def display_stats(relevance_df):
    print("RELEVANCE STATS")
    print(relevance_df.Relevance.value_counts())
    print()
    print("ROUGE")
    print(relevance_df[["ROUGE_1", "ROUGE_2", "ROUGE_3"]].mean())

In [None]:
display_stats(relevance_df)

RELEVANCE STATS
Relevance
RELEVANT           2
PARTLY_RELEVANT    1
Name: count, dtype: int64

ROUGE
ROUGE_1    0.187236
ROUGE_2    0.066492
ROUGE_3    0.163233
dtype: float64


## Evaluating

In [None]:
random.seed(123)
test_subset = random.sample(ground_truth, 50)

In [None]:
def evaluate_rag(examples:list, rag_verstion:str) -> pd.DataFrame:
    relevance_list = []
    for row in tqdm(examples):
        question = row["question"]
        answer = rag(question, rag_verstion)
        score = evaluate_relevance(question, answer)
        score["question"] = question
        score["answer"] = answer
        relevance_list.append(score)
    return pd.DataFrame(relevance_list)

In [None]:
scores1 = evaluate_rag(test_subset, "text+simple")

In [None]:
display_stats(scores1)

RELEVANCE STATS
Relevance
RELEVANT           46
PARTLY_RELEVANT     3
NON_RELEVANT        1
Name: count, dtype: int64

ROUGE
ROUGE_1    0.185495
ROUGE_2    0.106079
ROUGE_3    0.172999
dtype: float64


In [None]:
llm = llm_haiku # to reduce cost

In [None]:
scores2 = evaluate_rag(test_subset, "text+structured")
scores2

100%|███████████████████████████████████████████████████████████████████████████████████| 50/50 [03:18<00:00,  3.98s/it]


Unnamed: 0,Relevance,Explanation,ROUGE_1,ROUGE_2,ROUGE_3,question,answer
0,RELEVANT,The generated answer provides a comprehensive ...,0.161491,0.096491,0.136646,What types of evidence can be provided to prov...,To prove ownership of funds and assets for inv...
1,RELEVANT,The generated answer provides a comprehensive ...,0.143885,0.082474,0.143885,What are some examples of actions that would b...,"Based on the context provided, there are a few..."
2,PARTLY_RELEVANT,The generated answer provides some relevant in...,0.116883,0.04386,0.103896,What are the employer's obligations concerning...,Based on the question and the provided context...
3,RELEVANT,The generated answer provides a comprehensive ...,0.2,0.092166,0.185714,What should a visa holder do if their employer...,Here is the answer to your question:\n\nIf a v...
4,NON_RELEVANT,The generated answer correctly states that the...,0.142857,0.086957,0.142857,What organization is mentioned in the text?,The text does not mention any specific organiz...
5,RELEVANT,The generated answer provides a relevant and t...,0.178571,0.075472,0.178571,What types of information can unlicensed indiv...,"Based on the information provided, this questi..."
6,RELEVANT,The generated answer provides a comprehensive ...,0.11976,0.070485,0.095808,What types of evidence should be included when...,When applying to extend a student stay in New ...
7,RELEVANT,The generated answer provides relevant informa...,0.256881,0.152778,0.256881,Where can employers and employees find informa...,"Based on the question, the key points related ..."
8,PARTLY_RELEVANT,The generated answer provides general informat...,0.252252,0.180556,0.252252,Is there a way to return to the homepage from ...,"Based on the context provided, there does not ..."
9,RELEVANT,The generated answer provides a comprehensive ...,0.09396,0.019139,0.09396,What is an eVisa and how does it differ from t...,An **eVisa** is an electronic visa that allows...


In [None]:
scores3 = evaluate_rag(test_subset, "text+big")
scores3

100%|███████████████████████████████████████████████████████████████████████████████████| 50/50 [04:42<00:00,  5.66s/it]


Unnamed: 0,Relevance,Explanation,ROUGE_1,ROUGE_2,ROUGE_3,question,answer
0,RELEVANT,The generated answer provides a comprehensive ...,0.168224,0.0,0.130841,What types of evidence can be provided to prov...,"According to the information provided, when ap..."
1,RELEVANT,The generated answer clearly and directly addr...,0.25,0.169811,0.25,What are some examples of actions that would b...,"According to the official information, some ex..."
2,RELEVANT,The generated answer directly addresses the qu...,0.208333,0.091603,0.166667,What are the employer's obligations concerning...,"According to the information provided, when an..."
3,RELEVANT,The generated answer directly addresses the qu...,0.241758,0.082645,0.197802,What should a visa holder do if their employer...,"According to the information provided, if a vi..."
4,RELEVANT,The generated answer clearly and comprehensive...,0.074766,0.027397,0.074766,What organization is mentioned in the text?,The main organizations mentioned in the provid...
5,RELEVANT,The generated answer provides relevant informa...,0.123711,0.0,0.082474,What types of information can unlicensed indiv...,"According to the information provided, **unlic..."
6,RELEVANT,The generated answer provides a comprehensive ...,0.141593,0.052288,0.106195,What types of evidence should be included when...,"To extend a student stay in New Zealand, you s..."
7,RELEVANT,The generated answer provides comprehensive in...,0.282609,0.178862,0.26087,Where can employers and employees find informa...,"According to the provided context, employers a..."
8,RELEVANT,The generated answer directly addresses the qu...,0.509091,0.444444,0.509091,Is there a way to return to the homepage from ...,"Yes, there is a way to return to the homepage ..."
9,RELEVANT,The generated answer provides a comprehensive ...,0.145455,0.040541,0.145455,What is an eVisa and how does it differ from t...,An **eVisa** is a visa that is issued and reco...


In [None]:
scores4 = evaluate_rag(test_subset, "vector+simple")
scores4

100%|███████████████████████████████████████████████████████████████████████████████████| 50/50 [04:05<00:00,  4.91s/it]


Unnamed: 0,Relevance,Explanation,ROUGE_1,ROUGE_2,ROUGE_3,question,answer
0,RELEVANT,The generated answer comprehensively covers th...,0.183099,0.112676,0.183099,What types of evidence can be provided to prov...,"Based on the information provided, here are so..."
1,RELEVANT,The generated answer provides several specific...,0.192982,0.113924,0.192982,What are some examples of actions that would b...,"Based on the information provided, some exampl..."
2,RELEVANT,The generated answer comprehensively covers th...,0.165289,0.079096,0.14876,What are the employer's obligations concerning...,"Based on the information provided, the key emp..."
3,RELEVANT,The generated answer provides comprehensive an...,0.197183,0.093023,0.183099,What should a visa holder do if their employer...,"Based on the information provided, here are th..."
4,RELEVANT,The generated answer comprehensively lists the...,0.080645,0.0,0.064516,What organization is mentioned in the text?,"Based on the context provided, the main organi..."
5,RELEVANT,The generated answer provides a comprehensive ...,0.15873,0.045714,0.142857,What types of information can unlicensed indiv...,"Based on the information provided, unlicensed ..."
6,RELEVANT,The generated answer provides a comprehensive ...,0.176471,0.096154,0.176471,What types of evidence should be included when...,"Based on the information provided, here are th..."
7,RELEVANT,The generated answer provides detailed and com...,0.247619,0.136646,0.247619,Where can employers and employees find informa...,"Based on the information provided, employers a..."
8,RELEVANT,The generated answer provides a detailed and c...,0.169935,0.103004,0.143791,Is there a way to return to the homepage from ...,"Based on the information provided, it does not..."
9,RELEVANT,The generated answer comprehensively addresses...,0.126984,0.041885,0.126984,What is an eVisa and how does it differ from t...,"Based on the information provided, an eVisa di..."


In [None]:
scores5 = evaluate_rag(test_subset, "vector+structured")
scores5

100%|███████████████████████████████████████████████████████████████████████████████████| 50/50 [03:34<00:00,  4.29s/it]


Unnamed: 0,Relevance,Explanation,ROUGE_1,ROUGE_2,ROUGE_3,question,answer
0,RELEVANT,The generated answer provides a comprehensive ...,0.135593,0.078125,0.112994,What types of evidence can be provided to prov...,To prove ownership of funds and assets for inv...
1,RELEVANT,The generated answer provides several specific...,0.146667,0.081818,0.133333,What are some examples of actions that would b...,"Based on the information provided, some exampl..."
2,RELEVANT,The generated answer covers the key employer o...,0.160584,0.09375,0.160584,What are the employer's obligations concerning...,The key points related to the employer's oblig...
3,RELEVANT,The generated answer provides a comprehensive ...,0.205479,0.101382,0.191781,What should a visa holder do if their employer...,Here is my answer to the question:\n\nIf a vis...
4,NON_RELEVANT,The generated answer does not mention any spec...,0.131579,0.043011,0.131579,What organization is mentioned in the text?,The text does not mention any specific organiz...
5,PARTLY_RELEVANT,The generated answer provides some relevant in...,0.117647,0.037879,0.117647,What types of information can unlicensed indiv...,"Based on the question and context provided, th..."
6,RELEVANT,The generated answer provides a comprehensive ...,0.135802,0.067511,0.135802,What types of evidence should be included when...,When applying to extend a student stay in New ...
7,RELEVANT,The generated answer provides relevant informa...,0.26,0.134328,0.26,Where can employers and employees find informa...,"Based on the question, the key points are:\n\n..."
8,RELEVANT,The generated answer directly addresses the qu...,0.218487,0.15,0.218487,Is there a way to return to the homepage from ...,"Based on the question, the key points are:\n\n..."
9,RELEVANT,The generated answer provides a clear and deta...,0.160584,0.071066,0.160584,What is an eVisa and how does it differ from t...,An **eVisa** is an electronic visa that is sto...


In [None]:
scores6 = evaluate_rag(test_subset, "vector+big")
scores6

In [None]:
display_stats(scores1)

RELEVANCE STATS
Relevance
RELEVANT           46
PARTLY_RELEVANT     3
NON_RELEVANT        1
Name: count, dtype: int64

ROUGE
ROUGE_1    0.185495
ROUGE_2    0.106079
ROUGE_3    0.172999
dtype: float64


In [None]:
display_stats(scores2)

RELEVANCE STATS
Relevance
RELEVANT           26
PARTLY_RELEVANT    18
NON_RELEVANT        6
Name: count, dtype: int64

ROUGE
ROUGE_1    0.163282
ROUGE_2    0.085720
ROUGE_3    0.154784
dtype: float64


In [None]:
display_stats(scores3)

RELEVANCE STATS
Relevance
RELEVANT    50
Name: count, dtype: int64

ROUGE
ROUGE_1    0.217819
ROUGE_2    0.116029
ROUGE_3    0.199039
dtype: float64


In [None]:
display_stats(scores4)

RELEVANCE STATS
Relevance
RELEVANT           48
PARTLY_RELEVANT     2
Name: count, dtype: int64

ROUGE
ROUGE_1    0.172731
ROUGE_2    0.086049
ROUGE_3    0.164021
dtype: float64


In [None]:
display_stats(scores5)

RELEVANCE STATS
Relevance
RELEVANT           27
PARTLY_RELEVANT    19
NON_RELEVANT        4
Name: count, dtype: int64

ROUGE
ROUGE_1    0.155240
ROUGE_2    0.078514
ROUGE_3    0.147519
dtype: float64


In [None]:
display_stats(scores6)

RELEVANCE STATS
Relevance
RELEVANT           48
PARTLY_RELEVANT     2
Name: count, dtype: int64

ROUGE
ROUGE_1    0.195728
ROUGE_2    0.091203
ROUGE_3    0.175914
dtype: float64


| Version | RELEVANT | PARTLY_RELEVANT | NON_RELEVANT | ROUGE_1 | ROUGE_2 | ROUGE_3 |
|---------|----------|-----------------|--------------|---------|---------|---------|
| scores1 ("text+simple") | 46 | 3 | 1 | 0.185495 | 0.106079 | 0.172999 |
| scores2 ("text+structured") | 26 | 18 | 6 | 0.163282 | 0.085720 | 0.154784 |
| scores3 ("text+big") | 50 | 0 | 0 | 0.217819 | 0.116029 | 0.199039 |
| scores4 ("vector+simple") | 48 | 2 | 0 | 0.172731 | 0.086049 | 0.164021 |
| scores5 ("vector+structured") | 27 | 19 | 4 | 0.155240 | 0.078514 | 0.147519 |
| scores6 ("vector+big") | 48 | 2 | 0 | 0.195728 | 0.091203 | 0.175914 |

## Evaluate token usage

In [None]:
def llm_haiku_tokens(prompt):
    response = clientA.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=500,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": prompt
                    }
                ]
            }
        ]
    )
    return response.content[0].text, response.usage.input_tokens + response.usage.output_tokens

llm = llm_haiku_tokens

In [None]:
def evaluate_relevance(question: str, answer_llm: str) -> dict:
    prompt_template = """
Evaluate answer relevance to question. Classify as NON_RELEVANT, PARTLY_RELEVANT, or RELEVANT.
Q: {q}
A: {a}
Respond with JSON:
{{"Relevance": "NON_RELEVANT"|"PARTLY_RELEVANT"|"RELEVANT", "Explanation": "Brief reason"}}
""".strip()

    evaluation_prompt = prompt_template.format(q=question, a=answer_llm)
    evaluation_response, token_count = llm_haiku(evaluation_prompt)

    try:
        evaluation_result = json.loads(evaluation_response)
        if not all(key in evaluation_result for key in ["Relevance", "Explanation"]):
            raise ValueError("Missing expected fields")
        if evaluation_result["Relevance"] not in ["NON_RELEVANT", "PARTLY_RELEVANT", "RELEVANT"]:
            raise ValueError("Invalid Relevance value")
    except (json.JSONDecodeError, ValueError) as e:
        return {
            "Relevance": "ERROR",
            "Explanation": f"Error processing LLM response: {str(e)}",
            "TokensUsed": token_count
        }

    rouge = Rouge()
    scores = rouge.get_scores(answer_llm, question)[0]
    
    evaluation_result.update({
        "ROUGE_1": scores['rouge-1']['f'],
        "ROUGE_2": scores['rouge-2']['f'],
        "ROUGE_L": scores['rouge-l']['f'],
        "TokensUsed": token_count
    })

    return evaluation_result

In [None]:
def rag(query, version):
    if version == "text+simple":
        search = elastic_search_boost
        build_prompt = build_basic_prompt
        extractor = extract_basic_answer
    elif version == "text+structured":
        search = elastic_search_boost
        build_prompt = build_structured_prompt
        extractor = extract_structured_answer
    elif version == "text+big":
        search = elastic_search_boost
        build_prompt = build_big_prompt
        extractor = extract_big_answer
    elif version == "vector+simple":
        search = elastic_search_combined_10
        build_prompt = build_basic_prompt
        extractor = extract_basic_answer
    elif version == "vector+structured":
        search = elastic_search_combined_10
        build_prompt = build_structured_prompt
        extractor = extract_structured_answer
    elif version == "vector+big":
        search = elastic_search_combined_10
        build_prompt = build_big_prompt
        extractor = extract_big_answer
    
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer, tokens = llm(prompt)
    return extractor(answer), tokens

In [None]:
def evaluate_rag(examples:list, rag_verstion:str) -> pd.DataFrame:
    relevance_list = []
    for row in tqdm(examples):
        question = row["question"]
        answer, tokens = rag(question, rag_verstion)
        score = evaluate_relevance(question, answer)
        score["question"] = question
        score["answer"] = answer
        score["tokens"] = tokens
        relevance_list.append(score)
    return pd.DataFrame(relevance_list)

In [None]:
test_subset = random.sample(ground_truth, 10)
scores3_token = evaluate_rag(test_subset, "text+big")
scores3_token

100%|███████████████████████████████████████████████████████████████████████████████████| 10/10 [00:37<00:00,  3.76s/it]


Unnamed: 0,Relevance,Explanation,ROUGE_1,ROUGE_2,ROUGE_L,TokensUsed,question,answer,tokens
0,RELEVANT,The answer provides detailed information on wh...,0.222222,0.100629,0.205128,443,Where can applicants find information about th...,"According to the information provided, applica...",4742
1,RELEVANT,The answer provides two specific steps a user ...,0.322581,0.219178,0.322581,300,How can someone get back to the main page of I...,Based on the information provided in the conte...,4577
2,RELEVANT,The answer directly addresses the question by ...,0.325581,0.235294,0.325581,223,Who certifies all immigration instructions and...,"According to the information provided, the **M...",3508
3,RELEVANT,The answer directly and accurately identifies ...,0.177215,0.122449,0.177215,366,What are the two main components of the RSE ap...,The two main components of the RSE application...,4120
4,RELEVANT,The answer provides detailed and relevant step...,0.150943,0.062112,0.132075,499,How can you transfer your visa to a new passport?,To transfer your valid New Zealand visa to a n...,5431
5,RELEVANT,The answer directly addresses the question by ...,0.216216,0.175824,0.216216,315,When does the selection of expressions of inte...,"According to the information provided, the sel...",3929
6,PARTLY_RELEVANT,The answer provides information about what hap...,0.16,0.0,0.16,257,What will happen after you state your language?,If you call Immigration New Zealand and do not...,7803
7,RELEVANT,The answer provides the relevant information t...,0.363636,0.1875,0.327273,314,How can visitors return to the main page of Im...,Based on the information provided in the conte...,3860
8,RELEVANT,The answer provides detailed information about...,0.165414,0.078212,0.150376,437,How have the employment agreements for RSE wor...,"As of October 1, 2023, the employment agreemen...",7875
9,RELEVANT,The answer directly addresses the steps to req...,0.225,0.056604,0.2,337,What should you do to request a refund if you ...,"According to the information provided, if you ...",5251


In [None]:
scores6_token = evaluate_rag(test_subset, "vector+big")
scores6_token

100%|███████████████████████████████████████████████████████████████████████████████████| 10/10 [00:47<00:00,  4.74s/it]


Unnamed: 0,Relevance,Explanation,ROUGE_1,ROUGE_2,ROUGE_L,TokensUsed,question,answer,tokens
0,RELEVANT,The answer directly addresses the question by ...,0.303797,0.206186,0.303797,381,Where can applicants find information about th...,Applicants can find information about the **ac...,3504
1,RELEVANT,The answer provides two clear and direct metho...,0.315789,0.197183,0.315789,335,How can someone get back to the main page of I...,To get back to the main page of the Immigratio...,3161
2,RELEVANT,The answer directly addresses the question by ...,0.186667,0.116505,0.186667,340,Who certifies all immigration instructions and...,"According to the information provided, **the M...",2847
3,RELEVANT,The answer clearly identifies the two main com...,0.186667,0.118812,0.186667,380,What are the two main components of the RSE ap...,The two main components of the RSE application...,3887
4,RELEVANT,The answer provides detailed steps on how to t...,0.164948,0.065789,0.14433,481,How can you transfer your visa to a new passport?,To transfer your New Zealand visa to a new pas...,3970
5,RELEVANT,The answer directly addresses the question by ...,0.301887,0.266667,0.301887,294,When does the selection of expressions of inte...,"According to the information provided, the sel...",3624
6,PARTLY_RELEVANT,The answer provides information about the Engl...,0.068182,0.0,0.068182,342,What will happen after you state your language?,After you provide information about your langu...,3125
7,RELEVANT,The answer directly addresses how visitors can...,0.344828,0.181818,0.275862,253,How can visitors return to the main page of Im...,To return to the main page of the Immigration ...,3515
8,RELEVANT,The answer provides detailed information on th...,0.148649,0.069652,0.135135,544,How have the employment agreements for RSE wor...,"As of October 1, 2023, the employment agreemen...",4834
9,RELEVANT,The answer provides clear and detailed instruc...,0.211765,0.073394,0.164706,466,What should you do to request a refund if you ...,According to the official immigration informat...,3603


In [None]:
scores3_token.tokens.mean()

5109.6

In [None]:
scores6_token.tokens.mean()

3607.0

## Conclusion

1. "Big" prompt versions (text+big and vector+big) performed best overall.
2. Text+big achieved highest relevance (100%) and ROUGE scores.
3. Vector-based approaches showed strong performance.
4. Structured prompts unexpectedly underperformed.
5. Comprehensive prompts led to better relevance and content quality.
6. Token without full articles use way more tokens for answer
   
The evaluation suggests that using more detailed prompts, especially with text-based retrieval, is most effective for this RAG system.