# Choose Best Model for Wiki-Helper

Choosing the best model for service is not an easy task. We need to pay attention to multiple parameters. In particular, for my case, I will be looking at three dimensions:

* **Quality**
* **Time per Iteration**
* **Model's size**



When it comes to the last two parameters (time per iteration and model size), they are relatively straightforward to measure. However, evaluating the quality parameter requires a more specific methodology.

### How I will measure quality:

Since my downstream task involves **Retrieval-Augmented Generation (RAG)** for Wikipedia articles, it makes sense to use the classical **SQuAD Dataset**. My approach will be as follows:

1. The system will answer questions based on the provided Wikipedia article.
2. I will then compute the **BERTScore** between the predicted answer and the ground truth answer.

Given that the answers in SQuAD are quite specific, using BERTScore will allow me to assess whether my system captures the correct semantics, even if there is some variation in the exact wording.

### Models for Evaluation

I have selected three models for evaluation, all of which are "mini" models. While some have more parameters than others, they share a common quantization method: **GGUF**. This method ensures consistency across the models in how they reduce precision while maintaining model accuracy. Here's a bit more detail on the models and the quantization process:

##### Quantization Method: GGUF (Q6_K Format)
In this case, all models have been quantized using the **Q6_K** method, which provides a balance between performance and memory efficiency.

### Models Overview

1. **[Gemma-2-2B](https://huggingface.co/fedric95/gemma-2-2b-GGUF)**
    - **Creator**: Google
    - **Quantization**: Q6_K (GGUF)
    - **Parameters**: 2 Billion

2. **[Llama-3.2-3B-Instruct](https://huggingface.co/unsloth/Llama-3.2-3B-Instruct-GGUF)**
    - **Creator**: Meta
    - **Quantization**: Q6_K (GGUF)
    - **Parameters**: 3.2 Billion

3. **[Nemotron-Mini-4B-Instruct)](https://huggingface.co/bartowski/Nemotron-Mini-4B-Instruct-GGUF)**
    - **Creator**: NVIDIA
    - **Quantization**: Q6_K (GGUF)
    - **Parameters**: 4 Billion

---






## Table of Contents
1. [Dataset](#dataset)
2. [Gemma-2-it-2b](#gemma-2-it-2b)
3. [Llama-3.2-3B-Instruct](#llama-32-3b-instruct)
4. [Nemotron-Mini-4B-Instruct](#nemotron-mini-4b-instruct)
5. [Evaluation](#evaluation)  
  a. [BERTscore](#bertscore)  
  b. [Duration](#duration)
6. [Conclusion](#conclusion)

---


# Dataset

SQuAD v2 consists of Wikipedia articles with corresponding questions and answers.

#### Sampling Method

1. **Randomly Select 10 Unique Article Titles**  
   From the dataset, 10 distinct article titles are randomly selected to ensure diversity in the sampled data.

2. **Sample 10 Entries Per Title**  
   For each selected title, 10 random question-answer pairs are sampled. If fewer than 10 entries exist for a title, all available entries are taken.

3. **Ensure Non-Empty Answers**  
   Only entries with non-empty answers are considered for sampling, ensuring the relevance of the evaluation data.

4. **Final Output**  
   This results in a total of 100 samples (10 entries from each of the 10 selected article titles), ready for display or further analysis.


In [7]:
import pandas as pd
from datasets import load_dataset

dataset = load_dataset("squad_v2", split="validation")

df = pd.DataFrame(dataset)

filtered_df = df[df["answers"].apply(lambda x: len(x["text"]) > 0)]

filtered_df.size

29640

In [8]:
random_titles = filtered_df["title"].drop_duplicates().sample(n=30, random_state=42)

samples = []

for title in random_titles:
    group = filtered_df[filtered_df["title"] == title]

    sampled_group = group.sample(n=min(10, len(group)), random_state=42)
    samples.append(sampled_group)

sampled_df = pd.concat(samples)

sampled_df.reset_index(drop=True, inplace=True)

sampled_df.to_csv("data/sampled_squad_v2_eval.csv", index=False)

In [9]:
sampled_df.head()

Unnamed: 0,id,title,context,question,answers
0,57293e221d046914007791d7,Intergovernmental_Panel_on_Climate_Change,The executive summary of the WG I Summary for ...,How much of the greenhouse effect is due to ca...,"{'text': ['over half', 'over half', 'over half..."
1,57294279af94a219006aa20a,Intergovernmental_Panel_on_Climate_Change,These studies were widely presented as demonst...,Who led the Science and Environmental Policy P...,"{'text': ['Fred Singer', 'Fred Singer', 'Fred ..."
2,57294279af94a219006aa209,Intergovernmental_Panel_on_Climate_Change,These studies were widely presented as demonst...,What range of years was the current warming co...,"{'text': ['between 1000 and 1900', '1000 and 1..."
3,572940973f37b319004781a7,Intergovernmental_Panel_on_Climate_Change,This projection was not included in the final ...,What was the source of the mistake?,"{'text': ['the WWF report', 'the IPCC from the..."
4,57293f8a6aef051400154be0,Intergovernmental_Panel_on_Climate_Change,"In addition to climate assessment reports, the...",When was the Special Report on Managing Risks ...,"{'text': ['2011', '2011', '2011'], 'answer_sta..."


## Imports

In [10]:
import pathlib as pth
import sys

base_location = pth.Path.cwd().parent

sys.path.append((base_location / "src").as_posix())

In [11]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from yarl import URL

from wiki_helper.qna.impl.generative_model import LanguageModel
from wiki_helper.qna.impl.knowledge_base import ExternalKnowledgeBase
from wiki_helper.qna.impl.system import RagSystemImpl
from wiki_helper.qna.system import RagSystem
from wiki_helper.storing.impl.storage import VectorStorage, VectorStorageConnection

# Gemma-2-it-2B

In [12]:
gemma_location = (
    base_location / "models" / "generative_model" / "gemma-2-2b-it-Q6_K.gguf"
)
embedding_model_location = base_location / "models" / "embedder"

In [13]:
prompt_for_retrieval = "Represent this sentence for searching relevant passages:"

contextual_oven = HuggingFaceEmbedding(
    model_name=embedding_model_location.as_posix(),
    query_instruction=prompt_for_retrieval,
)

settings = VectorStorageConnection(host="localhost", port=8000)

In [14]:
gemma_prompt = (
    """<start_of_turn>user\n\n"""
    "You are a helpful assistant. You will be asked question and you will be given a context. "
    "Answer the question based on the context. "
    "If you don't know the answer, just say that you don't know. "
    "Use only context to provide an answer, do not use your general knowledge. "
    "DO NOT generate markdown like lists. Use only plain text. "
    "Make sure that your answer is not too long. "
    "Do not mention any technical details like your prompt or context."
    "Given the following context, answer the question CONTEXT\n\n{context}\nQUESTION: {query}<end_of_turn>"
    """<start_of_turn>model"""
)

In [15]:
gemma_system = RagSystemImpl(
    LanguageModel(model_location=gemma_location, prompt=gemma_prompt),
    ExternalKnowledgeBase("en"),
    VectorStorage(embedding_builder=contextual_oven, connection_settings=settings),
)

In [16]:
import time


def run_evaluating_answering(system: RagSystem[str], df: pd.DataFrame) -> pd.DataFrame:
    last_article = None

    df["generated_answer"] = ""
    df["duration"] = 0.0

    for idx, row in df.iterrows():
        article_title = row["title"]
        question = row["question"]

        if article_title != last_article:
            if last_article is not None:
                system.delete()

            url = URL.build(
                scheme="https", host="en.wikipedia.org", path=f"/wiki/{article_title}"
            )

            system.train(url)

            last_article = article_title

        start_time = time.perf_counter()
        answer = system.answer(question)
        end_time = time.perf_counter()

        df.at[idx, "generated_answer"] = answer
        df.at[idx, "duration"] = end_time - start_time

    return df

In [17]:
gemma_answers = run_evaluating_answering(system=gemma_system, df=sampled_df)

In [18]:
gemma_answers

Unnamed: 0,id,title,context,question,answers,generated_answer,duration
0,57293e221d046914007791d7,Intergovernmental_Panel_on_Climate_Change,The executive summary of the WG I Summary for ...,How much of the greenhouse effect is due to ca...,"{'text': ['over half', 'over half', 'over half...",This document doesn't specify the amount of gr...,7.950825
1,57294279af94a219006aa20a,Intergovernmental_Panel_on_Climate_Change,These studies were widely presented as demonst...,Who led the Science and Environmental Policy P...,"{'text': ['Fred Singer', 'Fred Singer', 'Fred ...",This document does not contain the answer to t...,7.263251
2,57294279af94a219006aa209,Intergovernmental_Panel_on_Climate_Change,These studies were widely presented as demonst...,What range of years was the current warming co...,"{'text': ['between 1000 and 1900', '1000 and 1...",The warming that is being measured started in ...,4.857468
3,572940973f37b319004781a7,Intergovernmental_Panel_on_Climate_Change,This projection was not included in the final ...,What was the source of the mistake?,"{'text': ['the WWF report', 'the IPCC from the...",The source of the mistake was the introduction...,2.728932
4,57293f8a6aef051400154be0,Intergovernmental_Panel_on_Climate_Change,"In addition to climate assessment reports, the...",When was the Special Report on Managing Risks ...,"{'text': ['2011', '2011', '2011'], 'answer_sta...",: Later in 2011. \n,6.064868
...,...,...,...,...,...,...,...
295,571156152419e3140095559c,Steam_engine,The historical measure of a steam engine's ene...,Who invented the notion of a steam engine's duty?,"{'text': ['Watt', 'Watt', 'Watt'], 'answer_sta...",> Watt \n,5.164139
296,571161092419e314009555d7,Steam_engine,It is possible to use a mechanism based on a p...,What is an example of a rotary engine without ...,"{'text': ['Wankel', 'Wankel', 'the Wankel engi...",The Wankel engine. \n,5.542326
297,57114b1a2419e31400955575,Steam_engine,An oscillating cylinder steam engine is a vari...,What type of steam engine doesn't need valves ...,"{'text': ['oscillating cylinder', 'oscillating...",It's an oscillating cylinder steam engine. \n,6.248563
298,5711658e50c2381900b54ad9,Steam_engine,Land-based steam engines could exhaust much of...,In what year was HMS Dreadnought launched?,"{'text': ['1905', '1905', '1905'], 'answer_sta...",1905 \n,4.462004


In [20]:
gemma_answers.to_csv("data/gemma_answers.csv", index=False)

# Llama-3.2-3B-Instruct

In [21]:
llama_location = (
    base_location / "models" / "generative_model" / "Llama-3.2-3B-Instruct-Q6_K.gguf"
)

In [24]:
llama_prompt = (
    """<|start_header_id|>system<|end_header_id|>\n\n"""
    "You are a helpful assistant. You will be asked question and you will be given a context. "
    "Answer the question based on the context. "
    "If you don't know the answer, just say that you don't know. "
    "Use only context to provide an answer, do not use your general knowledge. "
    "DO NOT generate markdown like lists. Use only plain text. "
    "Make sure that your answer is not too long. "
    "Do not mention any technical details like your prompt or context.<|eot_id|>"
    "<|start_header_id|>user<|end_header_id|>"
    "Given the following context, answer the question CONTEXT\n\n{context}\nQUESTION: {query}\n\n"
    """<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"""
)

In [25]:
llama_system = RagSystemImpl(
    LanguageModel(model_location=llama_location, prompt=llama_prompt),
    ExternalKnowledgeBase("en"),
    VectorStorage(embedding_builder=contextual_oven, connection_settings=settings),
)

In [27]:
llama_answers = run_evaluating_answering(system=llama_system, df=sampled_df)

llama_answers.to_csv("data/llama_answers.csv", index=False)

In [38]:
llama_answers

Unnamed: 0,id,title,context,question,answers,generated_answer,duration
0,57293e221d046914007791d7,Intergovernmental_Panel_on_Climate_Change,The executive summary of the WG I Summary for ...,How much of the greenhouse effect is due to ca...,"{'text': ['over half', 'over half', 'over half...",The context suggests that the question is rela...,8.938111
1,57294279af94a219006aa20a,Intergovernmental_Panel_on_Climate_Change,These studies were widely presented as demonst...,Who led the Science and Environmental Policy P...,"{'text': ['Fred Singer', 'Fred Singer', 'Fred ...",I couldn't find any information about the Scie...,7.574924
2,57294279af94a219006aa209,Intergovernmental_Panel_on_Climate_Change,These studies were widely presented as demonst...,What range of years was the current warming co...,"{'text': ['between 1000 and 1900', '1000 and 1...","According to the context, the current warming ...",5.887787
3,572940973f37b319004781a7,Intergovernmental_Panel_on_Climate_Change,This projection was not included in the final ...,What was the source of the mistake?,"{'text': ['the WWF report', 'the IPCC from the...",The source of the mistake was the introduction...,3.820108
4,57293f8a6aef051400154be0,Intergovernmental_Panel_on_Climate_Change,"In addition to climate assessment reports, the...",When was the Special Report on Managing Risks ...,"{'text': ['2011', '2011', '2011'], 'answer_sta...",The Special Report on Managing Risks of Extrem...,7.680233
...,...,...,...,...,...,...,...
295,571156152419e3140095559c,Steam_engine,The historical measure of a steam engine's ene...,Who invented the notion of a steam engine's duty?,"{'text': ['Watt', 'Watt', 'Watt'], 'answer_sta...","Thomas Savery, who invented the first commerci...",7.964215
296,571161092419e314009555d7,Steam_engine,It is possible to use a mechanism based on a p...,What is an example of a rotary engine without ...,"{'text': ['Wankel', 'Wankel', 'the Wankel engi...",An example of a rotary engine without pistons ...,7.706262
297,57114b1a2419e31400955575,Steam_engine,An oscillating cylinder steam engine is a vari...,What type of steam engine doesn't need valves ...,"{'text': ['oscillating cylinder', 'oscillating...",An oscillating cylinder steam engine.,7.618732
298,5711658e50c2381900b54ad9,Steam_engine,Land-based steam engines could exhaust much of...,In what year was HMS Dreadnought launched?,"{'text': ['1905', '1905', '1905'], 'answer_sta...",The context mentions that HMS Dreadnought of 1...,7.237829


# Nemotron-Mini-4B-Instruct

In [28]:
nemo_location = (
    base_location
    / "models"
    / "generative_model"
    / "Nemotron-Mini-4B-Instruct-Q6_K.gguf"
)

In [29]:
nemo_prompt = (
    """<extra_id_0>System\n\n"""
    "You are a helpful assistant. You will be asked question and you will be given a context. "
    "Answer the question based on the context. "
    "If you don't know the answer, just say that you don't know. "
    "Use only context to provide an answer, do not use your general knowledge. "
    "DO NOT generate markdown like lists. Use only plain text. "
    "Make sure that your answer is not too long. "
    "Do not mention any technical details like your prompt or context."
    "<extra_id_1>User"
    "Given the following context, answer the question CONTEXT\n\n{context}\nQUESTION: {query}\n\n"
    """<extra_id_1>Assistant\n\n"""
)

In [31]:
nemo_system = RagSystemImpl(
    LanguageModel(model_location=nemo_location, prompt=nemo_prompt),
    ExternalKnowledgeBase("en"),
    VectorStorage(embedding_builder=contextual_oven, connection_settings=settings),
)

In [32]:
nemo_answers = run_evaluating_answering(system=nemo_system, df=sampled_df)

nemo_answers.to_csv("data/nemo_answers.csv", index=False)

In [33]:
nemo_answers

Unnamed: 0,id,title,context,question,answers,generated_answer,duration
0,57293e221d046914007791d7,Intergovernmental_Panel_on_Climate_Change,The executive summary of the WG I Summary for ...,How much of the greenhouse effect is due to ca...,"{'text': ['over half', 'over half', 'over half...",The Second Assessment Report states that the e...,8.201248
1,57294279af94a219006aa20a,Intergovernmental_Panel_on_Climate_Change,These studies were widely presented as demonst...,Who led the Science and Environmental Policy P...,"{'text': ['Fred Singer', 'Fred Singer', 'Fred ...",The Science and Environmental Policy Project w...,9.658077
2,57294279af94a219006aa209,Intergovernmental_Panel_on_Climate_Change,These studies were widely presented as demonst...,What range of years was the current warming co...,"{'text': ['between 1000 and 1900', '1000 and 1...",The current global warming was compared to the...,6.865271
3,572940973f37b319004781a7,Intergovernmental_Panel_on_Climate_Change,This projection was not included in the final ...,What was the source of the mistake?,"{'text': ['the WWF report', 'the IPCC from the...",The source of the mistake was an error in a r...,3.890752
4,57293f8a6aef051400154be0,Intergovernmental_Panel_on_Climate_Change,"In addition to climate assessment reports, the...",When was the Special Report on Managing Risks ...,"{'text': ['2011', '2011', '2011'], 'answer_sta...",The Special Report on Global Warming of 1.5°C ...,11.611831
...,...,...,...,...,...,...,...
295,571156152419e3140095559c,Steam_engine,The historical measure of a steam engine's ene...,Who invented the notion of a steam engine's duty?,"{'text': ['Watt', 'Watt', 'Watt'], 'answer_sta...",Watt introduced the concept of duty in order t...,9.483513
296,571161092419e314009555d7,Steam_engine,It is possible to use a mechanism based on a p...,What is an example of a rotary engine without ...,"{'text': ['Wankel', 'Wankel', 'the Wankel engi...",The Hult Brothers Rotary Steam Engine Company ...,10.009149
297,57114b1a2419e31400955575,Steam_engine,An oscillating cylinder steam engine is a vari...,What type of steam engine doesn't need valves ...,"{'text': ['oscillating cylinder', 'oscillating...",An oscillating cylinder steam engine is a var...,13.107081
298,5711658e50c2381900b54ad9,Steam_engine,Land-based steam engines could exhaust much of...,In what year was HMS Dreadnought launched?,"{'text': ['1905', '1905', '1905'], 'answer_sta...","In 1905, HMS Dreadnought was launched.",7.474219


# Evaluation

Now, I will evaluate the semantic similarity between the ground truths and answers using BERTscore. Moreover, I'll compare models' preformances by their durations statistics.

In [48]:
import bert_score
import pandas as pd


def calculate_bertscore(df: pd.DataFrame) -> pd.DataFrame:
    ground_truths = df["answers"].apply(lambda x: x["text"][0]).tolist()
    predicted_answers = df["generated_answer"].tolist()

    P, R, F1 = bert_score.score(
        predicted_answers, ground_truths, lang="en", rescale_with_baseline=False
    )

    df["bertscore"] = F1.tolist()

    return df

In [49]:
def print_stats(df: pd.DataFrame, column_name: str) -> None:
    data = df[column_name]

    mean_value = data.mean()
    std_value = data.std()
    min_value = data.min()
    max_value = data.max()
    percentiles = data.quantile([0.5, 0.75, 0.9, 0.99])

    print(f"Statistics for '{column_name}':")
    print(f"Mean: {mean_value}")
    print(f"Standard Deviation: {std_value}")
    print(f"Minimum: {min_value}")
    print(f"Maximum: {max_value}")
    print("Percentiles:")
    for percentile, value in percentiles.items():
        print(f"  {int(percentile*100)}th percentile: {value}")

### BERTscore

#### Gemma-2-2B

In [50]:
import warnings

from transformers import logging as transformers_logging

warnings.simplefilter(action="ignore", category=FutureWarning)
transformers_logging.set_verbosity_error()


gemma_answers_eval = calculate_bertscore(gemma_answers)

print_stats(gemma_answers_eval, "bertscore")

Statistics for 'bertscore':
Mean: 0.8293970932563146
Standard Deviation: 0.03723752709230562
Minimum: 0.7265495657920837
Maximum: 1.000000238418579
Percentiles:
  50th percentile: 0.8253500163555145
  75th percentile: 0.8478496223688126
  90th percentile: 0.8763234078884125
  99th percentile: 0.9413502812385555


#### Llama-3.2-3B-Instruct

In [58]:
llama_answers_eval = calculate_bertscore(llama_answers)

print_stats(llama_answers_eval, "bertscore")

Statistics for 'bertscore':
Mean: 0.849396791656812
Standard Deviation: 0.049873974009492096
Minimum: 0.7477531433105469
Maximum: 0.9999998807907104
Percentiles:
  50th percentile: 0.8368507325649261
  75th percentile: 0.8706192374229431
  90th percentile: 0.9131241738796236
  99th percentile: 0.9949575346708297


#### Nemotron-Mini-4B-Instruct

In [59]:
nemo_answers_eval = calculate_bertscore(nemo_answers)

print_stats(nemo_answers_eval, "bertscore")

Statistics for 'bertscore':
Mean: 0.8293970932563146
Standard Deviation: 0.03723752709230562
Minimum: 0.7265495657920837
Maximum: 1.000000238418579
Percentiles:
  50th percentile: 0.8253500163555145
  75th percentile: 0.8478496223688126
  90th percentile: 0.8763234078884125
  99th percentile: 0.9413502812385555


## Duration

#### Gemma-2-2B

In [60]:
print_stats(gemma_answers_eval, "duration")

Statistics for 'duration':
Mean: 10.449147210003382
Standard Deviation: 3.921189097409322
Minimum: 2.7195086249994347
Maximum: 33.464258866999444
Percentiles:
  50th percentile: 9.740066264999768
  75th percentile: 12.436653168749217
  90th percentile: 14.931566049498361
  99th percentile: 21.76241399363847


#### Llama-3.2-3B-Instruct

In [61]:
print_stats(llama_answers_eval, "duration")

Statistics for 'duration':
Mean: 9.533914186676679
Standard Deviation: 3.601370875414153
Minimum: 2.5570840399996086
Maximum: 33.68731780900271
Percentiles:
  50th percentile: 8.951638114500383
  75th percentile: 11.366679939997994
  90th percentile: 13.56175717499973
  99th percentile: 20.2294812440599


#### Nemotron-Mini-4B-Instruct

In [62]:
print_stats(nemo_answers_eval, "duration")

Statistics for 'duration':
Mean: 10.449147210003382
Standard Deviation: 3.921189097409322
Minimum: 2.7195086249994347
Maximum: 33.464258866999444
Percentiles:
  50th percentile: 9.740066264999768
  75th percentile: 12.436653168749217
  90th percentile: 14.931566049498361
  99th percentile: 21.76241399363847


# Conclusion

## Although the metrics are all at the same level, llama-3.2 slightly wins by quality, as well as by duration. Moreover, llama has multilingual prospects. This model is capable of understanding and generating text in eight languages (officially supported, but it can also know others to some extend). 