## Evaluate the falcon models using LlamaIndex's rag_evaluator LlamaPack

This notebook demonstrates how to evaluate the base model of `Falcon-RW-1B` using LlamaIndex's reg_evaluator pack.


In [1]:
!pip install llama_index==0.9.25 llama_hub torch transformers accelerate bitsandbytes  auto_gptq optimum

Collecting llama_index==0.9.25
  Downloading llama_index-0.9.25-py3-none-any.whl.metadata (8.3 kB)
Collecting llama_hub
  Downloading llama_hub-0.0.79.post1-py3-none-any.whl.metadata (16 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl.metadata (9.9 kB)
Collecting auto_gptq
  Downloading auto_gptq-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting optimum
  Downloading optimum-1.17.0-py3-none-any.whl.metadata (18 kB)
Collecting httpx (from llama_index==0.9.25)
  Downloading httpx-0.26.0-py3-none-any.whl.metadata (7.6 kB)
Collecting nltk<4.0.0,>=3.8.1 (from llama_index==0.9.25)
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting openai>=1.1.0 (from llama_index==0.9.25)
  Downloading openai-1.12.0-py3-none-any.whl.metadata (18 kB)
Collecting tiktoken>=0.3.3 (from llama_ind

In [2]:
import logging, sys
import nest_asyncio

nest_asyncio.apply()

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)

### Download Llama dataset and RagEvaluatorPack

First, we download both the Llama dataset and `RagEvaluatorPack`. We use Paul Graham's essay dataset in our evaluation. From the dataset, the pack uses `SimpleDirectoryReader` to load the data into `documents`, and we then construct the `VectorStoreIndex` from the `documents`.

In [3]:
from llama_index.llama_dataset import download_llama_dataset
from llama_index.llama_pack import download_llama_pack
from llama_index import VectorStoreIndex
from llama_index.llms import OpenAI

import os

# get the OpenAI API key from secrets tab 
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
os.environ["OPENAI_API_KEY"] = user_secrets.get_secret("OPENAI_API_KEY")


## Evaluate the 4bit model

In [4]:
from llama_index.llms import HuggingFaceLLM
from llama_index import ServiceContext

llm_4bit = HuggingFaceLLM(model_name="reachrkr/falcon-rw-1bt-gptq-4bit-ptb")

service_context_base = ServiceContext.from_defaults(
    llm=llm_4bit,
    embed_model="local:WhereIsAI/UAE-Large-V1"
)

config.json:   0%|          | 0.00/1.83k [00:00<?, ?B/s]

2024-02-17 10:20:32.371565: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-17 10:20:32.371690: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-17 10:20:32.521800: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


model.safetensors:   0%|          | 0.00/836M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/264 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/733 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [5]:
# download a LabelledRagDataset from llama-hub
rag_dataset, documents = download_llama_dataset(
    "PaulGrahamEssayDataset", "./paul_graham"
)

# download and install RagEvaluatorPack
RagEvaluatorPack = download_llama_pack(
  "RagEvaluatorPack", "./rag_evaluator_pack"
)

# build index from the source documents
index = VectorStoreIndex.from_documents(documents=documents)

# define query engine
query_engine_base = index.as_query_engine(service_context=service_context_base)

# construct RagEvaluatorPack
rag_evaluator_pack_base = RagEvaluatorPack(
    query_engine=query_engine_base,
    rag_dataset=rag_dataset,
    #judge_llm=OpenAI(temperature=0, model="gpt-3.5-turbo-1106")
)

# run eval
benchmark_df_base = rag_evaluator_pack_base.run()
#benchmark_df = await rag_evaluator_pack_base.arun(
#    batch_size=20,  # batches the number of openai api calls to make
#    sleep_time_in_seconds=1,  # seconds to sleep before making an api call
#)
print(benchmark_df_base)

  0%|          | 0/10 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (2048). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 10%|█         | 1/10 [01:17<11:35, 77.32s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 20%|██        | 2/10 [02:32<10:09, 76.20s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 30%|███       | 3/10 [03:48<08:51, 75.92s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 40%|████      | 4/10 [05:03<07:34, 75.74s/it]Setting `pad_toke

rag                            base_rag
metrics                                
mean_correctness_score         1.000000
mean_relevancy_score           0.022727
mean_faithfulness_score        0.000000
mean_context_similarity_score  0.931626


## Evaluate the 2bit model 

In [6]:
from llama_index.llms import HuggingFaceLLM
from llama_index import ServiceContext

llm_2bit = HuggingFaceLLM(model_name="reachrkr/falcon-rw-1bt-gptq-2bit-ptb")

service_context_base = ServiceContext.from_defaults(
    llm=llm_2bit,
    embed_model="local:WhereIsAI/UAE-Large-V1"
)

config.json:   0%|          | 0.00/1.83k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/532M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [7]:
# download a LabelledRagDataset from llama-hub
rag_dataset, documents = download_llama_dataset(
    "PaulGrahamEssayDataset", "./paul_graham"
)

# download and install RagEvaluatorPack
RagEvaluatorPack = download_llama_pack(
  "RagEvaluatorPack", "./rag_evaluator_pack"
)

# build index from the source documents
index = VectorStoreIndex.from_documents(documents=documents)

# define query engine
query_engine_base = index.as_query_engine(service_context=service_context_base)

# construct RagEvaluatorPack
rag_evaluator_pack_base = RagEvaluatorPack(
    query_engine=query_engine_base,
    rag_dataset=rag_dataset,
    #judge_llm=OpenAI(temperature=0, model="gpt-3.5-turbo-0125")
)

# run eval
benchmark_df_base = rag_evaluator_pack_base.run()
print(benchmark_df_base)

  0%|          | 0/10 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 10%|█         | 1/10 [01:15<11:19, 75.46s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 20%|██        | 2/10 [02:30<10:03, 75.39s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 30%|███       | 3/10 [03:46<08:48, 75.49s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 40%|████      | 4/10 [05:01<07:32, 75.45s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 50%|█████     | 5/10 [06:16<06:16, 75.30s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-

rag                            base_rag
metrics                                
mean_correctness_score          1.00000
mean_relevancy_score            0.00000
mean_faithfulness_score         0.00000
mean_context_similarity_score   0.93283


## Evaluate Base model 

In [8]:
from llama_index.llms import HuggingFaceLLM
from llama_index import ServiceContext

llm_base = HuggingFaceLLM(model_name="tiiuae/falcon-rw-1b")

service_context_base = ServiceContext.from_defaults(
    llm=llm_base,
    embed_model="local:WhereIsAI/UAE-Large-V1"
)

config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.62G [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


generation_config.json:   0%|          | 0.00/115 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [9]:
# download a LabelledRagDataset from llama-hub
rag_dataset, documents = download_llama_dataset(
    "PaulGrahamEssayDataset", "./paul_graham"
)

# download and install RagEvaluatorPack
RagEvaluatorPack = download_llama_pack(
  "RagEvaluatorPack", "./rag_evaluator_pack"
)

# build index from the source documents
index = VectorStoreIndex.from_documents(documents=documents)

# define query engine
query_engine_base = index.as_query_engine(service_context=service_context_base)

# construct RagEvaluatorPack
rag_evaluator_pack_base = RagEvaluatorPack(
    query_engine=query_engine_base,
    rag_dataset=rag_dataset,
    #judge_llm=OpenAI(temperature=0, model="gpt-3.5-turbo-0125")
)

# run eval
benchmark_df_base = rag_evaluator_pack_base.run()
print(benchmark_df_base)

  0%|          | 0/10 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 10%|█         | 1/10 [00:17<02:37, 17.55s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 20%|██        | 2/10 [00:35<02:20, 17.57s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 30%|███       | 3/10 [00:52<02:03, 17.62s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 40%|████      | 4/10 [01:10<01:45, 17.61s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 50%|█████     | 5/10 [01:27<01:27, 17.54s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-

rag                            base_rag
metrics                                
mean_correctness_score         1.000000
mean_relevancy_score           0.000000
mean_faithfulness_score        0.000000
mean_context_similarity_score  0.932818
