<a href="https://colab.research.google.com/github/wenqiglantz/llmops/blob/main/Eval_for_base_model_of_Mistral_7B_Instruct_v0_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluate the base model of Mistral-7B-Instruct-v0.2 using LlamaIndex's rag_evaluator LlamaPack

This notebook demonstrates how to evaluate the base model of `Mistral-7B-Instruct-v0.2` using LlamaIndex's reg_evaluator pack.


In [None]:
!pip install llama_index==0.9.25 llama_hub torch transformers accelerate bitsandbytes llama-cpp-python

Collecting llama_index==0.9.25
  Downloading llama_index-0.9.25-py3-none-any.whl (15.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.8/15.8 MB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama_hub
  Downloading llama_hub-0.0.70-py3-none-any.whl (42.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.6/42.6 MB[0m [31m32.0 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-cpp-python
  Downloading llama_cpp_python-0.2.28.tar.gz (9.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.4/9.4 MB[

In [None]:
import logging, sys
import nest_asyncio

nest_asyncio.apply()

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)

### Download Llama dataset and RagEvaluatorPack

First, we download both the Llama dataset and `RagEvaluatorPack`. We use Paul Graham's essay dataset in our evaluation. From the dataset, the pack uses `SimpleDirectoryReader` to load the data into `documents`, and we then construct the `VectorStoreIndex` from the `documents`.

In [None]:
from llama_index.llama_dataset import download_llama_dataset
from llama_index.llama_pack import download_llama_pack
from llama_index import VectorStoreIndex
from llama_index.llms import OpenAI
from google.colab import userdata
import os

# get the OpenAI API key from secrets tab in Colab
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

### Evaluate the base model

In [None]:
from llama_index.llms import HuggingFaceLLM
from llama_index import ServiceContext

llm_base = HuggingFaceLLM(model_name="mistralai/Mistral-7B-Instruct-v0.2")

service_context_base = ServiceContext.from_defaults(
    llm=llm_base,
    embed_model="local:WhereIsAI/UAE-Large-V1"
)

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/264 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]



In [None]:
# download a LabelledRagDataset from llama-hub
rag_dataset, documents = download_llama_dataset(
    "PaulGrahamEssayDataset", "./paul_graham"
)

# download and install RagEvaluatorPack
RagEvaluatorPack = download_llama_pack(
  "RagEvaluatorPack", "./rag_evaluator_pack"
)

# build index from the source documents
index = VectorStoreIndex.from_documents(documents=documents)

# define query engine
query_engine_base = index.as_query_engine(service_context=service_context_base)

# construct RagEvaluatorPack
rag_evaluator_pack_base = RagEvaluatorPack(
    query_engine=query_engine_base,
    rag_dataset=rag_dataset,
    judge_llm=OpenAI(temperature=0, model="gpt-4-1106-preview")
)

# run eval
benchmark_df_base = rag_evaluator_pack_base.run()
print(benchmark_df_base)

2it [00:13,  6.63s/it]
2it [00:12,  6.15s/it]
2it [00:14,  7.01s/it]
2it [00:11,  5.89s/it]
2it [00:13,  6.88s/it]
2it [00:10,  5.43s/it]
2it [00:14,  7.04s/it]
2it [00:15,  7.99s/it]
2it [00:15,  7.66s/it]
2it [00:21, 10.76s/it]
2it [00:14,  7.45s/it]
2it [00:14,  7.11s/it]
2it [00:19,  9.64s/it]
2it [00:13,  6.73s/it]
2it [00:14,  7.17s/it]
2it [00:17,  8.83s/it]
2it [00:23, 11.75s/it]
2it [00:16,  8.50s/it]
2it [00:13,  6.71s/it]
2it [00:18,  9.21s/it]
2it [00:16,  8.34s/it]
2it [00:15,  7.66s/it]


rag                            base_rag
metrics                                
mean_correctness_score         3.465909
mean_relevancy_score           0.659091
mean_faithfulness_score        0.954545
mean_context_similarity_score  0.932186
