<a href="https://colab.research.google.com/github/wenqiglantz/llmops/blob/main/Eval_for_quantized_models_for_Mistral_7B_Instruct_v0_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluate the quantized Mistral-7B-Instruct-v0.2 using LlamaIndex's rag_evaluator LlamaPack

This notebook demonstrates how to evaluate the quantized models for `Mistral-7B-Instruct-v0.2` and compare the performance between the two quantized models.


In [None]:
!pip install llama_index==0.9.25 llama_hub torch transformers accelerate bitsandbytes llama-cpp-python

Collecting llama_index==0.9.25
  Downloading llama_index-0.9.25-py3-none-any.whl (15.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.8/15.8 MB[0m [31m68.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama_hub
  Downloading llama_hub-0.0.71-py3-none-any.whl (100.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m100.2/100.2 MB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-cpp-python
  Downloading llama_cpp_python-0.2.28.tar.gz (9.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.4/9.4 M

In [None]:
import logging, sys
import nest_asyncio

nest_asyncio.apply()

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)

In [None]:
from llama_index.llama_dataset import download_llama_dataset
from llama_index.llama_pack import download_llama_pack
from llama_index import VectorStoreIndex
from llama_index.llms import OpenAI
from google.colab import userdata
import os

# get the OpenAI API key from secrets tab in Colab
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

### Evaluate 5-bit quantized model

In [None]:
from llama_index.llms import LlamaCPP
from llama_index import ServiceContext

# define llm by calling LlamaCPP, pass in the gguf file from hugging face hub
llm_q5 = LlamaCPP(
    model_url="https://huggingface.co/wenqiglantz/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q5_K_M.gguf"
)

# define ServiceContext
service_context_q5 = ServiceContext.from_defaults(
    llm=llm_q5,
    embed_model="local:WhereIsAI/UAE-Large-V1"
)

Downloading url https://huggingface.co/wenqiglantz/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q5_K_M.gguf to path /tmp/llama_index/models/mistral-7b-instruct-v0.2.Q5_K_M.gguf
total size (MB): 5132.35


4895it [00:30, 160.30it/s]                          
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 


We download both the Llama dataset and RagEvaluatorPack. We use Paul Graham's essay dataset in our evaluation. From the dataset, the pack uses SimpleDirectoryReader to load the data into documents, and we then construct the VectorStoreIndex from the documents.

In [None]:
# download and install RagEvaluatorPack
RagEvaluatorPack = download_llama_pack(
  "RagEvaluatorPack", "./rag_evaluator_pack"
)

# download a LabelledRagDataset from llama-hub
rag_dataset_q5, documents_q5 = download_llama_dataset(
    "PaulGrahamEssayDataset", "./paul_graham"
)

# build index from the source documents
index_q5 = VectorStoreIndex.from_documents(documents=documents_q5)

# define query engine
query_engine_q5 = index_q5.as_query_engine(service_context=service_context_q5)

# construct RagEvaluatorPack
rag_evaluator_pack_q5 = RagEvaluatorPack(
    query_engine=query_engine_q5,
    rag_dataset=rag_dataset_q5,
    judge_llm=OpenAI(temperature=0, model="gpt-4-1106-preview")
)

# run eval
benchmark_df_q5 = rag_evaluator_pack_q5.run()
print(benchmark_df_q5)

 10%|█         | 1/10 [07:48<1:10:16, 468.48s/it]Llama.generate: prefix-match hit
 20%|██        | 2/10 [15:18<1:00:59, 457.49s/it]Llama.generate: prefix-match hit
 30%|███       | 3/10 [16:16<32:06, 275.25s/it]  Llama.generate: prefix-match hit
 40%|████      | 4/10 [23:50<34:34, 345.70s/it]Llama.generate: prefix-match hit
 50%|█████     | 5/10 [31:21<31:59, 383.87s/it]Llama.generate: prefix-match hit
 60%|██████    | 6/10 [38:45<26:56, 404.20s/it]Llama.generate: prefix-match hit
 70%|███████   | 7/10 [45:25<20:08, 402.73s/it]Llama.generate: prefix-match hit
 80%|████████  | 8/10 [50:01<12:04, 362.46s/it]Llama.generate: prefix-match hit
 90%|█████████ | 9/10 [57:10<06:23, 383.16s/it]Llama.generate: prefix-match hit
100%|██████████| 10/10 [1:04:14<00:00, 385.42s/it]
  0%|          | 0/10 [00:00<?, ?it/s]Llama.generate: prefix-match hit
 10%|█         | 1/10 [04:10<37:31, 250.19s/it]Llama.generate: prefix-match hit
 20%|██        | 2/10 [11:03<46:11, 346.43s/it]Llama.generate: prefix-ma

rag                            base_rag
metrics                                
mean_correctness_score         3.625000
mean_relevancy_score           0.590909
mean_faithfulness_score        1.000000
mean_context_similarity_score  0.932185


### Evaluate 4-bit quantized model

In [None]:
from llama_index.llms import LlamaCPP
from llama_index import ServiceContext

llm_q4 = LlamaCPP(
    model_url="https://huggingface.co/wenqiglantz/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf"
)

service_context_q4 = ServiceContext.from_defaults(
    llm=llm_q4,
    embed_model="local:WhereIsAI/UAE-Large-V1"
)

Downloading url https://huggingface.co/wenqiglantz/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf to path /tmp/llama_index/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf
total size (MB): 4369.38


4167it [00:17, 231.70it/s]                          
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 


config.json:   0%|          | 0.00/733 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [None]:
# download and install RagEvaluatorPack
RagEvaluatorPack = download_llama_pack(
  "RagEvaluatorPack", "./rag_evaluator_pack"
)

# download a LabelledRagDataset from llama-hub
rag_dataset_q4, documents_q4 = download_llama_dataset(
    "PaulGrahamEssayDataset", "./paul_graham"
)

# build index from the source documents
index_q4 = VectorStoreIndex.from_documents(documents=documents_q4)

# define query engine
query_engine_q4 = index_q4.as_query_engine(service_context=service_context_q4)

# construct RagEvaluatorPack
rag_evaluator_pack_q4 = RagEvaluatorPack(
    query_engine=query_engine_q4,
    rag_dataset=rag_dataset_q4,
    judge_llm=OpenAI(temperature=0, model="gpt-4-1106-preview")
)

# run eval
benchmark_df_q4 = rag_evaluator_pack_q4.run()
print(benchmark_df_q4)

 10%|█         | 1/10 [06:32<58:54, 392.70s/it]Llama.generate: prefix-match hit
 20%|██        | 2/10 [12:58<51:48, 388.57s/it]Llama.generate: prefix-match hit
 30%|███       | 3/10 [13:57<27:46, 238.01s/it]Llama.generate: prefix-match hit
 40%|████      | 4/10 [20:26<29:47, 297.88s/it]Llama.generate: prefix-match hit
 50%|█████     | 5/10 [26:48<27:20, 328.05s/it]Llama.generate: prefix-match hit
 60%|██████    | 6/10 [32:56<22:47, 341.80s/it]Llama.generate: prefix-match hit
 70%|███████   | 7/10 [38:27<16:54, 338.19s/it]Llama.generate: prefix-match hit
 80%|████████  | 8/10 [42:25<10:12, 306.16s/it]Llama.generate: prefix-match hit
 90%|█████████ | 9/10 [48:17<05:20, 320.56s/it]Llama.generate: prefix-match hit
100%|██████████| 10/10 [54:34<00:00, 327.43s/it]
  0%|          | 0/10 [00:00<?, ?it/s]Llama.generate: prefix-match hit
 10%|█         | 1/10 [03:31<31:47, 211.95s/it]Llama.generate: prefix-match hit
 20%|██        | 2/10 [09:25<39:23, 295.39s/it]Llama.generate: prefix-match hit


rag                            base_rag
metrics                                
mean_correctness_score         3.670455
mean_relevancy_score           0.681818
mean_faithfulness_score        0.977273
mean_context_similarity_score  0.932186
