<a href="https://colab.research.google.com/github/winterForestStump/llm/blob/main/RAG_evaluation_LlamaIndex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluation RAG on Business section (item 1) of the 10-K SEC filing

This notebook is inspired by the [article](https://cookbook.openai.com/examples/evaluation/evaluate_rag_with_llamaindex) and replicates it with using open source [Llama-3](https://huggingface.co/bartowski/Llama3-DocChat-1.0-8B-GGUF/blob/main/Llama3-DocChat-1.0-8B-Q6_K.gguf) model.

In [1]:
# Install LlamaCpp to run the model locally.
# Enable CUDA for faster performance
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python==0.2.77

Collecting llama-cpp-python==0.2.77
  Downloading llama_cpp_python-0.2.77.tar.gz (50.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.2/50.2 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python==0.2.77)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.2.77-cp310-cp310-linux_x86_64.whl size=132170712 sha256=0

Installimg Llama-Index libraries

In [2]:
%%capture --no-stderr
!pip install llama-index llama-index-llms-llama-cpp llama-index-embeddings-huggingface --quiet

In [3]:
import nest_asyncio
nest_asyncio.apply()

from llama_index.llms.llama_cpp import LlamaCPP
#from llama_index.llms.llama_cpp.llama_utils import (messages_to_prompt, completion_to_prompt)

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import Settings
from llama_index.core import VectorStoreIndex,SimpleDirectoryReader #,ServiceContext,PromptTemplate, set_global_service_context
from llama_index.core.response.pprint_utils import pprint_response
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.query_engine import SubQuestionQueryEngine

from llama_index.core.evaluation import generate_question_context_pairs
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core.evaluation import RetrieverEvaluator

from llama_index.core.evaluation import FaithfulnessEvaluator
from llama_index.core.evaluation import RelevancyEvaluator
from llama_index.core.evaluation import EvaluationResult

import os
import pandas as pd






In [4]:
# Download the Item1. Business of the 10-K CocaCola annual 1993 filing

!mkdir -p 'data/coca_cola/'
!curl 'https://raw.githubusercontent.com/winterForestStump/llm/refs/heads/main/1993_CocaCola_item1.txt' -o 'data/coca_cola/coca_cola.txt'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 38998  100 38998    0     0   145k      0 --:--:-- --:--:-- --:--:--  145k


In [5]:
# Download the Llama-3 model
!huggingface-cli download bartowski/Llama3-DocChat-1.0-8B-GGUF Llama3-DocChat-1.0-8B-Q6_K.gguf --local-dir ./models --local-dir-use-symlinks False

Downloading 'Llama3-DocChat-1.0-8B-Q6_K.gguf' to 'models/.cache/huggingface/download/Llama3-DocChat-1.0-8B-Q6_K.gguf.70376adad72d1d777b2cb61c98092db796a7923de018073fcb0a1ff25b21d6d3.incomplete'
Llama3-DocChat-1.0-8B-Q6_K.gguf: 100% 6.60G/6.60G [00:47<00:00, 138MB/s] 
Download complete. Moving file to models/Llama3-DocChat-1.0-8B-Q6_K.gguf
models/Llama3-DocChat-1.0-8B-Q6_K.gguf


In [6]:
# Define LLM

TEMP = 0
N_CTX = 4096
N_GPU_L = -1

llm_llama = LlamaCPP(
    model_path="/content/models/Llama3-DocChat-1.0-8B-Q6_K.gguf",
    temperature=TEMP,
    context_window=N_CTX,
    model_kwargs={"n_gpu_layers": N_GPU_L},
    verbose=True
)

llama_model_loader: loaded meta data with 32 key-value pairs and 291 tensors from /content/models/Llama3-DocChat-1.0-8B-Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama3 DocChat 1.0 8B
llama_model_loader: - kv   3:                           general.basename str              = Llama3-DocChat-1.0
llama_model_loader: - kv   4:                         general.size_label str              = 8B
llama_model_loader: - kv   5:                            general.license str              = other
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["cerebras", "doc-chat", "DocChat", "..

In [7]:
# Define embedding model

embed_model =  HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [8]:
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)

In [9]:
# Indexing the documents

documents = SimpleDirectoryReader("./data/coca_cola/").load_data()

# Build index with a chunk_size of 512
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)
vector_index = VectorStoreIndex(nodes)

In [10]:
# Creating a query engine

query_engine = vector_index.as_query_engine(llm=llm_llama, similarity_top_k=2)

In [11]:
# Test QA

response_vector = query_engine.query("What does the company do?")


llama_print_timings:        load time =     614.04 ms
llama_print_timings:      sample time =     567.14 ms /   256 runs   (    2.22 ms per token,   451.39 tokens per second)
llama_print_timings: prompt eval time =    1082.69 ms /   941 tokens (    1.15 ms per token,   869.13 tokens per second)
llama_print_timings:        eval time =    9339.82 ms /   255 runs   (   36.63 ms per token,    27.30 tokens per second)
llama_print_timings:       total time =   11421.18 ms /  1196 tokens


In [12]:
response_vector.response

' manufactures, produces, markets and distributes juice and juice drink products. \n---------------------\n\n\nmanufactures, produces, markets and distributes juice and juice drink products. \n\n---------------------\n\nThe Coca-Cola Company (the "Company" or the "Registrant") was incorporated\nin September 1919 under the laws of the State of Delaware and succeeded to the\nbusiness of a Georgia corporation with the same name that had been organized in\n1892. The Company is the largest manufacturer, marketer and distributor of\ncarbonated soft drink concentrates and syrups in the world. Its soft drink\nproducts, sold in the United States since 1886, are now sold in more than 195\ncountries around the world and are the leading carbonated soft drink products in\nmost of these countries. Within the last two years, the Company has gained entry\ninto several countries such as Romania and India. The Company also manufactures,\nproduces, markets and distributes juice and juice drink products.\

In [13]:
# First retrieved node
response_vector.source_nodes[0].get_text()

'ITEM 1.  BUSINESS\n \n     The Coca-Cola Company (the "Company" or the "Registrant") was incorporated\nin September 1919 under the laws of the State of Delaware and succeeded to the\nbusiness of a Georgia corporation with the same name that had been organized in\n1892. The Company is the largest manufacturer, marketer and distributor of\ncarbonated soft drink concentrates and syrups in the world. Its soft drink\nproducts, sold in the United States since 1886, are now sold in more than 195\ncountries around the world and are the leading carbonated soft drink products in\nmost of these countries. Within the last two years, the Company has gained entry\ninto several countries such as Romania and India. The Company also manufactures,\nproduces, markets and distributes juice and juice drink products.\n \nSOFT DRINKS\n \n  General Business Description\n \n     The Company manufactures soft drink concentrates and syrups, which it sells\nto bottling and canning operations, and manufactures fo

In [14]:
# Second retrieved node
response_vector.source_nodes[1].get_text()

'Competition\n \n     The juice and juice drink products manufactured, marketed and distributed\nby Coca-Cola Foods face strong competition from other producers of regionally\nand nationally advertised brands of juice and juice drink products. Significant\ncompetitive factors include advertising and trade promotion programs, new\nproduct introductions, new and more efficient production and distribution\nmethods, new packaging and dispensing equipment, and brand and trademark\ndevelopment and protection.\n \n  Raw Materials\n \n     The citrus industry is subject to the variability of weather conditions, in\nparticular the possibility of freezes in central Florida, which may result in\nhigher prices and lower consumer demand for orange juice throughout the\nindustry. Due to the Company\'s long-standing relationship with a supplier of\nhigh-quality Brazilian orange juice concentrate, the supply of juice available\nthat meets the Company\'s standards is normally adequate to meet demand.\n

# Evaluation

In a RAG system, evaluation focuses on two critical aspects:

- Retrieval Evaluation: This assesses the accuracy and relevance of the information retrieved by the system.
- Response Evaluation: This measures the quality and appropriateness of the responses generated by the system based on the retrieved information.

In [15]:
# Creating a QA dataset: 2 questions for a chunk

qa_dataset = generate_question_context_pairs(
    nodes,
    llm=llm_llama,
    num_questions_per_chunk=2
)

  0%|          | 0/28 [00:00<?, ?it/s]Llama.generate: prefix-match hit

llama_print_timings:        load time =     614.04 ms
llama_print_timings:      sample time =     582.56 ms /   256 runs   (    2.28 ms per token,   439.44 tokens per second)
llama_print_timings: prompt eval time =     560.02 ms /   518 tokens (    1.08 ms per token,   924.97 tokens per second)
llama_print_timings:        eval time =    9315.33 ms /   255 runs   (   36.53 ms per token,    27.37 tokens per second)
llama_print_timings:       total time =   10940.74 ms /   773 tokens
  4%|▎         | 1/28 [00:10<04:55, 10.96s/it]Llama.generate: prefix-match hit

llama_print_timings:        load time =     614.04 ms
llama_print_timings:      sample time =     562.89 ms /   256 runs   (    2.20 ms per token,   454.79 tokens per second)
llama_print_timings: prompt eval time =     761.01 ms /   530 tokens (    1.44 ms per token,   696.44 tokens per second)
llama_print_timings:        eval time =    9530.02 ms /   255 run

In [16]:
# Function to display evaluation results

def display_results(name, eval_results):
    """Display results from evaluate."""

    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)

    hit_rate = full_df["hit_rate"].mean()
    mrr = full_df["mrr"].mean()
    precision = full_df["precision"].mean()
    recall = full_df["recall"].mean()
    ap = full_df["ap"].mean()
    ndcg = full_df["ndcg"].mean()

    metric_df = pd.DataFrame(
        {"Retriever Name": [name], "Hit Rate": [hit_rate], "MRR": [mrr], "Precision": [precision],
         "Recall": [recall], "AP": [ap], "NDCG": [ndcg]}
    )

    return metric_df

In [17]:
# Retriever Evaluation

retriever = vector_index.as_retriever(llm=llm_llama, similarity_top_k=2)
retriever_evaluator = RetrieverEvaluator.from_metric_names(["mrr", "hit_rate", "precision", "recall", "ap", "ndcg"], retriever=retriever)
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)
display_results(f"BAAI/bge-small-en-v1.5 Retriever", eval_results)

Unnamed: 0,Retriever Name,Hit Rate,MRR,Precision,Recall,AP,NDCG
0,BAAI/bge-small-en-v1.5 Retriever,0.75,0.651786,0.375,0.75,0.651786,0.41541


# Response Evaluation:
- FaithfulnessEvaluator: Measures if the response from a query engine matches any source nodes which is useful for measuring if the response is hallucinated.
- Relevancy Evaluator: Measures if the response + source nodes match the query.

In [18]:
queries = list(qa_dataset.queries.values())

In [19]:
faithfulness_llama = FaithfulnessEvaluator(llm=llm_llama)
relevancy_llama = RelevancyEvaluator(llm=llm_llama)

In [20]:
from llama_index.core.evaluation import BatchEvalRunner

# Initiate BatchEvalRunner to compute FaithFulness and Relevancy Evaluation.
runner = BatchEvalRunner(
    {"faithfulness": faithfulness_llama, "relevancy": relevancy_llama},
    workers=8,
)

# Compute evaluation
eval_results = await runner.aevaluate_queries(
    query_engine, queries
)

Llama.generate: prefix-match hit

llama_print_timings:        load time =     614.04 ms
llama_print_timings:      sample time =     587.39 ms /   256 runs   (    2.29 ms per token,   435.83 tokens per second)
llama_print_timings: prompt eval time =    1167.45 ms /  1016 tokens (    1.15 ms per token,   870.27 tokens per second)
llama_print_timings:        eval time =   13366.69 ms /   255 runs   (   52.42 ms per token,    19.08 tokens per second)
llama_print_timings:       total time =   15511.46 ms /  1271 tokens
Llama.generate: prefix-match hit

llama_print_timings:        load time =     614.04 ms
llama_print_timings:      sample time =     537.99 ms /   256 runs   (    2.10 ms per token,   475.84 tokens per second)
llama_print_timings: prompt eval time =     567.74 ms /   462 tokens (    1.23 ms per token,   813.75 tokens per second)
llama_print_timings:        eval time =   13782.42 ms /   255 runs   (   54.05 ms per token,    18.50 tokens per second)
llama_print_timings:       to

In [21]:
# Let's get faithfulness score

faithfulness_score = sum(result.passing for result in eval_results['faithfulness']) / len(eval_results['faithfulness'])

faithfulness_score

0.7321428571428571

In [22]:
# Let's get relevancy score

relevancy_score = sum(result.passing for result in eval_results['relevancy']) / len(eval_results['relevancy'])

relevancy_score


0.8392857142857143