# Evaluate using sample questions: Llama Stack vs. LlamaIndex

This notebook starts with sample questions and reference answers in the format generated by [make-sample-questions.ipynb](./make-sample-questions.ipynb). It then does the following:

1. It runs RAG with Llama Stack to generate answers.
2. It runs RAG with LlamaIndex to generate answers.
3. It uses Ragas to evaluate the outputs given the generated answers, and the reference answers.  This step depends on having a very powerful model to do the evaluation.  We are using gpt-4o for that purpose.
4. It determines whether the results are statistically significant.

## Import dependencies

In [1]:
import evaluation_utilities


import os
from IPython.display import clear_output

import copy
import importlib
from enum import Enum
from typing import NamedTuple

from ragas.metrics import (
    AnswerAccuracy,
)
from ragas.metrics._domain_specific_rubrics import RubricsScore

from llama_index.llms.openai import OpenAI as LlamaIndexOpenAI
from llama_index.llms.llama_api import LlamaAPI
from llama_index.llms.openai_like import OpenAILike

from ragas.llms import LlamaIndexLLMWrapper

from llama_index.core.llms import ChatMessage
from llama_index.llms.ibm import WatsonxLLM
from llama_index.core import SimpleDirectoryReader
from llama_index.core import VectorStoreIndex
from llama_index.core import get_response_synthesizer
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

from llama_stack_client import LlamaStackClient
from llama_stack_client.types.shared_params import UserMessage
from llama_stack_client import RAGDocument
from llama_stack_client import Agent
from llama_stack_client.types.shared_params.sampling_params import SamplingParams

import uuid

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Rerun this cell whenever you change evaluation_utilities
importlib.reload(evaluation_utilities)

<module 'evaluation_utilities' from '/Users/bmurdock/lls-comparisons/evaluation_utilities.py'>

## Configure and initialize models

The main configuration options for this notebook are in the following cell, so you may want to edit some values there before running.

In [3]:
# reuse_client and max_retries are useful for preventing Ragas from failing due to rate limiting
EVALUATOR_MODEL={"model": "gpt-4o", "reuse_client": False, "max_retries": 10}

EMBED_MODEL_ID_FOR_LLAMAINDEX="ibm-granite/granite-embedding-125m-english"
EMBED_MODEL_ID_FOR_LLAMA_STACK="granite-embedding-125m"

CONTENT_URLS=["https://www.ibm.com/downloads/documents/us-en/1227c12d3a38b173"]
CONTENT_LOCATION="./docs/"

TEST_DATA_FILES = ["qna-ibm-2024-2250-2239.json"]

NUMBER_OF_SEARCH_RESULTS=5
LLAMA_STACK_RAG_MODELS_SAMPLING_PARAMS = SamplingParams(max_tokens=4096)

In [4]:

# These are only used if the selected provider is WATSONX_AI
WATSONX_PROJECT_ID=os.environ.get("WATSONX_PROJECT_ID")
LLAMA_INDEX_RAG_MODELS_INFO_WATSONX= [ {"model_id": "meta-llama/llama-3-3-70b-instruct", "project_id": WATSONX_PROJECT_ID, "max_new_tokens": 4096, "additional_params": {"time_limit": 10000}}]
LLAMA_STACK_RAG_MODELS_WATSONX = ["meta-llama/llama-3-3-70b-instruct"]

# These are only used if the selected provider is LLAMA_API
LLAMA_API_KEY=os.environ.get("LLAMA_API_KEY")
LLAMA_INDEX_RAG_MODELS_INFO_LLAMA_API= [ {"model": "Llama-3.3-70B-Instruct", "api_key":LLAMA_API_KEY, "api_base": "https://api.llama.com/compat/v1/"} ]
LLAMA_STACK_RAG_MODELS_LLAMA_API = ["Llama-3.3-70B-Instruct"]

# These are only used if the selected provider is OPENAI
OPENAIAPI_KEY=os.environ.get("OPENAI_API_KEY")
LLAMA_INDEX_RAG_MODELS_INFO_OPENAI= [ {"model": "gpt-3.5-turbo", "api_key":OPENAIAPI_KEY}]
LLAMA_STACK_RAG_MODELS_OPENAI = ["gpt-3.5-turbo"]

class InferenceProvider(Enum):
    WATSONX_AI = {
        "llamaindex_rag_models_info": LLAMA_INDEX_RAG_MODELS_INFO_WATSONX,
        "llama_stack_rag_models": LLAMA_STACK_RAG_MODELS_WATSONX
    }
    LLAMA_API = {
        "llamaindex_rag_models_info": LLAMA_INDEX_RAG_MODELS_INFO_LLAMA_API,
        "llama_stack_rag_models": LLAMA_STACK_RAG_MODELS_LLAMA_API
    }
    OPENAI = {
        "llamaindex_rag_models_info": LLAMA_INDEX_RAG_MODELS_INFO_OPENAI,
        "llama_stack_rag_models": LLAMA_STACK_RAG_MODELS_OPENAI
    }

SELECTED_PROVIDER = InferenceProvider.OPENAI

In [5]:

client = LlamaStackClient(base_url="http://localhost:8321", timeout=12000)

EMBED_MODEL_FOR_LLAMAINDEX = HuggingFaceEmbedding(model_name=EMBED_MODEL_ID_FOR_LLAMAINDEX)

evaluator_model = LlamaIndexOpenAI(**EVALUATOR_MODEL)

# The sample prompt here is one that is asking for a long, complex answer.
# This is useful for seeing if the model is able to output a substantial number of tokens.
# Some providers (e.g., watsonx.ai) default to very few tokens, so this is a good test
# of whether they are configured for a reasonable amount of text.  If the answer is cut off,
# it is likely that the model needs to be configured to allow for more tokens.
print(evaluator_model.complete("Explain why WW1 started"))

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: ibm-granite/granite-embedding-125m-english
INFO:sentence_transformers.SentenceTransformer:2 prompts are loaded, with the keys: ['query', 'text']
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


The causes of World War I are complex and multifaceted, involving a combination of political, economic, and social factors. Here are some of the key reasons:

1. **Alliance System**: Europe was divided into two main alliance systems: the Triple Entente (comprising France, Russia, and the United Kingdom) and the Triple Alliance (comprising Germany, Austria-Hungary, and Italy). These alliances were meant to provide mutual defense and maintain a balance of power, but they also meant that any conflict involving one country could quickly involve others.

2. **Nationalism**: Nationalistic fervor was on the rise in many parts of Europe, leading to competitive and antagonistic relationships between nations. Ethnic groups within multi-national empires, such as the Austro-Hungarian and Ottoman Empires, sought independence, further destabilizing the region.

3. **Imperialism**: The major European powers were competing for colonies and influence around the world. This competition often led to conf

In [6]:


LLAMA_INDEX_RAG_MODELS = {}
for info in SELECTED_PROVIDER.value["llamaindex_rag_models_info"]:
    if SELECTED_PROVIDER == InferenceProvider.WATSONX_AI:
        model_id = info["model_id"]
        llm = WatsonxLLM(**info)
    elif SELECTED_PROVIDER == InferenceProvider.LLAMA_API:
        model_id = info["model"]
        llm = LlamaAPI(**info)
    elif SELECTED_PROVIDER == InferenceProvider.OPENAI:
        model_id = info["model"]
        llm = LlamaIndexOpenAI(**info)
    LLAMA_INDEX_RAG_MODELS[model_id] = llm

In [7]:
for label, model in LLAMA_INDEX_RAG_MODELS.items():
    print(f"{label}: {model.complete('Explain why WW1 started')}")

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


gpt-3.5-turbo: World War I started in 1914 due to a complex web of political, economic, and social factors that had been building up for years. Some of the main reasons for the outbreak of the war include:

1. Nationalism: Nationalistic fervor was high in Europe at the time, with many countries seeking to assert their power and dominance over others. This led to increased tensions between nations and a desire to prove their superiority through military means.

2. Imperialism: European powers were engaged in a race to colonize and control territories around the world. This competition for resources and power created rivalries and conflicts between nations.

3. Militarism: Many countries had been building up their military strength in the years leading up to the war, creating an arms race that heightened tensions and made conflict more likely.

4. Alliances: A system of alliances had been formed between European powers, with countries pledging to support each other in case of war. When o

In [8]:
LLAMA_STACK_RAG_MODELS = SELECTED_PROVIDER.value["llama_stack_rag_models"]
LLAMA_STACK_RAG_MODELS

['gpt-3.5-turbo']

In [9]:
for model_id in LLAMA_STACK_RAG_MODELS:
    response = client.inference.chat_completion(
        model_id=model_id,
        messages=[{"role": "user", "content": "Explain why WW1 started"}],
        sampling_params=LLAMA_STACK_RAG_MODELS_SAMPLING_PARAMS,
        stream=False
    )
    print(response.completion_message.content)

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/inference/chat-completion "HTTP/1.1 200 OK"


World War 1 started due to a complex web of political, economic, and social factors that had been building up in Europe for decades. Some of the main reasons for the outbreak of the war include:

1. Nationalism: Nationalistic fervor was high in Europe during the late 19th and early 20th centuries, with many countries seeking to assert their dominance and expand their territories. This led to increased tensions between nations and competition for power.

2. Imperialism: European powers were engaged in a race to colonize and control territories around the world. This competition for resources and influence often led to conflicts between nations.

3. Militarism: Many European countries had been building up their military strength in the years leading up to WW1, creating an arms race and increasing the likelihood of conflict.

4. Alliances: A system of alliances had been formed between European powers, with countries pledging to support each other in the event of war. When one country was 

## Load the evaluation questions with reference answers

This loads the outputs of [make-sample-questions.ipynb](./make-sample-questions.ipynb)

In [10]:
loaded_data = []

for f in TEST_DATA_FILES:
    data = evaluation_utilities.read_json(f)
    loaded_data = loaded_data + data
print(len(loaded_data))
loaded_data[0:2]

2239


[{'user_input': "Based on the 2024 Annual Report, what are the key factors that IBM anticipates will drive its growth in 2025, and how might these factors influence the company's financial performance?",
  'user_input_type': 'reasoning',
  'user_input_topic': 'Future outlook and guidance for IBM in 2025',
  'user_profile': 'Professional stock market analyst',
  'reference': "Based on the 2024 Annual Report, IBM anticipates that its growth in 2025 will be driven by several key factors:\n\n1. **Continued Investment in AI and Hybrid Cloud**: IBM's strategy focuses on leveraging AI and hybrid cloud technologies to unlock the full value of data for clients. The company has made significant investments in these areas, which are expected to continue driving growth. This focus is likely to enhance IBM's software offerings and consulting services, leading to increased demand and revenue.\n\n2. **Expansion of Generative AI Offerings**: IBM's generative AI business has grown significantly, with a

In [11]:
# Temporary code to cut scope for quick testing.  Uncomment this line to run a mini-test of the notebook and uncomment it to run on the full dataset.
loaded_data = loaded_data[0:20]

In [12]:
rows_with_complete_reference_answers = [i for i, element in enumerate(loaded_data) if element['has_reference_answer']]
count = len(loaded_data)
print(f"{len(rows_with_complete_reference_answers)} of {count}")

15 of 20


# Run Agentic RAG with Llama Stack

This attempts to recreate the flow in https://llama-stack.readthedocs.io/en/latest/building_applications/rag.html#using-the-rag-tool , i.e., the RAG that a naive user getting started with the getting-started documentation would build EXCEPT that it is configured with the following elements:

- Content is from the URLs configured in CONTENT_URLS at the top of this notebook
- Milvus-lite inline vector IO provider
- granite-embedding-125m embedding model
- gpt-3.5-turbo generative model
- max_tokens for output is 4096

In [13]:
# Register a vector db
vector_db_id = f"rag-eval-{uuid.uuid4().hex}"
response = client.vector_dbs.register(
    vector_db_id=vector_db_id,
    embedding_model=EMBED_MODEL_ID_FOR_LLAMA_STACK,
    embedding_dimension=768,
    provider_id="milvus",
)

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/vector-dbs "HTTP/1.1 200 OK"


In [14]:
# Wrap the URLs as Llama Stack RAGDocument objects

documents = [
    RAGDocument(
        document_id=f"num-{i}",
        content=url,
        mime_type="application/pdf",
        metadata={"url": url},
    )
    for i, url in enumerate(CONTENT_URLS)
]

documents

[{'document_id': 'num-0',
  'content': 'https://www.ibm.com/downloads/documents/us-en/1227c12d3a38b173',
  'mime_type': 'application/pdf',
  'metadata': {'url': 'https://www.ibm.com/downloads/documents/us-en/1227c12d3a38b173'}}]

In [15]:
# Insert those Llama Stack RAGDocument objects into the vector database

client.tool_runtime.rag_tool.insert(
    documents=documents,
    vector_db_id=vector_db_id,
    chunk_size_in_tokens=512,
)

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/tool-runtime/rag-tool/insert "HTTP/1.1 200 OK"


In [16]:
# Query documents

results_with_reference_answers = client.tool_runtime.rag_tool.query(
    vector_db_ids=[vector_db_id],
    content="Based on IBM's 2024 Annual Report, how might the company's financial performance and strategic initiatives influence its ability to sustain or increase dividend payouts to shareholders?",
)

results_with_reference_answers

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/tool-runtime/rag-tool/query "HTTP/1.1 200 OK"


QueryResult(metadata={'document_ids': ['num-0', 'num-0', 'num-0', 'num-0', 'num-0']}, content=[TextContentItem(text='knowledge_search tool found 5 chunks:\nBEGIN of knowledge_search tool results.\n', type='text'), TextContentItem(text="Result 1\nContent: -year amounts have been reclassified to conform to the change in 2024 presentation.\nFrom the perspective of how management views cash flow, in 2024, after investing $1.1 billion in net capital investments, we \ngenerated free cash flow of $12.7 billion, an increase of $1.5 billion versus the prior year. The year-to-year increase in free cash \nflow primarily reflects current year performance-related improvements within net income and sustainable lower cash requirements\nthrough changes in our retirement plans. In 2024, net capital expenditures and net cash from operating activities include $0.4 \nbillion and $0.1 billion, respectively, of cash proceeds from the sale of certain QRadar SaaS assets. This benefit to net capital \nexpendit

In [17]:
# Register a RAG Agent in Llama Stack using the vector database

# Create agent with memory
agent = Agent(
    client,
    model=LLAMA_STACK_RAG_MODELS[0],
    instructions="You are a helpful assistant",
    sampling_params=LLAMA_STACK_RAG_MODELS_SAMPLING_PARAMS,
    tools=[
        {
            "name": "builtin::rag/knowledge_search",
            "args": {
                "vector_db_ids": [vector_db_id],
                "query_config": {
                    "chunk_size_in_tokens": 512,
                    "chunk_overlap_in_tokens": 0,
                    "max_chunks": NUMBER_OF_SEARCH_RESULTS,
                    "chunk_template": "Result {index}\nContent: {chunk.content}\nMetadata: {metadata}\n",
                },
            },
        }
    ],
)

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/agents "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET http://localhost:8321/v1/tools?toolgroup_id=builtin%3A%3Arag%2Fknowledge_search "HTTP/1.1 200 OK"


In [18]:
session_id = agent.create_session(f"rag_session-{uuid.uuid4().hex}")


# Ask questions about documents in the vector db, and the agent will query the db to answer the question.
response = agent.create_turn(
    messages=[{"role": "user", "content": "Based on IBM's 2024 Annual Report, how might the company's financial performance and strategic initiatives influence its ability to sustain or increase dividend payouts to shareholders?"}],
    session_id=session_id,
    stream=False
)

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/agents/6856873a-bc35-4a2f-8336-f78025e7a2f9/session "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:8321/v1/agents/6856873a-bc35-4a2f-8336-f78025e7a2f9/session/76c13136-c14d-4844-bb36-82de3ecc16a7/turn "HTTP/1.1 200 OK"


In [19]:
response.output_message.content

"IBM's 2024 Annual Report highlights the company's significant progress in becoming a higher growth, higher margin business by combining technology innovation and consulting expertise. The strategic initiatives focus on driving growth, improving productivity, and enhancing operational efficiency for clients and the company itself. IBM's strategy is built upon the technological foundations of AI and hybrid cloud to unlock the full value of data for clients.\n\nIn terms of financial performance, IBM generated $62.8 billion in revenue in 2024, representing a 3% increase at constant currency. The company also reported $12.7 billion in free cash flow, an increase of $1.5 billion year-over-year. IBM's generative AI book of business exceeded $5 billion since inception. These positive financial results enabled IBM to make significant investments in the business and deliver value to shareholders.\n\nThe company's strong cash flow from operations allows for investments in areas with attractive l

In [20]:
# Wait 30 seconds after a failure in the hopes that any temporary server glitches will be sorted out by then.
DELAY=30

# If it fails more than 15 times, give up.
MAX_RETRIES=15

def run_lls_rag(data, generator_model_ids, vector_db_id, instructions="You are a helpful assistant", label="Llama Stack"):
    datasets = {}
    i = 1

    for generator_model_id in generator_model_ids:
        dataset_for_test_model = copy.deepcopy(data)
        # Create agent with memory
        agent = Agent(
            client,
            model=generator_model_id,
            instructions=instructions,
            sampling_params=LLAMA_STACK_RAG_MODELS_SAMPLING_PARAMS,
            tools=[
                {
                    "name": "builtin::rag/knowledge_search",
                    "args": {
                        "vector_db_ids": [vector_db_id],
                        # Defaults
                        "query_config": {
                            "chunk_size_in_tokens": 512,
                            "chunk_overlap_in_tokens": 0,
                            "max_chunks": NUMBER_OF_SEARCH_RESULTS,
                            "chunk_template": "Result {index}\nContent: {chunk.content}\nMetadata: {metadata}\n",
                        },
                    },
                }
            ],
        )
    
        count = len(dataset_for_test_model)
        for entry in dataset_for_test_model:
            clear_output(wait=True)
            print(generator_model_id)
            print(f"{i} / {count}")
            i += 1
            question = entry["user_input"]
            
            session_id = agent.create_session(f"rag_session-{uuid.uuid4().hex}")
            try:
                response = evaluation_utilities.run_with_retries(
                    lambda: agent.create_turn(
                        messages=[{"role": "user", "content": question}],
                        session_id=session_id,
                        stream=False
                    ),
                    MAX_RETRIES,
                    DELAY)
                text = response.output_message.content
                entry["response"] = text
            except RuntimeError as e:
                # Sometimes we get:
                # > RuntimeError: Turn did not complete. Error: 400: litellm.ContextWindowExceededError: litellm.BadRequestError: ContextWindowExceededError: OpenAIException - This model's maximum context length is 16385 tokens. However, you requested 19600 tokens (15451 in the messages, 53 in the functions, and 4096 in the completion). Please reduce the length of the messages, functions, or completion.
                # Ideally we'd fall back to fewer search results and try again, but it's not obvious how to do that in Llama Stack without major code changes.  For now, we will just skip these cases.
                entry["response"] = None
                print(f"Skipping due to runtime error: {e}")

        datasets[label + ":" + generator_model_id] = dataset_for_test_model
    return datasets

In [21]:
lls_datasets = run_lls_rag(loaded_data, LLAMA_STACK_RAG_MODELS, vector_db_id)

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/agents/78b967bf-70e5-4b1b-a47f-cc83cbb51ae5/session "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:8321/v1/agents/78b967bf-70e5-4b1b-a47f-cc83cbb51ae5/session/6b758ad9-e948-4c37-941d-51ba6c265d08/turn "HTTP/1.1 200 OK"


gpt-3.5-turbo
20 / 20


Here we go through and print out the questions where we skipped due to a runtime error and count them.  If it is a big count, then the experiment has been substantially compromised.

In [22]:
count_skipped = 0
for key, results_with_reference_answers in lls_datasets.items():
    for entry in results_with_reference_answers:
        if entry["response"] is None:
            count_skipped += 1
            print(f"{key} {entry['user_input']}")
            entry["response"] = ""

print(f"Skipped {count_skipped} entries")

Skipped 0 entries


In [23]:
for _, value in lls_datasets.items():
    results_with_reference_answers = value

results_with_reference_answers[0:2]

[{'user_input': "Based on the 2024 Annual Report, what are the key factors that IBM anticipates will drive its growth in 2025, and how might these factors influence the company's financial performance?",
  'user_input_type': 'reasoning',
  'user_input_topic': 'Future outlook and guidance for IBM in 2025',
  'user_profile': 'Professional stock market analyst',
  'reference': "Based on the 2024 Annual Report, IBM anticipates that its growth in 2025 will be driven by several key factors:\n\n1. **Continued Investment in AI and Hybrid Cloud**: IBM's strategy focuses on leveraging AI and hybrid cloud technologies to unlock the full value of data for clients. The company has made significant investments in these areas, which are expected to continue driving growth. This focus is likely to enhance IBM's software offerings and consulting services, leading to increased demand and revenue.\n\n2. **Expansion of Generative AI Offerings**: IBM's generative AI business has grown significantly, with a

In [24]:
# Save checkpoint
evaluation_utilities.write_json(lls_datasets, f"./questions_and_reference_answers_and_system_answers-lls-{count}.json")

# Run RAG with LlamaIndex

This attempts to recreate the flow in https://docs.llamaindex.ai/en/stable/understanding/rag/ , i.e., the RAG that a naive user getting started with the getting-started documentation would build EXCEPT that it is configured with the following elements:

- Content is from the URLs configured in CONTENT_URLS at the top of this notebook
- Milvus vector IO provider
- granite-embedding-125m embedding model
- gpt-3.5-turbo generative model
- max_tokens for output is 4096
- number of search results to return is 5

In [25]:
documents = SimpleDirectoryReader(CONTENT_LOCATION).load_data()
vector_index = VectorStoreIndex.from_documents(documents=documents, embed_model=EMBED_MODEL_FOR_LLAMAINDEX)
vector_index.as_query_engine()



<llama_index.core.query_engine.retriever_query_engine.RetrieverQueryEngine at 0x3bc807c50>

In [26]:
# Get an arbitrary model for testing
for _, value in LLAMA_INDEX_RAG_MODELS.items():
    model = value

query_engine = vector_index.as_query_engine(llm=model, similarity_top_k=NUMBER_OF_SEARCH_RESULTS)

In [27]:
question = "Why does IBM exist?"
result = query_engine.query(question)
result.response.strip()

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


"IBM exists to provide integrated solutions and products that leverage technology and business expertise, with a focus on addressing the hybrid cloud and AI opportunity. They aim to support clients' digital transformations, help them engage with customers and employees in new ways, and deliver innovation and productivity through their technology and consulting capabilities. IBM's mission is to drive technology-led business growth for their clients by offering a differentiated portfolio that includes software, consulting services, and a deep incumbency in mission-critical systems."

In [28]:
def run_li_rag(qna, generator_models, idx, number_of_search_results):
    outputs = {}
    for generator_model_id, generator_model in generator_models.items():
        dataset = copy.deepcopy(qna)

        retriever = VectorIndexRetriever(
            index=idx,
            similarity_top_k=number_of_search_results,
        )

        query_engine = RetrieverQueryEngine(
            retriever=retriever,
            response_synthesizer=get_response_synthesizer(llm = generator_model)
        )

        i = 1
        count = len(dataset)
        for entry in dataset:
            clear_output(wait=True)
            print(generator_model_id)
            print(f"{i} / {count}")
            i += 1
            question = entry["user_input"]
            result = evaluation_utilities.run_with_retries(
                    lambda: query_engine.query(question),
                    MAX_RETRIES,
                    DELAY
                )
            entry["response"] = result.response.strip()
            entry["retrieved_contexts"] = [n.text for n in result.source_nodes]
        outputs["LlamaIndex" + ":" + generator_model_id] = dataset
    return outputs

In [29]:
li_datasets = run_li_rag(loaded_data, LLAMA_INDEX_RAG_MODELS, vector_index, number_of_search_results=NUMBER_OF_SEARCH_RESULTS)

gpt-3.5-turbo
20 / 20


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In [30]:
for _, value in li_datasets.items():
    results_with_reference_answers = value

results_with_reference_answers[0:2]

[{'user_input': "Based on the 2024 Annual Report, what are the key factors that IBM anticipates will drive its growth in 2025, and how might these factors influence the company's financial performance?",
  'user_input_type': 'reasoning',
  'user_input_topic': 'Future outlook and guidance for IBM in 2025',
  'user_profile': 'Professional stock market analyst',
  'reference': "Based on the 2024 Annual Report, IBM anticipates that its growth in 2025 will be driven by several key factors:\n\n1. **Continued Investment in AI and Hybrid Cloud**: IBM's strategy focuses on leveraging AI and hybrid cloud technologies to unlock the full value of data for clients. The company has made significant investments in these areas, which are expected to continue driving growth. This focus is likely to enhance IBM's software offerings and consulting services, leading to increased demand and revenue.\n\n2. **Expansion of Generative AI Offerings**: IBM's generative AI business has grown significantly, with a

# Combine the two sets of outputs

Here we combine the outputs from Llama Stack and LlamaIndex into one data structure and save it to disk.  This data structure will then be used in later sections to do the evaluation.

In [31]:
datasets = lls_datasets | li_datasets

datasets.keys()

dict_keys(['Llama Stack:gpt-3.5-turbo', 'LlamaIndex:gpt-3.5-turbo'])

In [32]:
count

20

In [33]:
# Save checkpoint
evaluation_utilities.write_json(datasets, f"./questions_and_reference_answers_and_system_answers-{count}.json")

In [34]:
# Re-load the checkpoint from disk.  This is just here to make it easier to restart the notebook from this point.

datasets = evaluation_utilities.read_json(f"./questions_and_reference_answers_and_system_answers-{count}.json")

In [35]:
data_with_reference_answers = {}
data_without_reference_answers = {}

for key, dataset in datasets.items():
    data_with_reference_answers[key] = []
    data_without_reference_answers[key] = []

    for entry in dataset:
        if entry['has_reference_answer']:
            data_with_reference_answers[key].append(entry)
        else:
            data_without_reference_answers[key].append(entry)

In [36]:
print(f"{len(list(data_with_reference_answers.values())[0])} of {len(list(datasets.values())[0])} have reference answers")
print(f"{len(list(data_without_reference_answers.values())[0])} of {len(list(datasets.values())[0])} do not have reference answers")

15 of 20 have reference answers
5 of 20 do not have reference answers


## Evaluate results with reference answers using Ragas and the evaluator model

In [37]:
# Reference for LlamaIndexLLMWrapper: https://docs.ragas.io/en/stable/howtos/integrations/_llamaindex/#building-the-queryengine
evaluator_llm_for_ragas = LlamaIndexLLMWrapper(evaluator_model)

# Rubrics from https://github.com/instructlab/eval/blob/main/src/instructlab/eval/ragas.py which got them from ragas v0.2.11
# and has them "hardcoded in case ragas makes any changes to their DEFAULT_WITH_REFERENCE_RUBRICS in the future".
SCORING_RUBRICS = {
    "score1_description": "The response is entirely incorrect, irrelevant, or does not align with the reference in any meaningful way.",
    "score2_description": "The response partially matches the reference but contains major errors, significant omissions, or irrelevant information.",
    "score3_description": "The response aligns with the reference overall but lacks sufficient detail, clarity, or contains minor inaccuracies.",
    "score4_description": "The response is mostly accurate, aligns closely with the reference, and contains only minor issues or omissions.",
    "score5_description": "The response is fully accurate, completely aligns with the reference, and is clear, thorough, and detailed.",
}

metrics = [
    AnswerAccuracy(llm=evaluator_llm_for_ragas),
    RubricsScore(llm=evaluator_llm_for_ragas, rubrics=SCORING_RUBRICS)
]

In [38]:
results_with_reference_answers = evaluation_utilities.run_ragas(data_with_reference_answers, evaluator_llm_for_ragas, metrics)

Batch 1/8:   0%|          | 0/4 [00:00<?, ?it/s]INFO:ragas.llms.base:temperature kwarg passed to LlamaIndex LLM
INFO:ragas.llms.base:callbacks not supported for LlamaIndex LLMs, ignoring callbacks
INFO:ragas.llms.base:temperature kwarg passed to LlamaIndex LLM
INFO:ragas.llms.base:callbacks not supported for LlamaIndex LLMs, ignoring callbacks
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:ragas.llms.base:temperature kwarg passed to LlamaIndex LLM
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:ragas.llms.base:temperature kwarg passed to LlamaIndex LLM
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 1/8:  25%|██▌       | 1/4 [00:01<00:03,  1.08s/it]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 1/8:  75%

In [39]:
results_with_reference_answers

{'Llama Stack:gpt-3.5-turbo': {'nv_accuracy': 0.5500, 'domain_specific_rubrics': 4.4667},
 'LlamaIndex:gpt-3.5-turbo': {'nv_accuracy': 0.5167, 'domain_specific_rubrics': 4.2667}}

In [40]:
keys = list(results_with_reference_answers.keys())
print(keys[0])
results_with_reference_answers[keys[0]].to_pandas().head()

Llama Stack:gpt-3.5-turbo


Unnamed: 0,user_input,reference_contexts,response,reference,nv_accuracy,domain_specific_rubrics
0,"Based on the 2024 Annual Report, what are the ...","[## 2024 Performance\n\nFor the year, IBM gene...",Based on the information from IBM's 2024 Annua...,"Based on the 2024 Annual Report, IBM anticipat...",0.75,5
1,Can you tell me how IBM has grown and what kin...,"[## IBM Strategy\n\nOver the past 5 years, IBM...",IBM has experienced significant growth and mad...,"Over the years, IBM has strategically shifted ...",0.75,5
2,Can you tell me some fun and surprising things...,[## DESCRIPTION OF BUSINESS\n\nPlease refer to...,Here are some fun and surprising facts about I...,Sure! Here are some fun and surprising things ...,0.0,2
3,Can you tell me some stories about how IBM's t...,[## IBM's differentiated portfolio value\n\nIB...,Here are some stories about how IBM's technolo...,IBM's technology has played a significant role...,0.5,4
4,Given IBM's current debt levels and credit rat...,[## Debt\n\nOur funding requirements are conti...,"Based on the information retrieved, IBM curren...",Given IBM's current debt levels and credit rat...,0.75,4


In [41]:
print(keys[1])
results_with_reference_answers[keys[1]].to_pandas().head()

LlamaIndex:gpt-3.5-turbo


Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,nv_accuracy,domain_specific_rubrics
0,"Based on the 2024 Annual Report, what are the ...",[IBM 2024 Annual Report 1\nArvind Krishna\nCha...,"[## 2024 Performance\n\nFor the year, IBM gene...",IBM anticipates that the key factors driving i...,"Based on the 2024 Annual Report, IBM anticipat...",0.75,5
1,Can you tell me how IBM has grown and what kin...,[We recognize \nthat no single company can pro...,"[## IBM Strategy\n\nOver the past 5 years, IBM...",IBM has shifted its focus to higher growth are...,"Over the years, IBM has strategically shifted ...",0.5,5
2,Can you tell me some fun and surprising things...,"[4\n(Ehningen, Germany). We also announced a p...",[## DESCRIPTION OF BUSINESS\n\nPlease refer to...,IBM has a strong focus on cutting-edge technol...,Sure! Here are some fun and surprising things ...,0.0,2
3,Can you tell me some stories about how IBM's t...,[We recognize \nthat no single company can pro...,[## IBM's differentiated portfolio value\n\nIB...,IBM's technology has been instrumental in tran...,IBM's technology has played a significant role...,0.25,4
4,Given IBM's current debt levels and credit rat...,[Our debt covenants are well \nwithin the requ...,[## Debt\n\nOur funding requirements are conti...,IBM might consider strategies such as managing...,Given IBM's current debt levels and credit rat...,0.5,4


## Evaluate results without reference answers

For the questions with no reference answers, the data set is indicating that the correct behavior is not to answer.  So we assess how often the system correctly refuses to answer in those cases.

In [42]:
list(data_without_reference_answers.values())[0][0:3]

[{'user_input': "Can you give an example of a problem that IBM's technology helped solve for a school or a community?",
  'user_input_type': 'reasoning',
  'user_input_topic': "How IBM Helps People and Businesses: Examples of how IBM's technology is used to solve problems and improve lives.",
  'user_profile': 'Fifth grader who wants to learn about IBM',
  'reference': "The provided context does not include a specific example of a problem that IBM's technology helped solve for a school or a community. It focuses on IBM's strategy of collaborating with clients and partners, using AI and hybrid cloud technologies, and mentions the use of IBM technology in resolving HR inquiries internally. However, it does not provide details about applications in educational or community settings.",
  'reference_contexts': ["## Collaborating to create value with clients and ecosystem partners\n\nBuilding our ecosystem is core to our overall strategy, focusing on helping clients transform their core oper

In [43]:
results_without_reference_answers = evaluation_utilities.run_evaluation_of_questions_without_reference_answers(data_without_reference_answers, evaluator_model)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In [44]:
list(results_without_reference_answers.values())[0].head()

Unnamed: 0,Percent Unanswered
0,1
1,0
2,1
3,0
4,0


# Assess statistical significance

Here we use Fisher's randomization test to judge whether those results are statistically significant both for the full set and for the subset that are marked as having complete reference answers.  We set the permutation_type='samples' because each of the models is using the same questions in the same order so we're doing a paired-sample test here.

In [45]:
import itertools

result_pairs = list(itertools.combinations(results_with_reference_answers.keys(), 2))
result_summary = []
result_pairs

[('Llama Stack:gpt-3.5-turbo', 'LlamaIndex:gpt-3.5-turbo')]

In [46]:
subset_of_rows = []

In [47]:
results0 = results_with_reference_answers[result_pairs[0][0]].to_pandas()
results0


Unnamed: 0,user_input,reference_contexts,response,reference,nv_accuracy,domain_specific_rubrics
0,"Based on the 2024 Annual Report, what are the ...","[## 2024 Performance\n\nFor the year, IBM gene...",Based on the information from IBM's 2024 Annua...,"Based on the 2024 Annual Report, IBM anticipat...",0.75,5
1,Can you tell me how IBM has grown and what kin...,"[## IBM Strategy\n\nOver the past 5 years, IBM...",IBM has experienced significant growth and mad...,"Over the years, IBM has strategically shifted ...",0.75,5
2,Can you tell me some fun and surprising things...,[## DESCRIPTION OF BUSINESS\n\nPlease refer to...,Here are some fun and surprising facts about I...,Sure! Here are some fun and surprising things ...,0.0,2
3,Can you tell me some stories about how IBM's t...,[## IBM's differentiated portfolio value\n\nIB...,Here are some stories about how IBM's technolo...,IBM's technology has played a significant role...,0.5,4
4,Given IBM's current debt levels and credit rat...,[## Debt\n\nOur funding requirements are conti...,"Based on the information retrieved, IBM curren...",Given IBM's current debt levels and credit rat...,0.75,4
5,How are the industry trends discussed in IBM's...,"[## 2024 Performance\n\nFor the year, IBM gene...",The IBM 2024 Annual Report highlights the comp...,IBM's 2024 Annual Report highlights several in...,0.5,5
6,How are the key challenges and opportunities i...,"[## Dear IBM Investor:\n\nIn 2024, IBM made si...",The key challenges and opportunities identifie...,IBM's 2024 Annual Report highlights several ke...,0.5,5
7,How are the new technologies IBM is working on...,[## Looking Forward\n\nTechnology has proven t...,IBM is working on various new technologies suc...,IBM is working on several new technologies tha...,0.5,4
8,How can understanding IBM's work in technology...,[## A new era of innovation\n\nThe mission of ...,"IBM's work in technology, particularly in area...",Understanding IBM's work in technology can ins...,0.75,5
9,How could IBM's approach to addressing geopoli...,[## DESCRIPTION OF BUSINESS\n\nPlease refer to...,IBM's approach to addressing geopolitical tens...,IBM's approach to addressing geopolitical tens...,0.5,4


Here we report the results for the questions that have reference answers that answer the questions.  The metrics reflect the measured accuracy of the generated answers assuming that the reference answers are correct.

Note that for large sample sizes, you might see a warning like this:

```
python3.11/site-packages/scipy/stats/_resampling.py:1492: RuntimeWarning: overflow encountered in scalar power
n_max = factorial(n_obs_sample)**n_samples
```

You can ignore that warning; it is computing a threshold for doing an exact test (enumerating all possible permutations) and when that number is very large, it falls back to a random sampling test.  The overflow error indicates that the number is larger than it can represent, so it counts that as effectively infinite and then falls back to the random sampling test just like it would if could represent the exact threshold.

In [48]:
print(f"Questions with reference answers ({len(results_with_reference_answers[next(iter(results_with_reference_answers.keys()))].to_pandas())})")
result_summary_for_rows_with_reference_answers = evaluation_utilities.report_results_with_significance(results_with_reference_answers, metrics)

Questions with reference answers (15)
Llama Stack:gpt-3.5-turbo LlamaIndex:gpt-3.5-turbo nv_accuracy
 Llama Stack:gpt-3.5-turbo                         :     0.5500
 LlamaIndex:gpt-3.5-turbo                          :     0.5167
 p_value                                           :     0.6949
  p_value>=0.05 so this result is NOT statistically significant.
  You can conclude that there is not enough data to tell which is better.
  Note that this data includes 15 questions which typically produces a margin of error of around +/-25.8%.
  So the two are probably roughly within that margin of error or so.
Llama Stack:gpt-3.5-turbo LlamaIndex:gpt-3.5-turbo domain_specific_rubrics
 Llama Stack:gpt-3.5-turbo                         :     4.4667
 LlamaIndex:gpt-3.5-turbo                          :     4.2667
 p_value                                           :     0.3728
  p_value>=0.05 so this result is NOT statistically significant.
  You can conclude that there is not enough data to tell whi

Here we report the results for the questions that have no reference answers, i.e., where the reference data asserts that the correct behavior is not to answer.  We measure how often the system correctly refused to answer those questions.  (Higher is better because refusing to answer is the correct behavior according to the reference).

In [49]:
class Metric(NamedTuple):
    name: str

print(f"Questions without reference answers ({len(results_without_reference_answers[next(iter(results_without_reference_answers.keys()))])})")
result_summary_for_rows_without_reference_answers = evaluation_utilities.report_results_with_significance(results_without_reference_answers, [Metric(name="Percent Unanswered")])

Questions without reference answers (5)
Llama Stack:gpt-3.5-turbo LlamaIndex:gpt-3.5-turbo Percent Unanswered
 Llama Stack:gpt-3.5-turbo                         :     0.4000
 LlamaIndex:gpt-3.5-turbo                          :     0.2000
 p_value                                           :     1.0000
  p_value>=0.05 so this result is NOT statistically significant.
  You can conclude that there is not enough data to tell which is better.
  Note that this data includes 5 questions which typically produces a margin of error of around +/-44.7%.
  So the two are probably roughly within that margin of error or so.


## Write reports

In [50]:
# Note that the reports only include the results for the questions that have reference answers.  In the future, it would be good to include the results for the questions that have no reference answers.

evaluation_utilities.write_excel(results_with_reference_answers, result_summary_for_rows_with_reference_answers, None, f"report-lls-vs-li-{count}.xlsx")