# Evaluating different models from the Azure AI catalog with LlamaIndex

The following notebook demonstrate how users can use multiple models from Azure AI studio depending on the scenario and use the right model for the right job. In this case, we will use LLamaIndex to build a RAG system and select different models from different providers, maximizing the capabilities they have on each case.

## Preparing

In this example, we will use multiple models deployed in this project, including Phi-3, Cohere Command R+, Cohere Embed V3, Mistral Large, and OpenAI GPT-4o. Endpoints URLs and keys are stored in the `.env` file. Please update it accordingly:

In [57]:
import os
from dotenv import load_dotenv

load_dotenv(".env")

True

Let's configure asynchronous operations on the notebook.

In [42]:
import nest_asyncio

nest_asyncio.apply()

## Configure instrumentation

We will use LlamaIndex to build a RAG system to answer different questions from the Paul Grahm dataset. To identify opportunities of improvement, we are using PromptFlow Tracing for tracing and monitoring. The following section configures automatic instrumentation of LlamaIndex and connects it with a PromptFlow server instance running locally:

In [3]:
from promptflow.tracing import start_trace
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor

In [4]:
start_trace()



Let's configure instrumentation:

In [6]:
instrumentor = LlamaIndexInstrumentor()
instrumentor.instrument()

## Building a RAG system with models from the catalog

Let's use the Cohere model ecosystem to implement our RAG solution. Cohere models are optimized for RAG patterns and they can work in a large range of languages, specially when using the Cohere Embed V3 Multilingual:

In [8]:
from llama_index.core import Settings
from llama_index.core import SimpleDirectoryReader
from llama_index.core import StorageContext
from llama_index.core import SummaryIndex
from llama_index.core import VectorStoreIndex
from llama_index.core.selectors import LLMSingleSelector

from llama_index.llms.azure_inference import AzureAICompletionsModel
from llama_index.embeddings.azure_inference import AzureAIEmbeddingsModel

Cohere Command R+:

In [9]:
llm = AzureAICompletionsModel(
    endpoint=os.environ["AZURE_AI_COHERE_CMDR_ENDPOINT_URL"],
    credential=os.environ['AZURE_AI_COHERE_CMDR_ENDPOINT_KEY']
)

Cohere Embed V3 - Multilingual:

In [10]:
embed_model = AzureAIEmbeddingsModel(
    endpoint=os.environ["AZURE_AI_COHERE_EMBED_ENDPOINT_URL"],
    credential=os.environ['AZURE_AI_COHERE_EMBED_ENDPOINT_KEY'],
)

### Building the index

To demostrate how to use different models, let's first create an index using Cohere models.

In [11]:
Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 1024

In this example, we will use the Paul Graham dataset.

In [12]:
documents = SimpleDirectoryReader("data/paul_graham").load_data()

Once we have documents, we create nodes by applying chunking into it as it is configured:

In [13]:
nodes = Settings.node_parser.get_nodes_from_documents(documents)

Let's initialize storage context, by default it's in-memory so we don't have to worry about persisting them:

In [14]:
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)

### Search tools

Our RAG system will be able to answer questions that look to summarize multiple sources of information vs a more simple retrieval strategy.

#### Tree summarize

The summary index is a simple data structure where nodes are stored in a sequence. During index construction, the document texts are chunked up, converted to nodes, and stored in a list. During query time, the summary index iterates through the nodes with some optional filter parameters, and synthesizes an answer from all the nodes.

In [15]:
summary_index = SummaryIndex(nodes, storage_context=storage_context)

In [16]:
summarize_query_engine = summary_index.as_query_engine(
    response_mode="tree_summarize",
    use_async=True,
)

#### Vector index

`VectorStoreIndex` only stores nodes in document store if vector store does not store text.

In [17]:
vector_index = VectorStoreIndex(nodes, storage_context=storage_context)

In [18]:
vector_query_engine = vector_index.as_query_engine()

#### Constructing the query engine with the search tools

In [19]:
from llama_index.core.tools import QueryEngineTool

summary_tool = QueryEngineTool.from_defaults(
    query_engine=summarize_query_engine,
    description=(
        "Useful for summarization questions related to Paul Graham eassy on"
        " What I Worked On."
    ),
)

vector_tool = QueryEngineTool.from_defaults(
    query_engine=vector_query_engine,
    description=(
        "Useful for retrieving specific context from Paul Graham essay on What"
        " I Worked On."
    ),
)

In [20]:
from llama_index.core.query_engine import RouterQueryEngine

query_engine = RouterQueryEngine(
    selector=LLMSingleSelector.from_defaults(),
    query_engine_tools=[
        summary_tool,
        vector_tool,
    ],
)

Let's see how this works:

In [21]:
response = query_engine.query("What is the summary of the document?")
print(str(response))

The document is a collection of essays by Paul Graham, detailing his journey as a writer, programmer, entrepreneur, and investor. Graham reflects on his early experiences with writing and programming, his transition into the field of artificial intelligence (AI) during his college years, and his realization of the limitations of AI as it was practiced at the time. He discusses his interest in the Lisp programming language and his desire to build something that would last, which leads him to pursue art and attend art school. Graham also shares his experiences working at Interleaf, a software company, and his subsequent return to art school. He then narrates the founding of his first company, Viaweb, an e-commerce software startup, and the challenges and successes he faced as an entrepreneur. After selling Viaweb to Yahoo, Graham focuses on open-source projects, including a new dialect of Lisp called Arc, and continues to publish essays online. He also meets his future partner, Jessica L

In [22]:
response = query_engine.query("What did Paul Graham do after RICS?")
print(str(response))

After leaving RICS, Paul Graham returned to New York and resumed his previous life, but with the added financial freedom from his work at RICS. He continued to paint and experiment with new techniques, and also began searching for a new apartment to buy. During this time, he had an idea for a new startup, which would eventually lead to the creation of Y Combinator.


## Using an smaller model for simpler tasks

Using exactly the same class `AzureAIModelInferenceLLM` we can instantiate another model, in this case a Phi-3-mini-4K model.

In [23]:
slm = AzureAICompletionsModel(
    endpoint=os.environ["AZURE_AI_PHI3_MINI_ENDPOINT_URL"],
    credential=os.environ['AZURE_AI_PHI3_MINI_ENDPOINT_KEY']
)

Now, let's configure the `RouterQueryEngine` to use Phi-3 for the routing task instead of the larger model:

In [24]:
query_engine = RouterQueryEngine(
    selector=LLMSingleSelector.from_defaults(llm=slm),
    query_engine_tools=[
        summary_tool,
        vector_tool,
    ],
)

In [25]:
response = query_engine.query("What did Paul Graham do after RICS?")
print(str(response))

After leaving RICS, Paul Graham returned to New York and resumed his previous life, but with the added financial freedom from his work at RICS. He continued to paint and experiment with new techniques, combining traditional painting with photography and printmaking. He also began looking for a new apartment to buy, contemplating which neighborhood would be the best fit for him. During this time, he had an idea for a new startup, which would eventually lead to the creation of a new company and the development of several open-source software projects.


# Build an evaluation dataset

Let's build an evaluation dataset to see the effect of the change in the model. We will use another LLM to generate examples, in this case Mistral Large which is a good model for RAG:

In [27]:
generator_llm = AzureAICompletionsModel(
    endpoint=os.environ["AZURE_AI_MISTRAL_ENDPOINT_URL"],
    credential=os.environ["AZURE_AI_MISTRAL_ENDPOINT_KEY"],
    temperature=0,
)

In [140]:
from llama_index.core.llama_dataset import LabelledRagDataset
from llama_index.core.llama_dataset.generator import RagDatasetGenerator

Let's create the generator:

In [29]:
dataset_generator = RagDatasetGenerator.from_documents(
    documents=documents,
    llm=generator_llm,
    num_questions_per_chunk=2,
)

In [30]:
rag_dataset = dataset_generator.generate_questions_from_nodes()

100%|██████████| 132/132 [00:00<00:00, 1214140.63it/s]


Let's see an example:

In [150]:
print("Query:", rag_dataset[1].query)
print("Context:", rag_dataset[1].reference_contexts[0][:50], "...")

Query: In the context of the author's early programming experiences, what were the limitations of the IBM 1401 mainframe computer that hindered his ability to write meaningful programs?
Context: What I Worked On

February 2021

Before college th ...


Let's save the examples:

In [134]:
rag_dataset.save_json("evals/pg_rag_dataset.json")

We can reload them as follows:

In [141]:
rag_dataset = LabelledRagDataset.from_json("evals/pg_rag_dataset.json")

### Use evaluations for retrieval

__FaithfulnessEvaluator__

`FaithfulnessEvaluator` is used to measure if the response from a query engine matches any response nodes. This is useful for measuring if the response has hallucinated.

__RelevancyEvaluator__

`RelevancyEvaluator` is used to measure if the response and the source nodes match the query. This is useful for measuring if the query was actually answered by the response.

In [89]:
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.core.evaluation import BatchEvalRunner
from llama_index.core.evaluation import RelevancyEvaluator, FaithfulnessEvaluator

In this case let's use a more powerful model as a judge, being GPT-4:

In [119]:
gpt4judge = AzureOpenAI(
    deployment="gpt-4",
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_KEY"],
    api_version="2023-07-01-preview"
)

Configure the evaluators with this LLM:

In [121]:
relevancy_evaluator = RelevancyEvaluator(llm=gpt4judge)
faithfulness_evaluator = FaithfulnessEvaluator(llm=gpt4judge)

Let's create a dataset of the `query` property only.

In [153]:
batch_eval_queries = [sample.query for sample in rag_dataset[1::2]]

> `rag_dataset[1::2]` retries the odd indexes only, since the dataset contains "question 1:" as part of the generation. It probably requires to change the generation template.

A `BatchEvalRunner` will allow us to run evalutions over all the dataset:

In [155]:
runner = BatchEvalRunner(
    {
        "faithfulness": faithfulness_evaluator,
        "relevancy": relevancy_evaluator
    },
    workers=8,
)

Compute the evaluations:

In [156]:
eval_results = await runner.aevaluate_queries(
    query_engine, queries=batch_eval_queries
)

Let's write the evaluation results:

In [166]:
import json

eval_results_dict = {}
eval_results_dict["faithfulness"] = [
    dict(result) for result in eval_results["faithfulness"]]
eval_results_dict["relevancy"] = [
    dict(result) for result in eval_results["relevancy"]]

with open("evals/pg_rag_eval_results_phi3.json", "w") as f:
    json.dump(eval_results_dict, f)

Compute the scores:

In [157]:
faithfulness_score = sum(
    result.passing for result in eval_results['faithfulness']) / len(eval_results['faithfulness'])
relevancy_score = sum(
    result.passing for result in eval_results['relevancy']) / len(eval_results['relevancy'])

Let's see the results:

In [158]:
print(f"Faithfulness Score: {faithfulness_score}")
print(f"Relevancy Score: {relevancy_score}")

Faithfulness Score: 0.9696969696969697
Relevancy Score: 0.9090909090909091
