# Meta-Evaluation on LlamaIndex built-in evaluation

`LlamaIndex` has good documentaion and [built-in support](https://gpt-index.readthedocs.io/en/latest/core_modules/supporting_modules/evaluation/usage_pattern.html) for evaluation.

## Installation

In [1]:
%pip install -qq llama_index=="0.8.22" pydantic nltk

Note: you may need to restart the kernel to use updated packages.


In [2]:
from llama_index import (
    VectorStoreIndex,
    SimpleWebPageReader,
    ServiceContext,
    LLMPredictor,
)
from llama_index.llms import OpenAI

## Loading Documents

[How to do great work](http://paulgraham.com/greatwork.html) is wonderful blog post by Paul Graham. With `SimpleWebPageReader`, we can easily load the documents and get the query engine.

In [3]:
urls = ["http://paulgraham.com/greatwork.html"]
documents = SimpleWebPageReader(html_to_text=True).load_data(urls)

llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="gpt-3.5-turbo"))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)

index = VectorStoreIndex.from_documents(
    documents=documents, service_context=service_context
)
query_engine = index.as_query_engine()

In [5]:
query = "To do great work, should I follow my heart or my head?"
response = query_engine.query(query)
print(response)

Follow your heart.


Yes, for sure. Let's look into the sources too.

In [6]:
print(response.get_formatted_sources(length=250))

> Source (Doc id: 5f29d164-ca93-4480-bb22-dd38745dca3c): This is how practically everyone who's done great work
has done it, from painters to physicists.Steps two and four will require hard work.It may not be possible to prove
that you have to work hard to do great things, but the empirical evidence is
...

> Source (Doc id: 50bd6bd7-f8c3-443c-a6f3-c1198f62bfa7): Since it matters so much for this cycle to be
running in the right direction, it can be a good idea to switch to easier work
when you're stuck, just so you start to get something done.One of the biggest mistakes ambitious people make is to allow s...


## Preparing Questions

To run an evaluation on a QA system, we need questions. The good thing is that `LllamaIndex` has `DatasetGenerator`!

In [10]:
from llama_index.evaluation import DatasetGenerator

data_generator = DatasetGenerator.from_documents(documents)
questions = data_generator.generate_questions_from_nodes(num=2)

questions

['What are the three qualities that the work you choose needs to have according to the author?', 'How does the author suggest figuring out what to work on?']


In [None]:
from llama_index.evaluation import QueryResponseEvaluator

evaluator = QueryResponseEvaluator(service_context=service_context)

results = []

for query in _ds["question"]:
    response = query_engine.query(query)
    result = evaluator.evaluate(query, response)
    results.append(result)

result2 = Dataset.from_dict({"input": _ds["question"], "prediction": results})
result2.to_pandas()

https://github.com/jerryjliu/llama_index/blob/9acd9297860824ebc2c9c47358c05f387c62cff5/llama_index/evaluation/base.py#L226

[QueryResponseEvaluator](https://gpt-index.readthedocs.io/en/latest/core_modules/supporting_modules/evaluation/usage_pattern.html#evaluting-query-response-for-answer-quality) checks if the synthesized response matches the query + any source context.

In [25]:
from typing import List

from datasets import Dataset
from llama_index import Response

import fastrepl


def get_context(response: Response) -> List[str]:
    return [context_info.node.get_content() for context_info in response.source_nodes]


def get_input(query: str, r: Response) -> str:
    response = r.response
    context = get_context(r)
    return f"Query: {query}, Response: {response}, Context: {context}"


_ds = Dataset.from_dict({"question": questions})


def transform(row):
    query = row["question"]
    response = query_engine.query(query)
    row["input"] = get_input(query, response)
    return row


ds = _ds.map(transform, remove_columns=["question"])
ds

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Dataset({
    features: ['input'],
    num_rows: 5
})

In [31]:
evaluator = fastrepl.SimpleEvaluator(
    pipeline=[
        fastrepl.LLMClassificationHead(
            model="gpt-4",
            context="You will receive text containing query, response, and context information. You should evaluate the response based on the query and context.",
            labels={
                "YES": "response for the query is in line with the context.",
                "NO": "response for the query is NOT in line with the context.",
            },
        )
    ]
)

result = fastrepl.local_runner(evaluator, ds).run()
result.to_pandas()

Output()

Unnamed: 0,input,prediction
0,Query: What are the three qualities that the w...,NO
1,Query: How does the author suggest figuring ou...,YES
2,Query: What are the four steps the author outl...,NO
3,Query: Why does the author emphasize the impor...,NO
4,Query: How does the author suggest making your...,YES
