In [2]:
from llama_index.evaluation import DatasetGenerator, QueryResponseEvaluator
from llama_index import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    ServiceContext,
    LLMPredictor,
    Response,
    Document,
)
from llama_index.llms import OpenAI

In [3]:
reader = SimpleDirectoryReader("../data/paul_graham/")
documents = reader.load_data()
trunc_doc = Document(text="\n".join(documents[0].get_content().split("\n")[::16]))

In [4]:
len(documents[0].get_content())

75011

In [5]:
len(trunc_doc.get_content())

9013

In [6]:
data_generator = DatasetGenerator.from_documents(documents)

We generate question and answer pairs for our document.

In [7]:
qna = data_generator.generate_qna_from_nodes()

In [8]:
qna[:10]

[('What were the two main things the author worked on before college?',
  'Writing and programming.'),
 ('What kind of stories did the author write before college?',
  'Short stories.'),
 ('What programming language did the author use on the IBM 1401?', 'Fortran.'),
 ("What was the author's experience with the 1401?",
  "The author couldn't figure out what to do with it and didn't have any data stored on punched cards."),
 ("What microcomputer did the author's friend build?",
  'A microcomputer kit sold by Heathkit.'),
 ("What kind of computer did the author's father buy?", 'A TRS-80.'),
 ('What did the author write using the TRS-80?',
  'Simple games, a program to predict rocket flight, and a word processor.'),
 ('What did the author plan to study in college initially?', 'Philosophy.'),
 ('What made the author switch to studying AI?',
  'The novel "The Moon is a Harsh Mistress" and seeing Terry Winograd using SHRDLU.'),
 ('What did the author realize about AI during the first year of 

In [9]:
from llama_index.evaluation import SemanticRelationMatch, SemanticAnswerSimilarity
from llama_index.indices.postprocessor import SentenceEmbeddingOptimizer

Let's test if pruning the retrieved nodes to 20% of their original size based on their semantic similarity to the query will adversely affect the performance

In [10]:
ctx = ServiceContext.from_defaults(embed_model="local")
index = VectorStoreIndex.from_documents(documents=documents, service_context=ctx)
query_engine_A = index.as_query_engine()
query_engine_B = index.as_query_engine(
    node_postprocessors=[
        SentenceEmbeddingOptimizer(embed_model=ctx.embed_model, percentile_cutoff=0.2)
    ]
)

  from .autonotebook import tqdm as notebook_tqdm


Next, the relation evaluator scores similarity based on its classification of "contradiction" (-1), "neutral" (0), "entailment" (1) to determine the similarity, while the similarity evaluator outputs a similarity score normalized to length 1

In [11]:
evaluator_rel = SemanticRelationMatch()
evaluator_sim = SemanticAnswerSimilarity()

Now we compare query engine A, which uses the entire source node, with query engine B, which uses 20% of the node.

In [None]:
import pandas as pd

df = pd.DataFrame(
    columns=[
        "Question",
        "LLM Gen Answer",
        "Answer A",
        "Answer B",
        "rel A",
        "rel B",
        "rel A - rel B",
        "similarity A",
        "similarity B",
        "A - B similarity",
    ]
)
for i, (q, a) in enumerate(qna[:10]):
    a_A = query_engine_A.query(q).response
    a_B = query_engine_B.query(q).response

    rel_A = evaluator_rel.evaluate(q + a, q + " A: " + a_A)
    rel_B = evaluator_rel.evaluate(q + a, q + " A: " + a_B)
    sim_A = evaluator_sim.evaluate(q + a, a_A)
    sim_B = evaluator_sim.evaluate(q + a, a_B)
    # print("res", (q, a, a_A, a_B, rel_A, rel_B, rel_A - rel_B, sim_A, sim_B, sim_A - sim_B))
    df.loc[len(df)] = (
        q,
        a,
        a_A,
        a_B,
        rel_A,
        rel_B,
        rel_A - rel_B,
        sim_A,
        sim_B,
        sim_A - sim_B,
    )

In [24]:
def highlight(row):
    red = "background-color: red;"
    blue = "background-color: blue;"
    green = "background-color: green;"

    return [
        green if row["rel A"] >= 0.0 else red,
        green if row["rel B"] >= 0.0 else red,
        green if row["similarity A"] > 0.5 else red,
        green if row["similarity B"] > 0.5 else red,
    ]


df.style.apply(
    highlight, subset=["rel A", "rel B", "similarity A", "similarity B"], axis=1
)

Unnamed: 0,Question,LLM Gen Answer,Answer A,Answer B,rel A,rel B,rel A - rel B,similarity A,similarity B,A - B similarity
4,What microcomputer did the author's friend build?,A microcomputer kit sold by Heathkit.,The microcomputer that the author's friend built was a Heathkit.,It is not possible to answer this question with the given context information.,0.0,-1.0,1.0,0.875564,0.263309,0.612254
3,What was the author's experience with the 1401?,The author couldn't figure out what to do with it and didn't have any data stored on punched cards.,"The author's experience with the 1401 was limited. He and his friend Rich Draves had permission to use it, but they didn't have any data stored on punched cards to use as input for their programs. The author remembers the moment he learned it was possible for programs not to terminate, when one of his programs didn't. He also remembers the alien-looking machines and the spectacularly loud printer.","The author's experience with the 1401 was that they wrote simple games, a program to predict how high their model rockets would fly, and a word processor that their father used to write at least one book. They also wrote short stories, but the memory of the 1401 was limited so they could only write two pages at a time.",1.0,1.0,0.0,0.517464,0.312007,0.205457
8,What made the author switch to studying AI?,"The novel ""The Moon is a Harsh Mistress"" and seeing Terry Winograd using SHRDLU.","The author switched to studying AI because he was drawn to the idea of creating a program like SHRDLU that could understand natural language. He was inspired by Terry Winograd's use of SHRDLU and believed that it was already climbing the lower slopes of intelligence. He was excited by the challenge of teaching himself Lisp, which was regarded as the language of AI at the time, and wanted to reverse-engineer SHRDLU for his undergraduate thesis. He was disappointed to find that the AI programs of the time could only understand a very limited subset of natural language, and realized that the approach of using explicit data structures to represent concepts was not going to work. He decided to focus on Lisp instead, and wrote a book about Lisp hacking.","The author switched to studying AI because they realized that the way AI was being practiced at the time was a hoax. They wanted to learn more about the field and chose Lisp as the language to do so, since it was regarded as the language of AI. They also applied to three graduate schools renowned for AI at the time, and wanted to explore the potential of microcomputers.",-1.0,1.0,-2.0,0.51977,0.409743,0.110027
7,What did the author plan to study in college initially?,Philosophy.,The author initially planned to study graduate-level mathematics in college.,The author initially planned to study painting and drawing at the RISD foundation program.,-1.0,-1.0,0.0,0.415815,0.332563,0.083252
0,What were the two main things the author worked on before college?,Writing and programming.,The two main things the author worked on before college were painting and freelance Lisp hacking work.,The two main things the author worked on before college were painting and writing essays.,-1.0,-1.0,0.0,0.536872,0.506068,0.030803
6,What did the author write using the TRS-80?,"Simple games, a program to predict rocket flight, and a word processor.","The author wrote a paper using the TRS-80 about how to choose what to work on in the past. The paper discussed topics such as numbers, errors, I/O, McCarthy's Lisp, spec expressed as code, abstract concepts, everyday words, and the evolution of computers.",The author wrote a more detailed version of McCarthy's original Lisp spec for others to read using the TRS-80.,-1.0,1.0,-2.0,0.375516,0.365896,0.00962
1,What kind of stories did the author write before college?,Short stories.,The author wrote short stories before college.,The author wrote short stories before college.,0.0,0.0,0.0,0.903834,0.903834,0.0
2,What programming language did the author use on the IBM 1401?,Fortran.,The author used an early version of Fortran on the IBM 1401.,The author used an early version of Fortran on the IBM 1401.,1.0,1.0,0.0,0.816106,0.816106,0.0
5,What kind of computer did the author's father buy?,A TRS-80.,The author's father bought a TRS-80 microcomputer.,The author's father bought a TRS-80 computer.,0.0,0.0,0.0,0.898926,0.90503,-0.006104
9,What did the author realize about AI during the first year of grad school?,That the way AI was practiced at the time was a hoax.,"The author realized that AI, as practiced at the time, was a hoax. It was clear that there was an unbridgeable gap between what AI programs could do and actually understanding natural language. It was not, in fact, simply a matter of teaching SHRDLU more words. That whole way of doing AI, with explicit data structures representing concepts, was not going to work.","The author realized that AI, as practiced at the time, was a hoax and that there weren't any classes in AI at Cornell. The author also realized that Lisp was regarded as the language of AI and that AI involved programs that could translate natural language into a formal representation.",1.0,1.0,0.0,0.607929,0.614198,-0.006268


By inspecting the results, we see that pruning the nodes has affected the correctness on some on the questions. However, we also note that the pipeline still manages to answer many questions correctly. As we can see, the evaluation models are not perfect either but at least give a signal to ease the evaluation process.