In [2]:
from llama_index.evaluation.benchmarks import HotpotQAEvaluator
from llama_index import ServiceContext, VectorStoreIndex
from llama_index.schema import Document
from llama_index.llms import OpenAI
from llama_index import LLMPredictor

llm_predictor = LLMPredictor(OpenAI(model="gpt-3.5-turbo"))

service_context = ServiceContext.from_defaults(
    embed_model="local:sentence-transformers/all-MiniLM-L6-v2",
    llm_predictor=llm_predictor,
)
index = VectorStoreIndex.from_documents(
    [Document.example()], service_context=service_context, show_progress=True
)

  from .autonotebook import tqdm as notebook_tqdm
Parsing documents into nodes: 100%|████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 515.59it/s]
Generating embeddings: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 12.64it/s]


First we try with a very simple engine. In this particular benchmark, the retriever and hence index is actually ignored, as the documents retrieved for each query is provided in the dataset. This is known as the "distractor" setting in HotpotQA.

In [9]:
engine = index.as_query_engine(service_context=service_context)

HotpotQAEvaluator().run(engine, queries=5, show_result=True)

Dataset: hotpot_dev_distractor downloaded at: /home/jonch/.cache/llama_index/datasets/HotpotQA
Evaluating on dataset: hotpot_dev_distractor
-------------------------------------
Loading 5 queries out of 7405 (fraction: 0.001)
Question:  Were Scott Derrickson and Ed Wood of the same nationality?
Response: No, Scott Derrickson and Ed Wood were not of the same nationality. Scott Derrickson is an American director, screenwriter, and producer, while Ed Wood was also an American filmmaker, actor, writer, producer, and director.
Correct answer:  yes
EM: 0 F1: 0
-------------------------------------
Question:  What government position was held by the woman who portrayed Corliss Archer in the film Kiss and Tell?
Response: The context information does not provide any information about the government position held by the woman who portrayed Corliss Archer in the film Kiss and Tell.
Correct answer:  Chief of Protocol
EM: 0 F1: 0
-------------------------------------
Question:  What science fantasy

Now we try with a sentence transformer reranker, which selects 3 out of the 10 nodes proposed by the retriever

In [10]:
from llama_index.indices.postprocessor import SentenceTransformerRerank

rerank = SentenceTransformerRerank(top_n=3)

engine = index.as_query_engine(
    service_context=service_context,
    node_postprocessors=[rerank],
)

HotpotQAEvaluator().run(engine, queries=5, show_result=True)

Dataset: hotpot_dev_distractor downloaded at: /home/jonch/.cache/llama_index/datasets/HotpotQA
Evaluating on dataset: hotpot_dev_distractor
-------------------------------------
Loading 5 queries out of 7405 (fraction: 0.001)
Question:  Were Scott Derrickson and Ed Wood of the same nationality?
Response: No, Scott Derrickson and Ed Wood were not of the same nationality. Scott Derrickson is an American director, while Ed Wood was also an American filmmaker.
Correct answer:  yes
EM: 0 F1: 0
-------------------------------------
Question:  What government position was held by the woman who portrayed Corliss Archer in the film Kiss and Tell?
Response: Based on the given context information, there is no mention of the government position held by the woman who portrayed Corliss Archer in the film Kiss and Tell.
Correct answer:  Chief of Protocol
EM: 0 F1: 0.07407407407407407
-------------------------------------
Question:  What science fantasy young adult series, told in first person, has a 

As we can see, the F1 score degrades. 

We also note that the F1 score is not an accurate measure of whether the question was answered correctly due to the tendency for the LLMs to parrot back the question.