<a href="https://colab.research.google.com/github/jerryjliu/llama_index/blob/main/docs/examples/metadata_extraction/MetadataExtraction_LLMSurvey.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Automated Metadata Extraction for Better Retrieval + Synthesis

In this tutorial, we show you how to perform automated metadata extraction for better retrieval results.
We use two extractors: a QuestionAnsweredExtractor which generates question/answer pairs from a piece of text, and also a SummaryExtractor which extracts summaries, not only within the current text, but also within adjacent texts.

We show that this allows for "chunk dreaming" - each individual chunk can have more "holistic" details, leading to higher answer quality given retrieved results.

Our data source is taken from Eugene Yan's popular article on LLM Patterns: https://eugeneyan.com/writing/llm-patterns/

## Setup

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [None]:
%pip install llama-index-llms-openai
%pip install llama-index-readers-web

In [None]:
!pip install llama-index

In [None]:
import nest_asyncio

nest_asyncio.apply()

import os
import openai

In [None]:
# OPTIONAL: setup W&B callback handling for tracing
from llama_index.core import set_global_handler

set_global_handler("wandb", run_args={"project": "llamaindex"})

In [None]:
os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]

## Define Metadata Extractors

Here we define metadata extractors. We define two variants:
- metadata_extractor_1 only contains the QuestionsAnsweredExtractor
- metadata_extractor_2 contains both the QuestionsAnsweredExtractor as well as the SummaryExtractor

In [None]:
from llama_index.llms.openai import OpenAI
from llama_index.core.schema import MetadataMode

In [None]:
llm = OpenAI(temperature=0.1, model="gpt-3.5-turbo", max_tokens=512)

We also show how to instantiate the `SummaryExtractor` and `QuestionsAnsweredExtractor`.

In [None]:
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.core.extractors import (
    SummaryExtractor,
    QuestionsAnsweredExtractor,
)

node_parser = TokenTextSplitter(
    separator=" ", chunk_size=256, chunk_overlap=128
)


extractors_1 = [
    QuestionsAnsweredExtractor(
        questions=3, llm=llm, metadata_mode=MetadataMode.EMBED
    ),
]

extractors_2 = [
    SummaryExtractor(summaries=["prev", "self", "next"], llm=llm),
    QuestionsAnsweredExtractor(
        questions=3, llm=llm, metadata_mode=MetadataMode.EMBED
    ),
]

## Load in Data, Run Extractors

We load in Eugene's essay (https://eugeneyan.com/writing/llm-patterns/) using our LlamaHub SimpleWebPageReader.

We then run our extractors.

In [None]:
from llama_index.core import SimpleDirectoryReader

In [None]:
# load in blog

from llama_index.readers.web import SimpleWebPageReader

reader = SimpleWebPageReader(html_to_text=True)
docs = reader.load_data(urls=["https://eugeneyan.com/writing/llm-patterns/"])

In [None]:
print(docs[0].get_content())

In [None]:
orig_nodes = node_parser.get_nodes_from_documents(docs)

In [None]:
# take just the first 8 nodes for testing
nodes = orig_nodes[20:28]

In [None]:
print(nodes[3].get_content(metadata_mode="all"))

is to measure the distance that words would
have to move to convert one sequence to another.

However, there are several pitfalls to using these conventional benchmarks and
metrics.

First, there’s **poor correlation between these metrics and human judgments.**
BLEU, ROUGE, and others have had [negative correlation with how humans
evaluate fluency](https://arxiv.org/abs/2008.12009). They also showed moderate
to less correlation with human adequacy scores. In particular, BLEU and ROUGE
have [low correlation with tasks that require creativity and
diversity](https://arxiv.org/abs/2303.16634).

Second, these metrics often have **poor adaptability to a wider variety of
tasks**. Adopting a metric proposed for one task to another is not always
prudent. For example, exact match metrics such as BLEU and ROUGE are a poor
fit for tasks like abstractive summarization or dialogue. Since they’re based
on n-gram overlap between output and reference, they don’t make sense for a
dialogue task where a w

### Run metadata extractors

In [None]:
from llama_index.core.ingestion import IngestionPipeline

# process nodes with metadata extractors
pipeline = IngestionPipeline(transformations=[node_parser, *extractors_1])

nodes_1 = pipeline.run(nodes=nodes, in_place=False, show_progress=True)

Parsing documents into nodes:   0%|          | 0/8 [00:00<?, ?it/s]

Extracting questions:   0%|          | 0/8 [00:00<?, ?it/s]

In [None]:
print(nodes_1[3].get_content(metadata_mode="all"))

[Excerpt from document]
questions_this_excerpt_can_answer: 1. What is the correlation between conventional metrics like BLEU and ROUGE and human judgments in evaluating fluency and adequacy in natural language processing tasks?
2. How do conventional metrics like BLEU and ROUGE perform in tasks that require creativity and diversity?
3. Why are exact match metrics like BLEU and ROUGE not suitable for tasks like abstractive summarization or dialogue in natural language processing?
Excerpt:
-----
is to measure the distance that words would
have to move to convert one sequence to another.

However, there are several pitfalls to using these conventional benchmarks and
metrics.

First, there’s **poor correlation between these metrics and human judgments.**
BLEU, ROUGE, and others have had [negative correlation with how humans
evaluate fluency](https://arxiv.org/abs/2008.12009). They also showed moderate
to less correlation with human adequacy scores. In particular, BLEU and ROUGE
have [low c

In [None]:
# 2nd pass: run summaries, and then metadata extractor

# process nodes with metadata extractor
pipeline = IngestionPipeline(transformations=[node_parser, *extractors_2])

nodes_2 = pipeline.run(nodes=nodes, in_place=False, show_progress=True)

Parsing documents into nodes:   0%|          | 0/8 [00:00<?, ?it/s]

Extracting summaries:   0%|          | 0/8 [00:00<?, ?it/s]

Extracting questions:   0%|          | 0/8 [00:00<?, ?it/s]

### Visualize some sample data

In [None]:
print(nodes_2[3].get_content(metadata_mode="all"))

[Excerpt from document]
prev_section_summary: The section discusses the comparison between BERTScore and MoverScore, two metrics used to evaluate the quality of text generation models. MoverScore is described as a metric that measures the effort required to transform one text sequence into another by mapping semantically related words. The section also highlights the limitations of conventional benchmarks and metrics, such as poor correlation with human judgments and low correlation with tasks requiring creativity.
next_section_summary: The section discusses the limitations of current evaluation metrics in natural language processing tasks. It highlights three main issues: lack of creativity and diversity in metrics, poor adaptability to different tasks, and poor reproducibility. The section mentions specific metrics like BLEU and ROUGE, and also references studies that have reported high variance in metric scores.
section_summary: The section discusses the limitations of conventional 

In [None]:
print(nodes_2[1].get_content(metadata_mode="all"))

[Excerpt from document]
prev_section_summary: The section discusses the F_{BERT} formula used in BERTScore and highlights the advantages of BERTScore over simpler metrics like BLEU and ROUGE. It also introduces MoverScore, another metric that uses contextualized embeddings but allows for many-to-one matching. The key topics are BERTScore, MoverScore, and the differences between them.
next_section_summary: The section discusses the comparison between BERTScore and MoverScore, two metrics used to evaluate the quality of text generation models. MoverScore is described as a metric that measures the effort required to transform one text sequence into another by mapping semantically related words. The section also highlights the limitations of conventional benchmarks and metrics, such as poor correlation with human judgments and low correlation with tasks requiring creativity.
section_summary: The key topics of this section are BERTScore and MoverScore, which are methods used to compute the 

## Setup RAG Query Engines, Compare Results! 

We setup 3 indexes/query engines on top of the three node variants.

In [None]:
from llama_index.core import VectorStoreIndex
from llama_index.core.response.notebook_utils import (
    display_source_node,
    display_response,
)

In [None]:
# try out different query engines

# index0 = VectorStoreIndex(orig_nodes)
# index1 = VectorStoreIndex(nodes_1 + orig_nodes[8:])
# index2 = VectorStoreIndex(nodes_2 + orig_nodes[8:])

index0 = VectorStoreIndex(orig_nodes)
index1 = VectorStoreIndex(orig_nodes[:20] + nodes_1 + orig_nodes[28:])
index2 = VectorStoreIndex(orig_nodes[:20] + nodes_2 + orig_nodes[28:])

In [None]:
query_engine0 = index0.as_query_engine(similarity_top_k=1)
query_engine1 = index1.as_query_engine(similarity_top_k=1)
query_engine2 = index2.as_query_engine(similarity_top_k=1)

### Try out some questions

In this question, we see that the naive response `response0` only mentions BLEU and ROUGE, and lacks context about other metrics.

`response2` on the other hand has all metrics within its context.

In [None]:
# query_str = "In the original RAG paper, can you describe the two main approaches for generation and compare them?"
query_str = (
    "Can you describe metrics for evaluating text generation quality, compare"
    " them, and tell me about their downsides"
)

response0 = query_engine0.query(query_str)
response1 = query_engine1.query(query_str)
response2 = query_engine2.query(query_str)

In [None]:
display_response(
    response0, source_length=1000, show_source=True, show_source_metadata=True
)

**`Final Response:`** Metrics for evaluating text generation quality can vary depending on the task. However, some commonly used metrics include BLEU, ROUGE, and exact match metrics. 

BLEU and ROUGE are often used for evaluating machine translation and summarization tasks. They measure the n-gram overlap between the generated text and a reference text. However, these metrics have limitations. For example, they may not be suitable for tasks like abstractive summarization or dialogue, where a wide variety of responses are possible. This is because they rely on exact match and n-gram overlap, which may not capture the semantic quality or creativity of the generated text.

Exact match metrics, such as BLEU and ROUGE, have the downside of not considering the possibility of a good response with zero n-gram overlap with the reference. This means that even if the generated text is a good response, it may receive a low score if it does not have any n-gram overlap with the reference.

In addition to their limitations in capturing text quality, these metrics also have poor adaptability to a wider variety of tasks. Using a metric proposed for one task to evaluate another task may not be appropriate. 

Furthermore, these metrics have poor reproducibility. There can be high variance in scores across different studies, possibly due to variations in human judgment collection or metric parameter settings. This lack of consistency makes it challenging to compare results across different studies.

Overall, while metrics like BLEU and ROUGE provide some quantitative measure of text generation quality, they have downsides in terms of adaptability, reproducibility, and their ability to capture the creativity and diversity of generated text.

---

**`Source Node 1/1`**

**Node ID:** 256dfc5a-cd91-4ff1-9f28-15693775e354<br>**Similarity:** 0.8402930255994321<br>**Text:** require creativity and
diversity](https://arxiv.org/abs/2303.16634).

Second, these metrics often have **poor adaptability to a wider variety of
tasks**. Adopting a metric proposed for one task to another is not always
prudent. For example, exact match metrics such as BLEU and ROUGE are a poor
fit for tasks like abstractive summarization or dialogue. Since they’re based
on n-gram overlap between output and reference, they don’t make sense for a
dialogue task where a wide variety of responses are possible. An output can
have zero n-gram overlap with the reference but yet be a good response.

Third, these metrics have **poor reproducibility**. Even for the same metric,
[high variance is reported across different
studies](https://arxiv.org/abs/2008.12009), possibly due to variations in
human judgment collection or metric parameter settings. Another study of
[ROUGE scores](https://aclanthology.org/2023.acl-long.107/) across 2,000
studies found that scores were hard<br>**Metadata:** {}<br>

In [None]:
print(response0.source_nodes[0].node.get_content())

require creativity and
diversity](https://arxiv.org/abs/2303.16634).

Second, these metrics often have **poor adaptability to a wider variety of
tasks**. Adopting a metric proposed for one task to another is not always
prudent. For example, exact match metrics such as BLEU and ROUGE are a poor
fit for tasks like abstractive summarization or dialogue. Since they’re based
on n-gram overlap between output and reference, they don’t make sense for a
dialogue task where a wide variety of responses are possible. An output can
have zero n-gram overlap with the reference but yet be a good response.

Third, these metrics have **poor reproducibility**. Even for the same metric,
[high variance is reported across different
studies](https://arxiv.org/abs/2008.12009), possibly due to variations in
human judgment collection or metric parameter settings. Another study of
[ROUGE scores](https://aclanthology.org/2023.acl-long.107/) across 2,000
studies found that scores were hard


In [None]:
display_response(
    response1, source_length=1000, show_source=True, show_source_metadata=True
)

**`Final Response:`** Metrics for evaluating text generation quality can vary depending on the task at hand. However, some commonly used metrics include BLEU (Bilingual Evaluation Understudy), ROUGE (Recall-Oriented Understudy for Gisting Evaluation), and exact match metrics.

BLEU and ROUGE are often used for evaluating machine translation and summarization tasks. They measure the overlap of n-grams (sequences of words) between the generated text and a reference text. Higher scores indicate better quality, as it suggests that the generated text is similar to the reference.

Exact match metrics, on the other hand, focus on the exact match between the generated text and the reference. These metrics are commonly used for tasks like question answering or dialogue systems. They determine if the generated text matches the reference exactly, without considering partial matches or variations in wording.

While these metrics can provide some insights into text generation quality, they have their downsides. First, they may lack adaptability to different tasks. For example, exact match metrics like BLEU and ROUGE may not be suitable for tasks like abstractive summarization or dialogue, where a wide variety of responses are possible.

Second, these metrics may have poor reproducibility. Different studies have reported high variance in scores, possibly due to variations in human judgment collection or metric parameter settings. This can make it challenging to compare results across different studies or reproduce the same scores.

In summary, metrics for evaluating text generation quality can be useful but should be chosen carefully based on the task. It is important to consider their adaptability, reproducibility, and limitations when interpreting the results.

---

**`Source Node 1/1`**

**Node ID:** 5161474e-bb2b-4642-9f07-692aa7db4375<br>**Similarity:** 0.8352202179927705<br>**Text:** require creativity and
diversity](https://arxiv.org/abs/2303.16634).

Second, these metrics often have **poor adaptability to a wider variety of
tasks**. Adopting a metric proposed for one task to another is not always
prudent. For example, exact match metrics such as BLEU and ROUGE are a poor
fit for tasks like abstractive summarization or dialogue. Since they’re based
on n-gram overlap between output and reference, they don’t make sense for a
dialogue task where a wide variety of responses are possible. An output can
have zero n-gram overlap with the reference but yet be a good response.

Third, these metrics have **poor reproducibility**. Even for the same metric,
[high variance is reported across different
studies](https://arxiv.org/abs/2008.12009), possibly due to variations in
human judgment collection or metric parameter settings. Another study of
[ROUGE scores](https://aclanthology.org/2023.acl-long.107/) across 2,000
studies found that scores were hard<br>**Metadata:** {'questions_this_excerpt_can_answer': '1. What are some limitations of using exact match metrics like BLEU and ROUGE for tasks such as abstractive summarization or dialogue?\n2. Why is it not always prudent to adopt a metric proposed for one task to another?\n3. What are some challenges in reproducing and comparing metric scores across different studies?'}<br>

In [None]:
display_response(
    response2, source_length=1000, show_source=True, show_source_metadata=True
)

**`Final Response:`** Metrics for evaluating text generation quality include BERTScore and MoverScore. BERTScore measures the similarity between generated text and reference text by considering contextual embeddings. On the other hand, MoverScore measures the effort required to transform one text sequence into another by mapping semantically related words. 

However, there are downsides to using conventional benchmarks and metrics for text generation evaluation. Firstly, these metrics have been found to have poor correlation with human judgments. For example, metrics like BLEU and ROUGE have shown negative correlation with human evaluations of fluency and moderate to less correlation with human adequacy scores. Additionally, they have low correlation with tasks that require creativity and diversity.

Secondly, these metrics often have poor adaptability to different tasks. Metrics like BLEU and ROUGE, which are based on n-gram overlap between output and reference, are not suitable for tasks like abstractive summarization or dialogue. Therefore, adopting a metric proposed for one task to another may not be appropriate.

---

**`Source Node 1/1`**

**Node ID:** 59031c3f-c150-4d84-82f6-a32bb9920a0a<br>**Similarity:** 0.8382547930335613<br>**Text:** is to measure the distance that words would
have to move to convert one sequence to another.

However, there are several pitfalls to using these conventional benchmarks and
metrics.

First, there’s **poor correlation between these metrics and human judgments.**
BLEU, ROUGE, and others have had [negative correlation with how humans
evaluate fluency](https://arxiv.org/abs/2008.12009). They also showed moderate
to less correlation with human adequacy scores. In particular, BLEU and ROUGE
have [low correlation with tasks that require creativity and
diversity](https://arxiv.org/abs/2303.16634).

Second, these metrics often have **poor adaptability to a wider variety of
tasks**. Adopting a metric proposed for one task to another is not always
prudent. For example, exact match metrics such as BLEU and ROUGE are a poor
fit for tasks like abstractive summarization or dialogue. Since they’re based
on n-gram overlap between output and reference, they don’t make sense for a
dialogue task where ...<br>**Metadata:** {'prev_section_summary': 'The section discusses the comparison between BERTScore and MoverScore, two metrics used to evaluate the quality of text generation models. MoverScore is described as a metric that measures the effort required to transform one text sequence into another by mapping semantically related words. The section also highlights the limitations of conventional benchmarks and metrics, such as poor correlation with human judgments and low correlation with tasks requiring creativity.', 'next_section_summary': 'The section discusses the limitations of current evaluation metrics in natural language processing tasks. It highlights three main issues: lack of creativity and diversity in metrics, poor adaptability to different tasks, and poor reproducibility. The section mentions specific metrics like BLEU and ROUGE, and also references studies that have reported high variance in metric scores.', 'section_summary': 'The section discusses the limitations of conventional benchmarks and metrics used to measure the distance between word sequences. It highlights two main issues: the poor correlation between these metrics and human judgments, and their limited adaptability to different tasks. The section mentions specific metrics like BLEU and ROUGE, which have been found to have low correlation with human evaluations of fluency, adequacy, creativity, and diversity. It also points out that metrics based on n-gram overlap, such as BLEU and ROUGE, are not suitable for tasks like abstractive summarization or dialogue.', 'questions_this_excerpt_can_answer': '1. What are the limitations of conventional benchmarks and metrics in measuring the distance between word sequences?\n2. How do metrics like BLEU and ROUGE correlate with human judgments in terms of fluency, adequacy, creativity, and diversity?\n3. Why are metrics based on n-gram overlap, such as BLEU and ROUGE, not suitable for tasks like abstractive summarization or dialogue?'}<br>

In this next question, we ask about BERTScore/MoverScore. 

The responses are similar. But `response2` gives slightly more detail than `response0` since it has more information about MoverScore contained in the Metadata.

In [None]:
# query_str = "What are some reproducibility issues with the ROUGE metric? Give some details related to benchmarks and also describe other ROUGE issues. "
query_str = (
    "Can you give a high-level overview of BERTScore/MoverScore + formulas if"
    " available?"
)

response0 = query_engine0.query(query_str)
response1 = query_engine1.query(query_str)
response2 = query_engine2.query(query_str)

In [None]:
display_response(
    response0, source_length=1000, show_source=True, show_source_metadata=True
)

**`Final Response:`** BERTScore is a metric that is useful because it can account for synonyms and paraphrasing, unlike simpler metrics like BLEU and ROUGE. It uses contextualized embeddings to compute the distance between tokens in the generated output and reference. The formula for BERTScore is:

\[ F_{\text{BERT}} = \frac{2 \cdot P_{\text{BERT}} \cdot R_{\text{BERT}}}{P_{\text{BERT}} + R_{\text{BERT}}} \]

MoverScore, on the other hand, also uses contextualized embeddings to compute the distance between tokens in the generated output and reference. However, unlike BERTScore, which is based on one-to-one matching of tokens, MoverScore allows for many-to-one matching. This is also known as "soft alignment." Unfortunately, the formula for MoverScore is not provided in the given context.

---

**`Source Node 1/1`**

**Node ID:** b15a1c51-3c52-42c2-aba0-1f7aa805e6ce<br>**Similarity:** 0.8393536980456234<br>**Text:** = F_{\text{BERT}} =
\frac{2 \cdot P_{\text{BERT}} \cdot R_{\text{BERT}}}{P_{\text{BERT}} +
R_{\text{BERT}}}\\]

BERTScore is useful because it can account for synonyms and paraphrasing.
Simpler metrics like BLEU and ROUGE can’t do this due to their reliance on
exact matches. BERTScore has been shown to have better correlation for tasks
such as image captioning and machine translation.

**[MoverScore](https://arxiv.org/abs/1909.02622)** also uses contextualized
embeddings to compute the distance between tokens in the generated output and
reference. But unlike BERTScore, which is based on one-to-one matching (or
“hard alignment”) of tokens, MoverScore allows for many-to-one matching (or
“soft alignment”).

![BERTScore \(left\) vs. MoverScore<br>**Metadata:** {}<br>

In [None]:
display_response(
    response1, source_length=1000, show_source=True, show_source_metadata=True
)

**`Final Response:`** BERTScore is a metric that is useful because it can account for synonyms and paraphrasing, unlike simpler metrics like BLEU and ROUGE. It uses contextualized embeddings to compute the distance between tokens in the generated output and reference. The formula for BERTScore is F_{BERT} = \frac{2 \cdot P_{BERT} \cdot R_{BERT}}{P_{BERT} + R_{BERT}}.

MoverScore, on the other hand, also uses contextualized embeddings to compute the distance between tokens in the generated output and reference. However, unlike BERTScore, which is based on one-to-one matching of tokens, MoverScore allows for many-to-one matching. Unfortunately, the formula for MoverScore is not provided in the given context.

---

**`Source Node 1/1`**

**Node ID:** c7d7dcb1-48a8-4433-9a58-a0a16ce9b14e<br>**Similarity:** 0.8485439966069306<br>**Text:** = F_{\text{BERT}} =
\frac{2 \cdot P_{\text{BERT}} \cdot R_{\text{BERT}}}{P_{\text{BERT}} +
R_{\text{BERT}}}\\]

BERTScore is useful because it can account for synonyms and paraphrasing.
Simpler metrics like BLEU and ROUGE can’t do this due to their reliance on
exact matches. BERTScore has been shown to have better correlation for tasks
such as image captioning and machine translation.

**[MoverScore](https://arxiv.org/abs/1909.02622)** also uses contextualized
embeddings to compute the distance between tokens in the generated output and
reference. But unlike BERTScore, which is based on one-to-one matching (or
“hard alignment”) of tokens, MoverScore allows for many-to-one matching (or
“soft alignment”).

![BERTScore \(left\) vs. MoverScore<br>**Metadata:** {'questions_this_excerpt_can_answer': '1. What is the advantage of using BERTScore over simpler metrics like BLEU and ROUGE?\n2. How does MoverScore differ from BERTScore in terms of token matching?\n3. What tasks have shown better correlation with BERTScore, such as image captioning and machine translation?'}<br>

In [None]:
display_response(
    response2, source_length=1000, show_source=True, show_source_metadata=True
)

**`Final Response:`** BERTScore is a method used to compute the similarity between generated output and reference in tasks like image captioning and machine translation. It considers synonyms and paraphrasing, unlike simpler metrics such as BLEU and ROUGE. BERTScore uses one-to-one matching of tokens to calculate its score.

MoverScore, on the other hand, also utilizes contextualized embeddings to compute the distance between tokens in the generated output and reference. Unlike BERTScore, which uses one-to-one matching, MoverScore allows for many-to-one matching. It solves an optimization problem to find the minimum effort required to transform one text into another by measuring the distance words would have to move.

Unfortunately, the formulas for BERTScore and MoverScore are not provided in the given context.

---

**`Source Node 1/1`**

**Node ID:** 21664308-0009-435f-93fc-1ff7de4a9091<br>**Similarity:** 0.8474860535677818<br>**Text:** = F_{\text{BERT}} =
\frac{2 \cdot P_{\text{BERT}} \cdot R_{\text{BERT}}}{P_{\text{BERT}} +
R_{\text{BERT}}}\\]

BERTScore is useful because it can account for synonyms and paraphrasing.
Simpler metrics like BLEU and ROUGE can’t do this due to their reliance on
exact matches. BERTScore has been shown to have better correlation for tasks
such as image captioning and machine translation.

**[MoverScore](https://arxiv.org/abs/1909.02622)** also uses contextualized
embeddings to compute the distance between tokens in the generated output and
reference. But unlike BERTScore, which is based on one-to-one matching (or
“hard alignment”) of tokens, MoverScore allows for many-to-one matching (or
“soft alignment”).

![BERTScore \(left\) vs. MoverScore<br>**Metadata:** {'next_section_summary': 'The key topics of this section are BERTScore and MoverScore, which are methods used to compute the similarity between generated output and reference in tasks like image captioning and machine translation. BERTScore uses one-to-one matching of tokens, while MoverScore allows for many-to-one matching. MoverScore solves an optimization problem to find the minimum effort required to transform one text into another by measuring the distance words would have to move.', 'section_summary': "The section discusses the importance of BERTScore and MoverScore in evaluating the quality of generated text. BERTScore is advantageous as it considers synonyms and paraphrasing, unlike simpler metrics such as BLEU and ROUGE. It has shown better correlation in tasks like image captioning and machine translation. On the other hand, MoverScore also utilizes contextualized embeddings but allows for many-to-one matching, unlike BERTScore's one-to-one matching.", 'questions_this_excerpt_can_answer': "1. What are the key differences between BERTScore and MoverScore in terms of their matching approaches?\n2. How does BERTScore account for synonyms and paraphrasing, and why is this advantageous compared to metrics like BLEU and ROUGE?\n3. What is the advantage of MoverScore allowing for many-to-one matching compared to BERTScore's one-to-one matching?"}<br>

In [None]:
response1.source_nodes[0].node.metadata

{'questions_this_excerpt_can_answer': '1. What is the advantage of using BERTScore over simpler metrics like BLEU and ROUGE?\n2. How does MoverScore differ from BERTScore in terms of token matching?\n3. What tasks have shown better correlation with BERTScore, such as image captioning and machine translation?'}