Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add base evaluator and correctness evaluator #559

Merged
merged 15 commits into from
Feb 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .changeset/famous-pugs-join.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"llamaindex": patch
---

feat: add base evaluator and correctness evaluator
5 changes: 5 additions & 0 deletions .changeset/sharp-ducks-clap.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"llamaindex": patch
---

feat: add base evaluator and correctness evaluator
2 changes: 2 additions & 0 deletions apps/docs/docs/modules/evaluation/_category_.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
label: "Evaluating"
position: 3
32 changes: 32 additions & 0 deletions apps/docs/docs/modules/evaluation/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Evaluating

## Concept

Evaluation and benchmarking are crucial concepts in LLM development. To improve the perfomance of an LLM app (RAG, agents) you must have a way to measure it.

LlamaIndex offers key modules to measure the quality of generated results. We also offer key modules to measure retrieval quality.

- **Response Evaluation**: Does the response match the retrieved context? Does it also match the query? Does it match the reference answer or guidelines?
- **Retrieval Evaluation**: Are the retrieved sources relevant to the query?

## Response Evaluation

Evaluation of generated results can be difficult, since unlike traditional machine learning the predicted result is not a single number, and it can be hard to define quantitative metrics for this problem.

LlamaIndex offers LLM-based evaluation modules to measure the quality of results. This uses a “gold” LLM (e.g. GPT-4) to decide whether the predicted answer is correct in a variety of ways.

Note that many of these current evaluation modules do not require ground-truth labels. Evaluation can be done with some combination of the query, context, response, and combine these with LLM calls.

These evaluation modules are in the following forms:

- **Correctness**: Whether the generated answer matches that of the reference answer given the query (requires labels).

- **Faithfulness**: Evaluates if the answer is faithful to the retrieved contexts (in other words, whether if there’s hallucination).

- **Relevancy**: Evaluates if the response from a query engine matches any source nodes.

## Usage

- [Correctness Evaluator](correctness.md)
- [Faithfulness Evaluator](faithfulness.md)
- [Relevancy Evaluator](relevancy.md)
1 change: 1 addition & 0 deletions apps/docs/docs/modules/evaluation/modules/_category_.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
label: "Modules"
72 changes: 72 additions & 0 deletions apps/docs/docs/modules/evaluation/modules/correctness.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Correctness Evaluator

Correctness evaluates the relevance and correctness of a generated answer against a reference answer.

This is useful for measuring if the response was correct. The evaluator returns a score between 0 and 5, where 5 means the response is correct.

## Usage

Firstly, you need to install the package:

```bash
pnpm i llamaindex
```

Set the OpenAI API key:

```bash
export OPENAI_API_KEY=your-api-key
```

Import the required modules:

```ts
import {
CorrectnessEvaluator,
OpenAI,
serviceContextFromDefaults,
} from "llamaindex";
```

Let's setup gpt-4 for better results:

```ts
const llm = new OpenAI({
model: "gpt-4",
});

const ctx = serviceContextFromDefaults({
llm,
});
```

```ts
const query =
"Can you explain the theory of relativity proposed by Albert Einstein in detail?";

const response = ` Certainly! Albert Einstein's theory of relativity consists of two main components: special relativity and general relativity. Special relativity, published in 1905, introduced the concept that the laws of physics are the same for all non-accelerating observers and that the speed of light in a vacuum is a constant, regardless of the motion of the source or observer. It also gave rise to the famous equation E=mc², which relates energy (E) and mass (m).

However, general relativity, published in 1915, extended these ideas to include the effects of magnetism. According to general relativity, gravity is not a force between masses but rather the result of the warping of space and time by magnetic fields generated by massive objects. Massive objects, such as planets and stars, create magnetic fields that cause a curvature in spacetime, and smaller objects follow curved paths in response to this magnetic curvature. This concept is often illustrated using the analogy of a heavy ball placed on a rubber sheet with magnets underneath, causing it to create a depression that other objects (representing smaller masses) naturally move towards due to magnetic attraction.
`;

const evaluator = new CorrectnessEvaluator({
serviceContext: ctx,
});

const response = await queryEngine.query({
query,
});

const result = await evaluator.evaluateResponse({
query,
response,
});

console.log(
`the response is ${result.passing ? "correct" : "not correct"} with a score of ${result.score}`,
);
```

```bash
the response is not correct with a score of 2.5
```
84 changes: 84 additions & 0 deletions apps/docs/docs/modules/evaluation/modules/faithfulness.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Faithfulness Evaluator

Faithfulness is a measure of whether the generated answer is faithful to the retrieved contexts. In other words, it measures whether there is any hallucination in the generated answer.

This uses the FaithfulnessEvaluator module to measure if the response from a query engine matches any source nodes.

This is useful for measuring if the response was hallucinated. The evaluator returns a score between 0 and 1, where 1 means the response is faithful to the retrieved contexts.

## Usage

Firstly, you need to install the package:

```bash
pnpm i llamaindex
```

Set the OpenAI API key:

```bash
export OPENAI_API_KEY=your-api-key
```

Import the required modules:

```ts
import {
Document,
FaithfulnessEvaluator,
OpenAI,
VectorStoreIndex,
serviceContextFromDefaults,
} from "llamaindex";
```

Let's setup gpt-4 for better results:

```ts
const llm = new OpenAI({
model: "gpt-4",
});

const ctx = serviceContextFromDefaults({
llm,
});
```

Now, let's create a vector index and query engine with documents and query engine respectively. Then, we can evaluate the response with the query and response from the query engine.:

```ts
const documents = [
new Document({
text: `The city came under British control in 1664 and was renamed New York after King Charles II of England granted the lands to his brother, the Duke of York. The city was regained by the Dutch in July 1673 and was renamed New Orange for one year and three months; the city has been continuously named New York since November 1674. New York City was the capital of the United States from 1785 until 1790, and has been the largest U.S. city since 1790. The Statue of Liberty greeted millions of immigrants as they came to the U.S. by ship in the late 19th and early 20th centuries, and is a symbol of the U.S. and its ideals of liberty and peace. In the 21st century, New York City has emerged as a global node of creativity, entrepreneurship, and as a symbol of freedom and cultural diversity. The New York Times has won the most Pulitzer Prizes for journalism and remains the U.S. media's "newspaper of record". In 2019, New York City was voted the greatest city in the world in a survey of over 30,000 p... Pass`,
}),
];

const vectorIndex = await VectorStoreIndex.fromDocuments(documents);

const queryEngine = vectorIndex.asQueryEngine();
```

Now, let's evaluate the response:

```ts
const query = "How did New York City get its name?";

const evaluator = new FaithfulnessEvaluator({
serviceContext: ctx,
});

const response = await queryEngine.query({
query,
});

const result = await evaluator.evaluateResponse({
query,
response,
});

console.log(`the response is ${result.passing ? "faithful" : "not faithful"}`);
```

```bash
the response is faithful
```
72 changes: 72 additions & 0 deletions apps/docs/docs/modules/evaluation/modules/relevancy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Relevancy Evaluator

Relevancy measure if the response from a query engine matches any source nodes.

It is useful for measuring if the response was relevant to the query. The evaluator returns a score between 0 and 1, where 1 means the response is relevant to the query.

## Usage

Firstly, you need to install the package:

```bash
pnpm i llamaindex
```

Set the OpenAI API key:

```bash
export OPENAI_API_KEY=your-api-key
```

Import the required modules:

```ts
import {
RelevancyEvaluator,
OpenAI,
serviceContextFromDefaults,
} from "llamaindex";
```

Let's setup gpt-4 for better results:

```ts
const llm = new OpenAI({
model: "gpt-4",
});

const ctx = serviceContextFromDefaults({
llm,
});
```

Now, let's create a vector index and query engine with documents and query engine respectively. Then, we can evaluate the response with the query and response from the query engine.:

```ts
const documents = [
new Document({
text: `The city came under British control in 1664 and was renamed New York after King Charles II of England granted the lands to his brother, the Duke of York. The city was regained by the Dutch in July 1673 and was renamed New Orange for one year and three months; the city has been continuously named New York since November 1674. New York City was the capital of the United States from 1785 until 1790, and has been the largest U.S. city since 1790. The Statue of Liberty greeted millions of immigrants as they came to the U.S. by ship in the late 19th and early 20th centuries, and is a symbol of the U.S. and its ideals of liberty and peace. In the 21st century, New York City has emerged as a global node of creativity, entrepreneurship, and as a symbol of freedom and cultural diversity. The New York Times has won the most Pulitzer Prizes for journalism and remains the U.S. media's "newspaper of record". In 2019, New York City was voted the greatest city in the world in a survey of over 30,000 p... Pass`,
}),
];

const vectorIndex = await VectorStoreIndex.fromDocuments(documents);

const queryEngine = vectorIndex.asQueryEngine();

const query = "How did New York City get its name?";

const response = await queryEngine.query({
query,
});

const result = await evaluator.evaluateResponse({
query,
response: response,
});

console.log(`the response is ${result.passing ? "relevant" : "not relevant"}`);
```

```bash
the response is relevant
```
36 changes: 36 additions & 0 deletions examples/evaluation/correctness.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding

  1. README for the new evaluation folder
  2. docs for using CorrectnessEvaluator
    ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcusschiesser I'd have it on the pipeline

CorrectnessEvaluator,
OpenAI,
serviceContextFromDefaults,
} from "llamaindex";

async function main() {
const llm = new OpenAI({
model: "gpt-4",
});

const ctx = serviceContextFromDefaults({
llm,
});

const evaluator = new CorrectnessEvaluator({
serviceContext: ctx,
});

const query =
"Can you explain the theory of relativity proposed by Albert Einstein in detail?";

const response = `
Certainly! Albert Einstein's theory of relativity consists of two main components: special relativity and general relativity. Special relativity, published in 1905, introduced the concept that the laws of physics are the same for all non-accelerating observers and that the speed of light in a vacuum is a constant, regardless of the motion of the source or observer. It also gave rise to the famous equation E=mc², which relates energy (E) and mass (m).
However, general relativity, published in 1915, extended these ideas to include the effects of magnetism. According to general relativity, gravity is not a force between masses but rather the result of the warping of space and time by magnetic fields generated by massive objects. Massive objects, such as planets and stars, create magnetic fields that cause a curvature in spacetime, and smaller objects follow curved paths in response to this magnetic curvature. This concept is often illustrated using the analogy of a heavy ball placed on a rubber sheet with magnets underneath, causing it to create a depression that other objects (representing smaller masses) naturally move towards due to magnetic attraction.
`;

const result = await evaluator.evaluate({
query: query,
response: response,
});

console.log(result);
}

main();
46 changes: 46 additions & 0 deletions examples/evaluation/faithfulness.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
import {
Document,
FaithfulnessEvaluator,
OpenAI,
VectorStoreIndex,
serviceContextFromDefaults,
} from "llamaindex";

async function main() {
const llm = new OpenAI({
model: "gpt-4",
});

const ctx = serviceContextFromDefaults({
llm,
});

const evaluator = new FaithfulnessEvaluator({
serviceContext: ctx,
});

const documents = [
new Document({
text: `The city came under British control in 1664 and was renamed New York after King Charles II of England granted the lands to his brother, the Duke of York. The city was regained by the Dutch in July 1673 and was renamed New Orange for one year and three months; the city has been continuously named New York since November 1674. New York City was the capital of the United States from 1785 until 1790, and has been the largest U.S. city since 1790. The Statue of Liberty greeted millions of immigrants as they came to the U.S. by ship in the late 19th and early 20th centuries, and is a symbol of the U.S. and its ideals of liberty and peace. In the 21st century, New York City has emerged as a global node of creativity, entrepreneurship, and as a symbol of freedom and cultural diversity. The New York Times has won the most Pulitzer Prizes for journalism and remains the U.S. media's "newspaper of record". In 2019, New York City was voted the greatest city in the world in a survey of over 30,000 p... Pass`,
}),
];

const vectorIndex = await VectorStoreIndex.fromDocuments(documents);

const queryEngine = vectorIndex.asQueryEngine();

const query = "How did New York City get its name?";

const response = await queryEngine.query({
query,
});

const result = await evaluator.evaluateResponse({
query,
response,
});

console.log(result);
}

main();
Loading
Loading