# Zeno Build Tutorial 3: Evaluating Text Generation

In this tutorial, we'll how to use
[Zeno Build](https://github.com/zeno-ml/zeno-build/) to evaluate generated text.
We'll assume that you've already read the
[previous tutorial](02_inference.ipynb) and have a basic understanding of
how to use Zeno Build to visualize results and perform inference.

Specifically, we will use models from [Hugging Face](https://huggingface.com) and [OpenAI](https://openai.com) to perform French-English translation on data from [Ted Talks](https://ted.org). To evaluate the generated text, we will use the [Critique API](https://docs.inspiredco.ai/critique/).

## Setup

Make sure that Zeno Build is installed (`pip install zeno-build`) and that you do the necessary imports.

In [None]:
import pandas as pd
from datasets import load_dataset

from zeno_build.evaluation.text_features.length import input_length, output_length
from zeno_build.evaluation.text_metrics.critique import (
    avg_bert_score,
    avg_chrf,
    bert_score,
    chrf,
)
from zeno_build.experiments.experiment_run import ExperimentRun
from zeno_build.models.lm_config import LMConfig
from zeno_build.models.text_generate import generate_from_text_prompt
from zeno_build.reporting.visualize import visualize

This example uses OpenAI and Critique, so also make sure that you:
* Obtain an [OpenAI API key]() and set in the environment variable `OPENAI_API_KEY`
* Obtain a [Inspired Cognition API Key](https://dashboard.inspiredco.ai) and set in the environment variable `INSPIREDCO_API_KEY`

The best way to set these variables is to create a file called `.env` in the same directory as this notebook, and put the following line in it:

```
OPENAI_API_KEY=<your key here>
INSPIREDCO_API_KEY=<your key here>
```

## Obtaining Data

Next we'll process the necessary data. We'll use the [ted_multi](https://huggingface.co/datasets/ted_multi) dataset from Hugging Face. We'll use 250 sentences from the French-English subset of this dataset for efficiency purposes.

In [None]:
dataset = load_dataset("ted_multi", split="validation")
srcs, trgs = [], []
src_language, trg_language = "fr", "en"
for datum in dataset:
    if (
        src_language not in datum["translations"]["language"]
        or trg_language not in datum["translations"]["language"]
    ):
        continue
    src_index = datum["translations"]["language"].index(src_language)
    trg_index = datum["translations"]["language"].index(trg_language)
    srcs.append(datum["translations"]["translation"][src_index])
    trgs.append(datum["translations"]["translation"][trg_index])
    if len(srcs) >= 250:
        break
df = pd.DataFrame({"text": srcs, "label": trgs})

## Running Inference

Next, we generate outputs. We can define the prompt template:

In [None]:

prompt_template = (
    "Translate this sentence into English:\n\n" "Sentence: {{text}}\n" "English: "
)

And then we use this to perform inference with various language models, like we did in the previous tutorial. The only difference is that we'll increase the number of tokens and temperature, which is more conducive to generating longer text.

In [None]:

all_results = []
for lm_config in [
    LMConfig(provider="openai_chat", model="gpt-3.5-turbo"),
    LMConfig(provider="huggingface", model="gpt2"),
    LMConfig(provider="huggingface", model="gpt2-xl"),
]:
    predictions = generate_from_text_prompt(
        [{"text": x} for x in srcs],
        prompt_template=prompt_template,
        model_config=lm_config,
        temperature=0.3,
        max_tokens=200,
        top_p=1.0,
        requests_per_minute=400,
    )
    result = ExperimentRun(
        name=lm_config.model,
        parameters={"provider": lm_config.provider, "model": lm_config.model},
        predictions=[x.strip().split("\n")[0] for x in predictions],
    )
    all_results.append(result)

## Evaluating Outputs and Visualizing

Finally, we'll perform evaluation and visualization. Instead of using exact match (accuracy) to evaluate the outputs like we did in the previous tutorial, we'll use [chrf](https://aclanthology.org/W15-3049/) (a measure of string similarity between the gold-standard output and the generated output) and [BERTScore](https://arxiv.org/abs/1904.09675) (a measure of semantic similarity between the two outputs).

In [None]:
functions = [
    output_length,
    input_length,
    chrf,
    avg_chrf,
    bert_score,
    avg_bert_score,
]

visualize(
    df,
    trgs,
    all_results,
    "text-classification",
    "text",
    functions,
    zeno_config={"cache_path": "zeno_cache"},
)

## Next Steps

This is the end of the tutorial series for now!
Next you can click over to the [examples](../../examples/) directory to see some more end-to-end examples of how to use Zeno Build to run experiments.