# HintEval

HintEval is a powerful framework designed for both generating and evaluating hints. These hints serve as subtle clues, guiding users toward the correct answer without directly revealing it. As the first tool of its kind, HintEval allows users to create and assess hints from various perspectives.

In [None]:
from IPython.display import clear_output

!pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0
!pip install hinteval
clear_output()

⚠️ **Important:**
After installing the dependencies above, you need to **restart the session** in Google Colab.

Go to the menu:
`Runtime` → `Restart session`

After restarting, re-run the cells below to continue.

# Generate a Synthetic Hint Dataset

This tutorial provides step-by-step guidance on how to generate a synthetic hint dataset using large language models via the TogetherAI platform. To proceed, make sure you have an active API key for TogetherAI.


In [None]:
# @title Pipeline Argument

api_key = '' # @param {type:"string"}
base_url = 'https://api.together.xyz/v1' # @param {type:"string"}

## Question/Answer Pairs

First, we'll gather a collection of question/answer pairs to use as a foundation for generating `Question/Answer/Hint` triples. We'll load 10 questions from the [WebQuestions](https://aclanthology.org/D13-1160.pdf) dataset using the [🤗datasets](https://pypi.org/project/datasets/) library.


In [None]:
from datasets import load_dataset

webq = load_dataset("Stanford/web_questions", split='test')
question_answers = webq.select_columns(['question', 'answers'])[10:20]
qa_pairs = zip(question_answers['question'], question_answers['answers'])

At this point, we have a set of question/answer pairs ready for creating synthetic `Question/Answer/Hint` instances.


## Dataset Creation

Next, we'll use HintEval's `Dataset` class to create a new dataset called `synthetic_hint_dataset`, which will include 10 question/answer pairs within a subset named `entire`.


In [None]:
from hinteval import Dataset
from hinteval.cores import Subset, Instance

dataset = Dataset('synthetic_hint_dataset')
subset = Subset('entire')

for q_id, (question, answers) in enumerate(qa_pairs, 1):
    instance = Instance.from_strings(question, answers, [])
    subset.add_instance(instance, f'id_{q_id}')

dataset.add_subset(subset)
dataset.prepare_dataset(fill_question_types=True)

## Hint Generation

Now, we can generate 5 hints for each question using HintEval's `AnswerAware` model. For this example, we will use the [Meta LLaMA-3.1-70b-Instruct-Turbo](https://www.llama.com/) model from TogetherAI.


In [None]:
from hinteval.model import AnswerAware

generator = AnswerAware('meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo',
                        api_key, base_url, num_of_hints=5, enable_tqdm=True)
generator.generate(dataset['entire'].get_instances())
pass

> Depending on the LLM provider you are using, you may need to configure the `model` and other parameters in the `AnswerAware` function. See the [Model](https://hinteval.readthedocs.io/en/latest/concepts/model.html) and [References](https://hinteval.readthedocs.io/en/latest/references/model.html) for more information.


## Exporting the Dataset

Once the hints are generated, we can export the synthetic hint dataset to a pickle file.

In [None]:
dataset.store('./synthetic_hint_dataset.pickle')

## Viewing the Hints

Finally, let's view the hints generated for the third question in the dataset.

In [None]:
dataset = Dataset.load('./synthetic_hint_dataset.pickle')

third_question = dataset['entire'].get_instance('id_3')
print(f'Question: {third_question.question.question}')
print(f'Answer: {third_question.answers[0].answer}')
print()
for idx, hint in enumerate(third_question.hints, 1):
    print(f'Hint {idx}: {hint.hint}')


Example output:

```
Question: who is governor of ohio 2011?
Answer: John Kasich

Hint 1: The answer is a Republican politician who served as the 69th governor of the state.
Hint 2: This person was a member of the U.S. House of Representatives for 18 years before becoming governor.
Hint 3: The governor was known for his conservative views and efforts to reduce government spending.
Hint 4: During their term, they implemented several reforms related to education, healthcare, and the economy.
Hint 5: This governor served two consecutive terms, from 2011 to 2019, and ran for the U.S. presidency in 2016.
```

Because of the generative nature of large language models, the output hints may vary.

# Evaluating Your Hint Dataset

Once your hint dataset is ready—whether you've created your own or used the synthetic dataset generation module—it's time to evaluate the hints. This guide will help you set up HintEval quickly so you can focus on improving your hint generation pipelines.

> By default, these metrics use TogetherAI's API to compute the score. If you prefer to run them locally, you can set `api_key` to `None`. You can also try other platforms by changing the `base_url` accordingly.

Let's start by loading the data.


## The Data

For this tutorial, we'll use the synthetic dataset generated in the synthetic dataset generation module. Alternatively, you can load a preprocessed dataset using the [Dataset.download_and_load_dataset()](https://hinteval.readthedocs.io/en/latest/references/dataset.html#hinteval.cores.dataset.dataset.Dataset.download_and_load_dataset) function. Each subset of the dataset includes a number of `Instance` objects, each of which contains:
- **question**: The question to evaluate along with its answers and hints.
- **answers**: A list of correct answers for the question.
- **hints**: A list of hints to help arrive at the correct answers.

In [None]:
from hinteval import Dataset

dataset = Dataset.load('./synthetic_hint_dataset.pickle')

## Metrics

HintEval provides several metrics to evaluate different aspects of the hints:

1. **Relevance**: Measures how relevant the hints are to the question.
2. **Readability**: Assesses the readability of the hints and questions.
3. **Convergence**: Evaluates how well the hints narrow down potential answers to the question.
4. **Familiarity**: Measures how common or well-known the information in the hints is.
5. **AnswerLeakage**: Determines how much the hints reveal the correct answers.

Each metric contains various methods, which you can explore in the [metrics guide](https://hinteval.readthedocs.io/en/latest/concepts/metrics/index.html).

Let's import the metrics:

In [None]:
from hinteval.evaluation.relevance import Rouge
from hinteval.evaluation.readability import MachineLearningBased
from hinteval.evaluation.convergence import LlmBased
from hinteval.evaluation.familiarity import WordFrequency
from hinteval.evaluation.answer_leakage import ContextualEmbeddings

Here we’re using five metrics, but what do they represent?

1. **Relevance (_Rouge_)**: Measures the relevance of the hints to the question using ROUGE-L algorithms.
2. **Readability (_MachineLearningBased_)**: Uses a Random Forest algorithm to measure the readability of the hints and questions.
3. **Convergence (_LLmBased_)**: Assesses how well the hints help eliminate incorrect answers using the Meta LLaMA-3.1-70b-Instruct-Turbo model.
4. **Familiarity (_WordFrequency_)**: Evaluates the familiarity of the information in the hints, questions, and answers based on word occurancy.
5. **AnswerLeakage (_ContextualEmbeddings_)**: Measures how much the hints reveal the answers by calculating the similarity between the hints and answers using contextual embeddings.

To explore other metrics, check the [metrics guide](https://hinteval.readthedocs.io/en/latest/concepts/metrics/index.html).

## Evaluation

First, extract the `question`, `hints`, and `answers` from each instance in the dataset and store them in separate lists.


In [None]:
instances = dataset['entire'].get_instances()
questions = [instance.question for instance in instances]
answers = []
[answers.extend(instance.answers) for instance in instances]
hints = []
[hints.extend(instance.hints) for instance in instances]
pass

To evaluate the dataset, call the `evaluate` method for each metric.


In [None]:
Rouge('rougeL', enable_tqdm=True).evaluate(instances)
MachineLearningBased('random_forest', enable_tqdm=True).evaluate(questions + hints)
LlmBased('llama-3-70b', together_ai_api_key=api_key, enable_tqdm=True).evaluate(instances)
WordFrequency(enable_tqdm=True).evaluate(questions + hints + answers)
ContextualEmbeddings(enable_tqdm=True).evaluate(instances)
pass

There you have it—all the evaluation scores you need.

> Depending on the LLM provider you're using, you may need to configure parameters such as `model`, `max_tokens`, and `batch_size` based on the provider’s rate limits.

## Viewing the Evaluation Metrics

Finally, let's view the metrics evaluated for the second hint of third question in the dataset.

In [None]:
third_question = dataset['entire'].get_instance('id_3')
second_hint = third_question.hints[1]

print(f'Question: {third_question.question.question}')
print(f'Answer: {third_question.answers[0].answer}')
print(f'Second Hint: {second_hint.hint}')
print()

for metric in second_hint.metrics:
    print(f'{metric}: {second_hint.metrics[metric].value}')

## Exporting the Results

If you want to further analyze the results, you can export the evaluated dataset to a JSON file, including the scores.

In [None]:
dataset.store_json('./evaluated_synthetic_hint_dataset.json')

> The evaluated scores and metrics are automatically stored in the dataset. Saving the dataset will also save the scores.
