# Uncertainty Estimation for Generation Tasks

In this notebook, we will explore the use of UE techniques for two tasks: out-of-distribution (OOD) detection in text summarization and selective generation in QA. For the text summarization task, we experiment with sequence-to-sequence models, evaluating the BART model on the Aeslc dataset. We investigate state-of-the-art UE methods, including Maximum Sequence Probability and Mahalanobis Distance (MD). \

In the QA task, we focus to selective generation using LLMs. Here, we evaluate the Qwen-2.5 model on the CoQA dataset and examine several state-of-the-art UE approaches for selective generation, including MSP, Lexical Similarity, and Semantic Entropy. These methods aim to enhance the model’s ability to abstain from generating low-quality responses.

In [None]:
!pip install transformers==4.49 datasets==2.15.0 accelerate>=0.20.1
!pip install omegaconf==2.3.0
!pip install rouge_score



In [None]:
import torch
import random, os
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader

from scipy.stats import rankdata
from sklearn.preprocessing import KBinsDiscretizer

def seed_everything(seed: int):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True

# Sequence-to-Sequence

We will use 2 widely used datasets for abstractive text summarization with each being ID and OOD: AESLC (ID) and XSum (OOD). For this task, we employ the standard encoder-decoder BART model, which is fine-tuned on the AESLC dataset.



AESLC: R. Zhang and J. Tetreault, [This Email Could Save Your Life: Introducing the Task of Email Subject Line Generation](https://aclanthology.org/P19-1043/), in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 446–456.

XSUM: S. Narayan, S. B. Cohen, and M. Lapata, [Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization](https:/aclanthology.org/D18-1206), in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 1797–1807.

## Out-Of-Distribuiton Detection

Artem Vazhentsev, Akim Tsvigun, Roman Vashurin, Sergey Petrakov, Daniil Vasilev, Maxim Panov, Alexander Panchenko, and Artem Shelmanov. 2023. [Efficient Out-of-Domain Detection for Sequence to Sequence Models](https://aclanthology.org/2023.findings-acl.93/). In Findings of the Association for Computational Linguistics: ACL 2023, pages 1430–1454, Toronto, Canada. Association for Computational Linguistics.

### Create Dataset with OOD

In [None]:
from omegaconf.dictconfig import DictConfig
from datasets import load_dataset, Dataset, concatenate_datasets

In [None]:
data = load_dataset('aeslc', ignore_verifications=True)
train_instances, dev_instances, test_instances = data['train'], data['validation'], data['test']


'ignore_verifications' was deprecated in favor of 'verification_mode' in version 2.9.1 and will be removed in 3.0.0.



In [None]:
test_instances[0]

{'email_body': "Phillip,   Could you please do me a favor?\nI would like  to read your current title policy to see what it says about easements.\nYou  should have received a copy during your closing.\nI don't know how many  pages it will be but let me know how you want to handle getting a copy  made.\nI'll be happy to make the copy, or whatever makes it easy for  you.\nThanks,\n",
 'subject_line': 'Huntley/question\n'}

In [None]:
data_ood = load_dataset('xsum', ignore_verifications=True)
test_instances_ood = data_ood['test']

In [None]:
test_instances_ood[0]

{'document': 'Prison Link Cymru had 1,099 referrals in 2015-16 and said some ex-offenders were living rough for up to a year before finding suitable accommodation.\nWorkers at the charity claim investment in housing would be cheaper than jailing homeless repeat offenders.\nThe Welsh Government said more people than ever were getting help to address housing problems.\nChanges to the Housing Act in Wales, introduced in 2015, removed the right for prison leavers to be given priority for accommodation.\nPrison Link Cymru, which helps people find accommodation after their release, said things were generally good for women because issues such as children or domestic violence were now considered.\nHowever, the same could not be said for men, the charity said, because issues which often affect them, such as post traumatic stress disorder or drug dependency, were often viewed as less of a priority.\nAndrew Stevens, who works in Welsh prisons trying to secure housing for prison leavers, said the

In [None]:
train_instances = train_instances.select(range(200))
test_instances = test_instances.select(range(100))
test_instances_ood = test_instances_ood.select(range(len(test_instances)))

In [None]:
test_instances = test_instances.add_column('label_ood', [0] * len(test_instances))
test_instances_ood = test_instances_ood.add_column('label_ood', [1] * len(test_instances_ood))

test_instances_ood = test_instances_ood.rename_column("document", "email_body")
test_instances_ood = test_instances_ood.rename_column("summary", "subject_line")

test_with_ood = concatenate_datasets([test_instances, test_instances_ood])
test_with_ood = test_with_ood.remove_columns(
    [
        col
        for col in test_with_ood.column_names
        if col not in ["email_body", "subject_line", "label_ood"]
    ]
)

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoModel, AutoTokenizer

model_path = 'Aktsvigun/bart-base_aeslc_42'
model = AutoModelForSeq2SeqLM.from_pretrained(
    model_path
)
model.cuda()
model.eval()
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-base", truncation=True, padding=True, max_length=512)

In [None]:
model

BartForConditionalGeneration(
  (model): BartModel(
    (shared): BartScaledWordEmbedding(50265, 768, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): BartScaledWordEmbedding(50265, 768, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 768)
      (layers): ModuleList(
        (0-5): 6 x BartEncoderLayer(
          (self_attn): BartSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_lay

In [None]:
def tokenize_data(
    data,
    tokenizer,
    document_name="document",
    label_name="",
    batched=True,
    padding=True,
):
    def tokenize_fn(instances):
        encoded = tokenizer(
            instances[document_name],
            truncation=True,
            padding=padding,
            max_length=512,
        )
        if label_name in instances:
            with tokenizer.as_target_tokenizer():
                labels = tokenizer(
                    instances[label_name],
                    truncation=True,
                    padding=padding,
                    max_length=512,
                )

            encoded["labels"] = labels["input_ids"]
        return encoded

    columns_to_remove = [x for x in data.features.keys() if x != "labels"]

    return data.map(
        tokenize_fn,
        batched=batched,
        remove_columns=columns_to_remove,
        load_from_cache_file=False,
    )

In [None]:
from transformers import DataCollatorWithPadding

device = 'cuda'

train_data = tokenize_data(
    data=train_instances, tokenizer=tokenizer, document_name="email_body"
)
train_loader = DataLoader(
        train_data,
        shuffle=False,
        batch_size=1,
        collate_fn=DataCollatorWithPadding(
            tokenizer=tokenizer
        ),
        pin_memory=0,
)

data_test_with_ood = tokenize_data(
    data=test_with_ood, tokenizer=tokenizer, document_name="email_body", label_name="subject_line"
)
test_loader = DataLoader(
        data_test_with_ood,
        shuffle=False,
        batch_size=1,
        collate_fn=DataCollatorWithPadding(
            tokenizer=tokenizer
        ),
        pin_memory=0,
)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]


`as_target_tokenizer` is deprecated and will be removed in v5 of Transformers. You can tokenize your labels by using the argument `text_target` of the regular `__call__` method (either in the same call as your input texts if you use the same keyword arguments, or in a separate call.



### Model Inference and Embeddings Extraction

We used two embeddings extraction strategies: the last hidden state of *encoder* averaged over non-padding tokens and the last hidden state of *decoder* averaged over all generated tokens.

In [None]:
def get_embeddings(output, batch, num_return_sequences):
    encoder_embeddings = (output.encoder_hidden_states[-1] * batch["attention_mask"][:, :, None]).sum(1) / batch["attention_mask"].sum(-1)[:, None]
    encoder_embeddings = encoder_embeddings.cpu().detach()

    decoder_hidden_states = torch.stack([torch.stack(hidden) for hidden in output.decoder_hidden_states])
    last_decoder_hidden_states = decoder_hidden_states[1:, -1, :, 0]
    decoder_embeddings = last_decoder_hidden_states.mean(dim=(0))[::num_return_sequences].cpu().detach()
    return encoder_embeddings, decoder_embeddings


def inference(model, dataloader, max_length=20, num_return_sequences=1, ignore_pad=True):
    encoder_hiddens = []
    decoder_hiddens = []
    answers = []
    probs = []
    possible_input_keys = ["input_ids", "attention_mask"]

    for batch in tqdm(dataloader):
        torch.cuda.empty_cache()
        batch = {k: v.to(device) for k, v in batch.items() if k in possible_input_keys}
        output = model.generate(
                        **batch,
                        max_length=max_length,
                        min_length=3,
                        output_scores=True,
                        return_dict_in_generate=True,
                        num_beams=num_return_sequences,
                        output_hidden_states=True,
                        num_return_sequences=num_return_sequences,
                    )

        encoder_embeddings, decoder_embeddings = get_embeddings(output, batch, num_return_sequences)
        encoder_hiddens.append(encoder_embeddings)
        decoder_hiddens.append(decoder_embeddings)

        probs.append(torch.cat(output.scores).log_softmax(-1).max(-1).values.sum().item())
        answers.extend(output.sequences.cpu().detach())

    train_embeddings_decoder = torch.cat(decoder_hiddens)
    train_embeddings = torch.cat(encoder_hiddens)
    probs = np.array(probs)

    output = (train_embeddings, train_embeddings_decoder, probs, answers)

    return output

In [None]:
train_embeddings, train_embeddings_decoder, _, _ = inference(model, train_loader)
test_embeddings, test_embeddings_decoder, probs, answers = inference(model, test_loader)

100%|██████████| 200/200 [00:12<00:00, 15.66it/s]
100%|██████████| 200/200 [00:13<00:00, 15.36it/s]


### Model Performance

Here we will use the ROUGE score for model evaluation.

ROUGE-N measures the number of matching n-grams between the model-generated text and a human-produced reference.

ROUGE-L is based on the longest common subsequence (LCS) between our model output and reference, i.e. the longest sequence of words (not necessarily consecutive, but still in order) that is shared between both.

https://medium.com/nlplanet/two-minutes-nlp-learn-the-rouge-metric-by-examples-f179cc285499

In [None]:
answers_text = tokenizer.batch_decode(answers, skip_special_tokens=True)
labels = test_with_ood['subject_line']

In [None]:
from datasets import load_metric

n_ood_samples = len(test_instances_ood)

is_zeroword = np.zeros(len(labels), dtype=bool)
is_uniword = np.zeros(len(labels), dtype=bool)

rouge = load_metric("rouge")
rouges = rouge.compute(
    predictions=answers_text,
    references=labels,
    use_stemmer=True,
    use_aggregator=False,
)
metrics = np.array([[x.fmeasure for x in value] for value in rouges.values()])[:3]
# Substitute invalid observations with nans
metrics[0][is_zeroword] = metrics[2][is_zeroword] = np.nan
metrics[1][is_zeroword | is_uniword] = np.nan
metrics_id = {
    "ROUGE-1": metrics[0][:n_ood_samples].mean(),
    "ROUGE-2": metrics[1][:n_ood_samples].mean(),
    "ROUGE-L": metrics[2][:n_ood_samples].mean(),
}

metrics_ood = {
    "ROUGE-1": metrics[0][n_ood_samples:].mean(),
    "ROUGE-2": metrics[1][n_ood_samples:].mean(),
    "ROUGE-L": metrics[2][n_ood_samples:].mean(),
}

In [None]:
for metric in metrics_id.keys():
    print(f'ID {metric}: ', metrics_id[metric])
print()

for metric in metrics_ood.keys():
    print(f'OOD {metric}: ', metrics_ood[metric])

ID ROUGE-1:  0.3666580935404465
ID ROUGE-2:  0.20200928603560184
ID ROUGE-L:  0.3520371785077668

OOD ROUGE-1:  0.09050766795212371
OOD ROUGE-2:  0.020749823829087233
OOD ROUGE-L:  0.08228105578387318


### Calculating UE

#### Maximum Sequence Probability (MSP)

\begin{equation}
  \mathrm{MSP}(x; \theta) = 1 - P(y \mid x, \theta).
\end{equation}

In [None]:
msp = 1 - probs

####  Mahalanobis Distance (MD)

MD method fits a Gaussian centered at the training data centroid $\mu$ with empirical covariance matrix $\Sigma$. The uncertainty score is the Mahalanobis distance between $h(x)$ and $\mu$:
  \begin{equation*}
    U^{\text{MD}}(x) = (h(x) - \mu)^{T} \Sigma^{-1} (h(x) - \mu).
  \end{equation*}

In [None]:
import numpy as np
import torch
from tqdm import tqdm

DOUBLE_INFO = torch.finfo(torch.double)
JITTERS = [10**exp for exp in range(-15, 0, 1)]

def _compute_centroid(train_features, train_labels, label, zero_vector=None):
    label_features = train_features[train_labels == label]
    if len(label_features):
        return label_features.mean(dim=0), False
    return zero_vector, True


def compute_centroids(train_features, train_labels, num_labels=None):
    labels = (
        np.sort(np.unique(train_labels))
        if num_labels is None
        else np.arange(num_labels)
    )
    device = train_features.device
    centroids = torch.empty(
        len(labels), train_features.shape[1], dtype=torch.float32, device=device
    )
    centroids_mask = torch.empty(len(labels), dtype=torch.bool, device="cpu")
    zero_vector = torch.zeros(train_features.shape[1], device=device)

    for i, label in enumerate(labels):
        centroid, centroid_mask = _compute_centroid(
            train_features, train_labels, label, zero_vector
        )
        centroids[i].copy_(centroid, non_blocking=True)
        centroids_mask[i] = centroid_mask

    return centroids, centroids_mask


def compute_inv_covariance(centroids, train_features, train_labels, jitters=None):
    if jitters is None:
        jitters = JITTERS
    jitter = 0
    jitter_eps = None

    cov = torch.zeros(
        centroids.shape[1], centroids.shape[1], device=centroids.device
    ).float()
    for c, mu_c in tqdm(enumerate(centroids)):
        for x in train_features[train_labels == c]:
            d = (x - mu_c).unsqueeze(1)
            cov += d @ d.T
    cov_scaled = cov / (train_features.shape[0] - 1)

    for i, jitter_eps in enumerate(jitters):
        jitter = jitter_eps * torch.eye(
            cov_scaled.shape[1],
            device=cov_scaled.device,
        )
        cov_scaled_update = cov_scaled + jitter
        eigenvalues = torch.linalg.eigh(cov_scaled_update).eigenvalues
        if (eigenvalues >= 0).all():
            break
    cov_scaled = cov_scaled + jitter
    cov_inv = torch.inverse(cov_scaled.to(torch.float64)).float()
    return cov_inv, jitter_eps

def mahalanobis_distance_with_known_centroids_sigma_inv(
    centroids, centroids_mask, sigma_inv, eval_features
):
    diff = eval_features.unsqueeze(1) - centroids.unsqueeze(
        0
    )  # bs (b), num_labels (c / s), dim (d / a)
    dists = torch.sqrt(torch.einsum("bcd,da,bsa->bcs", diff, sigma_inv, diff))
    device = dists.device
    dists = torch.stack([torch.diag(dist).cpu() for dist in dists], dim=0)
    if centroids_mask is not None:
        dists = dists.masked_fill_(centroids_mask, float("inf")).to(device)
    return dists  # np.min(dists, axis=1)

In [None]:
train_labels = np.zeros(train_embeddings.shape[0])

centroid = train_embeddings.mean(dim=0)
sigma_inv, _ = compute_inv_covariance(
    centroid.unsqueeze(0), train_embeddings, train_labels
)
md_enc = mahalanobis_distance_with_known_centroids_sigma_inv(
    centroid.unsqueeze(0),
    None,
    sigma_inv,
    test_embeddings,
)[:, 0]

1it [00:00, 12.17it/s]


In [None]:
centroid = train_embeddings_decoder.mean(dim=0)
sigma_inv, _ = compute_inv_covariance(
    centroid.unsqueeze(0), train_embeddings_decoder, train_labels
)
md_dec = mahalanobis_distance_with_known_centroids_sigma_inv(
    centroid.unsqueeze(0),
    None,
    sigma_inv,
    test_embeddings_decoder,
)[:, 0]

1it [00:00, 12.03it/s]


### Results

In [None]:
import sklearn.metrics as metrics

def get_ood_score(label, ue):
    fpr, tpr, threshold = metrics.roc_curve(label, ue)
    roc_auc = metrics.auc(fpr, tpr)
    return fpr, tpr, roc_auc

In [None]:
label_ood = test_with_ood['label_ood']

fpr_msp, tpr_msp, roc_auc_msp = get_ood_score(label_ood, msp)
fpr_md_dec, tpr_md_dec, roc_auc_md_dec = get_ood_score(label_ood, md_dec)
fpr_md_enc, tpr_md_enc, roc_auc_md_enc = get_ood_score(label_ood, md_enc)

In [None]:
from plotly import graph_objects as go
import plotly.io as pio
pio.renderers.default = 'colab'

ats_metric_name = 'ROC-AUC'
metric = ''
x_axis = np.arange(len(msp) + 1) / (len(msp) + 1)

fig = go.Figure(
    layout=dict(
        height=400,
        width=700,
        title=ats_metric_name,
        margin=dict(l=0, r=0, t=30, b=10),
    )
)

fig.add_scatter(
    x=fpr_msp, y=tpr_msp, name=f'MSP {roc_auc_msp:.2f}',
)
fig.add_scatter(
    x=fpr_md_enc, y=tpr_md_enc, name=f'MD encoder {roc_auc_md_enc:.2f}',
)
fig.add_scatter(
    x=fpr_md_dec, y=tpr_md_dec, name=f'MD decoder {roc_auc_md_dec:.2f}',
)

fig.show()

As the results, density-based appear to be superior to the probability-based and *ensemble-based* methods in terms of both performance and compute time, which makes them a good choice for applying in practice.

# Large Language Models

## Selective Generation

In this task, we aim to identify instances where the LLM generates errorneus outputs (e.g. hallucinations) with low generation metrics. Ideally, uncertainty estimates should correlate with the errors of the model, e.g. "1 - quality of generation".



### Load Dataset with prompt

In [None]:
from omegaconf.dictconfig import DictConfig
from datasets import load_dataset, Dataset, concatenate_datasets

In [None]:
dataset = load_dataset("LM-Polygraph/coqa", "continuation")
train_dataset = dataset["train"]
test_dataset = dataset["test"]

In [None]:
print(test_dataset[0]["input"])

The following are stories and questions about them. Each story is followed by a question and answer to a given question.

Story: Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. But Cotton wasn't alone in her little home above the barn, oh no. She shared her hay bed with her mommy and 5 other sisters. All of her sisters were cute and fluffy, like Cotton. But she was the only white one in the bunch. The rest of her sisters were all orange with beautiful white tiger stripes like Cotton's mommy. Being different made Cotton quite sad. She often wished she looked like the rest of her family. So one day, when Cotton found a can of the old farmer's orange paint, she used it to paint herself like them. When her mommy and sisters found her they started laughing. 

"What are you doing, Cotton?!" 

"I only wanted to be more like you". 

Cotton's mommy rubb

In [None]:
print(test_dataset[0]["output"])

white


In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B")
model.generation_config.pad_token_id = tokenizer.pad_token_id

In [None]:
eos_tokens = ["\n", "\n\n", ".\n\n", ".", ","]
eos_token_ids = tokenizer(eos_tokens)['input_ids']

In [None]:
device = 'cuda'

max_size = 100
test_dataset_sample = test_dataset.select(list(range(max_size)))

### Model Inference

In [None]:
from rouge_score import rouge_scorer

def inference(model, dataloader, max_length=5, num_return_sequences=1, num_beams=1, do_sample=False):

    scorer = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True)
    answers = []
    probs = []
    rougeL = []

    possible_input_keys = ["input_ids", "attention_mask"]
    start = 0
    model = model.to(device)
    for batch in tqdm(dataloader):
        torch.cuda.empty_cache()
        inputs = tokenizer(batch["input"], return_tensors="pt").to("cuda")
        inputs = {k: v.to(device) for k, v in inputs.items() if k in possible_input_keys}
        output = model.generate(
                        **inputs,
                        max_new_tokens=max_length,
                        output_scores=True,
                        return_dict_in_generate=True,
                        num_beams=num_beams,
                        output_hidden_states=False,
                        num_return_sequences=num_return_sequences,
                        do_sample=do_sample,
                        eos_token_id=eos_token_ids
                    )
        prediction = tokenizer.decode(output.sequences[0, inputs["input_ids"].shape[1]:].cpu().detach(), skip_special_tokens=True)
        probs.append(torch.cat(output.scores).log_softmax(-1).max(-1).values.sum().item())
        answers.append(prediction)
        rougeL.append(scorer.score(batch["output"], prediction)["rougeL"].fmeasure)

    probs = np.array(probs)
    output = (probs, answers, rougeL)
    return output

In [None]:
probs, preds, rougeL = inference(model, test_dataset_sample)

100%|██████████| 100/100 [00:50<00:00,  1.98it/s]


### Calculating UE

#### Maximum Sequence Probability (MSP)

In [None]:
msp = 1 - probs

#### Lexical Similairy

Marina Fomicheva, Shuo Sun, Lisa Yankovskaya, Frédéric Blain, Francisco Guzmán, Mark Fishel, Nikolaos Aletras, Vishrav Chaudhary, Lucia Specia; [Unsupervised Quality Estimation for Neural Machine Translation](https://doi.org/10.1162/tacl_a_00330). Transactions of the Association for Computational Linguistics 2020; 8 539–555.

Lexical Similairy computes the mean pairwise similarity between sampled answers. We conduct sampling through probability sampling instead of beam search during the generation procedure.

$$
\text{LexSim}=\frac{1}{C} \sum_{i=1}^{|\mathbb{H}|} \sum_{j=1}^{|\mathbb{H}|} \operatorname{sim}\left(h_i, h_j\right)
$$

In [None]:
def lexical_similarity(model, test_loader, n_samples=5):
    scorer = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True)

    sampling_answers = []
    for _ in range(n_samples):
        _, answers, _ = inference(model, test_loader, do_sample=True)
        sampling_answers.append(answers)
    sampling_answers = np.array(sampling_answers).T


    similarities = []

    for samples in sampling_answers:
        sim_matrix = []
        for i, text_i in enumerate(samples):
            for text_j in samples[i:]:
                sim_matrix.append(scorer.score(text_i, text_j)["rougeL"].fmeasure)
        similarities.append(sim_matrix)

    return sampling_answers, np.mean(similarities, axis=-1)

In [None]:
sampling_answers, similarities = lexical_similarity(model, test_dataset_sample)
ue_lexsim = 1 - similarities

100%|██████████| 100/100 [00:50<00:00,  1.99it/s]
100%|██████████| 100/100 [00:50<00:00,  1.99it/s]
100%|██████████| 100/100 [00:49<00:00,  2.01it/s]
100%|██████████| 100/100 [00:50<00:00,  1.98it/s]
100%|██████████| 100/100 [00:49<00:00,  2.00it/s]


### Prediction Rejection (PR) curve

Consider a test dataset $D = {(x_i, y_i)}$. Let
$f(x_i)$ be the output generated by an LLM and
$U(x_i)$ be the uncertainty score of a prediction. The
prediction rejection (PR) curve indicates the dependence
of the average quality $Q(f(x_i), y_i)$ of the
covered instances from the uncertainty rate a used
for rejection, in ascending order. We use ROUGE-L as text quality
metrics $Q(f(x_i), y_i)$.

Andrey Malinin, Anton Ragni, Kate Knill, and Mark
Gales. 2017. [Incorporating uncertainty into deep
learning for spoken language assessment](https://doi.org/10.18653/v1/P17-2008). In Proceedings
of the 55th Annual Meeting of the Association
for Computational Linguistics (Volume 2: Short Papers),
pages 45–50, Vancouver, Canada. Association
for Computational Linguistics.

In [None]:
def PRR(ue, target):
    ue = np.array(ue)
    num_obs = len(ue)
    # Sort in ascending order: the least uncertain come first
    ue_argsort = np.argsort(ue)
    # want sorted_metrics to be increasing => smaller scores is better
    sorted_metrics = np.array(target)[ue_argsort]
    # Since we want all plots to coincide when all the data is discarded
    cumsum = np.cumsum(sorted_metrics)
    scores = (cumsum / np.arange(1, num_obs + 1))[::-1]
    prr_score = np.sum(scores) / num_obs
    return scores, prr_score

def get_random_scores(function, metrics, num_iter=100, seed=42):
    np.random.seed(seed)
    rand_scores = np.arange(len(metrics))

    value = []
    for i in range(num_iter):
        np.random.shuffle(rand_scores)
        rand_val = function(rand_scores, metrics)[1]
        value.append(rand_val)
    return np.mean(value)

In [None]:
rougeL = np.array(rougeL)

scores_msp, prr_score_msp = PRR(msp, rougeL)
scores_lexsim, prr_score_lexsim = PRR(ue_lexsim, rougeL)

oracle = PRR(-rougeL, rougeL)[1]
random = get_random_scores(PRR, rougeL)

final_score_msp = (prr_score_msp - random) / (oracle - random)
final_score_ls = (prr_score_lexsim - random) / (oracle - random)

In [None]:
print("Average ROUGE-L: ", np.mean(rougeL))

Average ROUGE-L:  0.6676847041847043


### Resutls

In [None]:
from plotly import graph_objects as go
import plotly.io as pio
pio.renderers.default = 'colab'

metric_name = 'PR'
x_axis = np.arange(len(rougeL) + 1) / (len(rougeL) + 1)

fig = go.Figure(
    layout=dict(
        height=400,
        width=700,
        title=metric_name,
        margin=dict(l=0, r=0, t=30, b=10),
    )
)

fig.add_scatter(
    x=x_axis, y=scores_msp, name=f'MSP {final_score_msp:.2f}',
)
fig.add_scatter(
    x=x_axis, y=scores_lexsim, name=f'LexSim {final_score_ls:.2f}',
)

fig.show()

The results indicate that lexical similarity outperforms probability-based methods in terms of PRR. However, this performance comes at the cost of significantly higher computational time. Notably, in the first half of the PR curve, MSP and lexical similarity show comparable results. Given this, MSP could be a practical and efficient starting point for real-world applications, particularly when dealing with short-form answers.




## LM-Polygraph

LM-Polygraph provides a battery of state-of-the-art of uncertainty estimation (UE) methods for LMs in text generation tasks. High uncertainty can indicate the presence of hallucinations and knowing a score that estimates uncertinaty can help to make applications of LLMs safer.

The framework also introduces an extendable benchmark for consistent evaluation of UE techniques by researchers and a demo web application that enriches the standard chat dialog with confidence scores, empowering end-users to discern unreliable responses.

Ekaterina Fadeeva, Roman Vashurin, Akim Tsvigun, Artem Vazhentsev, Sergey Petrakov, Kirill Fedyanin, Daniil Vasilev, Elizaveta Goncharova, Alexander Panchenko, Maxim Panov, Timothy Baldwin, Artem Shelmanov. 2023. [LM-Polygraph: Uncertainty Estimation for Language Models](https://arxiv.org/abs/2311.07383). In EMNLP-2023.

Code: https://github.com/IINemo/lm-polygraph/tree/main

### High-level Usage

**To ensure the LM-Polygraph library is installed correctly in Google Colab, it is recommended to restart the environment before the installation.**



In [None]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [None]:
!git clone https://github.com/IINemo/lm-polygraph.git
%cd lm-polygraph
!pip install -r requirements.txt
!pip install -e . --no-deps

fatal: destination path 'lm-polygraph' already exists and is not an empty directory.
/content/lm-polygraph
Obtaining file:///content/lm-polygraph
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: lm_polygraph
  Building editable for lm_polygraph (pyproject.toml) ... [?25l[?25hdone
  Created wheel for lm_polygraph: filename=lm_polygraph-0.0.0-0.editable-py3-none-any.whl size=14875 sha256=4e7175a82002ecb803b73240b9b91a8e99fff40a4a3edb844faac6c287af75c1
  Stored in directory: /tmp/pip-ephem-wheel-cache-1dud85o5/wheels/01/02/00/e11308a42c9066bcc9c18e285502f537e7db3bfdc4c0ca1f45
Successfully built lm_polygraph
Installing collected packages: lm_polygraph
  Attempting uninstall: lm_polygraph
    Found existing installation: lm_polygraph 0.0.0
    Un

#### Model Initialization

In [None]:
from lm_polygraph.estimators import *
from lm_polygraph.utils.model import WhiteboxModel
from lm_polygraph import estimate_uncertainty

model_path = "Qwen/Qwen2.5-1.5B-Instruct"
model = WhiteboxModel.from_pretrained(
    model_path,
    device_map="auto"
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

#### UE Methods



In [None]:
input_text = "What is the capital of USA?"
ue_method = MeanPointwiseMutualInformation()
estimate_uncertainty(model, ue_method, input_text=input_text)



UncertaintyOutput(uncertainty=-16.621862097159497, input_text='What is the capital of USA?', generation_text='The capital of the United States of America is Washington, D.C.', generation_tokens=[785, 6722, 315, 279, 3639, 4180, 315, 5159, 374, 6515, 11, 422, 727, 13], model_path='Qwen/Qwen2.5-1.5B-Instruct', estimator='MeanPointwiseMutualInformation')

In [None]:
ue_method = LexicalSimilarity()
estimate_uncertainty(model, ue_method, input_text=input_text)

UncertaintyOutput(uncertainty=-0.7701795012139838, input_text='What is the capital of USA?', generation_text='The capital of the United States of America is Washington, D.C.', generation_tokens=[785, 6722, 315, 279, 3639, 4180, 315, 5159, 374, 6515, 11, 422, 727, 13], model_path='Qwen/Qwen2.5-1.5B-Instruct', estimator='LexicalSimilarity_rougeL')

Generalizing length-normalized log probability, **TokenSAR** computes the weighted average of the negative log probability of generated tokens based on their relevance for the entire generated text. For a given sentence similarity function $g(\cdot, \cdot)$ and token relevance function $R_T(y_k, y, x) = 1 - g(x \cup y, x \cup y \setminus y_k)$, the resulting estimate is given by the following formula:
  $$
    U_\mathrm{TokenSAR}(x) =-\sum\nolimits_{l = 1}^L \tilde{\mathrm{R}}_T(y_l, y, x) \log P(y_l \mid y_{<l}, x),
  $$
  where $\tilde{\mathrm{R}}_T(y_k, y, x) = \frac{\mathrm{R}_T(y_k, y, x)}{\sum\nolimits_{l = 1}^L \mathrm{R}_T(y_l, y, x)}$.

  **SentenceSAR** enlarges the probability of those sentences that are more relevant and convincing than others. Given sentence relevance measure $g\bigl(y^{(j)}, y^{(k)}\bigr)$ of $y^{(j)}$ concerning to $y^{(k)}$, SentenceSAR is computed as:
  $$
    \mathrm{R}_S (y^{(j)}, x) \! = \sum_{k \neq j} g\bigl(y^{(j)}, y^{(k)}\bigr) P\bigl(y^{(k)} \mid x\bigr). \\
    U_\mathrm{SentSAR}(x) = -\frac{1}{K} \sum_{k = 1}^K \log \Bigl(P(y^{(k)} \mid x) + \frac{1}{t} \mathrm{R}_S (y^{(k)}, x)\Bigr),
  $$
  where $t$ is a temperature parameter used to control the scale of shifting to relevance.

  Combining SentenceSAR and TokenSAR results in a new method **SAR**. In particular, the generative probability $P(y \mid x)$ is replaced with the token-shifted probability $P'(y \mid x) = \exp\{-\mathrm{TokenSAR}(y, x)\}$.

Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. 2024. [Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models.](https://aclanthology.org/2024.acl-long.276/) In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5050–5063, Bangkok, Thailand. Association for Computational Linguistics.

In [None]:
ue_method = SAR()
estimate_uncertainty(model, ue_method, input_text=input_text)

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.56M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

UncertaintyOutput(uncertainty=-8.787354646975869, input_text='What is the capital of USA?', generation_text='The capital of the United States of America is Washington, D.C.', generation_tokens=[785, 6722, 315, 279, 3639, 4180, 315, 5159, 374, 6515, 11, 422, 727, 13], model_path='Qwen/Qwen2.5-1.5B-Instruct', estimator='SAR')

In [None]:
input_text = "What is the capital of Moscow?"
ue_method = SAR()
estimate_uncertainty(model, ue_method, input_text=input_text)

UncertaintyOutput(uncertainty=-7.783019947567017, input_text='What is the capital of Moscow?', generation_text='The capital of Moscow is Moscow itself.', generation_tokens=[785, 6722, 315, 22415, 374, 22415, 5086, 13], model_path='Qwen/Qwen2.5-1.5B-Instruct', estimator='SAR')

### Low-level Usage of LM-Polygraph

#### Load Dataset

In [None]:
from omegaconf.dictconfig import DictConfig
from datasets import load_dataset, Dataset, concatenate_datasets

In [None]:
dataset = load_dataset("LM-Polygraph/coqa", "continuation")
train_dataset = dataset["train"]
test_dataset = dataset["test"]

max_size = 50
test_dataset_sample = test_dataset.select(list(range(max_size)))

#### Load Model

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B")
tokenizer.chat_template = None # only for non-instructed version of Qwen-2.5

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B").to("cuda")

model.generation_config.pad_token_id = tokenizer.pad_token_id

In [None]:
from lm_polygraph.utils.generation_parameters import GenerationParameters

eos_tokens = ["\n", "\n\n", ".\n\n", ".", ","]
generation_parameters = GenerationParameters(generate_until=eos_tokens)

In [None]:
from lm_polygraph.model_adapters import WhiteboxModel
model_adapter = WhiteboxModel(model, tokenizer, generation_parameters=generation_parameters)

#### Load UE Methods

Here, we initialize the UE methods that we will use in our experiments.




In [None]:
from lm_polygraph.estimators import *

estimators = [MaximumSequenceProbability(),
              ClaimConditionedProbability(),
              LexicalSimilarity(),
              DegMat(),
              SemanticEntropy()]

#### Load UE Calculators

Here, we initialize the statistical calculators required for general inference, sampling, computing NLI scores, and clustering.

In [None]:
from lm_polygraph.stat_calculators import *
from lm_polygraph.utils.deberta import Deberta

device = "cuda"
calc_infer_llm = GreedyProbsCalculator()

nli_model = Deberta(device=device)
nli_model.setup()

calc_nli = GreedyAlternativesNLICalculator(nli_model=nli_model)

calc_samples = SamplingGenerationCalculator(samples_n=3)
calc_semantic_matrix = SemanticMatrixCalculator(nli_model=nli_model)
calc_semantic_classes = SemanticClassesCalculator()

config.json:   0%|          | 0.00/729 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.62G [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/deberta-large-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [None]:
from lm_polygraph.generation_metrics import RougeMetric
metric = RougeMetric("rougeL")

#### General Inference with UE Methods

General loop iterating over batches. First, we perform general inference, and then process the results using additional calculators.

In [None]:
from tqdm import tqdm
from transformers import set_seed
from torch.utils.data import DataLoader

set_seed(42)

max_new_tokens = 5
data_loader = DataLoader(test_dataset_sample, batch_size=1, shuffle=False, collate_fn=lambda x: x)

ues = {}
metrics = []

for batch in tqdm(data_loader):
    texts = [x["input"] for x in batch]
    ground_truth = [x["output"] for x in batch]

    deps = {"input_texts": texts}
    deps.update(calc_infer_llm(deps, texts=texts, model=model_adapter, max_new_tokens=max_new_tokens))
    deps.update(calc_nli(deps, texts=texts, model=model_adapter, max_new_tokens=max_new_tokens))
    deps.update(calc_samples(deps, texts=texts, model=model_adapter, max_new_tokens=max_new_tokens))
    deps.update(calc_semantic_matrix(deps, texts=texts, model=model_adapter, max_new_tokens=max_new_tokens))
    deps.update(calc_semantic_classes(deps, texts=texts, model=model_adapter, max_new_tokens=max_new_tokens))

    for estimator in estimators:
        uncertainty_score = estimator(deps)
        method = str(estimator)
        if method in ues:
            ues[method].append(uncertainty_score)
        else:
            ues[method] = [uncertainty_score]

    metrics.append(metric(deps, ground_truth))

100%|██████████| 50/50 [02:26<00:00,  2.93s/it]


#### Generation and UE Metrics

In [None]:
import numpy as np
from lm_polygraph.ue_metrics.ue_metric import get_random_scores
from lm_polygraph.ue_metrics.pred_rej_area import PredictionRejectionArea

PRR = PredictionRejectionArea()

metrics = np.array(metrics).flatten()
oracle = PRR(-metrics, metrics)
random = get_random_scores(PRR, metrics)

print("Average ROUGE-L: ", np.mean(metrics))
print()

for method in ues.keys():
    ues_array = np.array(ues[method]).flatten()
    prr_score = PRR(ues_array, metrics)
    final_score = (prr_score - random) / (oracle - random)
    print(f"{method}: {final_score:.2f}")

Average ROUGE-L:  0.6135281385281385

MaximumSequenceProbability: 0.78
CCP: 0.36
LexicalSimilarity_rougeL: 0.58
DegMat_NLI_score_entail: 0.53
SemanticEntropy: 0.66


These results demonstrate that MSP outperforms sampling-based methods in terms of PRR while also requiring significantly less computational time. As expected, Semantic Entropy shows superior performance compared to other sampling-based approaches.

# Task

**Your task is to implement the [Verbalized UE](https://arxiv.org/pdf/2305.14975) method and apply it for selective generation. You are expected to use the Verbalized 1S top-1 prompt by default, but you may experiment with alternative prompts. Complete the missing sections in the code template provided below.
The description and algorithm are listed below before the corresponding empty sections. All other codes for initializing the model and dataset and evaluation of the performance of the UE methods are given**

**Prompt example**: *Provide your best guess and the probability that it is correct (0.0 to 1.0) for the following question. Give ONLY the guess and probability, no other words or explanation. For example:\n\nGuess: <most likely guess, as short as possible; not a complete sentence, just the guess!>\n Probability: <the probability between 0.0 and 1.0 that your guess is correct, without any extra commentary whatsoever; just the probability!>\n\nThe question is: ${THE_QUESTION}*

In [None]:
from omegaconf.dictconfig import DictConfig
from datasets import load_dataset, Dataset, concatenate_datasets

**Load the CoQA dataset using one of the following approaches:**

1. Use the "verb_1s_top1" subset of the "LM-Polygraph/coqa" dataset available on HuggingFace.
2. Alternatively, load the "continuation" subset of the "LM-Polygraph/coqa" dataset and manually apply the verbalized prompt to the data.


In [None]:
dataset = # <your code here>
test_dataset = # <your code here>

max_size = 50
test_dataset_sample = test_dataset.select(list(range(max_size)))

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/40.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.88M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/108647 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7983 [00:00<?, ? examples/s]

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B")
tokenizer.chat_template = None

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B").to("cuda")

model.generation_config.pad_token_id = tokenizer.pad_token_id

In [None]:
from lm_polygraph.utils.generation_parameters import GenerationParameters

eos_tokens = ["\n", "\n\n", ".\n\n", ".", ","]
generation_parameters = GenerationParameters(generate_until=eos_tokens)

In [None]:
from lm_polygraph.model_adapters import WhiteboxModel
model_adapter = WhiteboxModel(model, tokenizer, generation_parameters=generation_parameters)

**Initialize the MaximumSequenceProbability and Verbalized estimators from LM-Polygraph**

When using the Verbalized1S or Verbalized2S estimators, ensure you specify `confidence_regex` (a regular expression to extract the confidence score from the generated text) and `name_postfix` (a string to append to the result names for proper identification, e.g., *_top1 for Verbalized 1S top-1 results).

You can find the `confidence_regex` in the configuration file available at the following link:
https://github.com/IINemo/lm-polygraph/blob/main/examples/configs/instruct/polygraph_eval_coqa_verb_1s_top1.yaml


In [None]:
from lm_polygraph.estimators import *

estimators = # <your code here>

In [None]:
from lm_polygraph.stat_calculators import *

device = "cuda"
calc_infer_llm = GreedyProbsCalculator()

In [None]:
from lm_polygraph.generation_metrics import RougeMetric
metric = RougeMetric("rougeL")

In [None]:
import re
import string

TOP1_OUTPUT_IGNORE_REGEX = re.compile(r"(?s)[Gg]uess:|[\n\.\(\,].*")

def normalize_em_coqa(s: str) -> str:
    def remove_articles(text):
        regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
        return re.sub(regex, " ", text)

    def white_space_fix(text):
        return " ".join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return "".join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))


def process_output_top1_coqa(output: str) -> str:
    output = TOP1_OUTPUT_IGNORE_REGEX.sub("", output)
    output = normalize_em_coqa(output)
    return output

In [None]:
from tqdm import tqdm
from transformers import set_seed

set_seed(42)

max_new_tokens = 5
data_loader = DataLoader(test_dataset_sample, batch_size=1, shuffle=False, collate_fn=lambda x: x)

ues = {}
metrics = []

for batch in tqdm(data_loader):
    texts = [x["input"] for x in batch]
    ground_truth = [x["output"] for x in batch]

    deps = {"input_texts": texts}
    deps.update(calc_infer_llm(deps, texts=texts, model=model_adapter, max_new_tokens=max_new_tokens))
    deps["greedy_texts"] = [process_output_top1_coqa(text) for text in deps["greedy_texts"]]

    for estimator in estimators:
        uncertainty_score = estimator(deps)
        method = str(estimator)
        if method in ues:
            ues[method].append(uncertainty_score)
        else:
            ues[method] = [uncertainty_score]

    metrics.append(metric(deps, ground_truth))

100%|██████████| 50/50 [00:41<00:00,  1.21it/s]


In [None]:
import numpy as np
from lm_polygraph.ue_metrics.ue_metric import get_random_scores
from lm_polygraph.ue_metrics.pred_rej_area import PredictionRejectionArea

PRR = PredictionRejectionArea()

metrics = np.array(metrics).flatten()
oracle = PRR(-metrics, metrics)
random = get_random_scores(PRR, metrics)

print("Average ROUGE-L: ", np.mean(metrics))
print()

for method in ues.keys():
    ues_array = np.array(ues[method]).flatten()
    prr_score = PRR(ues_array, metrics)
    final_score = (prr_score - random) / (oracle - random)
    print(f"{method}: {final_score:.2f}")

Provide **a short analysis of the achieved results** by addressing the following questions:
1. Does the model's performance remain consistent when using the verbalized prompt compared to the original prompt from the seminar?
2. Are the results obtained with the Verbalized method meaningful?
3. Does the performance of the Verbalized method better than MSP or Semantic Entropy?

In [None]:
# <your analysis here>