# Per token inference

Now that we have a SAPLMA classifier that works well with the last token of an input prompt, we want to explore how it behaves when used in a real-time environment, so that we may eventually have an hallucination probability inferred **while the LLM is generating**, per single token.

That is, we compute:
$$ SAPLMA \ (\ h_{t,l}[:,-1,:] \ )\  \ for \ t \in [1,M] $$

where $h_{t,l}$ of shape $ B * S * H $ (with $B$ = batch size, $S$ = sequence length, $H$ = hidden dimension) are the hidden states produced by the Llama LLM at layer $l$ (which depends on the version of SAPLMA to use) while generating, at time-step $t$ (where a new token is generated), and $M$ is the number of tokens generated (the generation ends when the $<EOS>$ token is produced).

# Imports, installations and declarations from previous notebooks

This section can be skipped and collapsed.

In [2]:
#@title Install missing dependencies
!pip install wandb lightning

Collecting lightning
  Downloading lightning-2.4.0-py3-none-any.whl.metadata (38 kB)
Collecting lightning-utilities<2.0,>=0.10.0 (from lightning)
  Downloading lightning_utilities-0.11.9-py3-none-any.whl.metadata (5.2 kB)
Collecting torchmetrics<3.0,>=0.7.0 (from lightning)
  Downloading torchmetrics-1.6.0-py3-none-any.whl.metadata (20 kB)
Collecting pytorch-lightning (from lightning)
  Downloading pytorch_lightning-2.4.0-py3-none-any.whl.metadata (21 kB)
Downloading lightning-2.4.0-py3-none-any.whl (810 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m811.0/811.0 kB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading lightning_utilities-0.11.9-py3-none-any.whl (28 kB)
Downloading torchmetrics-1.6.0-py3-none-any.whl (926 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m926.4/926.4 kB[0m [31m28.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pytorch_lightning-2.4.0-py3-none-any.whl (815 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
import os
try:
    import google.colab
    IN_COLAB = True
except ModuleNotFoundError:
    IN_COLAB = False

In [4]:
# If not in Colab, do some compatibility changes
if not IN_COLAB:
    DRIVE_PATH='.'
    os.environ['HF_TOKEN'] = open('.hf_token').read().strip()

In [5]:
#@title Mount Drive, if needed, and check the HF_TOKEN is set and accessible
if IN_COLAB:
    from google.colab import drive, userdata

    drive.mount('/content/drive', readonly=True)
    DRIVE_PATH: str = '/content/drive/MyDrive/Final_Project/'
    assert os.path.exists(DRIVE_PATH), 'Did you forget to create a shortcut in MyDrive named Final_Project this time as well? :('
    !cp -R {DRIVE_PATH}/publicDataset .
    !pwd
    !ls
    print()

    assert userdata.get('HF_TOKEN'), 'Set up HuggingFace login secret properly in Colab!'
    print('HF_TOKEN found')

    os.environ['WANDB_API_KEY'] = userdata.get('WANDB_API_KEY')
    print('WANDB_API_KEY found and set as env var')

Mounted at /content/drive
/content
drive  publicDataset  sample_data

HF_TOKEN found
WANDB_API_KEY found and set as env var


In [6]:
#@title Clone the new updated Python files from GitHub, from master
if IN_COLAB:
  !mkdir -p /root/.ssh
  !touch /root/.ssh/id_ecdsa

  with open('/root/.ssh/id_ecdsa', 'w') as f:
    git_ssh_private_key = """
        -----BEGIN OPENSSH PRIVATE KEY-----
        b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAAEbm9uZQAAAAAAAAABAAAAMwAAAAtzc2gtZW
        QyNTUxOQAAACCB3clOafi6fZaBgQCN29TVyJKNW/eVRXT4/B4MB28VQAAAAJhAtW8YQLVv
        GAAAAAtzc2gtZWQyNTUxOQAAACCB3clOafi6fZaBgQCN29TVyJKNW/eVRXT4/B4MB28VQA
        AAAEA6ARNr020VevD7mkC4GFBVqlTcZP7hvn8B3xi5LDvzYIHdyU5p+Lp9loGBAI3b1NXI
        ko1b95VFdPj8HgwHbxVAAAAAEHNpbW9uZUBhcmNobGludXgBAgMEBQ==
        -----END OPENSSH PRIVATE KEY-----
    """
    f.write('\n'.join([line.strip() for line in git_ssh_private_key.split('\n') if line.strip() ]) + '\n')

  with open('/root/.ssh/known_hosts', 'w') as f:
    f.write("github.com ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOMqqnkVzrm0SdG6UOoqKLsabgH5C9okWi0dh2l9GKJl\n")
    f.write("github.com ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCj7ndNxQowgcQnjshcLrqPEiiphnt+VTTvDP6mHBL9j1aNUkY4Ue1gvwnGLVlOhGeYrnZaMgRK6+PKCUXaDbC7qtbW8gIkhL7aGCsOr/C56SJMy/BCZfxd1nWzAOxSDPgVsmerOBYfNqltV9/hWCqBywINIR+5dIg6JTJ72pcEpEjcYgXkE2YEFXV1JHnsKgbLWNlhScqb2UmyRkQyytRLtL+38TGxkxCflmO+5Z8CSSNY7GidjMIZ7Q4zMjA2n1nGrlTDkzwDCsw+wqFPGQA179cnfGWOWRVruj16z6XyvxvjJwbz0wQZ75XK5tKSb7FNyeIEs4TT4jk+S4dhPeAUC5y+bDYirYgM4GC7uEnztnZyaVWQ7B381AK4Qdrwt51ZqExKbQpTUNn+EjqoTwvqNj4kqx5QUCI0ThS/YkOxJCXmPUWZbhjpCg56i+2aB6CmK2JGhn57K5mj0MNdBXA4/WnwH6XoPWJzK5Nyu2zB3nAZp+S5hpQs+p1vN1/wsjk=\n")
    f.write("github.com ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBEmKSENjQEezOmxkZMy7opKgwFB9nkt5YRrYMjNuG5N87uRgg6CLrbo5wAdT/y6v0mKV0U2w0WZ2YB/++Tpockg=\n")

  !chmod 400 ~/.ssh/id_ecdsa ~/.ssh/known_hosts
  !ls ~/.ssh

  # Clone the repository
  !rm -rf /content/AML-project
  !git clone git@github.com:simonesestito/AML-project.git /content/AML-project
  assert os.path.exists('/content/AML-project/.git'), 'Error cloning the repository. See logs above for details'
  !rm -rf ./hallucination_detector && mv /content/AML-project/hallucination_detector .
  !rm -rf /content/AML-project  # We don't need the Git repo anymore

id_ecdsa  known_hosts
Cloning into '/content/AML-project'...
remote: Enumerating objects: 505, done.[K
remote: Counting objects: 100% (167/167), done.[K
remote: Compressing objects: 100% (117/117), done.[K
remote: Total 505 (delta 90), reused 103 (delta 49), pack-reused 338 (from 1)[K
Receiving objects: 100% (505/505), 4.56 MiB | 3.04 MiB/s, done.
Resolving deltas: 100% (275/275), done.


In [7]:
%load_ext autoreload
%autoreload 1
%aimport hallucination_detector
import hallucination_detector

# Initialize Llama

In [8]:
import sys
import torch
import torch.nn as nn
import torch.nn.functional as F
import lightning as pl
from lightning.pytorch.loggers import WandbLogger
from lightning.pytorch.callbacks import ModelCheckpoint
from hallucination_detector.llama import LlamaInstruct, LlamaPrompt
from hallucination_detector.dataset import StatementDataModule
from hallucination_detector.extractor import LlamaHiddenStatesExtractor, WeightedMeanReduction, HiddenStatesReduction
from hallucination_detector.classifier import OriginalSAPLMAClassifier, LightningHiddenStateSAPLMA, EnhancedSAPLMAClassifier
from hallucination_detector.utils import try_to_overfit, plot_weight_matrix, classificator_evaluation
import wandb
import seaborn as sns
import matplotlib.pyplot as plt
import random

device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [9]:
llama = LlamaInstruct()
assert not IN_COLAB or llama.device.type == 'cuda', 'The model should be running on a GPU. On CPU, it is impossible to run'

if llama.device.type == 'cpu':
    print('WARNING: You are running an LLM on the CPU. Beware of the long inference times! Use it ONLY FOR SMALL tests, like very small tests.', file=sys.stderr, flush=True)

config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

# Initialize trained `SAPLMAClassifier`

In [10]:
# load the trained enhanced SAPLMA from Weights&Biases
saplma_artifact_id = 'aml-2324-project/llama-hallucination-detector/enhanced-saplma-architecture-2-u6ohk9pt:v6'

run = wandb.init()
artifact = run.use_artifact(saplma_artifact_id, type='model')
artifact_dir = artifact.download()

saplma = LightningHiddenStateSAPLMA.load_from_checkpoint(
    os.path.join(artifact_dir, 'model.ckpt'),
    llama=llama,
    saplma_classifier=EnhancedSAPLMAClassifier(dropout=0.15, norm='layer', hidden_sizes=[256,128,64]),
    reduction = HiddenStatesReduction(7, 'last'),
).eval()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mpaolini-1943164[0m ([33mpaolini-1943164-sapienza-universit-di-roma[0m). Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m:   1 of 1 files downloaded.  


# Load dataset

In [11]:
from hallucination_detector.extractor.tokenizer import tokenize_prompts_fixed_length

In [12]:
batch_size = 1
datamodule = StatementDataModule(batch_size=batch_size, drive_path='publicDataset')
datamodule.prepare_data()
print(f'Found {len(datamodule.full_dataset)} samples')

Loading file: inventions_true_false.csv
Loading file: elements_true_false.csv
Loading file: companies_true_false.csv
Loading file: generated_true_false.csv
Loading file: cities_true_false.csv
Loading file: facts_true_false.csv
Loading file: animals_true_false.csv
Found 6330 samples


In [13]:
random_sample = datamodule.full_dataset[980]
random_sample

('Indium has the symbol As.', tensor(0), 'elements_true_false')

# Infer SAPLMA on single tokens

## Functions

In [15]:
def remove_prefix_suffix_from_tokens(tokens: torch.Tensor) -> torch.Tensor:
  '''
  Given a tensor output from some test function,
  remove the prefix and suffix tokens included in the prompt,
  leaving us with only the user input.
  '''

  random_user_input = 'y6BabNgCyZf3A9XC3d1Qr'

  # Extract the strings for prefix and suffix, that are added by LlamaPrompt
  full_llama_prompt = str(LlamaPrompt(random_user_input))
  prefix_len = full_llama_prompt.index(random_user_input)
  suffix_len = len(full_llama_prompt) - prefix_len - len(random_user_input)
  prefix, suffix = full_llama_prompt[:prefix_len], full_llama_prompt[-suffix_len:]

  # Count how many tokens do they require
  prefix_len = llama.tokenizer([prefix], return_tensors='pt').input_ids.ravel().size(0)
  suffix_len = llama.tokenizer([suffix], return_tensors='pt').input_ids.ravel().size(0)

  # Remove the tokens that are not part of the user input we want to analyze
  tokens = tokens[prefix_len:-suffix_len+1]
  return tokens

In [16]:
@torch.no_grad()
def test_single_tokens_with_saplma_inference(statement: str) -> tuple[torch.Tensor, torch.Tensor]:
    '''
    Run SALPMA on the single tokens of a statement.
    Thus, inferring the hallucination probability of a single token,
    maybe while generating... in theory
    '''

    # It requires one inference pass per token, in batch
    tokenized_sample = tokenize_prompts_fixed_length(llama, statement)
    token_ids, attn_mask = tokenized_sample.input_ids.squeeze(), tokenized_sample.attention_mask.squeeze()
    real_token_ids = remove_prefix_suffix_from_tokens(token_ids[attn_mask == 1])
    # Create the batch
    batch_statements: list[LlamaPrompt] = []
    for token_i in range(len(real_token_ids)):
      partial_statement = real_token_ids[:token_i+1]
      decoded_partial_statement = llama.tokenizer.decode(partial_statement)
      is_last_partial = token_i == len(real_token_ids) - 1
      batch_statements.append(LlamaPrompt(decoded_partial_statement, with_suffix=is_last_partial))

    ## (!!!) For the hallucination classification of each token,
    #        take the opposite result.
    #        This is simply because the model outputs 1 for True sentences
    #        and 0 for False/Hallucinated ones.
    #        We want an hallucination score, so we must take 1 - classifier_output
    each_token_classification = 1 - saplma(tuple(str(statement) for statement in batch_statements))

    return each_token_classification, real_token_ids

## Results

In [17]:
print(random_sample[0])
each_token_classification, real_token_ids = test_single_tokens_with_saplma_inference(random_sample[0])

for hallucination_probability, token in zip(each_token_classification, real_token_ids):
    print(f'{hallucination_probability.item():>6.1%}: {llama.tokenizer.decode(token)}')

Indium has the symbol As.
  8.8%: Ind
 59.8%: ium
 68.0%:  has
 77.1%:  the
 81.0%:  symbol
 80.4%:  As
 62.7%: .


Here we visualize the hallucination probability computed by SAPLMA after observing a portion of the sentence which ends with the token next to the shown probability. We can see that higher probabilities are assigned to the final words of the sentence, suggesting that the false information in the sentence is located in such final part.

# Analyze gradients on tokens

Now let's also compute the gradient of SAPLMA output with respect to the single tokens (whose related hidden states are fed as its input), to see which tokens are more responsible for the hallucination verdict of a sentence.

## Functions

In [19]:
#
# Copy of the same function in the Python files,
# but now it also returns embeddings of the input,
# with requires_grad=True and retain_grad()
#
def extract_input_hidden_states_for_layers(prompt, for_layers: set[int]) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Given a batch of prompts, with length BATCH_SIZE,
    extract the hidden states for the L requested layers.

    The output tensor will have shape [BATCH_SIZE, L, SEQ_LEN, TOKEN_DIM].
    SEQ_LEN is the length of the input sequence, which is the same for all prompts in the batch, fixed to 70.
    TOKEN_DIM is the dimension of the hidden states, which is the same for all layers in the model, fixed to 2048.
    """
    if isinstance(for_layers, list) or isinstance(for_layers, tuple):
        for_layers = set(for_layers)
    assert isinstance(for_layers, set), f"Expected for_layers to be a set. Found: {type(for_layers)}"

    max_layers = len(llama.iter_layers())
    assert all(0 <= layer < max_layers for layer in for_layers), f"Expected all layers to be in range [0, {max_layers}). Found: {for_layers}"

    hidden_states = []

    def _collect_hidden_states(layer_idx: int):
        def _hook(module, inputs, outputs):
            assert isinstance(outputs, tuple), f"Expected outputs to be a tuple. Found: {type(outputs)}"
            assert len(outputs) >= 1, f"Expected outputs to have 1+ elements. Found: {len(outputs)}"

            hidden_state = outputs[0]
            assert isinstance(hidden_state, torch.Tensor), f"Expected hidden_state to be a torch.Tensor. Found: {type(hidden_state)}"
            assert hidden_state.size(1) == 70 and hidden_state.size(2) == 2048, f"Expected hidden_state to have shape (?, 70, 2048). Found: {hidden_state.shape}"
            hidden_states.append(hidden_state)
        return _hook

    llama.unregister_all_hooks()
    for layer_idx, decoder_layer in enumerate(llama.iter_layers()):
        if layer_idx in for_layers:
            llama.register_hook(decoder_layer, _collect_hidden_states(layer_idx))

    inputs = tokenize_prompts_fixed_length(llama, prompt)
    embedded_inputs = llama.model.get_input_embeddings()(inputs.input_ids)
    embedded_inputs = embedded_inputs.clone().detach().requires_grad_(True)
    embedded_inputs.retain_grad()
    _ = llama.model(
        inputs_embeds=embedded_inputs,
        attention_mask=inputs.attention_mask,
        **{
            "max_length": None,
            "max_new_tokens": 1,
            "num_return_sequences": 1,
            # We are collecting hidden_states in a more fine-grained way with hooks
            "output_attentions": False,
            "output_hidden_states": False,
            "return_dict_in_generate": False,
        }
    )
    llama.unregister_all_hooks()

    # Now, hidden_states are a list of tensors, each tensor representing the hidden_state for a layer we requested
    return embedded_inputs, torch.stack(hidden_states).transpose(0, 1)

In [20]:
def test_tokens_with_grad(statement: str) -> tuple[torch.Tensor, torch.Tensor]:
    '''
    Try to understand which tokens are the responsible for the hallucination verdict of a sentence,
    based on the gradients they receive = their importance in the final outcome.
    '''
    tokenized_sample = tokenize_prompts_fixed_length(llama, statement)
    token_ids, attn_mask = tokenized_sample.input_ids.squeeze(), tokenized_sample.attention_mask.squeeze()
    real_token_ids = token_ids[attn_mask == 1]

    # Do a forward pass, with also returning the input embeddings (with requires_grad=True)
    embedded_inputs, hidden_states = extract_input_hidden_states_for_layers(
        statement,
        for_layers={7},
    )
    hidden_states = hidden_states[0, 0].to(torch.float32)
    assert hidden_states.shape == (70, 2048)

    saplma_input = hidden_states[64].unsqueeze(0)
    assert saplma_input.shape == (1, 2048)
    prediction = saplma.saplma_classifier(saplma_input)
    (prediction).sum().backward()  # Compute gradients on the input

    # Reduce the gradients on the input embeddings, summing up all dimensions of every token
    embedded_inputs_grads = embedded_inputs.grad[0].sum(dim=1)[attn_mask == 1].to(torch.float32)

    # Remove the tokens that are not part of the user input we want to analyze
    embedded_inputs_grads = remove_prefix_suffix_from_tokens(embedded_inputs_grads)
    real_token_ids = remove_prefix_suffix_from_tokens(real_token_ids)
    assert embedded_inputs_grads.shape == real_token_ids.shape

    # Normalize the distribution of the gradients
    grads_mean = torch.mean(embedded_inputs_grads)
    grads_std = torch.std(embedded_inputs_grads)
    embedded_inputs_grads = (embedded_inputs_grads - grads_mean) / grads_std

    temperature = 1
    embedded_inputs_grads = F.softmax(embedded_inputs_grads / temperature, dim=0)
    return embedded_inputs_grads, real_token_ids




In [22]:
infer_colors = [
    (0.50, '\033[0m'),  # Neutral
    (0.75, '\033[33m'), # Yellow
    (1.01, '\033[31m'), # Red
]

grad_colors = [
    (0.25, '\033[0m'),  # Neutral
    (0.45, '\033[33m'), # Yellow
    (1.01, '\033[31m'), # Red
]

def test_tokens(sample: tuple) -> tuple[torch.Tensor, torch.Tensor]:
    '''
    Given a statement,
    test it with multiple strategies and pretty print its results.
    '''
    statement, is_real, _ = sample
    real_pred = saplma(statement).item()

    print(f'y_true: {is_real.item()} ({"real sentence" if is_real else "hallucination"}) -- y_pred: {real_pred:.2f}')

    for name, strategy, color_thresholds in (
        ('SAPLMA infer on single tokens', test_single_tokens_with_saplma_inference, infer_colors),
        ('SAPLMA gradients on embeddings', test_tokens_with_grad, grad_colors),
    ):
      probabilities, token_ids = strategy(statement)

      print(f'{name:>32s}:', end=' ')
      for hallucination_probability, token in zip(probabilities, token_ids):
          if hallucination_probability.isnan():
              color_for_probability = '\033[45m'  # Purple highlighted
          else:
            assert 0 <= hallucination_probability <= 1, f'Expected hallucination_probability to be in range [0, 1]. Found: {hallucination_probability}'
            color_for_probability = next(style for threshold, style in color_thresholds if hallucination_probability < threshold)

          print(f'{color_for_probability}{llama.tokenizer.decode(token)}\033[0m', end='')
      print()
    print()


## Results

We test both the **hallucination probability** inferred by SAPLMA and the **grandients** of its output on each token. Words in red or yellow indicate, respectively, high or medium values for such quantities, whereas words in black are associated to small values and are considered as neutral.

In [None]:
# Pick 10 random FALSE samples from full_dataset
false_sentences_indexes = [
    sample for sample in datamodule.full_dataset if not sample[1]
]
random_samples = random.sample(false_sentences_indexes, 10)

for sample in random_samples:
    test_tokens(sample)

y_true: 0 (hallucination) -- y_pred: 0.16
   SAPLMA infer on single tokens: [0mGeorge[0m[0m Cay[0m[0mley[0m[0m invented[0m[0m the[0m[31m orang[0m[31mutan[0m[31m conservation[0m[31m.[0m
  SAPLMA gradients on embeddings: [0mGeorge[0m[0m Cay[0m[0mley[0m[0m invented[0m[0m the[0m[0m orang[0m[0mutan[0m[0m conservation[0m[0m.[0m

y_true: 0 (hallucination) -- y_pred: 0.44
   SAPLMA infer on single tokens: [0mBar[0m[0mium[0m[33m has[0m[33m the[0m[33m symbol[0m[33m Sb[0m[33m.[0m
  SAPLMA gradients on embeddings: [0mBar[0m[0mium[0m[0m has[0m[0m the[0m[33m symbol[0m[0m Sb[0m[0m.[0m

y_true: 0 (hallucination) -- y_pred: 0.33
   SAPLMA infer on single tokens: [0mL[0m[0mima[0m[0m is[0m[0m a[0m[0m name[0m[33m of[0m[31m a[0m[33m country[0m[33m.[0m
  SAPLMA gradients on embeddings: [0mL[0m[0mima[0m[0m is[0m[0m a[0m[33m name[0m[0m of[0m[0m a[0m[0m country[0m[0m.[0m

y_true: 0 (hallucination) -- y_pred: 0.2

In [25]:
# Pick 10 random TRUE samples from full_dataset
true_sentences_indexes = [
    sample for sample in datamodule.full_dataset if sample[1]
]
random_samples = random.sample(true_sentences_indexes, 10)

for sample in random_samples:
    test_tokens(sample)

y_true: 1 (real sentence) -- y_pred: 0.86
   SAPLMA infer on single tokens: [0mFar[0m[0moe[0m[0m Islands[0m[0m is[0m[0m a[0m[0m name[0m[33m of[0m[31m a[0m[33m country[0m[0m.[0m
  SAPLMA gradients on embeddings: [0mFar[0m[0moe[0m[0m Islands[0m[0m is[0m[0m a[0m[33m name[0m[0m of[0m[0m a[0m[0m country[0m[0m.[0m

y_true: 1 (real sentence) -- y_pred: 0.93
   SAPLMA infer on single tokens: [0mThe[0m[0m dog[0m[0m is[0m[0m a[0m[0m mamm[0m[0mal[0m[0m.[0m
  SAPLMA gradients on embeddings: [0mThe[0m[0m dog[0m[0m is[0m[0m a[0m[31m mamm[0m[0mal[0m[0m.[0m

y_true: 1 (real sentence) -- y_pred: 0.28
   SAPLMA infer on single tokens: [0mCon[0m[0mak[0m[0mry[0m[33m is[0m[33m a[0m[0m name[0m[31m of[0m[31m a[0m[31m city[0m[33m.[0m
  SAPLMA gradients on embeddings: [33mCon[0m[0mak[0m[33mry[0m[0m is[0m[0m a[0m[0m name[0m[0m of[0m[0m a[0m[0m city[0m[0m.[0m

y_true: 1 (real sentence) -- y_pred: 0.17
   S

In [26]:
# Try with easy True/False sentences
true_sample = ('Yellow is a black color.', torch.tensor(0), None)
false_sample = ('Yellow is a light color.', torch.tensor(1), None)

test_tokens(true_sample)
test_tokens(false_sample)

y_true: 0 (hallucination) -- y_pred: 0.65
   SAPLMA infer on single tokens: [0mYellow[0m[0m is[0m[0m a[0m[31m black[0m[33m color[0m[0m.[0m
  SAPLMA gradients on embeddings: [0mYellow[0m[0m is[0m[0m a[0m[33m black[0m[0m color[0m[0m.[0m

y_true: 1 (real sentence) -- y_pred: 0.91
   SAPLMA infer on single tokens: [0mYellow[0m[0m is[0m[0m a[0m[33m light[0m[0m color[0m[0m.[0m
  SAPLMA gradients on embeddings: [0mYellow[0m[0m is[0m[0m a[0m[33m light[0m[0m color[0m[0m.[0m



In [None]:
true_sample = ('True is true.', torch.tensor(1), None)
false_sample = ('True is false.', torch.tensor(0), None)

test_tokens(true_sample)
test_tokens(false_sample)

y_true: 1 (real sentence) -- y_pred: 0.92
   SAPLMA infer on single tokens: [0mTrue[0m[0m is[0m[0m true[0m[0m.[0m
  SAPLMA gradients on embeddings: [33mTrue[0m[33m is[0m[0m true[0m[33m.[0m

y_true: 0 (hallucination) -- y_pred: 0.38
   SAPLMA infer on single tokens: [0mTrue[0m[0m is[0m[33m false[0m[33m.[0m
  SAPLMA gradients on embeddings: [0mTrue[0m[31m is[0m[0m false[0m[33m.[0m



We can see that with easier sentences SAPLMA actually takes into account the most relevant words when predicting their truthfulness, whereas in more complex statements there is not a clear predominance of important tokens.