<h1>Semantic Entropy Probes: Robust and Cheap
Hallucination Detection in LLMs</h1>

<p style="font-zize: 16px"> <a href="https://arxiv.org/abs/2406.15927">Pre Published paper.</p>

In [None]:
!pip install accelerate bitsandbytes

In [None]:
!pip install openai python-dotenv

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import numpy as np

model_id = "unsloth/llama-3-8b-Instruct-bnb-4bit"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    load_in_4bit=True,
    device_map="auto",
)

Model structure note the 32 layers in which our queries would pass through for this particular model.

In [5]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): Ll

In [6]:
# Define the input messages
messages = [
    {"role": "user", "content": "Answer the following question in a single brief but complete sentence. What is the capital of france"},
]

# Apply chat template and tokenize
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

# Define terminators
terminators = [
    tokenizer.eos_token_id
]

# Set generation config
model.generation_config.pad_token_id = tokenizer.pad_token_id

# Generate the response with hidden states output
outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    eos_token_id=terminators[0],  # Ensure eos_token_id is valid
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
    output_hidden_states=True,
    return_dict_in_generate=True
)

# Extract hidden states from the output
hidden_states = outputs.hidden_states  # This is a tuple of hidden states

response = outputs.sequences[0][input_ids.shape[-1]:]
response_text = tokenizer.decode(response, skip_special_tokens=True)
print("Generated Response:", response_text)


Generated Response: The capital of France is Paris.


In [7]:
tbg_hiddenstates = hidden_states[0][-5:] # last five layers of first token - should really be last token of X input?
slt_hiddenstates = hidden_states[-2][-5:] ## last five layers of second to last token

Our 5 hidden states obtained for the second to last token.

Lets use the second last tokens across questions to train a logistic regression model.

In [8]:
slt_hiddenstates

(tensor([[[0.0574, 0.1030, 0.2539,  ..., 0.4863, 0.0156, 0.4434]]],
        device='cuda:0', dtype=torch.bfloat16),
 tensor([[[0.0105, 0.1050, 0.3867,  ..., 0.4844, 0.1416, 0.3574]]],
        device='cuda:0', dtype=torch.bfloat16),
 tensor([[[ 0.0312, -0.0723,  0.2275,  ...,  0.6211, -0.0098,  0.2617]]],
        device='cuda:0', dtype=torch.bfloat16),
 tensor([[[-0.0820,  0.0173, -0.0303,  ...,  0.6094,  0.0020, -0.3516]]],
        device='cuda:0', dtype=torch.bfloat16),
 tensor([[[-2.0156,  0.6758, -0.7852,  ...,  4.8750, -0.7227, -2.1875]]],
        device='cuda:0', dtype=torch.bfloat16))

In [9]:
for hidden_state in slt_hiddenstates:
    print(hidden_state.squeeze().shape) # change fom shape [1, 1, 4096] to [4096] as paper

torch.Size([4096])
torch.Size([4096])
torch.Size([4096])
torch.Size([4096])
torch.Size([4096])


<h2>Define our Semantic Entropy functions.</h2>

In [10]:
from openai import AzureOpenAI
from dotenv import load_dotenv
import os

load_dotenv()

client = AzureOpenAI(
            azure_endpoint=os.getenv("BASE"),
            api_version=os.getenv("VERSION"),
            api_key=os.getenv("KEY"),
            timeout=15.0,
            max_retries=2
        )

def make_oai_call(prompt):
    response = client.chat.completions.create(
        model= "gpt-4o",
        temperature=0.0,
        max_tokens=2000,
        timeout=25.0,
        messages=[
            {"role": "user", "content": f"{prompt}"},
        ]
    )
    return response.choices[0].message.content

def check_implication_gpt_4o(text1, text2, question):
    prompt = f"""We are evaluating answers to the question "{question}"
    Here are two possible answers:
    Possible Answer 1: {text1}
    Possible Answer 2: {text2}
    Does Possible Answer 1 semantically entail Possible Answer 2? Respond with entailment, contradiction, or neutral."""

    response_text = make_oai_call(prompt).lower()
    if 'entailment' in response_text:
        return 2
    elif 'contradiction' in response_text:
        return 0
    elif 'neutral' in response_text:
        return 1
    else:
        return 1  # Default to neutral if the response is unclear

def bidirectional_entailment_clustering(sequences, question):
    # Initialize the set of meanings with the first sequence
    C = [{sequences[0]}]

    # Iterate over each sequence starting from the second one
    for m in range(1, len(sequences)):
        s_m = sequences[m]
        added_to_existing_class = False

        # Compare with existing classes
        for c in C:
            s_c = next(iter(c))  # Get the first sequence in the class

            # Check bi-directional entailment
            left = check_implication_gpt_4o(s_c, s_m, question) # change call here to other model.
            right = check_implication_gpt_4o(s_m, s_c, question)

            if left == 2 and right == 2:  # both directions entail
                c.add(s_m)
                added_to_existing_class = True
                break

        if not added_to_existing_class:
            # If not added to any existing class, create a new class
            C.append({s_m})

    return C


# Function to calculate discrete semantic entropy
def calculate_discrete_semantic_entropy(clusters):
    total_responses = sum(len(cluster) for cluster in clusters)
    probabilities = [len(cluster) / total_responses for cluster in clusters]
    print(probabilities)
    entropy = -np.sum(probabilities * np.log10(probabilities)) ## log 10
    return entropy

def calculate_semantic_entropy(responses, question):
    clusters = bidirectional_entailment_clustering(responses, question)
    for clus in clusters:
        print(clus)

    # Calculate discrete semantic entropy
    entropy = calculate_discrete_semantic_entropy(clusters)
    return entropy


In [12]:
def query_llama_model(prompt):

    messages = [
        {"role": "user", "content": f"Answer the following question in a single brief but complete sentence. {prompt}"},
    ]
    input_ids = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to(model.device)
    terminators = [tokenizer.eos_token_id]
    model.generation_config.pad_token_id = tokenizer.pad_token_id
    outputs = model.generate(
        input_ids,
        max_new_tokens=256,
        eos_token_id=terminators[0],  # Ensure eos_token_id is valid
        do_sample=True,
        temperature=0.6,
        top_p=0.9,
        output_hidden_states=True,
        return_dict_in_generate=True)
    hidden_states = outputs.hidden_states  # This is a tuple of hidden states
    response = outputs.sequences[0][input_ids.shape[-1]:]
    response_text = tokenizer.decode(response, skip_special_tokens=True)
    tbg_hiddenstates = hidden_states[0][-5:] #token beginning generation - this should be end of the X input rather then the response generation.
    slt_hiddenstates = hidden_states[-2][-5:] # second last token
    return response_text, slt_hiddenstates


In [25]:
questions = [
    "Where is the Eiffel Tower?",\
    "Until when can you continue to use your national ID card to enter the UK in certain cases?",\
    "What skill level must the job offer be at for skilled workers under the points-based system in the uk?",\
    "Can EU, EEA, and Swiss citizens use ePassport gates in the UK?",
]

features = []
targets = []

for question in questions:
  print(question)
  responses = []

  for _ in range(10):  # N samples as per the paper
      response, slt_hidden_states = query_llama_model(question)
      responses.append(response)

  entropy = calculate_semantic_entropy(responses, question)

  # Collect the hidden states for this question
  for hidden_state in slt_hidden_states:
      feature_vector = hidden_state.squeeze().to(torch.float32).cpu().numpy()  # Squeeze and convert to numpy array
      features.append(feature_vector)
      targets.append(entropy)  # Use the same entropy score for all hidden states of this question



Where is the Eiffel Tower?
{'The Eiffel Tower is located in Paris, France.'}
[1.0]
Until when can you continue to use your national ID card to enter the UK in certain cases?
{"You can continue to use your national ID card to enter the UK in certain cases until December 31, 2025, when the UK's reciprocal agreement with the EU ends.", 'You can continue to use your national ID card to enter the UK in certain cases until December 31, 2025, as the UK has temporarily allowed the use of national ID cards for travel to the country until that date.', 'You can continue to use your national ID card to enter the UK in certain cases until December 31, 2025, as the UK has temporarily extended the acceptance of national ID cards for travel to the UK until that date.'}
{'You can continue to use your national ID card to enter the UK in certain cases until December 31, 2025, as the UK has not yet decided to accept national ID cards as a valid travel document for entry.'}
{"You can continue to use your n

In [26]:
targets # semantic entropy scores for each 5*4 layers first question on paris is 0 entropy so first 5 entries = 0

[-0.0,
 -0.0,
 -0.0,
 -0.0,
 -0.0,
 0.796657624451305,
 0.796657624451305,
 0.796657624451305,
 0.796657624451305,
 0.796657624451305,
 0.38997287335391506,
 0.38997287335391506,
 0.38997287335391506,
 0.38997287335391506,
 0.38997287335391506,
 0.9397940008672038,
 0.9397940008672038,
 0.9397940008672038,
 0.9397940008672038,
 0.9397940008672038]

<h2>Train a Logistic Regression model on the Hidden States.</h2>

Ideally you would collate more of these hiddden states to train more of course! Here its overfitting as only 4 query examples for this quick example. - 3 being 'high' and 1 'low'

In [27]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Convert lists to numpy arrays
features = np.array(features)
targets = np.array(targets)

threshold = 0.2  # Example threshold - the paper has some notation to calculate but seems to be based off intuition still? \upgamma not defined?

categorized_scores = (targets > threshold).astype(int)  # 1 for high, 0 for lo

# Check the shapes of features and targets
print(f'Features shape: {features.shape}')  # Should be (5, 4096)
print(f'Targets shape: {targets.shape}')  # Should be (5,)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(features, categorized_scores, test_size=0.2, random_state=42)

# Train the logistic regression model with L2 regularization
log_reg = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=10000)
log_reg.fit(X_train, y_train)

# Evaluate the model
y_pred = log_reg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')



Features shape: (20, 4096)
Targets shape: (20,)
Accuracy: 1.0
