## Why this notebook

This notebook mainly serves the purpose to explain to you how to infer and evaluate your LLM outputs. 

The methods I am showing here are mostly from this blog: please read if you are interested how Google, HELM or Huggingface evaluate their LLMs (not regexing through the generated outputs).

https://github.com/huggingface/blog/blob/main/evaluating-mmlu-leaderboard.md

All the figures are from the blog as well! And to map to what I did, it is the HELM way and the	AI Harness way.

If there is any mistakes! Please let me know!

In [1]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
# I choose a instruction tuned mistral here because it follows instruction way better than llamas
tokenizer = AutoTokenizer.from_pretrained("ehartford/dolphin-2.1-mistral-7b")
model = AutoModelForCausalLM.from_pretrained("ehartford/dolphin-2.1-mistral-7b") #teknium/CollectiveCognition-v1.1-Mistral-7B

  from .autonotebook import tqdm as notebook_tqdm
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████| 2/2 [01:38<00:00, 49.02s/it]


First, in this example, I am going to show you how to evaluate like the figure where you basically measuring how likely are these A,B,C,D are produced on your first immediate token right after your prompt.
![Alt text](image.png)

In [24]:
import torch
# Encode the prompt
prompt = "I love this house A: positive B: negative \n"
text = ""
inputs = tokenizer(prompt+text, return_tensors="pt")

# Generate response with scores
generate_args = {
    "max_length": inputs.input_ids.size(1) + 2,  # Original length plus one token
    "output_scores": True,
    "return_dict_in_generate": True
}

# Generate one token to get its logprob
generate_output = model.generate(**inputs, **generate_args)

# Retrieve the scores for the generated token
scores = generate_output.scores[0]  # scores for the first generated token

# Convert the scores to probabilities
log_probs = torch.nn.functional.log_softmax(scores, dim=-1)

# Get the generated token ID
generated_token_id = generate_output.sequences[0, inputs.input_ids.size(1)].item()  # First token after the input sequence

# Get the log-probability of the generated token
log_prob_of_generated_token = log_probs[0, generated_token_id].item()

# Decode the generated token ID to the token string
generated_token = tokenizer.decode(generated_token_id)

print(f"The first generated token is: '{generated_token}' with a log probability of: {log_prob_of_generated_token}")
print('=====')
# Retrieve the full sequence of generated token IDs (including the prompt)
generated_sequence_ids = generate_output.sequences[0]

# Decode the entire generated sequence to a string
print(tokenizer.decode(generated_sequence_ids, skip_special_tokens=True))

# Get token IDs for 'A' and 'B'
token_id_A = tokenizer.convert_tokens_to_ids('A')
token_id_B = tokenizer.convert_tokens_to_ids('B')

# Get the log-probabilities of 'A' and 'B'
log_prob_of_A = log_probs[0, token_id_A].item()
log_prob_of_B = log_probs[0, token_id_B].item()


print(f"The log probability of 'A' as the first generated token is: {log_prob_of_A}")
print(f"The log probability of 'B' as the first generated token is: {log_prob_of_B}")

Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.


The first generated token is: '
' with a log probability of: -0.04539313539862633
=====
i love this house A: positive B: negative 

I
The log probability of 'A' as the first generated token is: -7.283852577209473
The log probability of 'B' as the first generated token is: -7.645366668701172


Now the same thing, the only small tweak here is sum up the non-cap A,B,C,D as well

In [2]:
# Encode the prompt
prompt = "i love this house A: positive B: negative \n"
text = ""
inputs = tokenizer(prompt+text, return_tensors="pt")

# Generate response with scores
generate_args = {
    "max_length": inputs.input_ids.size(1) + 2,  # Original length plus one token
    "output_scores": True,
    "return_dict_in_generate": True
}

# Generate one token to get its logprob
generate_output = model.generate(**inputs, **generate_args)

# Retrieve the scores for the generated token
scores = generate_output.scores[0]  # scores for the first generated token

# Convert the scores to probabilities
log_probs = torch.nn.functional.log_softmax(scores, dim=-1)

# Get the generated token ID
generated_token_id = generate_output.sequences[0, inputs.input_ids.size(1)].item()  # First token after the input sequence

# Get the log-probability of the generated token
log_prob_of_generated_token = log_probs[0, generated_token_id].item()

# Decode the generated token ID to the token string
generated_token = tokenizer.decode(generated_token_id)

print(f"The first generated token is: '{generated_token}' with a log probability of: {log_prob_of_generated_token}")
print('=====')
# Retrieve the full sequence of generated token IDs (including the prompt)
generated_sequence_ids = generate_output.sequences[0]

# Decode the entire generated sequence to a string
print(tokenizer.decode(generated_sequence_ids, skip_special_tokens=True))

# Get token IDs for 'A' and 'B'
token_id_A = tokenizer.convert_tokens_to_ids('A')
token_id_B = tokenizer.convert_tokens_to_ids('B')

token_id_a = tokenizer.convert_tokens_to_ids('a')
token_id_b = tokenizer.convert_tokens_to_ids('b')

# Get the log-probabilities of 'A' and 'B'
log_prob_of_A = log_probs[0, token_id_A].item()
log_prob_of_B = log_probs[0, token_id_B].item()

# Print the log-probabilities of 'A' and 'B'
log_prob_of_a = log_probs[0, token_id_a].item()
log_prob_of_b = log_probs[0, token_id_b].item()

log_prob_of_A = log_prob_of_A+log_prob_of_a
log_prob_of_B = log_prob_of_B+log_prob_of_b

print(f"The log probability of 'A' as the first generated token is: {log_prob_of_A}")
print(f"The log probability of 'B' as the first generated token is: {log_prob_of_B}")

Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.


The first generated token is: '
' with a log probability of: -0.04539313539862633
=====
i love this house A: positive B: negative 

I
The log probability of 'A' as the first generated token is: -14.71200180053711
The log probability of 'B' as the first generated token is: -16.32606792449951


A small demo to see whether instruction helps:

In [3]:
# Encode the prompt
prompt = "i love this house A: positive B: negative \nOutput only A or B please:\n"
text = ""
inputs = tokenizer(prompt+text, return_tensors="pt")

# Generate response with scores
generate_args = {
    "max_length": inputs.input_ids.size(1) + 2,  # Original length plus one token
    "output_scores": True,
    "return_dict_in_generate": True
}

# Generate one token to get its logprob
generate_output = model.generate(**inputs, **generate_args)

# Retrieve the scores for the generated token
scores = generate_output.scores[0]  # scores for the first generated token

# Convert the scores to probabilities
log_probs = torch.nn.functional.log_softmax(scores, dim=-1)

# Get the generated token ID
generated_token_id = generate_output.sequences[0, inputs.input_ids.size(1)].item()  # First token after the input sequence

# Get the log-probability of the generated token
log_prob_of_generated_token = log_probs[0, generated_token_id].item()

# Decode the generated token ID to the token string
generated_token = tokenizer.decode(generated_token_id)

print(f"The first generated token is: '{generated_token}' with a log probability of: {log_prob_of_generated_token}")
print('=====')
# Retrieve the full sequence of generated token IDs (including the prompt)
generated_sequence_ids = generate_output.sequences[0]

# Decode the entire generated sequence to a string
print(tokenizer.decode(generated_sequence_ids, skip_special_tokens=True))

# Get token IDs for 'A' and 'B'
token_id_A = tokenizer.convert_tokens_to_ids('A')
token_id_B = tokenizer.convert_tokens_to_ids('B')

token_id_a = tokenizer.convert_tokens_to_ids('a')
token_id_b = tokenizer.convert_tokens_to_ids('b')

# Get the log-probabilities of 'A' and 'B'
log_prob_of_A = log_probs[0, token_id_A].item()
log_prob_of_B = log_probs[0, token_id_B].item()

# Print the log-probabilities of 'A' and 'B'
log_prob_of_a = log_probs[0, token_id_a].item()
log_prob_of_b = log_probs[0, token_id_b].item()

log_prob_of_A = log_prob_of_A+log_prob_of_a
log_prob_of_B = log_prob_of_B+log_prob_of_b

print(f"The log probability of 'A' as the first generated token is: {log_prob_of_A}")
print(f"The log probability of 'B' as the first generated token is: {log_prob_of_B}")

Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.


The first generated token is: '
' with a log probability of: -0.41807690262794495
=====
i love this house A: positive B: negative 
Output only A or B please:

A
The log probability of 'A' as the first generated token is: -7.696923136711121
The log probability of 'B' as the first generated token is: -11.69415807723999


And the prob and answer do changes as we requested :\)

And here is an example on how to extract the last token, and the second last token as an embedding.

In [27]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Encode the prompt
prompt = "i love this house A: positive B: negative \n"
inputs = tokenizer(prompt, return_tensors="pt")

# Generate response with scores
generate_args = {
    "max_length": inputs.input_ids.size(1) + 2,  # Original length plus one token
    "output_scores": True,
    "return_dict_in_generate": True
}

# Generate one token to get its logprob
generate_output = model.generate(**inputs, **generate_args)

# Retrieve the full sequence of generated token IDs (including the prompt)
generated_sequence_ids = generate_output.sequences[0]

# Run the model again with the generated sequence to get hidden states
outputs = model(generated_sequence_ids.unsqueeze(0), output_hidden_states=True)

# Retrieve the last hidden state
hidden_states = outputs.hidden_states
last_hidden_state = hidden_states[-1]

# Get the embeddings for the last and second-to-last tokens
# -1 is the last token, -2 is the second-to-last token in the sequence
last_token_embedding = last_hidden_state[0, -1, :]
second_last_token_embedding = last_hidden_state[0, -2, :]

print("Embedding for the last token:", last_token_embedding)
print("Embedding for the second-to-last token:", second_last_token_embedding)


Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.


Embedding for the last token: tensor([ 0.9806, -2.3296,  3.0921,  ...,  6.5378,  0.3018,  3.7267],
       grad_fn=<SliceBackward0>)
Embedding for the second-to-last token: tensor([-2.2555,  2.6335,  5.9585,  ...,  4.3784,  6.0477,  1.9409],
       grad_fn=<SliceBackward0>)


Note here:

We are basically feeding all the tokens (token_ids) prompt+generated back to the model;

And we are kinda asking how to likely we generate these tokens.

In [28]:
generate_output

GreedySearchDecoderOnlyOutput(sequences=tensor([[    1,   613,  2016,   456,  2134,   330, 28747,  5278,   365, 28747,
          7087, 28705,    13,    13, 28737]]), scores=(tensor([[-12.0531, -11.6898,  -0.6606,  ...,  -4.7127,   4.2573, -12.7862]]), tensor([[-11.3678, -11.5735,  -0.9566,  ...,  -3.5125,   3.9774, -10.2945]])), attentions=None, hidden_states=None)

In [29]:
tokenizer.decode(generate_output.sequences[0], skip_special_tokens=True)

'i love this house A: positive B: negative \n\nI'

Therefore it drives us to the last method, directly giving answers and just evaluate the whole seq generation likelihood.

![Alt text](image-1.png)

In [39]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# # Initialize your model and tokenizer
# model = AutoModelForCausalLM.from_pretrained('gpt2')  # Replace with your model
# tokenizer = AutoTokenizer.from_pretrained('gpt2')  # Replace with your tokenizer

# Text sequences
text_seq1 = 'i love this house A: positive B: negative \nA'
text_seq2 = 'i love this house A: positive B: negative \nB'

sequences = [text_seq1, text_seq2]

def calculate_log_likelihood(model, tokenizer, text):
    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
    return outputs.loss.item()

# Calculate log likelihood for each sequence
log_likelihoods = [calculate_log_likelihood(model, tokenizer, seq) for seq in sequences]

# Determine which sequence has the highest likelihood
most_likely_sequence_index = log_likelihoods.index(min(log_likelihoods))  # Lower loss means higher likelihood
most_likely_sequence = sequences[most_likely_sequence_index]

print(f"The most likely sequence is sequence number {most_likely_sequence_index + 1}")


The most likely sequence is sequence number 1


So overall some notes for Shan(myself) and others:

1. evaluating the seq generation prob != this seq is the most probable one the model will generate

    1.1 from my limited experience, I do see many models do not follow instructions or do not generate the right thing even they are high on LLM leaderboard 
2. notice the difference in performance here (source from https://github.com/huggingface/blog/blob/main/evaluating-mmlu-leaderboard.md)
    2.1 do we at the end of the day need a powerful LM to be resolver to actually evaluate them?


<div>
<table><p>
  <tbody>
 <tr style="text-align: left;">
  <td>Original implementation</td>
  <td>HELM</td>
  <td>AI Harness (as of Jan 2023)</td>
 </tr>
  <tr style=" vertical-align: top;">
    <td> We compare the probabilities of the following letter answers:
</td>
    <td>The model is expected to generate as text the following letter answer:
</td>
    <td>We compare the probabilities of the following full answers:
</td>
  </tr>
  <tr style=" vertical-align: top;">
    <td>  A <br>
 B <br>
 C <br>
 D
</td>
    <td>A
</td>
    <td> A. It damaged support for the US model of political economy and capitalism <br>
 B. It created anger at the United States for exaggerating the crisis <br>
 C. It increased support for American global leadership under President Obama <br>
 D. It reduced global use of the US dollar
</td>
  </tr>
  </tbody>
</table><p>
</div>

|                                           | MMLU (HELM) | MMLU (Harness) | MMLU (Original) |
|:------------------------------------------|------------:|---------------:|----------------:|
| llama-65b                     |       **0.637** |          0.488 |           **0.636** |
| tiiuae/falcon-40b                         |       0.571 |          **0.527** |           0.558 |
| llama-30b                     |       0.583 |          0.457 |           0.584 |
| EleutherAI/gpt-neox-20b                   |       0.256 |          0.333 |           0.262 |
| llama-13b                     |       0.471 |          0.377 |           0.47  |
| llama-7b                      |       0.339 |          0.342 |           0.351 |
| tiiuae/falcon-7b                          |       0.278 |          0.35  |           0.254 |
| togethercomputer/RedPajama-INCITE-7B-Base |       0.275 |          0.34  |           0.269 |