# Initial tests

Testing the sample code from perplexity:

The dependencies:

* Ended up with `uv pip install torch==2.1.0` as `uv add torch` failed with: `error: Distribution `torch==2.6.0 @ registry+https://pypi.org/simple` can't be installed because it doesn't have a source distribution or wheel for the current platform`
* Got a numpy vs numba warning, indicating I should downgrade numpy. I did not oblige.

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

Download the model and tokenizer, this will take a while (took around 2.3min with 1GB/s fiber)...

In [4]:
# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2")
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")

Downloading shards: 100%|██████████| 2/2 [02:13<00:00, 66.78s/it] 
Loading checkpoint shards: 100%|██████████| 2/2 [00:19<00:00,  9.96s/it]


So, basically, we are going to tokenize our inputs, then plug that into the model, get tokens back as output, and convert that back into words (I think).

Step one, tokenizing the input:

In [13]:
# Input text
tokenizer.pad_token = tokenizer.eos_token
input_text = "The capital of France is"
input_ids = tokenizer(
    input_text,
    return_tensors="pt",
    padding=True,
    truncation=True,
    return_attention_mask=True,
).input_ids
input_ids

tensor([[ 464, 3139,  286, 4881,  318]])

Such pretty numbers.

Next we set the parameters that affects the stochastic nature of LLMs:

In [14]:
gen_kwargs = {
    "max_new_tokens": 20,
    "do_sample": True,
    "temperature": 0.7,
    "top_p": 0.9,
    "top_k": 50,
    "repetition_penalty": 1.2,
    "no_repeat_ngram_size": 2,
    "return_dict_in_generate": True,
    "output_scores": True,
    "num_return_sequences": 1,
    "pad_token_id": tokenizer.eos_token_id,
}

This should produce deterministic results, should...

In [15]:
torch.manual_seed(42)

<torch._C.Generator at 0x112319630>

We can now generate text:

In [16]:
with torch.no_grad():
    outputs = model.generate(input_ids, **gen_kwargs)

We can get back the tokens and their scores

In [23]:
generated_tokens = outputs.sequences[0, input_ids.shape[-1]:]
scores = torch.stack(outputs.scores)
print(generated_tokens)
print(scores)

tensor([ 6342,    13,   198, 26410,    25,   383, 23227,    82,   287,   262,
         6827,   389,   366, 28572,  1600,   366, 27544,     1,   290,   366])
tensor([[[   -inf,    -inf,    -inf,  ...,    -inf,    -inf,    -inf]],

        [[   -inf,    -inf,    -inf,  ...,    -inf,    -inf,    -inf]],

        [[   -inf,    -inf,    -inf,  ...,    -inf,    -inf,    -inf]],

        ...,

        [[   -inf, 34.7760,    -inf,  ...,    -inf,    -inf,    -inf]],

        [[   -inf,    -inf,    -inf,  ...,    -inf,    -inf,    -inf]],

        [[   -inf,    -inf,    -inf,  ...,    -inf,    -inf,    -inf]]])


Some `-inf`, and some non inf.

In [29]:
log_probs = torch.log_softmax(scores, dim=-1)
token_indices = generated_tokens.reshape(1, -1, 1).expand(1, -1, 1)
token_indices = token_indices.transpose(0, 1)
token_log_probs = log_probs.gather(-1, token_indices).squeeze(-1)
token_probs = torch.exp(token_log_probs)

The probabilities of the final output:

In [40]:
# Print results
print(f"Generated text:\n\n ```\n{tokenizer.decode(generated_tokens)}\n```")
print("\nPer-token probabilities:")
for token, log_prob, prob in zip(generated_tokens, token_log_probs, token_probs):
    token_text = tokenizer.decode([token])
    print(f"`{token_text}`\t\t: prob={prob.item():.4f} \t\t (log_prob={log_prob.item():.4f})")

Generated text:

 ```
 Paris.
Output: The nouns in the sentence are "France", "capital" and "
```

Per-token probabilities:
` Paris`		: prob=1.0000 		 (log_prob=0.0000)
`.`		: prob=0.6372 		 (log_prob=-0.4507)
`
`		: prob=1.0000 		 (log_prob=0.0000)
`Output`		: prob=0.0987 		 (log_prob=-2.3158)
`:`		: prob=1.0000 		 (log_prob=0.0000)
` The`		: prob=0.6562 		 (log_prob=-0.4213)
` noun`		: prob=0.0128 		 (log_prob=-4.3564)
`s`		: prob=0.5722 		 (log_prob=-0.5583)
` in`		: prob=0.4682 		 (log_prob=-0.7589)
` the`		: prob=1.0000 		 (log_prob=0.0000)
` sentence`		: prob=1.0000 		 (log_prob=0.0000)
` are`		: prob=1.0000 		 (log_prob=0.0000)
` "`		: prob=1.0000 		 (log_prob=0.0000)
`France`		: prob=1.0000 		 (log_prob=0.0000)
`",`		: prob=0.4187 		 (log_prob=-0.8705)
` "`		: prob=0.7697 		 (log_prob=-0.2617)
`capital`		: prob=0.5945 		 (log_prob=-0.5201)
`"`		: prob=1.0000 		 (log_prob=0.0000)
` and`		: prob=1.0000 		 (log_prob=0.0000)
` "`		: prob=1.0000 		 (log_prob=0.0000)


The bit that the model produces: `Output: The nouns in the sentence are "France", "capital" and "` is strange.

Now to get the top 5 at each step, which is much more interesting:

In [45]:
# Get scores from output
scores = torch.stack(outputs.scores)  # Shape: [num_tokens, 1, vocab_size]

# Convert to probabilities
probs = torch.softmax(scores, dim=-1)

# Get top 5 probabilities and indices for each step
top_k = 5
top_probs, top_indices = torch.topk(probs, k=top_k, dim=-1)

# Print results for each generation step
print(f"Generated text:\n\n ```\n{tokenizer.decode(generated_tokens)}\n```")
print("\nTop 5 probable tokens at each step:")
for step, (token, step_probs, step_indices) in enumerate(zip(generated_tokens, top_probs, top_indices)):
    actual_token = tokenizer.decode([token])
    print(f"\nStep {step}: Actually chosen token: '{actual_token}'")
    print("Top 5 candidates:")
    for prob, idx in zip(step_probs[0], step_indices[0]):
        token_text = tokenizer.decode([idx])
        print(f"\t `{token_text}`: \t {prob.item():.4f}")

Generated text:

 ```
 Paris.
Output: The nouns in the sentence are "France", "capital" and "
```

Top 5 probable tokens at each step:

Step 0: Actually chosen token: ' Paris'
Top 5 candidates:
	 ` Paris`: 	 1.0000
	 `#`: 	 0.0000
	 `%`: 	 0.0000
	 `!`: 	 0.0000
	 `"`: 	 0.0000

Step 1: Actually chosen token: '.'
Top 5 candidates:
	 `.`: 	 0.6372
	 `."`: 	 0.3186
	 `,`: 	 0.0442
	 `#`: 	 0.0000
	 `%`: 	 0.0000

Step 2: Actually chosen token: '
'
Top 5 candidates:
	 `
`: 	 1.0000
	 `#`: 	 0.0000
	 `%`: 	 0.0000
	 `!`: 	 0.0000
	 `"`: 	 0.0000

Step 3: Actually chosen token: 'Output'
Top 5 candidates:
	 `<|endoftext|>`: 	 0.5940
	 `    `: 	 0.2170
	 `Output`: 	 0.0987
	 `Answer`: 	 0.0516
	 `##`: 	 0.0386

Step 4: Actually chosen token: ':'
Top 5 candidates:
	 `:`: 	 1.0000
	 `#`: 	 0.0000
	 `%`: 	 0.0000
	 `!`: 	 0.0000
	 `"`: 	 0.0000

Step 5: Actually chosen token: ' The'
Top 5 candidates:
	 ` The`: 	 0.6562
	 ` Fact`: 	 0.1660
	 ` This`: 	 0.0538
	 ` True`: 	 0.0362
	 ` What`: 	 0.03

Inspecting Step 3 is fun now:

```
Step 3: Actually chosen token: 'Output'
Top 5 candidates:
	 `<|endoftext|>`: 	 0.5940
	 `    `: 	 0.2170
	 `Output`: 	 0.0987
	 `Answer`: 	 0.0516
	 `##`: 	 0.0386
```

The highest probability ($p=0.5940$) token was `<|endoftext|>`, which would have stopped the token generation. The result would then just have been `Paris`. Which is correct. Due to the stochastic nature of LLMs, it chose the lower probability token ($p=0.0987$) `Output`. This is just a fancy way of saying that it does weighting sampling based on the calculated probabilities, which in turn are calculated by the Neural-Network. So the higher the probability of a token, the higher the chances that it will get selected. At Step 3, if we were to repeat this process 1000 times, we would expect it to stop (i.e. `<|endoftext|>`) about 600 times, introduce a tab `    ` about 210 times, produce `Output` about 100 times, and so forth. The actual result that we saw is the 10\% probability materializing.

## Conclusion

This is just a test example to get going, and it's already quite interesting. For more formal testing, we can wrap the above into functions, repeat the experiment, and capture and analyze the results. Next, we can adjust the input parameters to make it deterministic. Then we can tweak the inputs slightly, to see how it influences the results.

We can also run some stretch experiments, such as using more difficult prompts, in the sense that there is less training data, and use different models.

Ultimately, this should allow us to get a better sense of the non-deterministic nature of LLMs and how to control it. LLMs have sharp edges.