In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

# Create a pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=500,
    do_sample=False,
)

In [None]:
# Prompt
messages = [
    {"role": "user", "content": "Create a funny joke about chickens."}
]

# Generate the output
output = pipe(messages)
print(output[0]["generated_text"])

## Temperature Sampling in Action
To see temperature sampling in practice, let’s try generating a joke about chickens with different temperature settings.

### Example with Temperature = 0 (Conservative Output)
Using a low temperature keeps the model focused on the most likely response, giving a more traditional, straightforward joke.

In [None]:
# Using a high temperature
output = pipe(messages, do_sample=False, temperature=0.0)
print(output[0]["generated_text"])

### Example with Temperature = 1 (Creative Output)
Raising the temperature makes the model explore a wider range of responses, resulting in a more creative joke!

In [None]:
# Using a high temperature
output = pipe(messages, do_sample=True, temperature=1.0)
print(output[0]["generated_text"])

## Top-P (Nucleus) Sampling in Action
With Top-P sampling, we let the model choose words until it reaches a cumulative probability threshold, 𝑝. This gives a dynamic balance between predictability and creativity.

### Example with Low Top-P (P=0.5)
With a low P, the model only picks from highly probable words, resulting in a focused, straightforward response.

In [None]:
# Using a low top-p value (focused response)
output = pipe(messages, do_sample=True, top_p=0.5)
print(output[0]["generated_text"])

### Example with Higher Top-P (P=2.0)
A higher P allows the model to consider a broader range of options, making the response more expressive and detailed.

In [None]:
# Using a high top-p value (more varied response)
output = pipe(messages, do_sample=True, top_p=2.0)
print(output[0]["generated_text"])

## Top-K Sampling in Action
In top-k sampling, we limit the model’s options to the top K most likely words. This keeps responses concise or, if K is set higher, more varied.

### Example with Low Top-K (K=5)
With a small K, the model chooses only from the five most probable next words, making the response more straightforward.

In [None]:
# Using a low top-k value (concise response)
output = pipe(messages, do_sample=True, top_k=5)
print(output[0]["generated_text"])

### Example with Higher Top-K (K=40)
A larger K value allows the model to consider more options, leading to a more varied and creative response.

In [None]:
# Using a high top-k value (more creative response)
output = pipe(messages, do_sample=True, top_k=40)
print(output[0]["generated_text"])