# Advanced Methods in Text Analytics
# Exercise 8: LLMs - Part 2
### Daniel Ruffinelli
## FSS 2025

* This notebook is designed so we can test basic functions of LLMs in CPU using a regular laptop. For that reason, we stick to small models. But if you have better resources, feel free to modify this to any model that is [available in HuggingFace](https://huggingface.co/models).  
* You run this code, you will need to install HuggingFace's [transformers](https://huggingface.co/docs/transformers/en/installation) and [PyTorch](https://pytorch.org/).
* You will also need to do the following three things:
1. Create a user name in HuggingFace.
2. Request access to the following models: [Llama-3.1-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) and [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct).
3. Create an access token, see [here](https://huggingface.co/docs/hub/security-tokens) for instructions. Your access token will be shown to you only once, so make you you copy it somewhere safe, because you will need to use it to login to HuggingFace via this code.

In [None]:
# HF login
from huggingface_hub import notebook_login
access_token = "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"

# then run this and enter your token (requires ipywidgets) 
# alternatively, do it via CLI with huggingface-cli login
notebook_login()

## Making Predictions with LLMs

### Question (a)

In [1]:
# for convenience, we'll store models and their corresponding tokenizers in a 
# dict of the form {model_name: [model, tokenizer]}
from collections import defaultdict as ddict

models_dict = ddict(list)

In [2]:
# we set up some constants for convenience
DEVICE="cpu"
MODEL = 0
TOKENIZER = 1

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# see all available models in HF here: https://huggingface.co/models
# first time you load a model, it will be downloaded, which will take several
# minutes, but after that, it will be read from a local cache, so it will be 
# only a few seconds

# we load Llama-3.2-1B 
llama_name = "meta-llama/Llama-3.2-1B"
models_dict["llama"].append(
    AutoModelForCausalLM.from_pretrained(
        llama_name, 
        device_map=DEVICE, 
        torch_dtype=torch.bfloat16, 
    )
)
models_dict["llama"].append(
    AutoTokenizer.from_pretrained(
        llama_name, padding_side="left"
        )
)


  from .autonotebook import tqdm as notebook_tqdm


In [4]:
# we load a model similar to GPT-3 made by EleutherAI
gpt3_name = "EleutherAI/gpt-neo-1.3B"
models_dict["gpt3"].append(
    AutoModelForCausalLM.from_pretrained(
        gpt3_name, 
        device_map=DEVICE, 
        torch_dtype=torch.bfloat16, 
    )
)
models_dict["gpt3"].append(
    AutoTokenizer.from_pretrained(
        gpt3_name, padding_side="left"
        )
)


In [5]:
# see loaded models
for model_name, model in models_dict.items():
    print(f"Model: {model_name}")
    print(f"Model: {model[MODEL]}")
    # print(f"Tokenizer: {model[TOKENIZER]}")
    print()


Model: llama
Model: LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-0

### Question (b)

In [6]:
# set model and tokenizer to use
model_name = "gpt3"
# model_name = "llama"
# model_name = "llama_instruct"
model = models_dict[model_name][MODEL]
tokenizer = models_dict[model_name][TOKENIZER]

In [7]:
# set toy prompt
prompt = """Hello!"""

# tokenize it 
tokenized_prompt = tokenizer(
    prompt, 
    return_tensors="pt"
).to(DEVICE)
print(tokenized_prompt)

{'input_ids': tensor([[15496,     0]]), 'attention_mask': tensor([[1, 1]])}


In [8]:
### SOLUTION
print(type(tokenized_prompt))
for key, value in tokenized_prompt.items():
    print(f"{key}:\n{value}")
    print(f"{key} type: {type(value)}")
    print(f"{key} size: {value.size()}")
    print()

<class 'transformers.tokenization_utils_base.BatchEncoding'>
input_ids:
tensor([[15496,     0]])
input_ids type: <class 'torch.Tensor'>
input_ids size: torch.Size([1, 2])

attention_mask:
tensor([[1, 1]])
attention_mask type: <class 'torch.Tensor'>
attention_mask size: torch.Size([1, 2])



### Question (c)

In [9]:
# get model predictions
model_predictions = model(**tokenized_prompt)
print(model_predictions)

CausalLMOutputWithPast(loss=None, logits=tensor([[[  1.6875,  -3.2969,  -7.0312,  ..., -13.8125, -12.2500,  -4.6875],
         [ -7.3750,  -9.6250,  -9.0625,  ..., -17.6250, -15.9375,  -5.6562]]],
       dtype=torch.bfloat16, grad_fn=<UnsafeViewBackward0>), past_key_values=((tensor([[[[-0.2578,  0.1104,  0.2695,  ..., -0.0645,  0.3184,  0.3848],
          [-0.3594,  0.5938, -0.1689,  ...,  0.2520,  0.8398,  0.3164]],

         [[-0.1777,  0.1562, -0.5586,  ...,  0.3379, -0.0262,  0.0486],
          [-0.0776,  0.0996, -0.2500,  ..., -0.0349, -0.3047, -0.2324]],

         [[ 0.5352, -0.5859,  0.1748,  ..., -0.3926, -0.0332,  0.2441],
          [ 0.6445, -0.3965,  0.2617,  ..., -0.3320,  0.1611,  0.0972]],

         ...,

         [[-0.4121,  0.2178, -0.0913,  ..., -1.9141, -0.7852,  0.0349],
          [-0.2598, -0.1196, -0.2285,  ..., -1.7266, -0.6719, -0.2656]],

         [[-0.9844,  0.3320, -0.0281,  ..., -0.3965,  0.2695,  0.5625],
          [-1.1328,  0.2617,  0.4785,  ..., -0.1748, 

In [10]:
### SOLUTION
print(type(model_predictions))
for key, value in model_predictions.items():
    print(f"{key}")
    print(f"{key} type: {type(value)}")
    if hasattr(value, "size"):
        # if the value has a size method, print its size
        # this is the case for torch tensors
        # if not, just print the value
        print(f"{key} size: {value.size()}")
    else:
        print(f"{key} size: {len(value)}")
    print()

<class 'transformers.modeling_outputs.CausalLMOutputWithPast'>
logits
logits type: <class 'torch.Tensor'>
logits size: torch.Size([1, 2, 50257])

past_key_values
past_key_values type: <class 'tuple'>
past_key_values size: 24



### Question (d)

In [None]:
import torch.nn.functional as F

def get_top_k_tokens(prompt, model_tokenizer, k=10):
    """
    Returns top k tokens predicted by the given tuple of model-tokenizer and 
    given prompt.
    """

    # unpacking
    model = model_tokenizer[MODEL]
    tokenizer = model_tokenizer[TOKENIZER]

    # tokenizer prompt
    tokenized_prompt = tokenizer(
        prompt, 
        return_tensors="pt"
    ).to(DEVICE)

    # forward pass
    model_predictions = model(**tokenized_prompt)

    # get top k tokens
    top_10_tokens = None
    
    ### WRITE YOUR CODE HERE ###

    # sort logits for sampling 
    _, sorted_indices = torch.sort(
        F.softmax(model_predictions["logits"][:, -1, :], dim=-1), 
        descending=True
        )
    sorted_indices = sorted_indices[0][:k]

    # decode top k tokens
    top_10_tokens = tokenizer.batch_decode(
        sorted_indices,
        skip_special_tokens=True
    )

    return top_10_tokens


In [12]:
# test your function
prompt = "Hello?"
top_10_tokens = get_top_k_tokens(prompt, models_dict["gpt3"], k=10)
print(top_10_tokens)

['\n', ' ', ' I', ' This', ' It', ' Is', ' Hello', ' You', ' My', ' Are']


## Prompting

### Question (a)

In [177]:
# settings
model_name = "gpt3"
# model_name = "llama"

# ask basic questions
prompt = "What is the capital of France?"
# prompt = "What is the largest continent?"
# prompt = "I have 30 apples, I eat 2, give away 6 and store 5 for the winter. How many apples do I have left?"

# get top tokens
top_10_tokens = get_top_k_tokens(prompt, models_dict[model_name])

print("MODEL:", model_name)
print(f"PROMPT:\n{prompt}")
print("TOP 10 TOKENS:", top_10_tokens)

MODEL: gpt3
PROMPT:
What is the capital of France?
TOP 10 TOKENS: ['\n', ' The', ' What', ' It', ' How', ' Paris', ' A', ' Is', ' Where', ' I']


In [178]:
### SOLUTION
model_name = "gpt3"
# model_name = "llama"

prompt = "Germany:Berlin. Italy:Rome. France:"
# prompt = "Germany:Berlin. France:Paris. Italy:"
# prompt = "France:Paris. Italy:Rome. Germany:"

# get top tokens
top_10_tokens = get_top_k_tokens(prompt, models_dict[model_name])

print("MODEL:", model_name)
print(f"PROMPT:\n{prompt}")
print("TOP 10 TOKENS:", top_10_tokens)

MODEL: gpt3
PROMPT:
Germany:Berlin. Italy:Rome. France:
TOP 10 TOKENS: ['Paris', 'B', 'L', 'R', ' Paris', 'T', 'Mar', 'Le', 'N', 'Ber']


### Question (b)

In [None]:
import random

def get_demonstrations_world_capitals():
    """
    Task: World capitals.
    """

    questions = [
        "Portugal",
        "Germany",
        "Italy",
        "Spain",
        "Poland",
        "France",
    ]

    answers = [
        "Lisbon",
        "Berlin",
        "Rome",
        "Madrid",
        "Warsaw",
        "Paris",
    ]

    return questions, answers


In [186]:
def get_demonstrations_verb_declination():
    """
    Task: Verb declination in English.
    """

    questions = [
        "I go, he ",
        "I play, he ",
        "I eat, he ",
        "You swim, she ",
        "You sleep, she ",
        "You sing, she ",
        "We say, she ",
        "We study, she ",
        "We pay, she ",
    ]

    answers = [
        "goes",
        "plays",
        "eats",
        "swims",
        "sleeps",
        "sings",
        "says",
        "studies",
        "pays",
    ]

    return questions, answers


In [187]:
def get_demonstrations_ioi():
    """
    Task: indirect object identification.
    """

    questions = [
        "When Mary and John went to the store, John gave a drink to ", 
        "Alice and Bob were playing chess. Alice won the game against ",
        "Harry and Hermione were studing in the library. Harry passed the book to ",
    ]

    answers = [
        "Mary",
        "Bob",
        "Hermione",
    ]
    
    return questions, answers

In [188]:
def get_demonstrations_translate_to_french():
    """
    Task: translate to French.
    """

    questions = [
        "Car",
        "House",
        "Dog",
        "Cat", 
    ]

    answers = [
        "Voiture",
        "Maison",
        "Chien",
        "Chat",
    ]
    
    return questions, answers

In [189]:
def get_demonstrations_translate_to_german():
    """
    Task: translate to German.
    """

    questions = [
        "Car",
        "House",
        "Dog",
        "Cat",
    ]

    answers = [
        "Auto",
        "Haus",
        "Hund",
        "Katze",
    ]
    
    return questions, answers

In [190]:
def get_demonstrations_translate_to_spanish():
    """
    Task: translate to Spanish.
    """

    questions = [
        "Car"
        "House"
        "Dog"
        "Cat",
    ]

    answers = [
        "Automovil",
        "Casa",
        "Perro",
        "Gato",
    ]
    
    return questions, answers

In [217]:
def construct_icl_prompt(
        questions, 
        answers, 
        qa_template,
        instruction=None,
    ):
    """
    Constructs an in-context learning (ICL) prompt.

    Args:
        questions (list): List of questions, all but the last will be used
            as demonstrations in the prompt, whereas the last one will be
            the question to be answered.
        answers (list): corresponding answers to given set of questions.
        instruction (str): Instruction to be used in the prompt.
        qa_template (bool): If True, demonstrations are of the form 
                            Q: <question>. A: <answer>. If False, demonstrations 
                            are of the form <question>:<answer>. 
    Returns:
        str: ICL prompt.
    """

    prompt = ""
    if instruction is not None:
        prompt = instruction + "\n\n"
    if qa_template:
        for i, question in enumerate(questions[:-1]):
            prompt += f"Q: {question}\nA: {answers[i]}\n\n"
        prompt += f"Q: {questions[-1]}\nA:"
    else:
        for i, question in enumerate(questions[:-1]):
            prompt += f"{question}:{answers[i]}\n"
        prompt += f"{questions[-1]}:"

    return prompt, answers[-1]


In [218]:
# we use this function to sample and shuffle demonstrations
def sample_and_shuffle_demonstrations(questions, answers, num_demos):
    if num_demos > len(questions):
        raise ValueError(
            f"Number of demonstrations ({num_demos}) is greater than the number of questions ({len(questions)})"
        )
    sampled_questions = []
    sampled_answers = []
    for i in range(num_demos):
        sampled_index = random.randint(0, len(questions) - 1)
        sampled_questions.append(questions[sampled_index])
        sampled_answers.append(answers[sampled_index])
        questions.pop(sampled_index)
        answers.pop(sampled_index)

    return sampled_questions, sampled_answers

In [221]:
# test ICL prompts
num_demos = 2
questions, answers = get_demonstrations_verb_declination()
questions, answers = sample_and_shuffle_demonstrations(
    questions, answers, num_demos
)
icl_prompt, answer = construct_icl_prompt(
    questions, 
    answers,
    qa_template=True 
)
print(icl_prompt)
print("\nANSWER:", answer)

Q: We study, she 
A: studies

Q: I play, he 
A:

ANSWER: plays


### Question (c)

In [228]:
# settings
model_name = "gpt3"
num_demos = 1
qa_template=False
questions, answers = get_demonstrations_world_capitals()

# construct ICL prompt
questions, answers = sample_and_shuffle_demonstrations(
    questions, answers, num_demos
)
icl_prompt, answer = construct_icl_prompt(
    questions, 
    answers, 
    qa_template=qa_template,
)

# get model predictions
top_10_tokens = get_top_k_tokens(icl_prompt, models_dict[model_name])

# inspect results
print("MODEL:", model_name)
print(f"PROMPT:\n{icl_prompt}")
print("ANSWER:", answer)
print("TOP 10 TOKENS:", top_10_tokens)

MODEL: gpt3
PROMPT:
Italy:
ANSWER: Rome
TOP 10 TOKENS: [' The', ' the', ' A', '\n', ' a', ' �', " '", ' Un', ' "', ' ']


### Question (d)

In [243]:
# settings
# model_name = "gpt3"
model_name = "llama"
num_demos = 1
qa=False
instruction="Translate the following word to french."
questions, answers = get_demonstrations_world_capitals()

# construct ICL prompt
questions, answers = sample_and_shuffle_demonstrations(
    questions, answers, num_demos
)
icl_prompt, answer = construct_icl_prompt(
    questions, 
    answers, 
    qa_template=qa,
    instruction=instruction
)

# get model predictions
top_10_tokens = get_top_k_tokens(icl_prompt, models_dict[model_name])

# inspect results
print("MODEL:", model_name)
print(f"PROMPT:\n{icl_prompt}")
print("ANSWER:", answer)
print("TOP 10 TOKENS:", top_10_tokens)

MODEL: llama
PROMPT:
Translate the following word to french.

Portugal:
ANSWER: Lisbon
TOP 10 TOKENS: [' ', '\xa0', ' (', ' Portugal', ' a', ' A', ' the', ' port', ' The', ' Portuguese']


### Question (e)

In [234]:
# we load Llama-3.2-1B-Instruct 
llama_instruct_name = "meta-llama/Llama-3.2-1B-Instruct"
models_dict["llama_instruct"].append(
    AutoModelForCausalLM.from_pretrained(
        llama_instruct_name, 
        device_map=DEVICE, 
        torch_dtype=torch.bfloat16, 
    )
)
models_dict["llama_instruct"].append(
    AutoTokenizer.from_pretrained(
        llama_instruct_name, padding_side="left"
        )
)


In [244]:
# settings
# model_name = "gpt3"
# model_name = "llama"
model_name = "llama_instruct"
num_demos = 1
qa=True
instruction="Translate the following word to French."
questions, answers = get_demonstrations_translate_to_french()

# construct ICL prompt
questions, answers = sample_and_shuffle_demonstrations(
    questions, answers, num_demos
)
icl_prompt, answer = construct_icl_prompt(
    questions, 
    answers, 
    qa_template=qa,
    instruction=instruction
)

# get model predictions
top_10_tokens = get_top_k_tokens(icl_prompt, models_dict[model_name])

# inspect results
print("MODEL:", model_name)
print(f"PROMPT:\n{icl_prompt}")
print("ANSWER:", answer)
print("TOP 10 TOKENS:", top_10_tokens)

MODEL: llama_instruct
PROMPT:
Translate the following word to French.

Q: House
A:
ANSWER: Maison
TOP 10 TOKENS: [' Maison', ' M', ' Ch', ' La', ' maison', ' Mais', ' Man', ' Le', ' Case', ' C']


## Generating Longer Responses

### Question (a)

In [254]:
import torch.nn.functional as F

def generate(
        prompt, 
        model_tokenizer, 
        num_tokens=10,
        verbose=False,
        ):
    """
    Returns string constructed with sequence of num_tokens tokens predicted by
    given model using greedy decoding on given prompt.

    Parameters
    ----------
    prompt : str
        Prompt to be used for generation.
    model_tokenizer : tuple
        Tuple of model and tokenizer
    num_tokens : int
        Number of tokens to be generated.
    num_tokens : bool
        Flag for printing tokens as they are generated.
    """

    # unpacking
    model = model_tokenizer[MODEL]
    tokenizer = model_tokenizer[TOKENIZER]

    final_str = ""
    ### WRITE YOUR CODE HERE ### 

    for i in range(num_tokens):

        # tokenize prompt
        tokenized_prompt = tokenizer(
            prompt, 
            return_tensors="pt"
        ).to(DEVICE)

        # get model predictions
        model_predictions = model(**tokenized_prompt)

        # sort logits for sampling 
        _, sorted_indices = torch.sort(
            F.softmax(model_predictions["logits"][:, -1, :], dim=-1), 
            descending=True
            )
        
        # sorted_indices is of size (batch_size, vocab size)
        sorted_indices = sorted_indices[0]

        # add top k to output
        top_token = tokenizer.batch_decode(
            sorted_indices[:1],
            skip_special_tokens=True
        )[0]

        # add predicted token to final string and to prompt
        final_str += top_token
        prompt += top_token
        
        if verbose:
            print(top_token)

    return final_str

In [255]:
# test your function
prompt = "Hello! "
model_name = "gpt3"
# generate text
generated_text = generate(
    prompt, 
    models_dict[model_name], 
)
print("GENERATED TEXT:", generated_text)

GENERATED TEXT:  I'm a newbie to the forum and I


### Question (b)

In [257]:
# settings
model_name = "llama_instruct"
num_tokens = 20

# set prompt
prompt = "What is the capital of France?"

# generate answer
generated_str = generate(
    prompt, 
    models_dict[model_name], 
    num_tokens,
)
print("GENERATED STR:\n")
print(generated_str)

GENERATED STR:

 Paris.
The capital of France is Paris. Paris is the most populous city in France and is known


### Question (c)

In [275]:
# settings 
model_name = "llama_instruct"
num_tokens = 10

# set prompt
prompt = "What is the capital of France?"

# unpack model and tokenizer
model = models_dict[model_name][MODEL]
tokenizer = models_dict[model_name][TOKENIZER]

# we need to tokenize the prompt ourselves
tokenized_prompt = tokenizer(
    prompt, 
    return_tensors="pt"
).to(DEVICE)

# generate answer with model's generate function
generated_ids = model.generate(
    **tokenized_prompt,
    max_new_tokens=num_tokens,
    )

# inspect results
# note this output already includes the prompt
print("\nOUTPUT:\n")
print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



OUTPUT:

What is the capital of France? Paris.
The capital of France is Paris. This


### Question (d)

In [277]:
# function to construct chat history
def construct_chat_prompt(
        new_prompt: str, 
        chat_history: str = "", 
        system_prompt: str = None,
    ):
    """
    Constructs prompt for chatting with model. 

    Args:
        new_prompt (str): new user entry in conversation
        chat_history (str): all of the conversation so far
        system_prompt (str): system prompt to give model initial instructions
    """

    prompt = ""
    if system_prompt is not None:
        prompt = f"{system_prompt}\n\n"

    return prompt + chat_history + new_prompt + "\n\n<ASSISTANT>\n\n"


In [278]:
# settings 
model_name = "llama_instruct"
num_tokens = 20

# set up a system prompt
system_prompt = """You are a helpful assistant. You will answer the user's questions in a friendly and informative manner." 

<USER>"""

# set up new dialogue entry
dialogue_entry = "Hello? Who are you?"

# construct chat prompt
prompt = construct_chat_prompt(
    new_prompt=dialogue_entry, 
    chat_history="", 
    system_prompt=system_prompt
)

# unpack
model = models_dict[model_name][MODEL]
tokenizer = models_dict[model_name][TOKENIZER]

# we need to tokenize the prompt ourselves
tokenized_prompt = tokenizer(
    prompt, 
    return_tensors="pt"
).to(DEVICE)

# generate model response
generated_ids = model.generate(
    **tokenized_prompt,
    tokenizer=tokenizer,
    max_new_tokens=num_tokens,
    )

# inspect results
# note this output already includes the prompt
print("\nOUTPUT:\n")
print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



OUTPUT:

You are a helpful assistant. You will answer the user's questions in a friendly and informative manner." 

<USER>

Hello? Who are you?

<ASSISTANT>

I'm an AI assistant, here to help answer any questions you may have. I'm here to
