Sheet 7.1: Using pretrained LLMs w/ the 'transformers' package
==============================================================

**Author:** Michael Franke



The &rsquo;transformers&rsquo; package by [huggingface](https://huggingface.co/) provides direct access to a multitude of pretrained large language models (LLMs).
Models and easy-to-use pipelines for many common NLP-tasks exist, ranging from (causal or masked) language modeling over machine translation to sentiment analysis or natural language inference.
This brief tutorial showcases how to download a pre-trained causal LLM, a version of OpenAI&rsquo;s GTP-2, how to use it for generation, and how to access its predictions (next-word probabilities, sequence embeddings).

The &rsquo;transformers&rsquo; package provides models for use with several programming environments, including Tensorflow, Rust or Jax.
Not all models or tools are available for all programming environments, but PyTorch and Tensorflow are covered best.



## Packages



We will make heavy use of the &rsquo;transformers&rsquo; package, but also use huggingface&rsquo;s &rsquo;datasets&rsquo; package to access a data set of text from Wikipedia articles.
In particular, we import two modules from the &rsquo;transformers&rsquo; package which give us access to instances of OpenAI&rsquo;s GPT-2 model for causal language modeling.
We need &rsquo;torch&rsquo; for tensor manipulations and &rsquo;textwrap&rsquo; to prettify output.



In [3]:
!pip install datasets
!pip install fsspec==2024.10.0
!pip install --force-reinstall gcsfs



Collecting fsspec==2024.10.0
  Downloading fsspec-2024.10.0-py3-none-any.whl.metadata (11 kB)
Downloading fsspec-2024.10.0-py3-none-any.whl (179 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.6/179.6 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2024.9.0
    Uninstalling fsspec-2024.9.0:
      Successfully uninstalled fsspec-2024.9.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 3.2.0 requires fsspec[http]<=2024.9.0,>=2023.1.0, but you have fsspec 2024.10.0 which is incompatible.[0m[31m
[0mSuccessfully installed fsspec-2024.10.0
Collecting gcsfs
  Downloading gcsfs-2024.10.0-py2.py3-none-any.whl.metadata (1.6 kB)
Collecting aiohttp!=4.0.0a0,!=4.0.0a1 (from gcsfs)
  Downloading aiohttp-3.11.10-cp310-cp310-ma

In [1]:
##################################################
## import packages
##################################################

from transformers import GPT2TokenizerFast, GPT2LMHeadModel
from datasets import load_dataset
import torch
import textwrap
import warnings
warnings.filterwarnings('ignore')

## Helpers



Here is a small helper function for prettier (?) printing of generated output text:



In [2]:
##################################################
## helper function (nicer printing)
##################################################

def pretty_print(s):
    print("Output:\n" + 80 * '-')
    print(textwrap.fill(tokenizer.decode(s, skip_special_tokens=True),80))

## Obtaining a pretrained LLM



The &rsquo;transformers&rsquo; package provides access to many different (language) models (see [here](https://huggingface.co/models) for overview).
One of them is GPT-2.
There are several types of GPT-2 instances we can instantiate through the &rsquo;transformers&rsquo; package, be it for different architectures (PyTorch, Tensorflow etc) or for different purposes (sequence classification, language modeling etc).
Here is [overview of the GPT-2 model family](https://huggingface.co/docs/transformers/v4.25.1/en/model_doc/gpt2).

In this tutorial, we are interested in using GPT-2 for (left-to-right) language modeling.
We therefore use the module &rsquo;[GPT2LMHeadModel](https://huggingface.co/docs/transformers/v4.25.1/en/model_doc/gpt2#transformers.GPT2LMHeadModel)&rsquo;.
This module provides access to different variants of GPT-2 models (larger or smaller, trained on more or less text).
Here we use the &rsquo;gpt2-large&rsquo; instance, just because.

Since different (language) models also use different tokenization, we also use the corresponding tokenizer from the module &rsquo;GPT2TokenizerFast&rsquo;.



In [3]:
##################################################
## instantiating LLM & its tokenizer
##################################################

# model_to_use = "gpt2"
model_to_use = "gpt2-large"

print("Using model: ", model_to_use)

# get the tokenizer for the pre-trained LM you would like to use
tokenizer = GPT2TokenizerFast.from_pretrained(model_to_use)

# instantiate a model (causal LM)
model = GPT2LMHeadModel.from_pretrained(model_to_use,
                                        output_scores=True,
                                        pad_token_id=tokenizer.eos_token_id)

# inspecting the (default) model configuration
# (it is possible to created models with different configurations)
print(model.config)

Using model:  gpt2-large


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

GPT2Config {
  "_attn_implementation_autoset": true,
  "_name_or_path": "gpt2-large",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 1280,
  "n_head": 20,
  "n_inner": null,
  "n_layer": 36,
  "n_positions": 1024,
  "output_scores": true,
  "pad_token_id": 50256,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.46.3",
  "use_cache": true,
  "vocab_size": 50257
}



## Using the LLM for text generation



The instance of the pre-trained LLM, which is now accessible with variable &rsquo;model&rsquo;, comes with several functions for use to use, one of which is &rsquo;generate&rsquo;.
We can use it to generate text after an initial prompt.
First, the input prompt must be translated into tokens, then fed into &rsquo;generate&rsquo;, which takes arguments to specify the decoding strategy (here top-k sampling).
The output is a tensor of tokens, which must be translated back into human-intelligible words for output.



In [5]:
from transformers import GPT2TokenizerFast, GPT2LMHeadModel
import torch

# Load the tokenizer and model
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Prepare input text
input_text = "Once upon a time"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
attention_mask = torch.ones_like(input_ids)  # Explicitly define attention_mask

# Generate output using top-k sampling
outputs = model.generate(
    input_ids,
    attention_mask=attention_mask,  # Pass attention mask
    max_length=50,
    do_sample=True,
    top_k=50,
    temperature=0.7
)


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


In [6]:
##################################################
## autoregressive generation
##################################################

# text to expand
prompt = "Once a vampire fell in love with a pixie so that they"

# translate the prompt into tokens
input_tokens = tokenizer(prompt, return_tensors="pt").input_ids
print(input_tokens)

outputs = model.generate(input_tokens,
                         max_new_tokens=100,
                         do_sample=True,
                         top_k=50,
                       )

print("\nTop-k sampling:\n")
pretty_print(outputs[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


tensor([[ 7454,   257, 23952,  3214,   287,  1842,   351,   257,   279, 39291,
           523,   326,   484]])

Top-k sampling:

Output:
--------------------------------------------------------------------------------
Once a vampire fell in love with a pixie so that they killed him, there was no
longer a vampire relationship to take us into.  The show will focus on the young
lady Mary Elizabeth (Claudia Saldanha), a vampire and vampire lover from the
town of Little White Mountain.  Crazy Credits  In this episode, Elizabeth is
brought in to be "licked," because there are vampires who believe she's cursed
to the point of insanity.  The Vampire's Curse  The witch of Lathrop's


We can also use beam search through &rsquo;generate&rsquo; by setting the parameter &rsquo;num<sub>beams</sub>&rsquo;.



In [7]:
outputs = model.generate(input_tokens,
                         max_new_tokens=100,
                         num_beams=6,
                         no_repeat_ngram_size=4,
                         early_stopping=True
                         )

print("\nBeam search:\n")
pretty_print(outputs[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.



Beam search:

Output:
--------------------------------------------------------------------------------
Once a vampire fell in love with a pixie so that they could have a child, she
would give birth to a child, and the child would become a vampire. The vampire
would then become a vampire again, and the pixie would become a pixie again, and
so on and so forth.  The vampire would then be reborn as a vampire again. The
pixie would then become an adult vampire, and the vampire would be reborn as an
adult vampire again.  When a vampire dies, the pixie will be reborn as the
vampire again


## Accessing next-word probabilities



To access the model&rsquo;s (raw) predictions, which are (log) next-word probabilities, we can just call the function &rsquo;model&rsquo; itself, which gives us access to the forward-pass of the model.
We simply need to feed in a prompt sequence as input.
We can additionally feed in a sequence of tokens as &rsquo;labels&rsquo; for which we then obtain the predicted next-word probabilities.
NB: The $i$-th word in the sequence of labels is assigned the probability obtained after having processed all words up to and including the $i$-th word of the input-token sequence.

The average negative log-likelihood of the provided labels is accessed through the &rsquo;loss&rsquo; attribute of the returned object from a call to &rsquo;model&rsquo;.
The returned object is of type &rsquo;[CausalLMOutputWithCrossAttentions](https://huggingface.co/docs/transformers/main/en/main_classes/output#transformers.modeling_outputs.CausalLMOutputWithCrossAttentions)&rsquo;.



In [8]:
##################################################
## retrieving next-word surprisals from GPT-2
##################################################

# NB: we can supply tensors of labels (token ids for next-words, no need to right-shift)
# using -100 in the labels means: "don't compute this one"
labels        = torch.clone(input_tokens)
labels[0,0]   = -100
output_word2  = model(input_tokens[:,0:2], labels= labels[:,0:2])
output_prompt = model(input_tokens, labels=input_tokens)

# negative log-likelihood of provided labels
nll_word2  = output_word2.loss
nll_output = output_prompt.loss * input_tokens.size(1)
print("NLL of second word: ", nll_word2.item())
print("NLL of whole output:", nll_output.item())

NLL of second word:  3.417576551437378
NLL of whole output: 49.20665740966797


We can also retrieve the logits (= non-normalized weights prior to the final softmax operation) from the returned object, and so derive the next-word probabilities:



In [9]:
# logits of provided labels
print(output_word2.logits)
# next-word log probabilities:
print(torch.nn.functional.log_softmax(output_word2.logits, dim = 1))

tensor([[[ -34.5645,  -34.4082,  -38.3079,  ...,  -41.6997,  -39.7802,
           -35.0521],
         [ -96.1105,  -94.0417,  -97.9706,  ..., -100.6318,  -98.2564,
           -93.8704]]], grad_fn=<UnsafeViewBackward0>)
tensor([[[  0.0000,   0.0000,   0.0000,  ...,   0.0000,   0.0000,   0.0000],
         [-61.5460, -59.6335, -59.6627,  ..., -58.9321, -58.4762, -58.8183]]],
       grad_fn=<LogSoftmaxBackward0>)


## Accessing the embeddings (hidden states)



If we want to repurpose the LLM, we would be interested in the embedding of an input sequence, i.e., the state of the final hidden layer after an input sequence.
Here is how to access it:



In [10]:
##################################################
## retrieving sequence embedding
##################################################

# set flag 'output_hidden_states' to true
output = model(input_tokens, output_hidden_states = True)

# this is a tuple with first element the embeddings of each token in the input
hidden_states = output.hidden_states
# so, access the first object from the tuple
embeddings = hidden_states[0]
# and print its size and content
print(embeddings.size())
print("Embedding of last word in input:\n", embeddings[0,0-1])

torch.Size([1, 13, 768])
Embedding of last word in input:
 tensor([ 1.9173e-02,  3.9362e-02,  1.4456e-01,  6.9483e-02, -5.7905e-02,
        -5.8178e-03, -2.7556e-01, -2.5199e-02,  6.1657e-02,  5.9887e-02,
         5.0426e-02,  6.7358e-02,  1.1095e-02,  7.1450e-02,  1.7385e-01,
         8.2816e-02, -6.0333e-02, -4.8444e-02,  4.1443e-02,  5.2130e-01,
         4.3383e-02, -1.0741e-02, -9.8101e-02, -1.3637e-02, -5.2027e-02,
         4.9931e-02, -1.3581e-03,  1.0930e-01, -6.5434e-02, -3.9222e-02,
         3.2260e-02,  7.8113e-02, -6.5054e-02,  1.2354e-02,  9.6690e-02,
         1.2364e-01, -5.5306e-01, -6.5979e-02, -5.0075e-02,  4.3763e-02,
         1.0316e-01,  3.3196e-02, -6.5039e-02,  5.6409e-02, -7.0250e-02,
         5.2222e-02,  7.3260e-02, -1.4981e-02, -4.7168e-02, -5.3863e-01,
         1.0385e-01,  5.0040e-02, -2.6686e-02,  3.8491e-02,  1.0197e-01,
        -2.9054e-01, -4.6894e-02,  1.0513e-01, -9.9775e-02,  9.8204e-02,
         5.8441e-02, -3.8922e-02,  7.7806e-02,  1.6908e-02, -2.17

## [Excursion:] Using data from &rsquo;datasets&rsquo;



The &rsquo;transformers&rsquo; package is accompanied by the &rsquo;datasets&rsquo; package (also from huggingface), which includes a bunch of interesting data sets for further exploration or fine-tuning.

Here is a brief example of how to load a data set of text from Wikipedia, which we need to pre-process a bit (conjoin lines, tokenize) and then feed into the LLM to access the average negative log-likelihood of the sequence.



In [11]:
##################################################
## working with datasets
##################################################

test = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
encodings = tokenizer("\n\n".join(test["text"]), return_tensors="pt")

input_tokens = encodings.input_ids[:,10:50]

pretty_print(input_tokens[0])

output = model(input_tokens, labels = input_tokens)
print("Average NLL for wikipedia chunk", output.loss.item())

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/733k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (287644 > 1024). Running this sequence through the model will result in indexing errors


Output:
--------------------------------------------------------------------------------
  Robert Boulter is an English film , television and theatre actor . He had a
guest @-@ starring role on the television series The Bill in 2000 . This was
followed by a starring role
Average NLL for wikipedia chunk 4.377964496612549
