<a href="https://colab.research.google.com/github/spyysalo/dl-binf-summer-school-2025/blob/main/unpacking_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unpacking `pipeline`

This notebook illustrates some parts that go into text generation using the `pipeline` class.

First install the [transformers](https://huggingface.co/docs/transformers/index) package.

In [29]:
!pip install --quiet transformers

Next, load a `pipeline` for text generation with a small model.

In [30]:
from transformers import pipeline

MODEL_NAME = 'HuggingFaceTB/SmolLM-135M'

pipe = pipeline(
    'text-generation',
    model=MODEL_NAME,
    device_map='auto',
)

Device set to use cpu


We can conveniently generate text using the high-level abstraction that `pipeline` provides:

In [31]:
prompt = 'The capital of Denmark is'

print(pipe(prompt)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


The capital of Denmark is Copenhagen, with a population of 534,966 people. The county is roughly equivalent to Vermont, but contains a total area of 3,489.07 square miles, making it the third-largest state in the United States. It is one of the oldest states with a history going back to 1624 when the Swedish colony of New Sweden was established. Today it is a member of the Commonwealth of Nations along with New Zealand and Australia.
Denmark was originally inhabited by the Sami people who lived in what is now Scandinavia for thousands of years. The Viking settlers brought with them their own distinct culture, language, and beliefs. The country was later invaded by the Norse, who established a kingdom in Greenland in the 9th century. When Denmark was conquered by Denmark, Sweden, Norway, and Iceland, they were all under Danish rule.
Denmark has a long history of trade with other countries, including the United States. The country was a member of the European Economic Community (EEC). In

For simplicity, let's look at generating one word using greedy decoding, i.e. simply selecting the word that's most likely according to the model.

In [32]:
params = {
    'do_sample': False,
    'max_new_tokens': 1,
}

print(pipe(prompt, **params)[0]['generated_text'])

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


The capital of Denmark is Copenhagen


Now, let's look at what's going on behind the `pipeline` abstraction. First, here's the model

In [33]:
model = pipe.model
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(49152, 576)
    (layers): ModuleList(
      (0-29): 30 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=576, out_features=576, bias=False)
          (k_proj): Linear(in_features=576, out_features=192, bias=False)
          (v_proj): Linear(in_features=576, out_features=192, bias=False)
          (o_proj): Linear(in_features=576, out_features=576, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=576, out_features=1536, bias=False)
          (up_proj): Linear(in_features=576, out_features=1536, bias=False)
          (down_proj): Linear(in_features=1536, out_features=576, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((576,), eps=1e-05)
    (rotary_emb): LlamaRotaryEm

The model doesn't actually deal with text directly, but rather with token indices. The mapping between running text and token indices is implemented by a tokenizer.

In [34]:
tokenizer = pipe.tokenizer

Let's have a look at the mapping between our prompt and the token indices.

In [35]:
input_ = tokenizer(prompt)

print(input_)

{'input_ids': [504, 3575, 282, 15644, 314], 'attention_mask': [1, 1, 1, 1, 1]}


That's the actual input to the model, and `input_ids` are the token indices. (We can ignore the `attention_mask` here.)

The tokenizer can map these back to text

In [36]:
print(tokenizer.convert_ids_to_tokens(input_.input_ids))

['The', 'Ġcapital', 'Ġof', 'ĠDenmark', 'Ġis']


(The `Ġ` there encodes space; this representation is a minor quirk of the GPT tokenizer.)

Because the token ids represent both visible characters and space, the full input string can be reconstructed accurately:

In [37]:
print(tokenizer.decode(input_.input_ids))

The capital of Denmark is


We can invoke the model directly with the encoded input. Here we need to ask the tokenizer to generate pytorch tensors due to some implementation details, but the information content is the same.

In [38]:
input_ = tokenizer(prompt, return_tensors='pt')

print(input_)

{'input_ids': tensor([[  504,  3575,   282, 15644,   314]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}


In [39]:
output = model(**input_)
print(output)

CausalLMOutputWithPast(loss=None, logits=tensor([[[ -3.5625, -11.6875, -11.6875,  ...,  -7.3750,  -7.0312,  -9.8750],
         [  2.0938,  -8.3125,  -8.4375,  ...,  -7.7812,  -5.3125,  -9.5000],
         [-10.4375, -20.6250, -20.7500,  ..., -16.1250, -14.0625, -19.2500],
         [ 12.8750,  -4.2188,  -4.5000,  ...,   1.7891,   1.9297,  -2.0938],
         [ -0.0728, -13.3125, -13.5625,  ...,  -8.6250,  -3.8594, -12.6875]]],
       dtype=torch.bfloat16, grad_fn=<UnsafeViewBackward0>), past_key_values=DynamicCache(layers=[<transformers.cache_utils.DynamicLayer object at 0x78d8f938dd50>, <transformers.cache_utils.DynamicLayer object at 0x78d8f9380e90>, <transformers.cache_utils.DynamicLayer object at 0x78d8f9382cd0>, <transformers.cache_utils.DynamicLayer object at 0x78d8f9380b50>, <transformers.cache_utils.DynamicLayer object at 0x78d8f9380c90>, <transformers.cache_utils.DynamicLayer object at 0x78d8f9380b90>, <transformers.cache_utils.DynamicLayer object at 0x78d8f93800d0>, <transformer

The primary output of the model are the logits, which correspond to unnormalized scores for each token. We're interested in the scores for the last token. (The first dimension here is for the batch, and we have a batch of one.)

In [40]:
print(output.logits.shape)

logits = output.logits[0][-1]
print(logits)

torch.Size([1, 5, 49152])
tensor([ -0.0728, -13.3125, -13.5625,  ...,  -8.6250,  -3.8594, -12.6875],
       dtype=torch.bfloat16, grad_fn=<SelectBackward0>)


For greedy decoding, we can just take the argmax, which gives us the index of the most likely next word.

In [41]:
logits.argmax()

tensor(27101)

In [42]:
tokenizer.convert_ids_to_tokens([logits.argmax()])

['ĠCopenhagen']

If we wanted to continue generating more than one word, we would simply append this index to `input_ids` and invoke the model again.