# pipeline (in detail)

`pipeline()` function encapsulates tokenization, model inference and postprocessing steps

## tokenizer

**preprocessing step:** input (`str`) --> character/subword/word/sentence (`str`) --> ~~embedding~~ numeric token (`tensor`) [**Embedding is the high dimensional representation of a token**, not to be confused with `token ids`]

**postprocessing step:** output embedding --> output (`str`)

In [2]:
from transformers import AutoTokenizer

In [3]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint) # tokenizer from a pretrained model's vocabulary

In [145]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]

token_ids = tokenizer(text=raw_inputs, padding=True, truncation=True, return_tensors="pt")
token_ids

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}

# Model

Download model using `AutoModel`

In [5]:
from transformers import AutoModel

In [6]:
model = AutoModel.from_pretrained(checkpoint)

2025-02-06 17:55:12.770269: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-02-06 17:55:12.864796: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-02-06 17:55:12.937459: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1738882513.025288  468446 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1738882513.051791  468446 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-06 17:55:13.229396: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU ins

In [None]:
outputs = model(**token_ids)
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])


In [8]:
outputs['last_hidden_state']

tensor([[[-0.1798,  0.2333,  0.6321,  ..., -0.3017,  0.5008,  0.1481],
         [ 0.2758,  0.6497,  0.3200,  ..., -0.0760,  0.5136,  0.1329],
         [ 0.9046,  0.0985,  0.2950,  ...,  0.3352, -0.1407, -0.6464],
         ...,
         [ 0.1466,  0.5661,  0.3235,  ..., -0.3376,  0.5100, -0.0561],
         [ 0.7500,  0.0487,  0.1738,  ...,  0.4684,  0.0030, -0.6084],
         [ 0.0519,  0.3729,  0.5223,  ...,  0.3584,  0.6500, -0.3883]],

        [[-0.2937,  0.7283, -0.1497,  ..., -0.1187, -1.0227, -0.0422],
         [-0.2206,  0.9384, -0.0951,  ..., -0.3643, -0.6605,  0.2407],
         [-0.1536,  0.8988, -0.0728,  ..., -0.2189, -0.8528,  0.0710],
         ...,
         [-0.3017,  0.9002, -0.0200,  ..., -0.1082, -0.8412, -0.0861],
         [-0.3338,  0.9674, -0.0729,  ..., -0.1952, -0.8181, -0.0634],
         [-0.3454,  0.8824, -0.0426,  ..., -0.0993, -0.8329, -0.1065]]],
       grad_fn=<NativeLayerNormBackward0>)

In [9]:
model.config

DistilBertConfig {
  "_attn_implementation_autoset": true,
  "_name_or_path": "distilbert-base-uncased-finetuned-sst-2-english",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.48.1",
  "vocab_size": 30522
}

In [10]:
from transformers import AutoModelForSequenceClassification

In [11]:
seq_model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

In [12]:
seq_outputs = seq_model(**token_ids)

In [13]:
seq_outputs.logits.shape

torch.Size([2, 2])

In [14]:
seq_model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

In [15]:
seq_outputs.logits

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)

In [16]:
from torch import softmax # alias for torch.nn.functional.softmax

In [17]:
probs = softmax(seq_outputs.logits, dim=-1)
probs

tensor([[4.0195e-02, 9.5981e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)

# instantiate transformers model

In [18]:
from transformers.models.bert import BertConfig, BertModel

In [19]:
BertConfig()

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.48.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

In [20]:
import importlib.util


importlib.util.find_spec("torch") is not None

True

In [21]:
model_bert = BertModel.from_pretrained('google-bert/bert-base-cased')

## keep code checkpoint-agnostic

Using `AutoModel` instead of `BertModel` is recommended

# tokenizers

types:
- word tokenization
- character tokenization
- sub-word tokenization

`bert` uses `WordPiece Tokenizer`

`tokenizer()` object is sufficient to create model-ready inputs. Its important arguements are:

- `padding`: 'longest', 'max_length', `True`
- `truncation`: True
- `return_tensors`: 'pt', 'tf', 'np'

In [23]:
from transformers.models.bert import BertTokenizer, BertTokenizerFast

In [41]:
from huggingface_hub import HfApi
api = HfApi()
models_list = api.list_models(filter="bert", author="google-bert")
for i in models_list:
    print(i)

ModelInfo(id='google-bert/bert-base-uncased', author=None, sha=None, created_at=datetime.datetime(2022, 3, 2, 23, 29, 4, tzinfo=datetime.timezone.utc), last_modified=None, private=False, disabled=None, downloads=84840766, downloads_all_time=None, gated=None, gguf=None, inference=None, likes=2095, library_name='transformers', tags=['transformers', 'pytorch', 'tf', 'jax', 'rust', 'coreml', 'onnx', 'safetensors', 'bert', 'fill-mask', 'exbert', 'en', 'dataset:bookcorpus', 'dataset:wikipedia', 'arxiv:1810.04805', 'license:apache-2.0', 'autotrain_compatible', 'endpoints_compatible', 'region:us'], pipeline_tag='fill-mask', mask_token=None, card_data=None, widget_data=None, model_index=None, config=None, transformers_info=None, trending_score=16, siblings=None, spaces=None, safetensors=None, security_repo_status=None)
ModelInfo(id='google-bert/bert-base-chinese', author=None, sha=None, created_at=datetime.datetime(2022, 3, 2, 23, 29, 4, tzinfo=datetime.timezone.utc), last_modified=None, privat

In [43]:
bert_tokenizert = BertTokenizer.from_pretrained('google-bert/bert-base-cased')

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

loading file vocab.txt from cache at /home/ruchirich/.cache/huggingface/hub/models--google-bert--bert-base-cased/snapshots/cd5ef92a9fb2f889e972770a36d4ed042daf221e/vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /home/ruchirich/.cache/huggingface/hub/models--google-bert--bert-base-cased/snapshots/cd5ef92a9fb2f889e972770a36d4ed042daf221e/tokenizer_config.json
loading file tokenizer.json from cache at /home/ruchirich/.cache/huggingface/hub/models--google-bert--bert-base-cased/snapshots/cd5ef92a9fb2f889e972770a36d4ed042daf221e/tokenizer.json
loading file chat_template.jinja from cache at None
loading configuration file config.json from cache at /home/ruchirich/.cache/huggingface/hub/models--google-bert--bert-base-cased/snapshots/cd5ef92a9fb2f889e972770a36d4ed042daf221e/config.json
Model config BertConfig {
  "_name_or_path": "google-bert/bert-base-cased",
  "architectures":

better use `AutoTokenizer`

In [42]:
from transformers.models.auto import AutoTokenizer

In [44]:
auto_tokenizer = AutoTokenizer.from_pretrained('google-bert/bert-base-cased')

loading configuration file config.json from cache at /home/ruchirich/.cache/huggingface/hub/models--google-bert--bert-base-cased/snapshots/cd5ef92a9fb2f889e972770a36d4ed042daf221e/config.json
Model config BertConfig {
  "_name_or_path": "google-bert/bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.48.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading file vocab.txt from cache at /home/ruchirich/.cache/huggingface/hub/models--google-bert--bert-base-cased/snapshots/cd5ef92a9fb2f8

In [48]:
word = "Pneumonoultramicroscopicsilicovolcanoconiosis"

In [52]:
tokens = auto_tokenizer.tokenize(word)
tokens

['P',
 '##ne',
 '##um',
 '##ono',
 '##ult',
 '##ram',
 '##ic',
 '##ros',
 '##copic',
 '##si',
 '##lic',
 '##ovo',
 '##l',
 '##cano',
 '##con',
 '##ios',
 '##is']

In [53]:
ids = auto_tokenizer.convert_tokens_to_ids(tokens)
ids

[153,
 1673,
 1818,
 23038,
 7067,
 4515,
 1596,
 5864,
 22258,
 5053,
 8031,
 18105,
 1233,
 17519,
 7235,
 10714,
 1548]

In [56]:
decoded_str = auto_tokenizer.decode(ids)
decoded_str

'Pneumonoultramicroscopicsilicovolcanoconiosis'

In [57]:
from transformers.models.auto import AutoModel, AutoTokenizer

In [74]:
list_models = api.list_models(filter="distilbert", author="distilbert")

for i in list_models:
    print(i.id)

distilbert/distilbert-base-uncased
distilbert/distilbert-base-uncased-finetuned-sst-2-english
distilbert/distilbert-base-cased-distilled-squad
distilbert/distilbert-base-cased
distilbert/distilbert-base-german-cased
distilbert/distilbert-base-multilingual-cased
distilbert/distilbert-base-uncased-distilled-squad


In [75]:
import torch

In [76]:
checkpoint = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"

distilbert = AutoModel.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

input_txt = "I enjoy learning stuff"

loading configuration file config.json from cache at /home/ruchirich/.cache/huggingface/hub/models--distilbert--distilbert-base-uncased-finetuned-sst-2-english/snapshots/714eb0fa89d2f80546fda750413ed43d93601a13/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert/distilbert-base-uncased-finetuned-sst-2-english",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.48.1",
  "vocab_size": 30522
}



The 2 code blocks are not equivalent

`encode()` adds `[CLS]` and `[EOS]` special tokens before encoding

In [78]:
encode_txt = tokenizer.encode(text=input_txt)
encode_txt

[101, 1045, 5959, 4083, 4933, 102]

In [80]:
tokenize_text = tokenizer.tokenize(text=input_txt)
ids = tokenizer.convert_tokens_to_ids(tokenize_text)
ids

[1045, 5959, 4083, 4933]

In [91]:
input_tensors = torch.tensor(ids)

print(input_tensors)
print(input_tensors.shape)
print(input_tensors.dim())

tensor([1045, 5959, 4083, 4933])
torch.Size([4])
1


In [None]:
# error in the next line: Model expects a List of sequences: List[tensor], but we have passed just one sequence: tensor
distilbert(input_tensors)

IndexError: too many indices for tensor of dimension 1

In [94]:
# instead do this

input_tensors = torch.tensor([ids]) # [] adds a new dimension to the input

print(input_tensors)
print(input_tensors.shape)
print(input_tensors.dim())

tensor([[1045, 5959, 4083, 4933]])
torch.Size([1, 4])
2


In [134]:
from transformers.models.auto import AutoModelForSequenceClassification

In [135]:
distilbert_sequence_classification = AutoModelForSequenceClassification.from_pretrained(checkpoint)

loading configuration file config.json from cache at /home/ruchirich/.cache/huggingface/hub/models--distilbert--distilbert-base-uncased-finetuned-sst-2-english/snapshots/714eb0fa89d2f80546fda750413ed43d93601a13/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert/distilbert-base-uncased-finetuned-sst-2-english",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.48.1",
  "vocab_size": 30522
}



In [136]:
output = distilbert(input_tensors)

print(output['last_hidden_state'].shape) # output is not logits, its 
print(output['last_hidden_state'].dim())
print(output['last_hidden_state'])

torch.Size([1, 4, 768])
3
tensor([[[ 0.9505,  0.7359,  0.0823,  ..., -0.2943,  0.0210, -0.4399],
         [ 0.5870,  0.6352,  0.3458,  ...,  0.0411,  0.6748, -0.0659],
         [ 0.4834,  0.6182,  0.4448,  ...,  0.0075,  0.7932, -0.0931],
         [ 0.7439,  0.6408,  0.4732,  ...,  0.2047,  0.7322,  0.0153]]],
       grad_fn=<NativeLayerNormBackward0>)


In [111]:
for i in range(4):
    print(f"token number {i}:")
    print(output['last_hidden_state'][0][i][0])
    print(output['last_hidden_state'][0][i][5])    

token number 0:
tensor(0.9505, grad_fn=<SelectBackward0>)
tensor(-0.6090, grad_fn=<SelectBackward0>)
token number 1:
tensor(0.5870, grad_fn=<SelectBackward0>)
tensor(-0.9413, grad_fn=<SelectBackward0>)
token number 2:
tensor(0.4834, grad_fn=<SelectBackward0>)
tensor(-0.8315, grad_fn=<SelectBackward0>)
token number 3:
tensor(0.7439, grad_fn=<SelectBackward0>)
tensor(-0.9864, grad_fn=<SelectBackward0>)


In [144]:
seq_outputs = distilbert_sequence_classification(input_tensors)
# print(seq_outputs.keys())
print(seq_outputs['logits'].shape) # output is not logits, its 
print(seq_outputs['logits'].dim())
print(seq_outputs['logits'])

torch.Size([1, 2])
2
tensor([[-1.0563,  1.4415]], grad_fn=<AddmmBackward0>)


## padding

a batch of sentences needs to be a rectangular tensor:

- pad shorter sentences to match the lenght of the longest sentence in the batch

In [147]:
tokenizer.pad_token_id

0

# Final wrap-up

In [148]:
sequences = ["Deepseek misinformation sent the stock market in a frenzy", "Not bad"]
model_inputs = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

outputs = distilbert_sequence_classification(**model_inputs)

print(outputs.logits)

tensor([[ 3.5242, -2.9109],
        [-3.7709,  4.0236]], grad_fn=<AddmmBackward0>)


In [157]:
probits = torch.softmax(outputs.logits, dim=-1)
print(probits)
print(distilbert_sequence_classification.config.id2label)

tensor([[9.9840e-01, 1.6017e-03],
        [4.1182e-04, 9.9959e-01]], grad_fn=<SoftmaxBackward0>)
{0: 'NEGATIVE', 1: 'POSITIVE'}


# Final Quiz notes

Model head: aka adaptation head, comes up in different forms like language modeling head, question answering head, sequence classification head etc

Tokenizer and Model should always use the same checkpoint