<a href="https://colab.research.google.com/github/tsilva/aiml-notebooks/blob/main/hugging-face-nlp-course/wip-001-hf-nlp-course-part-1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hugging Face NLP Course - Part 1 🤗📚

This notebook summarizes my learnings from the [Hugging Face NLP Course](https://huggingface.co/learn/nlp-course/), covering chapters 1 to 4. 📚

## Setup 🛠️

Install the necessary libraries for this notebook: 📚💻

In [1]:
%pip install python-dotenv
%pip install transformers[sentencepiece]
%pip install scikit-learn scipy

Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1


To access all models, you need an access token 🗝️ with the **Make calls to the serverless Inference API** permission. Create one [here](https://huggingface.co/settings/tokens) 🌐 and set it as the `HF_TOKEN` environment variable. 🖥️

Load the environment variables 🌍 and check that `HF_TOKEN` is available 🔑:

In [2]:
import os

# Load environment variables from .env (if available)
from dotenv import load_dotenv
load_dotenv()

# If running in Google Colab, copy
# required secrets into environment variables
try:
  from google.colab import userdata
  os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')
except:
  pass

assert os.getenv("HF_TOKEN"), "You need to set the HF_TOKEN environment variable to run this notebook"

Now, let's check if PyTorch can access an NVIDIA GPU 🖥️. While not mandatory, using a GPU significantly speeds up model training ⚡. This check ensures CUDA is installed and accessible by PyTorch ✅. To skip GPU usage, simply comment out the cell below: 📝

In [3]:
import torch

if torch.cuda.is_available():
    DEVICE = torch.device("cuda")
    print(f"✅ Using GPU: {torch.cuda.get_device_name(0)}")
else:
    DEVICE = torch.device("cpu")
    print("⚠️ Warning: No TPU or GPU detected. Using CPU.")

print(f"🔥 Device Selected: {DEVICE}")

✅ Using GPU: Tesla T4
🔥 Device Selected: cuda


If you got no errors, you're ready to go! 🚀📚

## Tasks 📝✨

Let's explore the `transformers` library 🔍 to load pre-trained models 🤖 for various tasks. ✨ The `pipeline` abstraction in `transformers` provides easy access to various transformer models for different tasks 🎯, automatically selecting the best model 🏆.

### Sentiment analysis 😊📊

Classify the sentiment of one or more sentences using the default model `distilbert/distilbert-base-uncased-finetuned-sst-2-english` for sentiment analysis: 😊📊

In [4]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", device=DEVICE)
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
        "It hurts so good!",
        "While the service was certainly unique, it left a lasting impression that I won’t forget anytime soon."
    ]
)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda


[{'label': 'POSITIVE', 'score': 0.9598046541213989},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455},
 {'label': 'NEGATIVE', 'score': 0.8600758910179138},
 {'label': 'POSITIVE', 'score': 0.9994854927062988}]

The model correctly labeled each sentence as positive 😊 or negative 😞.

### Zero-shot classification 🤖✨

In a `zero-shot-classification` task, a model classifies the probability of an input matching provided, non-predefined labels: 📊🤖✨

In [5]:
classifier = pipeline("zero-shot-classification", device=DEVICE)
classifier(
    "The ball curved beautifully into the top corner, leaving the goalkeeper with no chance.",
    candidate_labels=["sports", "art", "technology", "cooking", "nature"]
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda


{'sequence': 'The ball curved beautifully into the top corner, leaving the goalkeeper with no chance.',
 'labels': ['sports', 'technology', 'art', 'nature', 'cooking'],
 'scores': [0.3564995527267456,
  0.29446229338645935,
  0.25602230429649353,
  0.06628327816724777,
  0.026732558384537697]}

The model classified the sentence as primarily about `sports` 🏅, with possible links to `technology` ⚙️ (physics of a moving ball) or `art` 🎨 ("curved beautifully"), but excluded `nature` 🌳 and `cooking` 🍳.

### Text Generation ✍️✨

In a `text-generation` task ✍️, the model adds tokens ➕ to the right ➡️ of the given sentence.

In [6]:
generator = pipeline("text-generation", device=DEVICE)
generator(
    "In this course, we will teach you how to",
    max_length=30, # Generate a sentence with a maximum length of 30 tokens
    num_return_sequences=2, # Generate 2 candidate sentence completions with the provided input as the prefix
)

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to use the power of the imagination, the power of the imagination to make real choices about life.\n'},
 {'generated_text': 'In this course, we will teach you how to use your knowledge to find things out about yourself. Learn how you will learn the best way to build'}]

Notice how the model generates coherent sentences 📝 based on our input. ✍️

### Mask Filling 🎭✨

The `mask-filling` task involves predicting the most likely tokens for the `<mask>` token's location: 🧩🔍

In [7]:
unmasker = pipeline("fill-mask", device=DEVICE)
unmasker("This course will teach you all about <mask> models.", top_k=2)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda


[{'score': 0.19198444485664368,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.042091820389032364,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

The model correctly predicted the missing token, classifying `mathematical` 📊 as the most likely choice ✅ based on the sentence context.

### Named Entity Recognition 🏷️🔍

The `ner` task identifies and tags important words in the text by category (e.g., names 🧑‍🤝‍🧑, locations 📍).

In [8]:
ner = pipeline("ner", grouped_entities=True, device=DEVICE)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Device set to use cuda


[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

The model detected a person's name (Sylvain) 👤, an organization (Hugging Face) 🏢, and a location (Brooklyn) 📍, pinpointing their exact positions in the text and indicating its confidence in the classifications.

### Question Answering ❓🤔

With the `question-answering` pipeline, the user inputs a context 📚 and a question ❓, and the model answers based on the context: 💡

In [9]:
question_answerer = pipeline("question-answering", device=DEVICE)
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cuda


{'score': 0.6949763894081116, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

The model correctly answered the question ✅ using the provided context 📚.

### Summarization 📚✨

The `summarization` pipeline generates a summary 📄 of the input text ✍️:

In [10]:
summarizer = pipeline("summarization", device=DEVICE)
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
"""
)


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Device set to use cuda


[{'summary_text': ' The number of engineering graduates in the United States has declined in recent years . China and India graduate six and eight times as many traditional engineers as the U.S. does . Rapidly developing economies such as China continue to encourage and advance the teaching of engineering . There are declining offerings in engineering subjects dealing with infrastructure, infrastructure, the environment, and related issues .'}]

### Translation

The `translation` pipeline translates text between languages 🌍✍️. A model must be explicitly selected to define the source and target languages 🔄:

In [11]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en", device=DEVICE)
translator("Ce cours est produit par Hugging Face.")

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

Device set to use cuda


[{'translation_text': 'This course is produced by Hugging Face.'}]

Observe how the model successfully translated the text from French 🇫🇷 to English 🇬🇧.

### Feature Extraction 🔍✨

The `feature-extraction` pipeline converts sentences into embeddings: 📝➡️🔍✨

In [12]:
feature_extractor = pipeline("feature-extraction", device=DEVICE)
result = feature_extractor([
    "Woodpecker",
    "How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
    "Soccer"
])
result

No model was supplied, defaulted to distilbert/distilbert-base-cased and revision 6ea8117 (https://huggingface.co/distilbert/distilbert-base-cased).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/263M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cuda


[[[[0.4632350206375122,
    -0.008048655465245247,
    -0.12127996981143951,
    -0.3442820906639099,
    -0.3879909813404083,
    0.04655570536851883,
    0.33751821517944336,
    -0.022074762731790543,
    -0.0007292242371477187,
    -1.1915024518966675,
    -0.19076372683048248,
    0.21029765903949738,
    -0.23856507241725922,
    -0.15761561691761017,
    -0.37006816267967224,
    0.1643211394548416,
    0.2532157599925995,
    0.009089706465601921,
    -0.1204853504896164,
    0.018180161714553833,
    -0.02640751376748085,
    -0.2988360822200775,
    0.614218533039093,
    -0.18887680768966675,
    0.34349286556243896,
    -0.09803786873817444,
    0.3379552364349365,
    0.07254429161548615,
    -0.016977066174149513,
    0.4146110713481903,
    0.06607840210199356,
    0.21736297011375427,
    -0.15209658443927765,
    0.1434631645679474,
    -0.0830865353345871,
    0.12500301003456116,
    -0.14274904131889343,
    -0.20017790794372559,
    0.052241336554288864,
    -0.023

The model converts each sentence into a `768`-dimensional vector 📏, useful for clustering 📊, classification 🗂️, or dimensionality reduction 📉. By comparing the cosine similarity 🔍 of two vectors, we can assess the similarity of the corresponding sentences 📝.

## Using Transformers ⚡🤖

Transformers are neural network architectures that efficiently process and generate text using *self-attention* to understand word relationships. They can be categorized into three main types:

**GPT-like Models (Auto-Regressive Transformers)**  
- Models like GPT (Generative Pre-trained Transformer) are designed for **causal language modeling**, predicting the next word based on previous ones. 📝  
- They are **decoder-only models**, ideal for text generation tasks like story writing and chatbots. 💬  

**BERT-like Models (Auto-Encoding Transformers)**  
- BERT (Bidirectional Encoder Representations from Transformers) uses **masked language modeling (MLM)** to predict missing words in a sentence. 🔍  
- As **encoder-only models**, they focus on understanding input rather than generating text. 📖  
- These models excel in tasks requiring deep understanding, such as sentence classification, named entity recognition (NER), and question answering. ✅  

**BART/T5-like Models (Sequence-to-Sequence Transformers)**  
- These models combine encoder and decoder architectures, making them **encoder-decoder models** (sequence-to-sequence). 🔄  
- They are designed for **generative tasks requiring input**, such as machine translation, text summarization, and text-based question answering. 🌐  

Each Transformer type is optimized for different tasks, enhancing versatility in various natural language processing (NLP) applications. 🌟

### Biases 🤐

All models have biases that must be recognized in real-world applications ⚖️. For example, `bert-base-uncased` shows biased predictions for male and female professions 👨‍💼👩‍💼:

In [13]:
unmasker = pipeline("fill-mask", model="bert-base-uncased", device=DEVICE)
[x["token_str"] for x in unmasker("This man works as a [MASK].")], [x["token_str"] for x in unmasker("This woman works as a [MASK].")]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another archite

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cuda


(['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor'],
 ['nurse', 'maid', 'teacher', 'waitress', 'prostitute'])

No comments. 🤐

### Behind the Pipeline 🔍💧

Let's manually set up the tasks behind the `pipeline` abstraction. Below is a `sentiment-analysis` task using the `distilbert/distilbert-base-uncased-finetuned-sst-2-english` model, while still utilizing the `pipeline` abstraction: 📊✨

In [14]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english", device=DEVICE)
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

Device set to use cuda


[{'label': 'POSITIVE', 'score': 0.9598046541213989},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

To set up the task manually, load the **model** 🏗️ and **tokenizer**, preprocess the input text 📄, and pass it through the model. Each model has a specific tokenizer 🔑 that converts text into tokens. The `AutoTokenizer` class automatically selects the correct tokenizer for a model ✅:

In [15]:
from transformers import AutoTokenizer

model_checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokenizer

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

DistilBertTokenizerFast(name_or_path='distilbert-base-uncased-finetuned-sst-2-english', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

Now that we have the tokenizer 🧩, we can preprocess the input text 📄 by tokenizing it and converting it to input IDs for the model. We will pad sequences to a maximum length 📏, truncate overly long sequences ✂️, and return the token IDs as PyTorch tensors 🔢:

In [16]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(
    raw_inputs, # The texts to tokenize
    padding=True, # Pad the inputs to the maximum input length
    truncation=True, # Truncate the text to the maximum length the model can accept
    return_tensors="pt" # Return PyTorch tensors
).to(DEVICE)
(
    inputs.keys(),
    inputs['input_ids'].shape,
    inputs["input_ids"],
    inputs['attention_mask'].shape,
    inputs["attention_mask"]
)

(dict_keys(['input_ids', 'attention_mask']),
 torch.Size([2, 16]),
 tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
           2607,  2026,  2878,  2166,  1012,   102],
         [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
              0,     0,     0,     0,     0,     0]], device='cuda:0'),
 torch.Size([2, 16]),
 tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0'))

The tokenizer returns a dictionary with two keys: `input_ids`, which holds the tokenized inputs 📝, and `attention_mask`, a binary mask indicating padding in `input_ids` 🛡️. The model ignores padding during predictions 🚫. Both tensors have the shape `(2, 16)`, representing the batch size (two sentences) 📄📄 and maximum sequence length (16 tokens) 🔢.

To retrieve the model, use the `AutoModel` class: 📦✨

In [17]:
from transformers import AutoModel

model_checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(model_checkpoint).to(DEVICE)
model

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): DistilBertSdpaAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): L

With the model loaded 📦 and sentences tokenized 📝, we can now process our inputs through the model: 🚀

In [18]:
model(**inputs)

BaseModelOutput(last_hidden_state=tensor([[[-0.1798,  0.2333,  0.6321,  ..., -0.3017,  0.5008,  0.1481],
         [ 0.2758,  0.6497,  0.3200,  ..., -0.0760,  0.5136,  0.1329],
         [ 0.9046,  0.0985,  0.2950,  ...,  0.3352, -0.1407, -0.6464],
         ...,
         [ 0.1466,  0.5661,  0.3235,  ..., -0.3376,  0.5100, -0.0561],
         [ 0.7500,  0.0487,  0.1738,  ...,  0.4684,  0.0030, -0.6084],
         [ 0.0519,  0.3729,  0.5223,  ...,  0.3584,  0.6500, -0.3883]],

        [[-0.2937,  0.7283, -0.1497,  ..., -0.1187, -1.0227, -0.0422],
         [-0.2206,  0.9384, -0.0951,  ..., -0.3643, -0.6605,  0.2407],
         [-0.1536,  0.8988, -0.0728,  ..., -0.2189, -0.8528,  0.0710],
         ...,
         [-0.3017,  0.9002, -0.0200,  ..., -0.1082, -0.8412, -0.0861],
         [-0.3338,  0.9674, -0.0729,  ..., -0.1952, -0.8181, -0.0634],
         [-0.3454,  0.8824, -0.0426,  ..., -0.0993, -0.8329, -0.1065]]],
       device='cuda:0', grad_fn=<NativeLayerNormBackward0>), hidden_states=None, a

The code above is equivalent to: 🔄

In [19]:
outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
outputs

BaseModelOutput(last_hidden_state=tensor([[[-0.1798,  0.2333,  0.6321,  ..., -0.3017,  0.5008,  0.1481],
         [ 0.2758,  0.6497,  0.3200,  ..., -0.0760,  0.5136,  0.1329],
         [ 0.9046,  0.0985,  0.2950,  ...,  0.3352, -0.1407, -0.6464],
         ...,
         [ 0.1466,  0.5661,  0.3235,  ..., -0.3376,  0.5100, -0.0561],
         [ 0.7500,  0.0487,  0.1738,  ...,  0.4684,  0.0030, -0.6084],
         [ 0.0519,  0.3729,  0.5223,  ...,  0.3584,  0.6500, -0.3883]],

        [[-0.2937,  0.7283, -0.1497,  ..., -0.1187, -1.0227, -0.0422],
         [-0.2206,  0.9384, -0.0951,  ..., -0.3643, -0.6605,  0.2407],
         [-0.1536,  0.8988, -0.0728,  ..., -0.2189, -0.8528,  0.0710],
         ...,
         [-0.3017,  0.9002, -0.0200,  ..., -0.1082, -0.8412, -0.0861],
         [-0.3338,  0.9674, -0.0729,  ..., -0.1952, -0.8181, -0.0634],
         [-0.3454,  0.8824, -0.0426,  ..., -0.0993, -0.8329, -0.1065]]],
       device='cuda:0', grad_fn=<NativeLayerNormBackward0>), hidden_states=None, a

Notice however that the output doesn't have any logits. This is because when you load a model with `AutoModel` it just loads the base model without a head. For our classification task, we need to load the model with a classification head using `AutoModelForSequenceClassification`: 📦✨

In [20]:
from transformers import AutoModelForSequenceClassification

model_checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint).to(DEVICE)
outputs = model(**inputs)
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], device='cuda:0', grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

The model outputs logits for two sentences across two classes (positive 😊 and negative 😞 sentiment). We can apply the `softmax` function to convert these logits into probabilities:

In [21]:
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], device='cuda:0', grad_fn=<SoftmaxBackward0>)


We can inspect predicted labels using the model's config 🛠️📊:

In [22]:
labels = model.config.id2label
labels

{0: 'NEGATIVE', 1: 'POSITIVE'}

And use this to print the probability of each sentence being positive 😊 or negative 😞:

In [23]:
for i, sentence in enumerate(raw_inputs):
    print("\n" + sentence)
    for index, prediction in enumerate(predictions[i]):
        label = labels[index]
        print(f"{index} ({label}): {prediction.item() * 100.0:.2f}%")


I've been waiting for a HuggingFace course my whole life.
0 (NEGATIVE): 4.02%
1 (POSITIVE): 95.98%

I hate this so much!
0 (NEGATIVE): 99.95%
1 (POSITIVE): 0.05%


`AutoModel` automatically instantiates the correct model for the specified checkpoint 🗂️, but you can also directly instantiate specific models 🛠️:

In [24]:
from transformers import BertConfig, BertModel

config = BertConfig()
model = BertModel(config).to(DEVICE)
model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

Let's also inspect the model `config` 🔍:

In [25]:
config

BertConfig {
  "_attn_implementation_autoset": true,
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.48.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

The example above shows a blank model 🏗️, but we can also load a pre-trained model 📦:

In [26]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased").to(DEVICE)
model

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(28996, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

Let's run tokenized sentences through the model: 🏃‍♂️💬🧠

In [27]:
model(torch.tensor([
    [101, 7592, 999, 102], # "Hello!"
    [101, 4658, 1012, 102], # "Cool."
    [101, 3835, 999, 102], # "Nice!"
]).to(DEVICE))

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 4.4496e-01,  4.8276e-01,  2.7797e-01,  ..., -5.4033e-02,
           3.9393e-01, -9.4770e-02],
         [ 2.4943e-01, -4.4093e-01,  8.1772e-01,  ..., -3.1917e-01,
           2.2992e-01, -4.1172e-02],
         [ 1.3668e-01,  2.2518e-01,  1.4502e-01,  ..., -4.6915e-02,
           2.8224e-01,  7.5566e-02],
         [ 1.1789e+00,  1.6738e-01, -1.8187e-01,  ...,  2.4671e-01,
           1.0441e+00, -6.1966e-03]],

        [[ 3.6436e-01,  3.2464e-02,  2.0258e-01,  ...,  6.0111e-02,
           3.2451e-01, -2.0996e-02],
         [ 7.1866e-01, -4.8725e-01,  5.1740e-01,  ..., -4.4012e-01,
           1.4553e-01, -3.7545e-02],
         [ 3.3223e-01, -2.3271e-01,  9.4876e-02,  ..., -2.5268e-01,
           3.2172e-01,  8.1144e-04],
         [ 1.2523e+00,  3.5754e-01, -5.1321e-02,  ..., -3.7840e-01,
           1.0526e+00, -5.6255e-01]],

        [[ 2.4042e-01,  1.4718e-01,  1.2110e-01,  ...,  7.6061e-02,
           3.3564e-01,  2

### Tokenizers 🏷️✨

Tokenization divides text into smaller units, crucial for natural language processing (**NLP**). The three main types are **word-based, character-based, and subword-based**:

- **Word-Based Tokenization**  
  - Splits text by spaces or punctuation, assigning each word an ID. ✍️  
  - Requires a **large vocabulary** (~500K words in English). 📚  
  - Struggles with unknown words, represented as `[UNK]` or `<unk>`. ❓  

- **Character-Based Tokenization**  
  - Assigns an ID to each character, resulting in a **small vocabulary** and fewer unknown tokens. 🔤  
  - Increases input size and loses meaning compared to words. 📏  
  - Useful for languages like **Chinese**, where characters have significant meaning. 🇨🇳  

- **Subword-Based Tokenization**  
  - Combines word and character tokenization by keeping common words whole and splitting rarer ones into meaningful parts. 🔗  
  - Example: **"modernization"** can be split into **"modern"** and **"ization"**, aiding models in understanding word structures while managing vocabulary size. 🏗️  

- **Common Tokenization Strategies**  
  - **Byte Pair Encoding (BPE)** – Used in **GPT-2**, efficient for multilingual text. 🌐  
  - **WordPiece** – Used in **BERT**, enhances deep learning tokenization. 🧠  
  - **SentencePiece & Unigram** – Common in **multilingual models**, accommodating diverse scripts. 📝  

Each method addresses different NLP needs, balancing vocabulary size, efficiency, and meaning preservation. ⚖️

To load a tokenizer in `transformers` 🛠️, you can use the specific model architecture class 🏗️, just like with models:

In [28]:
from transformers import BertTokenizer

BertTokenizer.from_pretrained("bert-base-cased")

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

BertTokenizer(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

Use the `AutoTokenizer` class 🛠️ to automatically select the correct tokenizer for a model 📚:

In [29]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer

BertTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

To tokenize a sentence ✍️, split it into tokens 🔤 and encode them into token IDs 🔢. Calling the tokenizer on a sentence returns a dictionary 📚 with token IDs and an attention mask 🎭:

In [30]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

You can also perform each step manually: 🛠️✨

In [31]:
tokens = tokenizer.tokenize("Using a Transformer network is simple")
tokens

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']

With the sentence split into "subword" tokens 📝, we can encode them into token IDs 🔢 that the model understands 🤖:

In [32]:
ids = tokenizer.convert_tokens_to_ids(tokens)
ids

[7993, 170, 13809, 23763, 2443, 1110, 3014]

You can decode these token IDs back into text: 🔑📜

In [33]:
tokenizer.decode(ids)

'Using a Transformer network is simple'

You can pass the IDs to the model as follows: 📥✨

In [34]:
model(torch.tensor([ids]).to(DEVICE))

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.6599,  0.1682, -0.8506,  ...,  0.1580,  1.0142,  0.3816],
         [ 0.3342, -0.0611,  0.1166,  ...,  0.0587,  0.5397,  0.3626],
         [ 0.2631,  0.7993,  0.6173,  ...,  0.3468,  0.1786,  0.1109],
         ...,
         [ 0.0731,  0.0760, -0.2084,  ...,  0.4211, -0.1886,  0.4988],
         [-0.0720,  0.1424,  0.3000,  ...,  0.5437,  0.4905,  0.4409],
         [ 0.2023, -0.0035,  0.2196,  ...,  0.1811,  0.4516,  0.5694]]],
       device='cuda:0', grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-0.8262,  0.7780,  1.0000, -0.9997,  0.9582,  0.5192,  0.9984,  0.2792,
         -0.9938, -0.6819,  0.9975,  0.9999, -0.7594, -0.9999,  0.1124, -0.9975,
          0.9991, -0.9207, -1.0000, -0.5364,  0.6101, -1.0000,  0.5347,  0.7747,
          0.9986,  0.2676,  0.9990,  1.0000,  0.8480,  0.7375,  0.2738, -0.9994,
          0.9288, -0.9999,  0.6452, -0.7292, -0.0798, -0.7764,  0.7578, -0.9621,
         -0.871

Remember, this won't work ❌: all sequences must be the same length 📏.

In [35]:
try:
    model(torch.tensor([
        [200, 200, 200],
        [200, 200] # sequence length is different
    ]).to(DEVICE))
except Exception as e:
    print(e)

expected sequence of length 3 at dim 1 (got 2)


When sequences differ in length, pad them to match. 📏✨

In [36]:
model(torch.tensor([
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id]
]).to(DEVICE))

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.7171,  0.3900,  0.7794,  ...,  0.2437,  0.3144,  0.2805],
         [-0.9494,  0.5553,  0.8574,  ...,  0.2297,  0.4452,  0.2730],
         [-0.9682,  0.5910,  0.8699,  ...,  0.2472,  0.4314,  0.2384]],

        [[-0.5893,  0.3276,  0.7874,  ...,  0.0994,  0.4246,  0.7273],
         [-0.4967, -0.0064,  0.9259,  ..., -0.0375,  0.4893,  0.6086],
         [-0.4615,  0.0833,  0.8920,  ..., -0.0974,  0.6483,  0.5301]]],
       device='cuda:0', grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-0.5442,  0.3112,  0.9957,  ...,  0.9987, -0.2157,  0.9938],
        [-0.5608,  0.3997,  0.9990,  ...,  0.9997, -0.3815,  0.9989]],
       device='cuda:0', grad_fn=<TanhBackward0>), hidden_states=None, past_key_values=None, attentions=None, cross_attentions=None)

Notice the following: 📢✨

In [37]:
(
    model(torch.tensor([[200, 200]]).to(DEVICE)),
    model(torch.tensor([[200, 200, tokenizer.pad_token_id]]).to(DEVICE))
)

(BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.5830,  0.3394,  0.6170,  ...,  0.2534,  0.2890,  0.2582],
          [-0.8096,  0.4996,  0.6657,  ...,  0.2510,  0.4012,  0.2330]]],
        device='cuda:0', grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-0.5251,  0.2551,  0.9945, -0.9736,  0.8164,  0.4568,  0.9330, -0.5746,
          -0.9240, -0.5941,  0.9378,  0.9920, -0.9549, -0.9854,  0.1041, -0.9097,
           0.9650, -0.1588, -0.9986, -0.1916, -0.2939, -0.9908,  0.2761,  0.9039,
           0.8641,  0.2132,  0.9837,  0.9982,  0.6723,  0.7878,  0.1319, -0.9740,
           0.1322, -0.9905,  0.2183, -0.1145, -0.4426, -0.1284, -0.2218, -0.7207,
          -0.6820, -0.1394, -0.5355, -0.3160,  0.7192,  0.5585,  0.4866, -0.1998,
          -0.2854,  0.9901, -0.9256,  0.9979, -0.8718,  0.9911,  0.9952,  0.6159,
           0.9896,  0.1895, -0.8932,  0.6724,  0.9092, -0.0017,  0.9505, -0.1938,
          -0.7072, -0.7293, -0.1320,  0.0882, -0.7021,  0.60

Notice how predictions change with the padding token? 🤔 The model considers it part of the input, altering predictions. 🔄 To prevent this, use the `attention_mask` from the tokenizer to instruct the model to ignore padding tokens: 🚫

In [38]:
(
    model(
        torch.tensor([[200, 200]]).to(DEVICE),
        attention_mask=torch.tensor([[1, 1]]).to(DEVICE)
    ),
    model(
        torch.tensor([[200, 200, tokenizer.pad_token_id]]).to(DEVICE),
        attention_mask=torch.tensor([[1, 1, 0]]).to(DEVICE)
    )
)

(BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.5830,  0.3394,  0.6170,  ...,  0.2534,  0.2890,  0.2582],
          [-0.8096,  0.4996,  0.6657,  ...,  0.2510,  0.4012,  0.2330]]],
        device='cuda:0', grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-0.5251,  0.2551,  0.9945, -0.9736,  0.8164,  0.4568,  0.9330, -0.5746,
          -0.9240, -0.5941,  0.9378,  0.9920, -0.9549, -0.9854,  0.1041, -0.9097,
           0.9650, -0.1588, -0.9986, -0.1916, -0.2939, -0.9908,  0.2761,  0.9039,
           0.8641,  0.2132,  0.9837,  0.9982,  0.6723,  0.7878,  0.1319, -0.9740,
           0.1322, -0.9905,  0.2183, -0.1145, -0.4426, -0.1284, -0.2218, -0.7207,
          -0.6820, -0.1394, -0.5355, -0.3160,  0.7192,  0.5585,  0.4866, -0.1998,
          -0.2854,  0.9901, -0.9256,  0.9979, -0.8718,  0.9911,  0.9952,  0.6159,
           0.9896,  0.1895, -0.8932,  0.6724,  0.9092, -0.0017,  0.9505, -0.1938,
          -0.7072, -0.7293, -0.1320,  0.0882, -0.7021,  0.60

Both predictions are identical 🔄 despite the input sequences having different lengths 📏.

### Putting it all together 🧩✨

To recap, pass a single sentence to the tokenizer: ✍️📜

In [39]:
sequence = "I've been waiting for a HuggingFace course my whole life."
tokenizer(sequence)

{'input_ids': [101, 146, 112, 1396, 1151, 2613, 1111, 170, 20164, 10932, 2271, 7954, 1736, 1139, 2006, 1297, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

You can pass a list of sentences: 📝✨

In [40]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]
tokenizer(sequences)

{'input_ids': [[101, 146, 112, 1396, 1151, 2613, 1111, 170, 20164, 10932, 2271, 7954, 1736, 1139, 2006, 1297, 119, 102], [101, 1573, 1138, 146, 106, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

Pad to the longest sequence in the batch: 📏✨

In [41]:
tokenizer(sequences, padding="longest")

{'input_ids': [[101, 146, 112, 1396, 1151, 2613, 1111, 170, 20164, 10932, 2271, 7954, 1736, 1139, 2006, 1297, 119, 102], [101, 1573, 1138, 146, 106, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}

Pad to the model's maximum sequence length: 📏✨

In [42]:
tokenizer(sequences, padding="max_length")

{'input_ids': [[101, 146, 112, 1396, 1151, 2613, 1111, 170, 20164, 10932, 2271, 7954, 1736, 1139, 2006, 1297, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

Padding should match the model's maximum length 📏, but sequences cannot exceed `N` tokens 🚫. If sequences are shorter than `N` tokens, they will be padded to `N` ➕; if longer, they will be truncated ✂️:

In [43]:
tokenizer(sequences, padding="max_length", max_length=8)

{'input_ids': [[101, 146, 112, 1396, 1151, 2613, 1111, 170, 20164, 10932, 2271, 7954, 1736, 1139, 2006, 1297, 119, 102], [101, 1573, 1138, 146, 106, 102, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0, 0]]}

When calling the tokenizer directly, it tokenizes the input ✍️ and adds special tokens like `[CLS]` 🔖 and `[SEP]` 🔖:

In [44]:
sequence = "I've been waiting for a HuggingFace course my whole life."
model_inputs = tokenizer(sequence).to(DEVICE)
tokenizer.decode(model_inputs["input_ids"])

"[CLS] I ' ve been waiting for a HuggingFace course my whole life. [SEP]"

This won't occur if you tokenize ✂️ and encode 🔒 the input separately:

In [45]:
tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
tokenizer.decode(ids)

"I ' ve been waiting for a HuggingFace course my whole life."

### Fine-tuning a Pretrained Model 🔧🤖

Here’s how to fine-tune an existing model, let's first load a batch of sequences to be classified: 🔧✨

In [46]:
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification

# Retrieve the model and tokenizer
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint).to(DEVICE)

# Tokenize two sentences
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]
batch = tokenizer(
    sequences,
    padding=True,
    truncation=True,
    return_tensors="pt"
).to(DEVICE)
batch

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  2023,  2607,  2003,  6429,   999,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0]], device='cuda:0'), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0')}

The tokenizer provided sequences for the model 📊. By adding expected labels 🏷️, we can use this batch structure to fine-tune the model 🔧:

In [47]:
# Add labels to the batch (both sentences are positive)
batch["labels"] = torch.tensor([1, 1]).to(DEVICE)

# Create an instance of AdamW optimizer
optimizer = AdamW(model.parameters())



You can now run the cell below to perform an optimization step on the batch, if you run it multiple times, the loss should go down: 🔧📉

In [48]:
# Forward pass the batch through the model and retrieve the calculated loss
loss = model(**batch).loss

# Backward pass to calculate the gradients
loss.backward()

# Perform a single optimization step to update the model's parameters based on the calculated gradients
optimizer.step()

# Print loss
loss

tensor(0.7847, device='cuda:0', grad_fn=<NllLossBackward0>)

We can't achieve much with a small fine-tuning dataset, so let's retrieve a larger one using Hugging Face's `datasets` library 📚. First, we need to install it: 🛠️

In [49]:
%pip install datasets

Collecting datasets
  Downloading datasets-3.3.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.3.1-py3-none-any.whl (484 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m484.9/484.9 kB[0m [31m31.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading

Now, we retrieve the [`MRPC`](https://paperswithcode.com/dataset/mrpc) dataset 📊, which contains sentence pairs 🗣️ labeled as paraphrases 🔄 or not:

In [50]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/649k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

The dataset is divided into `training`, `validation`, and `test` sets 📊, each containing two sentences ✍️✍️ and a label 🏷️ indicating if they are paraphrases. Let's examine the dataset features: 🔍

In [51]:
raw_train_dataset = raw_datasets["train"]
raw_train_dataset.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

Let's retrieve an example from the training set: 📚✨

In [52]:
raw_train_dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

To train on this dataset 📊, each pair of sentences can be tokenized as a single sentence ✍️, separated by the `[SEP]` token 🔗, which the tokenizer does by default.

In [53]:
inputs = tokenizer("This is the first sentence.", "This is the second one.")
tokenizer.convert_ids_to_tokens(inputs["input_ids"])

['[CLS]',
 'this',
 'is',
 'the',
 'first',
 'sentence',
 '.',
 '[SEP]',
 'this',
 'is',
 'the',
 'second',
 'one',
 '.',
 '[SEP]']

Tokenize a batch of sentence pairs: 📝🔄

In [54]:
inputs = tokenizer([
    ["This is the first sentence.", "This is the second one."],
    ["The first sentence is this one.", "The second sentence is this one."]
], padding=True)
torch.tensor(inputs["input_ids"]).shape, inputs

(torch.Size([2, 17]),
 {'input_ids': [[101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102, 0, 0], [101, 1996, 2034, 6251, 2003, 2023, 2028, 1012, 102, 1996, 2117, 6251, 2003, 2023, 2028, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]})

The tokenizer produced a batch of two tokenized sentences 📜✍️, which we need to feed as pairs to obtain the probability of them being paraphrases 🤔🔍. Let's examine one of these sentences:

In [55]:
tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

['[CLS]',
 'this',
 'is',
 'the',
 'first',
 'sentence',
 '.',
 '[SEP]',
 'this',
 'is',
 'the',
 'second',
 'one',
 '.',
 '[SEP]',
 '[PAD]',
 '[PAD]']

At first glance, the batch appeared correct ✅, but it incorrectly paired the sentences ❌. Feeding two input lists into the tokenizer instead 🥪 treats it as a sentence pair task, causing it to tokenize corresponding items from both lists as a single sentence 📜:

In [56]:
inputs = tokenizer(
    ["This is the first sentence.", "This is the second one."],
    ["The first sentence is this one.", "The second sentence is this one."]
, padding=True)
torch.tensor(inputs["input_ids"]).shape, inputs, tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

(torch.Size([2, 16]),
 {'input_ids': [[101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 1996, 2034, 6251, 2003, 2023, 2028, 1012, 102], [101, 2023, 2003, 1996, 2117, 2028, 1012, 102, 1996, 2117, 6251, 2003, 2023, 2028, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]},
 ['[CLS]',
  'this',
  'is',
  'the',
  'first',
  'sentence',
  '.',
  '[SEP]',
  'the',
  'first',
  'sentence',
  'is',
  'this',
  'one',
  '.',
  '[SEP]'])

We now know how to tokenize the entire training dataset 📊✨:

In [57]:
tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
    return_tensors="pt"
)
input_ids = tokenized_dataset["input_ids"]
input_ids.shape

torch.Size([3668, 103])

We have an input batch of `3668` sequences 📊, each of length `103` 📏 (padded to the maximum sequence length).

The above tokenization works for small datasets, but for larger ones, the `map` function is preferable, as it allows processing batches in parallel without having to preload the dataset into memory first:

In [58]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(
    tokenize_function,
    batched=True # Send multiple sentences to tokenize_function each call for better performance
)
tokenized_datasets

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

Notice that the tokenized dataset now includes new features from the tokenizer, such as `input_ids` 🆔 and `attention_mask` 🎭.

Let's investigate the sequence lengths of the tokenized training set: 📊✨

In [59]:
len(set([len(x) for x in tokenized_datasets["train"]["input_ids"]])) # Count number of unique sequence lengths

77

If the tokenized dataset was padded to the maximum sequence length, all sequences would be the same length, which is not the case. ❌ This is expected, as padding is only necessary when feeding sequences to the model, and it should match the batch length, not the dataset length. 📏 We need to use a `DataCollator` to handle this: 🛠️

In [60]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
data_collator

DataCollatorWithPadding(tokenizer=BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
), padding=True, max_length=None, pad_to_multiple_of=None, return_tensors='pt')

Create a sample batch of tokenized sequences: 📝✨

In [61]:
samples = tokenized_datasets["train"][:8] # Pick first 8 samples
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]} # Remove unnecessary columns
samples.keys()

dict_keys(['label', 'input_ids', 'token_type_ids', 'attention_mask'])

Feed the batch into the data collator 📊, which will pad the sequences 📏 to the length of the longest one. 🏆

In [62]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

{'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'attention_mask': torch.Size([8, 67]),
 'labels': torch.Size([8])}

Notice that all sequences are now the same length of `67` 📏, which is the length of the longest sequence in the batch 📊. Let's confirm ✅:

In [63]:
max([len(x) for x in tokenized_datasets["train"][:8]["input_ids"]])

67

Correct! ✅ Let's set up our data from scratch: load the dataset 📂, tokenize it ✂️, and feed it to the data collator 📊:

In [64]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

# Load the tokenizer
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Load dataset
raw_datasets = load_dataset("glue", "mrpc")

# Tokenize dataset
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

# Wrap the tokenized dataset with the DataCollatorWithPadding
# (to make sure batches have same sequence length)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Load the model for fine-tuning 🔧 and specify the number of labels to predict 📊:

In [65]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2).to(DEVICE)
model

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

The warning above indicates that the loaded model lacked a classification head for two labels ⚠️, so a new one with random weights was added ✨. We will utilize the pre-trained weights from the model 💪, but the classification head will be trained from scratch 🛠️:

Now, we initialize the trainer: 🎓✨

In [66]:
from transformers import Trainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    "test-trainer",
    bf16=True # Enable mixed precision training
)
trainer = Trainer(
    model, # the instantiated 🤗 Transformers model to be trained
    training_args, # training arguments, defined above
    train_dataset=tokenized_datasets["train"], # The dataset to train the model on
    eval_dataset=tokenized_datasets["validation"], # The dataset to evaluate the model on
    data_collator=data_collator, # defaults to DataCollatorWithPadding if not provided
    tokenizer=tokenizer # The tokenizer to be used,
)
training_args, trainer

  trainer = Trainer(


(TrainingArguments(
 _n_gpu=1,
 accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
 adafactor=False,
 adam_beta1=0.9,
 adam_beta2=0.999,
 adam_epsilon=1e-08,
 auto_find_batch_size=False,
 average_tokens_across_devices=False,
 batch_eval_metrics=False,
 bf16=True,
 bf16_full_eval=False,
 data_seed=None,
 dataloader_drop_last=False,
 dataloader_num_workers=0,
 dataloader_persistent_workers=False,
 dataloader_pin_memory=True,
 dataloader_prefetch_factor=None,
 ddp_backend=None,
 ddp_broadcast_buffers=None,
 ddp_bucket_cap_mb=None,
 ddp_find_unused_parameters=None,
 ddp_timeout=1800,
 debug=[],
 deepspeed=None,
 disable_tqdm=False,
 dispatch_batches=None,
 do_eval=False,
 do_predict=False,
 do_train=False,
 eval_accumulation_steps=None,
 eval_delay=0,
 eval_do_concat_batches=True,
 eval_on_start=False,
 eval_steps=None,
 eval_st

We can now train the model: 🏋️‍♂️📊

In [None]:
trainer.train()



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mtsilva[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Step,Training Loss
500,0.5072


Training is complete ✅, and loss has significantly decreased 📉. We can now evaluate the model on the validation set 🧪 using the trainer's `predict()` function, which simplifies the process by handling tokenization automatically 🔄:

In [None]:
predictions = trainer.predict(tokenized_datasets["validation"])
predictions.predictions.shape, predictions.label_ids.shape

The resulting object contains a `predictions` tensor with `2` logits for the predicted labels of each of the `408` validation sequences 📊, and a `label_ids` tensor with the ground truth label for each sequence 📜.

Let's check the logits 📊, predictions 📈, and ground truth labels ✅ for some of the results:

In [None]:
import numpy as np

[(predictions.predictions[x], np.argmax(predictions.predictions[x], axis=-1), predictions.label_ids[x]) for x in range(3)]

Now, let's calculate the total number of accurate predictions for the entire validation dataset 📊:

In [None]:
predicted_labels = np.argmax(predictions.predictions, axis=-1)
predicted_labels

We can now compare with the ground truth labels: 📊🔍

In [None]:
np.sum(predicted_labels == predictions.label_ids) / len(predicted_labels)

Our model accurately predicts labels for **>80%** 📊 of the validation set.

A better way to evaluate performance is to use `evaluate` 📊, which considers dataset characteristics 📈 and applies the most suitable metrics ✅. First let's install it:

In [None]:
%pip install evaluate

Let's now compute the metrics for our predictions. 📊📈

In [None]:
import evaluate

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=predicted_labels, references=predictions.label_ids)

If we create a method that takes a tuple of logits 📊 and ground truth labels 🏷️ and returns evaluation metrics 📈, we can provide it to the `Trainer`:

In [None]:
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)
compute_metrics((predictions.predictions, predictions.label_ids))

Let's pass `compute_metrics` to `Trainer` and try again 📊:

In [None]:
training_args = TrainingArguments(
    "test-trainer",
    evaluation_strategy="epoch",
    fp16=True
)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2).to(DEVICE)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
trainer.train()

The training process now reports  validation loss 📉 due to the `compute_metrics` method we provided. We can see the training loss decreasing while validation loss increases, which means the model is overfitting 📉.

Now let's perform the same training run without the `Trainer` class 📚✨:

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets["train"].column_names

Now that we've tokenized the sentences, we can remove `sentence1` and `sentence2` from the dataset. 📊 Additionally, `idx` is unnecessary, and `label` should be renamed to `labels` for compatibility with Hugging Face:

In [None]:
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

Now that the dataset is ready, we can create a `DataCollator` to pad sequences 📏 and create `DataLoader` objects to load batches:

In [None]:
from torch.utils.data import DataLoader

# Create data collator to pad batch sequences
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Create a data loader to load batches from the training set
train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True, # Shuffle the training set
    batch_size=8, # Each batch will have 8 samples
    collate_fn=data_collator # Use the data collator to pad the samples in the batch
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"],
    batch_size=8, # Each batch will have 8 samples
    collate_fn=data_collator # Use the data collator to pad the samples in the batch
)

Let's load a batch from the training set and inspect its contents: 📊✨

In [None]:
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

Each batch has `8` sequences 📊, each padded to the maximum sequence length of the batch 📏 (may be different with each batch).

Let's create the model and run the batch through it:

In [None]:
from transformers import AutoModelForSequenceClassification

# Load the pretrained model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2).to(DEVICE)

# Send batch to GPU
batch.to(DEVICE)

# Forward pass the batch through the model
outputs = model(**batch)

(outputs.loss, outputs.logits.shape)

Running the batch through the model resulted in a logits tensor of shape `(8, 2)`, representing the strength for each class (paraphrase or not) for each of the `8` sequences 📊. Since the batch contains the ground-truth labels, a loss was computed as well, representing how close the predictions represented by the logits matches the ground-truth.

Let's now run training without using `Trainer`:

In [None]:
from tqdm.auto import tqdm
from transformers import get_scheduler
from transformers import AdamW

# Create an instance of AdamW optimizer with starting learning rate of 5e-5
optimizer = AdamW(model.parameters(), lr=5e-5)

# Calculate the number of training steps
# (num_batches * n_epochs)
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)

# Create learning rate scheduler
lr_scheduler = get_scheduler(
    "linear", # Linearly decrease the learning rate
    optimizer=optimizer, # The optimizer to use
    num_warmup_steps=0, # No warmup steps
    num_training_steps=num_training_steps # Total number of training steps
)

# Set the model in training mode (this will
# make sure that the model tracks gradients)
model.train()

# Create a progress bar where 100% = num_training_steps
progress_bar = tqdm(range(num_training_steps))

# Train for N epochs
for epoch in range(num_epochs):
    # Sample a batch from the training set
    for batch in train_dataloader:
        # Move batch to GPU
        batch.to(DEVICE)

        # Forward pass batch through the model
        outputs = model(**batch)

        # Backpropagate the loss through
        # the model (gradient calculation)
        loss = outputs.loss
        loss.backward()

        # Perform a single optimization step
        optimizer.step()

        # Update the learning rate using the linear scheduler
        lr_scheduler.step()

        # Zero out the gradients for the next batch
        # (otherwise they would accumulate)
        optimizer.zero_grad()

        # Update the progress bar
        progress_bar.update(1)

# Run last batch through model and output logits
outputs = model(**batch)
outputs.logits

Let's use the `evaluate` package to load the dataset metrics and calculate them for the model we just trained:

In [None]:
import evaluate

# Change the model to evaluation mode
# (remove dropout layers, change batch norm layers to eval mode, etc.)
model.eval()

# Load metrics from dataset
metric = evaluate.load("glue", "mrpc")

# Run inference on the evaluation set and add predictions to the metric
for batch in eval_dataloader:
    # Move batch to GPU
    batch.to(DEVICE)

    # Forward pass the batch through the model
    # (disable gradient calculation to speed up computation)
    with torch.no_grad(): outputs = model(**batch)

    # Calculate predictions from logits
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)

    # Add batch to metric
    metric.add_batch(predictions=predictions, references=batch["labels"])

# Compute final metrics
metric.compute()

**Now with `"accelerate"` 🚀:** Hugging Face’s `accelerate` **automates multi-GPU, TPU, and mixed precision training**, boosting speed and reducing memory use. It removes the need for `DataParallel`, optimizes device placement, and enables **seamless FP16/BF16 training**—just wrap your model with `accelerator.prepare()`, and you're set! 🚀🔥

In [None]:
from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

# Load the pre-trained model, add classification head
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

# Create an instance of AdamW optimizer
optimizer = AdamW(model.parameters(), lr=3e-5)

# Prepare for training using the Accelerator
accelerator = Accelerator()
train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

# Create the learning rate scheduler
num_epochs = 3
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

# Set the model in training mode
# (eg: enable dropout layers, etc.)
model.train()

# Train the model for N epochs
progress_bar = tqdm(range(num_training_steps))
for epoch in range(num_epochs):
    for batch in train_dl:
        # Forward pass batch through the model
        outputs = model(**batch)

        # Calculate the loss and perform a backward pass
        loss = outputs.loss
        accelerator.backward(loss)

        # Perform a single optimization step
        # (updates weights using gradients calculated during backpropagation)
        optimizer.step()

        # Perform a learning rate step
        lr_scheduler.step()

        # Zero out the gradients for the next batch
        # (otherwise they would accumulate)
        optimizer.zero_grad()

        # Update the progress bar
        progress_bar.update(1)

# Run last batch through model and output logits
outputs = model(**batch)
outputs.logits