<a href="https://colab.research.google.com/github/tsilva/sandbox-transformers/blob/main/HF_NLP_Course.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hugging Face NLP Course - Part 1

- Better intro
- Change section structure
- Review content again
- Publish in aiml-notebooks
- Publish final model to hugging face

This a notebook summarizing everything I've learned in [Hugging Face NLP Course](https://huggingface.co/learn/nlp-course/). This part covers chapters 1 to 4.

## Setup

Let's install all the libraries we will need in this notebook:

In [88]:
%pip install python-dotenv
%pip install transformers[sentencepiece] datasets evaluate accelerate
%pip install scikit-learn scipy
%pip install openai

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


To able to access all models we need an access token with the **Make calls to the serverless Inference API** permission. 
You can create one [here](https://huggingface.co/settings/tokens) and make it available as the `HF_TOKEN` environment variable.

Let's load the environment variables and assert that `HF_TOKEN` is available:

In [89]:
from dotenv import load_dotenv
load_dotenv()

import os
assert os.environ["HF_TOKEN"], "You need to set the HF_TOKEN environment variable to run this notebook"

Now let's assert that PyTorch has access to an NVIDIA GPU. This is not strictly mandatory but it will considerably speed up the training of our models. This assertion will guarantee that CUDA is installed, and that PyTorch can access it. If you don't want to use a GPU simply comment out the cell below:

In [90]:
import torch
assert torch.cuda.is_available(), "CUDA is not available."
assert torch.cuda.device_count() > 0, "No GPU device is available."

This is just an util to make notebook authoring easier:

In [None]:
import os
import IPython

from IPython.core.magic import register_cell_magic
assert os.environ["OPENAI_API_KEY"], "You need to set the OPENAI_API_KEY environment variable to run this notebook"

def review_text(text):
    from openai import OpenAI

    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system", 
                "content": 
"""
Your task is to refine a Markdown excerpt from a Jupyter Notebook about Hugging Face Transformers, making it more concise and clear.
Optimize for brevity while preserving key information.
Eliminate redundancy, improve clarity, and enhance readability.
Output only the revised text—no explanations or additional content.
""".strip()
},
            {"role": "user", "content": text}
        ],
        temperature=1.0
    )
    return response.choices[0].message.content

@register_cell_magic
def r(line, cell):
    """Refines markdown text in a Jupyter cell and updates it in-place (Works in VS Code)."""

    # Get improved text from GPT-4o
    improved_text = review_text(cell)

    # Overwrite the current cell with the new content
    shell = IPython.get_ipython()
    shell.set_next_input(improved_text, replace=True)

    # Change cell type to Markdown (works in Jupyter Notebook & VS Code)
    shell.run_cell_magic("script", "python", "get_ipython().set_next_input('', replace=True)")
    shell.run_cell_magic("script", "python", "get_ipython().set_next_input('%%markdown\\n' + '''" + improved_text.replace("'", "\\'") + "''', replace=True)")


Let's explore the `transformers` library to load pre-trained models for different tasks.

## Tasks

The `pipeline` abstraction in `transformers` facilitates easy access to various transformer models for different tasks, automatically selecting the most appropriate model.



### Sentiment analysis

Classify the sentiment of one or more sentences. By default, this pipeline uses the `distilbert/distilbert-base-uncased-finetuned-sst-2-english` model for sentiment analysis:

In [4]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
        "It hurts so good!",
        "While the service was certainly unique, it left a lasting impression that I won’t forget anytime soon."
    ]
)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9598047137260437},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455},
 {'label': 'NEGATIVE', 'score': 0.8600767254829407},
 {'label': 'POSITIVE', 'score': 0.9994854927062988}]

The model accurately labeled each sentence as positive or negative.

### Zero-shot classification

In a `zero-shot-classification` task, a model classifies the probability of an input corresponding to provided labels, which are not predefined:

In [5]:
classifier = pipeline("zero-shot-classification")
classifier(
    "The ball curved beautifully into the top corner, leaving the goalkeeper with no chance.",
    candidate_labels=["sports", "art", "technology", "cooking", "nature"]
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


{'sequence': 'The ball curved beautifully into the top corner, leaving the goalkeeper with no chance.',
 'labels': ['sports', 'technology', 'art', 'nature', 'cooking'],
 'scores': [0.35649919509887695,
  0.29446250200271606,
  0.25602230429649353,
  0.0662834644317627,
  0.02673245407640934]}

The model correctly classified the sentence as primarily about `sports`, and possibly related to `technology` (physics of a moving ball) or `art` ("curved beautifully"), but ruled out `nature` and `cooking`.

#### Text Generation

In a `text-generation` task, the model will add more tokens to the right of the provided sentence.

In [6]:
generator = pipeline("text-generation")
generator(
    "In this course, we will teach you how to",
    max_length=30, # Generate a sentence with a maximum length of 30 tokens
    num_return_sequences=2, # Generate 2 candidate sentence completions with the provided input as the prefix
)

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to create and store a variety of SQL-related data objects. We will show you how to handle user'},
 {'generated_text': 'In this course, we will teach you how to create 3D images as well as the best way to show your images in 3D for the web'}]

Notice how the model was able to generate coherent sentences that start with our input.

#### Mask Filling

The `mask-filling` task entails having the model predict the most likely tokens to be present in the location of the `<mask>` token.

In [7]:
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


[{'score': 0.1919846087694168,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.04209217429161072,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

Notice how the model was able to correctly predict the missing token in the sentence, classifying ` mathematical` as the most likely token, which makes sense in the context of the sentence.

#### Named Entity Recognition

The `ner` task will identify important words in the text and tag them with their respective category (eg: names, locations, etc.).

In [8]:
ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


[{'entity_group': 'PER',
  'score': np.float32(0.9981694),
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': np.float32(0.9796019),
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': np.float32(0.9932106),
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

Notice how the model was able to detect a person name (Sylvain), an organization (Hugging Face), and a location (Brooklyn). It pointed to the exact location of these entities in the text, and specified how confident it was that its classification was accurate.

#### Question Answering

With the `question-answering` pipeline, the user provides a context and a question, and the model will answer the question using the provided context.

In [9]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


{'score': 0.6949758529663086, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

Notice how the model was able to answer the question correctly by using the provided context.

#### Summarization

The `summarization` pipeline creates a summary of the provided text:

In [10]:
summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
"""
)


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'summary_text': ' The number of engineering graduates in the United States has declined in recent years . China and India graduate six and eight times as many traditional engineers as the U.S. does . Rapidly developing economies such as China continue to encourage and advance the teaching of engineering . There are declining offerings in engineering subjects dealing with infrastructure, infrastructure, the environment, and related issues .'}]

#### Translation

The `translation` pipeline translates text from one language to another. In this pipeline, a model needs to be explicitly selected in order to specify the source and target languages.

In [11]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

Device set to use cuda:0


[{'translation_text': 'This course is produced by Hugging Face.'}]

Notice how the model was able to translate the text from French to English.

### Feature Extraction

The `feature-extraction` pipeline converts sentences to embeddings.

In [59]:
feature_extractor = pipeline("feature-extraction")
result = feature_extractor([
    "Woodpecker",
    "How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
    "Soccer"
])
result

No model was supplied, defaulted to distilbert/distilbert-base-cased and revision 6ea8117 (https://huggingface.co/distilbert/distilbert-base-cased).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[[[0.46323493123054504,
   -0.008048598654568195,
   -0.1212799996137619,
   -0.3442820608615875,
   -0.38799092173576355,
   0.046555694192647934,
   0.3375183045864105,
   -0.022074824199080467,
   -0.0007292297668755054,
   -1.191502332687378,
   -0.19076354801654816,
   0.21029764413833618,
   -0.23856502771377563,
   -0.1576155573129654,
   -0.37006819248199463,
   0.1643211394548416,
   0.2532159090042114,
   0.009089866653084755,
   -0.12048514932394028,
   0.01818009838461876,
   -0.026407526805996895,
   -0.2988360822200775,
   0.6142184734344482,
   -0.1888769418001175,
   0.3434928059577942,
   -0.09803776443004608,
   0.3379552364349365,
   0.07254431396722794,
   -0.01697712205350399,
   0.4146110415458679,
   0.06607840955257416,
   0.2173629254102707,
   -0.15209655463695526,
   0.14346309006214142,
   -0.0830865427851677,
   0.12500286102294922,
   -0.14274907112121582,
   -0.200177863240242,
   0.05224117636680603,
   -0.02394922636449337,
   -0.5409788489341736,
   0.

The model was able to convert each sentence into a 768-dimensional vector. These vectors can be used for tasks such as clustering, classification, or dimensionality reduction. For example, by comparing two vectors' cosine similarity, we can determine how similar two sentences are.

## How do Transformers work?

Transformers are a type of neural network architecture that process and generate text efficiently. They rely on a mechanism called *self-attention* to understand the relationships between words in a sequence. Depending on their design and training objectives, Transformers can be categorized into three main types:  

**GPT-like Models (Auto-Regressive Transformers)**  
- These models, such as GPT (Generative Pre-trained Transformer), are designed for **causal language modeling**, where they predict the next word in a sequence based on the previous words.  
- They are **decoder-only models**, making them well-suited for text generation tasks like story writing and chatbot interactions.  

**BERT-like Models (Auto-Encoding Transformers)**  
- BERT (Bidirectional Encoder Representations from Transformers) and similar models use **masked language modeling (MLM)** to predict missing (masked) words in a sentence.  
- They are **encoder-only models**, meaning they focus on understanding and processing input rather than generating new text.  
- These models excel in **tasks requiring deep understanding**, such as sentence classification, named entity recognition (NER), and question answering.  

**BART/T5-like Models (Sequence-to-Sequence Transformers)**  
- These models combine elements of both encoder and decoder architectures, making them **encoder-decoder models** (also called sequence-to-sequence models).  
- They are designed for **generative tasks that require an input**, such as machine translation, text summarization, and text-based question answering.  

Each Transformer model type is optimized for different tasks, making them highly versatile across various natural language processing (NLP) applications.

All models have their biases, and it's important to be aware of them when using them in real-world applications. Notice how `bert-base-uncased` biases the most likely professions for male and female:

In [61]:
unmasker = pipeline("fill-mask", model="bert-base-uncased")
[x["token_str"] for x in unmasker("This man works as a [MASK].")], [x["token_str"] for x in unmasker("This woman works as a [MASK].")]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


(['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor'],
 ['nurse', 'maid', 'teacher', 'waitress', 'prostitute'])

No comments 🤦.


---

## Using Transformers

### Behind the Pipeline

Let's go behind the abstraction provided by `pipeline` and setup the same tasks manually. The following is a `sentiment-analysis` task using the `distilbert/distilbert-base-uncased-finetuned-sst-2-english` model, still using the `pipeline` abstraction:

In [14]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9598047137260437},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

To setup the same task manually, we need to load the model and tokenizer, preprocess the input text, and pass it through the model. Each model has its own tokenizer, which is responsible for converting text into tokens that the model can understand. The `AutoTokenizer` class automatically selects the appropriate tokenizer for a given model:

In [62]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer

DistilBertTokenizerFast(name_or_path='distilbert-base-uncased-finetuned-sst-2-english', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

Now that we have the tokenizer, we can preprocess the input text by tokenizing it and converting it to input IDs that the model can understand. We will also ask it to pad the sequences to a maximum length (fill with token IDs corresponding to padding tokens), truncate sequences that are too long, and return the token ids as PyTorch tensors:

In [69]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(
    raw_inputs, # The texts to tokenize
    padding=True, # Pad the inputs to the maximum input length
    truncation=True, # Truncate the text to the maximum length the model can accept
    return_tensors="pt" # Return PyTorch tensors
)
(
    inputs.keys(), 
    inputs['input_ids'].shape,
    inputs["input_ids"], 
    inputs['attention_mask'].shape,
    inputs["attention_mask"]
)

(dict_keys(['input_ids', 'attention_mask']),
 torch.Size([2, 16]),
 tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
           2607,  2026,  2878,  2166,  1012,   102],
         [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
              0,     0,     0,     0,     0,     0]]),
 torch.Size([2, 16]),
 tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]))

The tokenizer returns a dictionary with two keys: "input_ids" contains the tokenized inputs, and "attention_mask" contains a binary mask indicating which elements in "input_ids" are padding elements. The model will ignore the padding elements when making predictions. Both tensors have shape `(2, 16)` which corresponds to the batch size (two sentences) and the maximum sequence length (16 tokens).

To retrieve the model itself we can use the `AutoModel` class:

In [19]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

With the model loaded and our sentences tokenized we can now run our inputs through the model:

In [70]:
model(**inputs)

SequenceClassifierOutput(loss=None, logits=tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

The code above is the same as doing the following:

In [74]:
outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

The output of the model contains the logits of the two sentences for each of the two classes (positive and negative sentiment). We can use the `softmax` function to convert these logits into probabilities:

In [75]:
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5981e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


We can inspect the predicted labels by using the model's config:

In [78]:
labels = model.config.id2label
labels

{0: 'NEGATIVE', 1: 'POSITIVE'}

And use that to print the probability of each sentence being positive or negative:

In [86]:
for i, sentence in enumerate(raw_inputs):
    print("\n" + sentence)
    for index, prediction in enumerate(predictions[i]):
        label = labels[index]
        print(f"{index} ({label}): {prediction.item() * 100.0:.2f}%")


I've been waiting for a HuggingFace course my whole life.
0 (NEGATIVE): 4.02%
1 (POSITIVE): 95.98%

I hate this so much!
0 (NEGATIVE): 99.95%
1 (POSITIVE): 0.05%


`AutoModel` automatically instantiates the appropriate model for the specified checkpoint, but you can also instantiate the specific models directly:

In [88]:
from transformers import BertConfig, BertModel
config = BertConfig()
model = BertModel(config)
model

(BertConfig {
   "_attn_implementation_autoset": true,
   "attention_probs_dropout_prob": 0.1,
   "classifier_dropout": null,
   "hidden_act": "gelu",
   "hidden_dropout_prob": 0.1,
   "hidden_size": 768,
   "initializer_range": 0.02,
   "intermediate_size": 3072,
   "layer_norm_eps": 1e-12,
   "max_position_embeddings": 512,
   "model_type": "bert",
   "num_attention_heads": 12,
   "num_hidden_layers": 12,
   "pad_token_id": 0,
   "position_embedding_type": "absolute",
   "transformers_version": "4.48.2",
   "type_vocab_size": 2,
   "use_cache": true,
   "vocab_size": 30522
 },
 BertModel(
   (embeddings): BertEmbeddings(
     (word_embeddings): Embedding(30522, 768, padding_idx=0)
     (position_embeddings): Embedding(512, 768)
     (token_type_embeddings): Embedding(2, 768)
     (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
     (dropout): Dropout(p=0.1, inplace=False)
   )
   (encoder): BertEncoder(
     (layer): ModuleList(
       (0-11): 12 x BertLayer(
     

Let's inspect the model `config` as well:

In [89]:
config

BertConfig {
  "_attn_implementation_autoset": true,
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.48.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

The example above would be for a blank model, but we can also load a pre-trained model:

In [90]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

Let's try running some tokenized sentences through the model:

In [92]:
model(torch.tensor([
    [101, 7592, 999, 102], # "Hello!"
    [101, 4658, 1012, 102], # "Cool."
    [101, 3835, 999, 102], # "Nice!"
]))

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 4.4496e-01,  4.8276e-01,  2.7797e-01,  ..., -5.4032e-02,
           3.9393e-01, -9.4770e-02],
         [ 2.4943e-01, -4.4093e-01,  8.1772e-01,  ..., -3.1917e-01,
           2.2992e-01, -4.1172e-02],
         [ 1.3668e-01,  2.2518e-01,  1.4502e-01,  ..., -4.6914e-02,
           2.8224e-01,  7.5566e-02],
         [ 1.1789e+00,  1.6738e-01, -1.8187e-01,  ...,  2.4671e-01,
           1.0441e+00, -6.1965e-03]],

        [[ 3.6436e-01,  3.2464e-02,  2.0258e-01,  ...,  6.0111e-02,
           3.2451e-01, -2.0996e-02],
         [ 7.1866e-01, -4.8725e-01,  5.1740e-01,  ..., -4.4012e-01,
           1.4553e-01, -3.7545e-02],
         [ 3.3223e-01, -2.3271e-01,  9.4876e-02,  ..., -2.5268e-01,
           3.2172e-01,  8.1094e-04],
         [ 1.2523e+00,  3.5754e-01, -5.1321e-02,  ..., -3.7840e-01,
           1.0526e+00, -5.6255e-01]],

        [[ 2.4042e-01,  1.4718e-01,  1.2110e-01,  ...,  7.6062e-02,
           3.3564e-01,  2

### Tokenizers

Tokenization breaks text into smaller units, essential for natural language processing (**NLP**). There are three main types: **word-based, character-based, and subword-based**.  

- **Word-Based Tokenization**  
  - Splits text by spaces or punctuation, assigning each word an ID.  
  - Requires a **large vocabulary** (~500K words in English).  
  - Struggles with unknown words, represented as `[UNK]` or `<unk>`.  

- **Character-Based Tokenization**  
  - Assigns an ID to each character, creating a **small vocabulary** and reducing unknown tokens.  
  - Increases input size and loses meaning compared to words.  
  - Useful for languages like **Chinese**, where characters carry more meaning.  

- **Subword-Based Tokenization**  
  - Balances word and character tokenization by keeping common words whole and splitting rarer ones into meaningful parts.  
  - Example: **"modernization"** can be split into **"modern"** and **"ization"**, helping models understand word structures while keeping vocabularies manageable.  

- **Common Tokenization Strategies**  
  - **Byte Pair Encoding (BPE)** – Used in **GPT-2**, efficient for multilingual text.  
  - **WordPiece** – Used in **BERT**, improves deep learning tokenization.  
  - **SentencePiece & Unigram** – Common in **multilingual models**, handling diverse scripts.  

Each method serves different NLP needs, balancing vocabulary size, efficiency, and meaning preservation.

To load a tokenizer in `transformers`, same as the models, you can load them with the specific model architecture class:

In [93]:
from transformers import BertTokenizer

BertTokenizer.from_pretrained("bert-base-cased")

BertTokenizer(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

Or use the `AutoTokenizer` class to automatically select the appropriate tokenizer class for a given model:

In [94]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer

BertTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

To tokenize a sentence, it is first split into tokens and then encoded into the corresponding token IDs. When you call the tokenizer on a sentence, it returns a dictionary with the token IDs and attention mask:

In [97]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

But you can perform each individual step manually as well:

In [98]:
tokens = tokenizer.tokenize("Using a Transformer network is simple")
tokens

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']

And now with the sentence split into tokens, in this case "subword" tokens, we can encode them into token IDs the model can understand:

In [99]:
ids = tokenizer.convert_tokens_to_ids(tokens)
ids

[7993, 170, 13809, 23763, 2443, 1110, 3014]

And you can always decode these token IDs back into text:

In [100]:
tokenizer.decode(ids)

'Using a Transformer network is simple'

And you can pass the ids to the model like this:

In [4]:
model(torch.tensor([ids]))

SequenceClassifierOutput(loss=None, logits=tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

And remember that this won't work (all sequences must have the same length):

In [11]:
try:
    model(torch.tensor([
        [200, 200, 200],
        [200, 200] # tokenizer.pad_token_id
    ]))
except Exception as e:
    print(e)

expected sequence of length 3 at dim 1 (got 2)


When sequences don't match in length, pad them to the same length:

In [12]:
model(torch.tensor([
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id]
]))

SequenceClassifierOutput(loss=None, logits=tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

Notice the following however:

In [17]:
model(torch.tensor([[200, 200]])), model(torch.tensor([[200, 200, tokenizer.pad_token_id]]))

(SequenceClassifierOutput(loss=None, logits=tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None),
 SequenceClassifierOutput(loss=None, logits=tensor([[ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None))

Notice how the predictions are different when we add the padding token? This is because the model is now considering the padding token as part of the input, which changes the model's predictions. To avoid this, we can use the `attention_mask` returned by the tokenizer to tell the model to ignore the padding tokens:

In [19]:
model(torch.tensor([[200, 200]]), attention_mask=torch.tensor([[1, 1]])), model(torch.tensor([[200, 200, tokenizer.pad_token_id]]), attention_mask=torch.tensor([[1, 1, 0]]))

(SequenceClassifierOutput(loss=None, logits=tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None),
 SequenceClassifierOutput(loss=None, logits=tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None))

Now both predictions are the same even though the input sequences have different lengths.

### Putting it all together

To recap, you can pass a single sentence to the tokenizer:

In [20]:
sequence = "I've been waiting for a HuggingFace course my whole life."
tokenizer(sequence)

{'input_ids': [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Or you can pass a list of sentences:

In [54]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]
tokenizer(sequences)

{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [101, 2061, 2031, 1045, 999, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

You can pad to the longest sequence in the batch:

In [22]:
tokenizer(sequences, padding="longest")

{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [101, 2061, 2031, 1045, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}

Or you can pad to the max sequence length supported by the model:

In [24]:
tokenizer(sequences, padding="max_length")

{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

Or you can say that padding should be done to the max length of the model, but sequences can't be longer than N tokens. In this case, if they are smaller than N tokens they will be padded to N tokens, if they are bigger, they will be truncated:

In [25]:
tokenizer(sequences, padding="max_length", max_length=8)

{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [101, 2061, 2031, 1045, 999, 102, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0, 0]]}

When you call the tokenizer directly, it will not only tokenize the input but also add special tokens like `[CLS]` and `[SEP]`:

In [26]:
sequence = "I've been waiting for a HuggingFace course my whole life."
model_inputs = tokenizer(sequence)
tokenizer.decode(model_inputs["input_ids"])

"[CLS] i've been waiting for a huggingface course my whole life. [SEP]"

This won't happen if you tokenize and encode the input separately:

In [27]:
tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
tokenizer.decode(ids)

"i've been waiting for a huggingface course my whole life."

## Fine-tuning a Pretrained Model

Here is how you can fine-tune an existing model:

In [43]:
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification

# Retrieve the model and tokenizer
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

# Tokenize two sentences
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
batch

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  2023,  2607,  2003,  6429,   999,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}

The tokenizer returned the sequences ready to be fed into the model, but if we add the expected labels for each of these sequences, we can use this same batch structure to fine-tune the model:

In [44]:
batch["labels"] = torch.tensor([1, 1]) # Both sentences are positive

# Create an instance of AdamW optimizer
optimizer = AdamW(model.parameters())

# Forward pass the batch through the model and retrieve the calculated loss
loss = model(**batch).loss

# Backward pass to calculate the gradients
loss.backward()

# Perform a single optimization step to update the model's parameters based on the calculated gradients
optimizer.step()



We can't do anything interesting with such a small fine-tuning dataset, so let's retrieve one using Hugging Face's `datasets` library. First we install it:

In [45]:
!pip install datasets

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
3589.48s - pydevd: Sending message related to process being replaced timed-out after 5 seconds




And now we retrieve the `MRPC` dataset, which contains pairs of sentences labeled as paraphrases or not:

In [46]:
from datasets import load_dataset
raw_datasets = load_dataset("glue", "mrpc") 
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

The dataset is split into training, validation and test sets, with each containing two sentences and a label indicating whether the sentences are paraphrases or not. Let's inspect the dataset features:

In [54]:
raw_train_dataset = raw_datasets["train"]
raw_train_dataset.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

Now let's retrieve an example from the training set:

In [47]:
raw_train_dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

To train on this dataset each pair of sentences can be presented as a single tokenized sentence, separated by the special `[SEP]` token. The tokenizer does this by default:

In [50]:
inputs = tokenizer("This is the first sentence.", "This is the second one.")
tokenizer.convert_ids_to_tokens(inputs["input_ids"])

['[CLS]',
 'this',
 'is',
 'the',
 'first',
 'sentence',
 '.',
 '[SEP]',
 'this',
 'is',
 'the',
 'second',
 'one',
 '.',
 '[SEP]']

You can tokenize a batch of sentence pairs:

In [80]:
inputs = tokenizer([ # TODO: not pairing properly
    ["This is the first sentence.", "This is the second one."], 
    ["The first sentence is this one.", "The second sentence is this one."]
], padding=True)
torch.tensor(inputs["input_ids"]).shape, inputs

(torch.Size([2, 17]),
 {'input_ids': [[101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102, 0, 0], [101, 1996, 2034, 6251, 2003, 2023, 2028, 1012, 102, 1996, 2117, 6251, 2003, 2023, 2028, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]})

The tokenizer generated a batch of two tokenized sentences, this is what we want, because we want to feed the sentence pairs and get back the probability of them being paraphrases. Let's inspect one of these sentences though:

In [81]:
tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

['[CLS]',
 'this',
 'is',
 'the',
 'first',
 'sentence',
 '.',
 '[SEP]',
 'this',
 'is',
 'the',
 'second',
 'one',
 '.',
 '[SEP]',
 '[PAD]',
 '[PAD]']

On first look the batch seemed correct, but it isn't because it paired the sentences incorrectly. It turns out that if you feed two lists of inputs into the tokenizer, it will assume its a sentence pair task and for each index in the list, will grab the corresponding item from both lists and tokenize them as a single sentence:

In [83]:
inputs = tokenizer( # TODO: not pairing properly
    ["This is the first sentence.", "This is the second one."], 
    ["The first sentence is this one.", "The second sentence is this one."]
, padding=True)
torch.tensor(inputs["input_ids"]).shape, inputs, tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

(torch.Size([2, 16]),
 {'input_ids': [[101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 1996, 2034, 6251, 2003, 2023, 2028, 1012, 102], [101, 2023, 2003, 1996, 2117, 2028, 1012, 102, 1996, 2117, 6251, 2003, 2023, 2028, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]},
 ['[CLS]',
  'this',
  'is',
  'the',
  'first',
  'sentence',
  '.',
  '[SEP]',
  'the',
  'first',
  'sentence',
  'is',
  'this',
  'one',
  '.',
  '[SEP]'])

We now know how to tokenize the whole training dataset:

In [84]:
tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
    return_tensors="pt"
)
input_ids = tokenized_dataset["input_ids"]
input_ids.shape

torch.Size([3668, 103])

We now have an input batch with 3668 sequences (size of the training set) of length 103 (they were all padded to match the maximum sequence length).

In [102]:
# TODO: how does this know how to pad?

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(
    tokenize_function, 
    batched=True # Send multiple sentences to tokenize_function each call for better performance
)
tokenized_datasets

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .
Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .
They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .
Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .
The stock rose $ 2.11 , or about 11 percent , to close Friday at $ 21.51 on the New York Stock Exchange .
Revenue in the first quarter of the year dropped 15 percent from the same period a year earlier .
The Nasdaq had a weekly gain of 17.27 , or 1.2 percent , closing at 1,520.15 on Friday .
The DVD-CCA then appealed to the state Supreme Court .
That compared with $ 35.18 million , or 24 cents per share , in the year-ago period .
Shares of Genentech , a much larger company with several products on the market , rose more than 2 percent .
Legislation making it harder for cons

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

He said the foodservice pie business doesn 't fit the company 's long-term growth strategy .
Magnarelli said Racicot hated the Iraqi regime and looked forward to using his long years of training in the war .
The dollar was at 116.92 yen against the yen , flat on the session , and at 1.2891 against the Swiss franc , also flat .
The AFL-CIO is waiting until October to decide if it will endorse a candidate .
No dates have been set for the civil or the criminal trial .
Wal-Mart said it would check all of its million-plus domestic workers to ensure they were legally employed .
While dioxin levels in the environment were up last year , they have dropped by 75 percent since the 1970s , said Caswell .
This integrates with Rational PurifyPlus and allows developers to work in supported versions of Java , Visual C # and Visual Basic .NET.
The top rate will go to 4.45 percent for all residents with taxable incomes above $ 500,000 .
The results appear in the January issue of Cancer , an American Ca

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

PCCW 's chief operating officer , Mike Butcher , and Alex Arena , the chief financial officer , will report directly to Mr So .
The world 's two largest automakers said their U.S. sales declined more than predicted last month as a late summer sales frenzy caused more of an industry backlash than expected .
According to the federal Centers for Disease Control and Prevention ( news - web sites ) , there were 19 reported cases of measles in the United States in 2002 .
A tropical storm rapidly developed in the Gulf of Mexico Sunday and was expected to hit somewhere along the Texas or Louisiana coasts by Monday night .
The company didn 't detail the costs of the replacement and repairs .
The settling companies would also assign their possible claims against the underwriters to the investor plaintiffs , he added .
Air Commodore Quaife said the Hornets remained on three-minute alert throughout the operation .
A Washington County man may have the countys first human case of West Nile virus , t

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

Notice how the tokenized dataset now has the new features provided by the tokenizer (input IDs, attention mask, etc.).

Let's double-check the shape of the tokenized training set:

In [99]:
len(set([len(x) for x in tokenized_datasets["train"]["input_ids"]]))

77

If the tokenized dataset was padded to the maximum sequence length, all sequences should have the same length, which is not the case. This is to be expected though, because we only need to pad the sequences when we feed them to the model, and they only need to be padded to match the length of the batch, not of the dataset. To handle this we need to use a `DataCollator`:

In [72]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
data_collator

DataCollatorWithPadding(tokenizer=BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
), padding=True, max_length=None, pad_to_multiple_of=None, return_tensors='pt')

Let's create a sample batch of tokenized sequences:

In [105]:
samples = tokenized_datasets["train"][:8] # Pick first 8 samples
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]} # Remove unnecessary columns
samples.keys()

dict_keys(['label', 'input_ids', 'token_type_ids', 'attention_mask'])

We can now feed the batch into the data collator and it will pad the sequences to the length of the longest sequence in the batch:

In [74]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

{'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'attention_mask': torch.Size([8, 67]),
 'labels': torch.Size([8])}

Notice how the sequences are now all the same length `67`, this should be length of the longest sequence in the batch, let's confirm:

In [107]:
max([len(x) for x in tokenized_datasets["train"][:8]["input_ids"]])

67

Correct! We're good to go, let's setup our data from scratch, load the dataset, tokenize it, and feed it to data collator:

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

# Load the tokenizer
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Load dataset
raw_datasets = load_dataset("glue", "mrpc")

# Tokenize dataset
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

# Wrap the tokenized dataset with the DataCollatorWithPadding 
# (to make sure batches have same sequence length)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Now we load the model we want to fine-tune, and specify the number of labels we want to predict:

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

The warning means that since the loaded model didn't have a classification head (or at least not one for two labels), a new classification head with random weights was added. This means that we're still going to leverage all the pre-trained weights from the model, but the classification head will be trained from scratch.

Now we initialize the trainer:

In [114]:
from transformers import Trainer
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator, # defaults to DataCollatorWithPadding if not provided
    tokenizer=tokenizer
)
training_args, trainer

  trainer = Trainer(


(TrainingArguments(
 _n_gpu=1,
 accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
 adafactor=False,
 adam_beta1=0.9,
 adam_beta2=0.999,
 adam_epsilon=1e-08,
 auto_find_batch_size=False,
 average_tokens_across_devices=False,
 batch_eval_metrics=False,
 bf16=False,
 bf16_full_eval=False,
 data_seed=None,
 dataloader_drop_last=False,
 dataloader_num_workers=0,
 dataloader_persistent_workers=False,
 dataloader_pin_memory=True,
 dataloader_prefetch_factor=None,
 ddp_backend=None,
 ddp_broadcast_buffers=None,
 ddp_bucket_cap_mb=None,
 ddp_find_unused_parameters=None,
 ddp_timeout=1800,
 debug=[],
 deepspeed=None,
 disable_tqdm=False,
 dispatch_batches=None,
 do_eval=False,
 do_predict=False,
 do_train=False,
 eval_accumulation_steps=None,
 eval_delay=0,
 eval_do_concat_batches=True,
 eval_on_start=False,
 eval_steps=None,
 eval_s

And we can finally train the model:

In [115]:
trainer.train()

Step,Training Loss
500,0.6046
1000,0.4762


TrainOutput(global_step=1377, training_loss=0.49980796309771014, metrics={'train_runtime': 172.5748, 'train_samples_per_second': 63.764, 'train_steps_per_second': 7.979, 'total_flos': 405114969714960.0, 'train_loss': 0.49980796309771014, 'epoch': 3.0})

Training has finished and loss has reduced substantially. We can now evaluate the model on the validation set by using the trainer's `predict()` function (we could do it manually, but this is easier because it already performs the tokenization under the hood):

In [118]:
predictions = trainer.predict(tokenized_datasets["validation"])
predictions.predictions.shape, predictions.label_ids.shape

((408, 2), (408,))

The resulting object has a `predictions` tensor with the `2` logits corresponding to the predicted labels, for each of the `408` sequences in the validation set, as well as a `label_ids` tensor with the ground truth label for each sequence:

Let's check the logits, predictions, and ground truth labels for some results:

In [129]:
import numpy as np

(
    predictions.predictions[0], np.argmax(predictions.predictions[0], axis=-1), predictions.label_ids[0],
    predictions.predictions[1], np.argmax(predictions.predictions[1], axis=-1), predictions.label_ids[1],
    predictions.predictions[2], np.argmax(predictions.predictions[2], axis=-1) , predictions.label_ids[2]
)

(array([-1.8311533,  2.2306285], dtype=float32),
 np.int64(1),
 np.int64(1),
 array([ 1.3678718, -1.7218637], dtype=float32),
 np.int64(0),
 np.int64(0),
 array([-1.6693746,  2.0803952], dtype=float32),
 np.int64(1),
 np.int64(0))

From the results above, the first two predictions are accurate, and the last one isn't, let's calculate how many accurate predictions we have. First let's calculate the predictions for the entire validation dataset:

In [130]:
predicted_labels = np.argmax(predictions.predictions, axis=-1)
predicted_labels

array([1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1,
       1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1,
       1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1,

We can now compare against the ground truth labels:

In [133]:
np.sum(predicted_labels == predictions.label_ids) / len(predicted_labels)

np.float64(0.821078431372549)

Our model predicts the correct label for 80% of the validation set.

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
15651.65s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Note: you may need to restart the kernel to use updated packages.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
15657.69s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Collecting scikit-learn
  Downloading scikit_learn-1.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting scipy
  Downloading scipy-1.15.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Using cached joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Using cached threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.5/13.5 MB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading scipy-1.15.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (40.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.6/40.6 MB[0m [31m56.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hUsing cached joblib-1.4.2-py3-none-any

A better way to evaluate performance is to use `evaluate`. This will take into account the characteristics of the dataset and use the most appropriate metrics:

In [141]:
import evaluate

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=predicted_labels, references=predictions.label_ids)

{'accuracy': 0.821078431372549, 'f1': 0.8773109243697479}

If we create a method that accepts a tuple of logits and ground truth labels and returns the eval metrics, we can provide that method to the `Trainer`:

In [143]:
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)
compute_metrics((predictions.predictions, predictions.label_ids))

{'accuracy': 0.821078431372549, 'f1': 0.8773109243697479}

Let's train again, now with metrics:

In [144]:
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.389362,0.845588,0.890815
2,0.514800,0.533532,0.835784,0.886248
3,0.286600,0.718473,0.835784,0.884682


TrainOutput(global_step=1377, training_loss=0.3247851037251923, metrics={'train_runtime': 178.2742, 'train_samples_per_second': 61.725, 'train_steps_per_second': 7.724, 'total_flos': 405114969714960.0, 'train_loss': 0.3247851037251923, 'epoch': 3.0})

Notice how the training process now reports the validation loss due to the inclusion of the `compute_metrics` method.

Now a full training

In [145]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

In [146]:
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

['labels', 'input_ids', 'token_type_ids', 'attention_mask']

In [147]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

In [148]:
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

{'labels': torch.Size([8]),
 'input_ids': torch.Size([8, 81]),
 'token_type_ids': torch.Size([8, 81]),
 'attention_mask': torch.Size([8, 81])}

In [149]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [150]:
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

tensor(0.9363, grad_fn=<NllLossBackward0>) torch.Size([8, 2])


In [151]:
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)



In [152]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)

1377


In [153]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

device(type='cuda')

In [154]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/1377 [00:00<?, ?it/s]

In [155]:
import evaluate

metric = evaluate.load("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

{'accuracy': 0.8627450980392157, 'f1': 0.9047619047619048}

Now with "accelerate":

In [156]:
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1377 [00:00<?, ?it/s]

In [157]:
from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

accelerator = Accelerator()

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

num_epochs = 3
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dl:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1377 [00:00<?, ?it/s]