Run locally or <a target="_blank" href="https://colab.research.google.com/github/aalgahmi/dl_handouts/blob/main/.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Using Hugging Face Transformers

This notebook provides a quick introduction to the Hugging Face Transformers Python library, which can be used to access and fine-tune pre-trained transformer models such as BERT, GPT, Llama, and many more. The library supports both PyTorch and TensorFlow, with PyTorch being its preferred deep learning framework. The Hugging Face Transformers package can return PyTorch modules that can be trained (for fine-tuning purposes) like any other PyTorch model we've used in this class.

The transformers package provides access to both the transformer models and their model-specific tokenizers. For common NLP tasks, there are pipelines that abstract both the models and tokenizers using a simple API interface. For more control, one can use the models and their specialized tokenizers directly. More information can be found on the [Hugging Face documentation page](https://huggingface.co/docs/transformers/index).

To get started, uncomment the following line to install the needed packages.

In [1]:
# !pip install transformers datasets sentencepiece -q

In [2]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from matplotlib import pyplot as plt

import warnings

warnings.filterwarnings("ignore", category=UserWarning)

## Using pipelines

Hugging Face transformer models are listed (puplished) in a common place called the [hub](https://huggingface.co/models) and organized by task. The easiest way to use these models is through the `pipeline` function. Let's see a few examples.

First, we import the `pipeline` function from the `transformers` package.

In [3]:
from transformers import pipeline

2024-03-27 18:40:10.751382: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-27 18:40:10.754508: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-27 18:40:10.796654: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-27 18:40:10.796673: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-27 18:40:10.797701: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to

The `pipeline` function allows us to specify an NLP task and returns the default transformer model for that task. We could also request a specific model from the [hub](https://huggingface.co/models).

Many of the following examples use the text below, which is an excerpt from War and Peace by Leo Tolstoy.

In [4]:
text = """Prince Vassily always spoke languidly, like an actor repeating his part in an \
old play. Anna Pavlovna Scherer, in spite of her forty years, was on the contrary \
brimming over with excitement and impulsiveness. To be enthusiastic had become her pose \
in society, and at times even when she had, indeed, no inclination to be so, she was \
enthusiastic so as not to disappoint the expectations of those who knew her. The \
affected smile which played continually about Anna Pavlovna’s face, out of keeping as it \
was with her faded looks, expressed a spoilt child’s continual consciousness of a \
charming failing of which she had neither the wish nor the power to correct herself, \
which, indeed, she saw no need to correct.
"""

### Text classification: Sentiment analysis

In [5]:
classifier = pipeline("text-classification")

outputs = classifier(text)
outputs

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.888614296913147}]

We can use Pandas' DataFrame to display the output in a table.

In [6]:
pd.DataFrame(outputs) 

Unnamed: 0,label,score
0,POSITIVE,0.888614


It's important to acknowledge that many of the available models may exhibit biases or prejudices against certain peoples or groups, reflecting the inherent bias or prejudice in the data they were pre-trained on. This issue significantly impacts many of these models and warrants recognition and action. Here is an example highlighting such bias.

In [7]:
outputs = classifier([
    "I am from Iraq.",
    "I am from Spain."
])
pd.DataFrame(outputs) 

Unnamed: 0,label,score
0,NEGATIVE,0.981107
1,POSITIVE,0.98889


As mentioned above, we can specify a specific model from the hub to use with the `pipleline` function instead of relying on the default model of the task.

In [8]:
model_name = "huggingface/distilbert-base-uncased-finetuned-mnli"
classifier_mnli = pipeline("text-classification", model=model_name)
classifier_mnli("She loves me. [SEP] She loves me not.")

config.json:   0%|          | 0.00/729 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/58.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

[{'label': 'contradiction', 'score': 0.9790192246437073}]

Notice the use the special token `[SEP]`, which BERT uses to separate sentences.

### Name entity recognition (NER)

Let's use an NER model to identify named entites or people in the above text:

In [9]:
ner_tagger = pipeline("ner", aggregation_strategy="simple")
outputs = ner_tagger(text)
pd.DataFrame(outputs)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Unnamed: 0,entity_group,score,word,start,end
0,PER,0.972958,Prince Vassily,0,14
1,PER,0.966766,Anna Pavlovna Scherer,88,109
2,PER,0.978406,Anna Pavlovna,460,473


### Question answering

We can also ask a model a question about the above text:

In [10]:
reader = pipeline("question-answering")
question = "What does Prince Vassily usually do?"
outputs = reader(question=question, context=text)
pd.DataFrame([outputs]) 

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Unnamed: 0,score,start,end,answer
0,0.948859,22,37,spoke languidly


### Text summarization

Let's also ask a model to summarize the above text for us:

In [11]:
summarizer = pipeline("summarization")
outputs = summarizer(text, max_length=100, clean_up_tokenization_spaces=True)
outputs[0]['summary_text']

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

' Anna Pavlovna Scherer was brimming over with excitement and impulsiveness. To be enthusiastic had become her pose in society, and at times even when she had, indeed, no inclination to be so, she was enthusiastic. Prince Vassily always spoke languidly, like an actor repeating his part in an old play.'

### Machine translation
For translation, the University of Helsinki released many translation models. To access them, use the following pattern: 
* Task name:`"translation_{src}_to_{trg}"` 
* Model name: `"Helsinki-NLP/opus-mt-{src}-{trg}"`

where `{src}` is the source langague and `{trg}` is the target language. Here are two examples: English-to-German and English-to-Arabic.

In [12]:
translator = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)

outputs[0]['translation_text']

config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

"Fürst Vassily sprach immer schlampig, wie ein Schauspieler wiederholt seine Rolle in einem alten Stück. Anna Pavlovna Scherer, trotz ihrer vierzig Jahre, war im Gegenteil über vor Aufregung und Impulsivität. Enthusiastisch zu sein war ihre Pose in der Gesellschaft geworden, und zu Zeiten, als sie, in der Tat, keine Neigung, so zu sein, sie war begeistert, um nicht zu enttäuschen die Erwartungen derer, die sie kannten. Das betroffene Lächeln, das ständig spielte über Anna Pavlovna's Gesicht, aus der Haltung, wie es war mit ihrem verblassten Blick, drückte ein verwöhntes Kind kontinuierlich Bewusstsein von einem charmanten Versagen, von denen sie weder den Wunsch noch die Macht, sich selbst zu korrigieren, die, in der Tat, sie sah keine Notwendigkeit zu korrigieren."

In [13]:
translator = pipeline("translation_en_to_ar", 
                      model="Helsinki-NLP/opus-mt-en-ar")
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)

outputs[0]['translation_text']

config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/801k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/917k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.12M [00:00<?, ?B/s]

'كانت آنا بافلوفنا شيرر، على الرغم من مرور أربعين عاماً على إنشائها، على العكس من ذلك، تتأرجح بالإثارة والاستعجال. لقد أصبح الحماس وضعها في المجتمع، وفي بعض الأوقات حتى عندما لم يكن لديها، في الواقع، أي رغبة في أن تكون كذلك، كانت متحمسة حتى لا تخيب آمال من يعرفونها. وكانت الابتسامة المتأثرة التي لعبت باستمرار حول وجه آنا بافلوفنا، بعيداً عن الثبات كما كان مع مظهرها المتلاشى، تعبر عن إدراك طفل ثمل المستمر لفشل ساحر لم يكن لديها الرغبة ولا القدرة على تصحيح نفسها، وهو ما لم تر ضرورة لتصحيحه.'

### Text generation

GPT models are good for text generation. Here is an example using GPTv2 that gives a prompt and asks the model to generate 3 responses.

In [14]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
outputs = generator(
    "This course will teach you how",
    max_length=60,
    num_return_sequences=3,
)
for i, response in enumerate(outputs):
    print(f"{i + 1}: {response['generated_text']}\n")

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


1: This course will teach you how to get started on making an app. After you’ve built it’t really needed to do this. Please take care’s sake, as it takes a lot of work, and it’ll take very long to finish it.



2: This course will teach you how to develop a very fast and easy and effective technique known as a 'dynamic' technique used to create the illusion of the self inside of your brain. You can apply various techniques over the course of seven days to make it work, it will take an additional 4-

3: This course will teach you how to use the Windows 10 Developer Preview to learn more about the feature and how you can get the app to run.


You can also find a link for all of the demos on this blog.
You can also enjoy the latest Windows 10 Developer Preview video in



### Filling in blanks with BERT
On the other hand, BERT models are good at filling in the blanks. Here is an example.

In [15]:
fill_masker = pipeline(model="bert-base-uncased")
outputs = fill_masker("What is this an [MASK] of?")
pd.DataFrame(outputs)

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Unnamed: 0,score,token,token_str,sequence
0,0.289291,2742,example,what is this an example of?
1,0.120134,2552,act,what is this an act of?
2,0.057876,3277,issue,what is this an issue of?
3,0.056309,6013,instance,what is this an instance of?
4,0.045756,7526,explanation,what is this an explanation of?


## Using  tokenizers and models directly

Pipelines are a convenient way to work with Hugging Face's models and tokenizers, as they handle many details about models and their tokenizers automatically. However, if you need more control over the process, you can work directly with models and tokenizers.

To get started, you need to find a model on the Hub that fits the task you want to perform. Each model comes with its own tokenizer, which performs the following tasks:

* Convert strings into lists of vocabulary IDs that the model requires.
* Convert the model's predictions into meaningful outputs.

Here is a simple example that shows how to use models and tokenizers directly:

In [16]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "siebert/sentiment-roberta-large-english"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/256 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Next, we use the tokenizer to tokenize some text:

In [17]:
token_ids = tokenizer(["I like soccer. [SEP] We all love soccer!",
                       "Joe lived for a very long time. [SEP] Joe is old."],
                      padding=True, return_tensors="pt")
token_ids

{'input_ids': tensor([[    0,   100,   101,  4191,     4,   646,  3388,   510,   742,   166,
            70,   657,  4191,   328,     2,     1,     1,     1],
        [    0, 18393,  3033,    13,    10,   182,   251,    86,     4,   646,
          3388,   510,   742,  2101,    16,   793,     4,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

As you can see, the output of the tokenizer is both the token ids and the attention mask. The latter consists of 1's and 0's indicating whether the corresponding token is an actual word (1) or is a padding (0).

Notice that the tokenizer can return different kinds of tensors. For PyTorch tensors use `return_tensors="pt"`. Notice also the use of the special token `[SEP]`. The above code is equivalent to the following (without the use of `[SEP]`):

In [18]:
token_ids = tokenizer([("I like soccer.", "We all love soccer!"),
                       ("Joe lived for a very long time.", "Joe is old.")],
                      padding=True, return_tensors="pt")
token_ids

{'input_ids': tensor([[    0,   100,   101,  4191,     4,     2,     2,   170,    70,   657,
          4191,   328,     2,     1,     1,     1],
        [    0, 18393,  3033,    13,    10,   182,   251,    86,     4,     2,
             2, 18393,    16,   793,     4,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

If you set the `return_token_type_ids=True` when calling the tokenizer, you will also get an extra tensor called `token_type_ids` that indicates which sentence each token belongs to. This is needed by some models, but not the one we are currently using.

In [19]:
token_ids = tokenizer(["I like soccer. [SEP] We all love soccer!",
                       "Joe lived for a very long time. [SEP] Joe is old."],
                      padding=True, return_tensors="pt", return_token_type_ids=True)
token_ids

{'input_ids': tensor([[    0,   100,   101,  4191,     4,   646,  3388,   510,   742,   166,
            70,   657,  4191,   328,     2,     1,     1,     1],
        [    0, 18393,  3033,    13,    10,   182,   251,    86,     4,   646,
          3388,   510,   742,  2101,    16,   793,     4,     2]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

The token ids generated by the tokenizer can be passed to the model to make predictions:

In [20]:
outputs = model(**token_ids)
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-3.7695,  2.9359],
        [ 2.1718, -1.4990]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

Usually models like this return logits.You can think of logits as "un-normalized" probabilities that don't add up to one. To covert them to normalized probabilities, we use the softmax function.

In [21]:
probs = torch.softmax(outputs.logits, dim=0)
probs

tensor([[0.0026, 0.9883],
        [0.9974, 0.0117]], grad_fn=<SoftmaxBackward0>)

Now we can make predictions:

In [22]:
pred = torch.argmax(probs, dim=1)
pred  # 0 = contradiction, 1 = entailment, 2 = neutral

tensor([1, 0])

## Fine-tuning a model
 
Transfer learning in NLP involves utilizing a pre-trained model, such as the one we just used, and fine-tuning it using our own data to adapt it to a specific task. Since the models returned by Hugging Face are regular PyTorch models with a few additional methods and parameters, we can train these models just as we have done many times in this class.

In this example, we will fine-tune a BERT model to perform sentiment analysis on the IMDB reviews dataset, which we have encountered before. Here is the model we will be fine-tuning along with its tokenizer. We will set `num_labels=2`, which makes this probelm a multiclass classification probelm with two classes: 0 for `NEGATIVE` and 1 for `POSITIVE`.

In [23]:
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, num_labels=2)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We will begin by downloading the dataset, using Hugging Face's datasets package for this purpose. However, we will only utilize a small portion of this dataset, while truncating reviews that exceed 60 tokens in length.

In [24]:
from datasets import load_dataset, DatasetDict

full_imdb = load_dataset("imdb")

# Just take the first 60 tokens for speed/running on cpu
def truncate(example):
    return {
        'text': " ".join(example['text'].split()[:60]),
        'label': example['label']
    }

# Take 256 random examples for train and 64 validation
imdb_ds = DatasetDict(
    train=full_imdb['train'].shuffle(seed=17).select(range(256)).map(truncate),
    validation=full_imdb['train'].shuffle(seed=17).select(range(256, 320)).map(truncate),
)

imdb_ds

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 21.0M/21.0M [00:01<00:00, 14.1MB/s]
Downloading data: 100%|██████████| 20.5M/20.5M [00:01<00:00, 13.1MB/s]
Downloading data: 100%|██████████| 42.0M/42.0M [00:01<00:00, 23.7MB/s]


Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Map:   0%|          | 0/256 [00:00<?, ? examples/s]

Map:   0%|          | 0/64 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 256
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 64
    })
})

As you can see, the database has two splits: training a validation. Here aret the first 5 training examples:

In [25]:
imdb_ds['train'][:5]

{'text': ['This version of "Moby Dick" insults the audience by claiming it is based on Melville\'s novel-even going so far as to show a phony first chapter sentence rather than the famous "Call me Ishmael". In addition to having atrocious acting, even from John Barrymore,this is perhaps the greatest example of how far Hollywood (especially early Hollywood) would go to revise',
  "(spoilers??)<br /><br />I wasn't sure what to think of the movie. Not too much of a kids film. Definately should be watched with a parent because it includes death and dying. But I was surprised that I was a bit entertained by it.<br /><br />I was a bit disappointed by the 81 minutes of time we had. (even less without",
  'this film was almost a great imaginative film. A mixture of shakespeare, pop, jazz, and faerie tales. This movie was an imaginative twist on the Cinderella theme. Featuring a strong cast, headed by the perfectly cast Kathleen Turner, this movie had everything going for it. Everything but pro

We need to tokenize this dataset, but we need to see what the tokenizer, mentioned earlier, does. Here is an example:

In [26]:
tokens = tokenizer("This movie is underrated. I didn't expect it to be this good.", padding="max_length", truncation=True, 
                     return_tensors="pt", max_length=120)
tokens

{'input_ids': tensor([[ 101, 2023, 3185, 2003, 2104, 9250, 1012, 1045, 2134, 1005, 1056, 5987,
         2009, 2000, 2022, 2023, 2204, 1012,  102,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

As you can see, the tokenizer converts words (tokens) into indices, referred to as `input_ids` here. It also produces an `attention_mask` tensor with ones and zeros, depending on whether the id is a real token (1) or a padding (0).

Here are the shapes of these `input_ids` and masks:

In [27]:
tokens['input_ids'].shape, tokens['attention_mask'].shape

(torch.Size([1, 120]), torch.Size([1, 120]))

As shown in both tensors, there is an extra artificial dimension; we'll need to squeeze it out.

Next, we tokenize this dataset using the above tokenizer. To accomplish this, let's create two functions: one called `tokenize()` for the actual tokenization, and another for squeezing the tokenized reviews and their attention masks into the correct shape before adding them to a data loader. We'll also set the maximum length of the tokenized reviews to 120 and enable paddig.

In [28]:
def tokenize(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True, 
                     return_tensors="pt", max_length=120)

def squeeze(batch):    
    return {
        "text": batch["text"], "input_ids": batch["input_ids"][0], 
        "attention_mask": batch["attention_mask"][0], "label": batch["label"]
    }

Next, we prepare the dataset by calling these two functions using the `map` method. Additionally, we:
* Remove the original `text` field.
* Rename the `label` field to `labels`, as expected by the model.
* Set the dataset format to `torch`.

In [29]:
imdb_encoded = imdb_ds.map(tokenize).map(squeeze)
imdb_encoded = imdb_encoded.remove_columns(["text"])
imdb_encoded = imdb_encoded.rename_column("label", "labels")
imdb_encoded.set_format(type='torch')
imdb_encoded

Map:   0%|          | 0/256 [00:00<?, ? examples/s]

Map:   0%|          | 0/64 [00:00<?, ? examples/s]

Map:   0%|          | 0/256 [00:00<?, ? examples/s]

Map:   0%|          | 0/64 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 256
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 64
    })
})

Here are the shapes of all the fields in this prepared dataset.

In [30]:
[ f"{k}: {v.shape}" for k, v in imdb_encoded['train'][0].items()]

['labels: torch.Size([])',
 'input_ids: torch.Size([120])',
 'attention_mask: torch.Size([120])']

Next, we create two PyTorch data loaders: one for training and another for validation.

In [31]:
from torch.utils.data import DataLoader

dl_train = DataLoader(imdb_encoded['train'], batch_size=32, shuffle=True)
dl_val = DataLoader(imdb_encoded['validation'], batch_size=32, shuffle=False)

Here is what the first training batch looks like:

In [32]:
next(iter(dl_train))

{'labels': tensor([0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0,
         0, 1, 0, 0, 0, 0, 0, 0]),
 'input_ids': tensor([[  101,  4895, 12053,  ...,     0,     0,     0],
         [  101,  1045,  2031,  ...,     0,     0,     0],
         [  101,  2023,  2003,  ...,     0,     0,     0],
         ...,
         [  101,  2054,  1037,  ...,     0,     0,     0],
         [  101,  2984, 10223,  ...,     0,     0,     0],
         [  101,  2821,  2009,  ...,     0,     0,     0]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]])}

We are now ready to train (fine-tune) this model. We'll do that for 5 epochs using the AdamW optimizer from the `transformers` package with a weight decay of 0.01 and a small learning rate ($10^{-5}$). As for the loss, we'll use whatever is included as part of the model output.

In [33]:
from transformers import AdamW
from tqdm.notebook import tqdm

n_epochs = 5
n_training_steps = n_epochs * len(dl_train)
optimizer = AdamW(model.parameters(), lr=5e-5, weight_decay=0.01)

progress_bar = tqdm(range(n_training_steps))
for epoch in range(n_epochs):
    # Training
    model.train()
    for batch_i, batch in enumerate(dl_train):
        output = model(**batch)
        
        optimizer.zero_grad()
        output.loss.backward() # Using the loss that the model outputs
        optimizer.step()
        
        progress_bar.update(1)
    
    # Validation
    model.eval()
    val_loss = 0.0
    for batch_i, batch in enumerate(dl_val):
        with torch.no_grad():
            output = model(**batch)
        val_loss += output.loss
    
    avg_val_loss = val_loss / len(dl_val)
    print(f"Epoch {epoch}: val_loss={avg_val_loss: .4f}")




  0%|          | 0/40 [00:00<?, ?it/s]

Epoch 0: val_loss= 0.6715
Epoch 1: val_loss= 0.6167
Epoch 2: val_loss= 0.6369
Epoch 3: val_loss= 0.6139
Epoch 4: val_loss= 0.7929


Having fine-tuned this model, let's test it:

In [34]:
review = "The acting in this move was OK, but the CGI was great!"

tokens = tokenizer(review, return_tensors="pt")
out = torch.argmax(model(**tokens).logits)
print(out, "=>", ["NEGATIVE", "POSITIVE"][out])

tensor(1) => POSITIVE


## Where to go from here
Here are additional resources you can check out to learn more about Hugging Face:
* [Hugging Face docs](https://huggingface.co/docs/transformers/index)
* [Hugging Face Course](https://huggingface.co/course/chapter1/1)
* [Natural Language Processing with Transformers](https://learning.oreilly.com/library/view/natural-language-processing/9781098136789/): an O'Reilly Book you can access for free through the Library.

Happy learning!