# **Hugging Face NLP Course - Chapter 1**

Source: https://huggingface.co/learn/nlp-course/chapter1/1

---

Helpful links:

- Deep Learning book: [Deep Learning for Coders with Fastai and Pytorch: AI Applications Without a PhD](https://github.com/fastai/fastbook)
- Datasets: [Kaggle](https://www.kaggle.com/)

---

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

In [None]:
from transformers import pipeline

## **1. Transformer Models**

### **1.1. Natural Language Processing (NLP)**

Common NLP tasks, with some examples:
- __Classifying whole sentences:__
  getting the sentiment of a review, detecting if an email is spam, determining if a sentence is grammatically correct or whether two sentences are logically related or not
- __Classifying each word in a sentence:__
  identifying the grammatical components of a sentence (noun, verb, adjective), or the named entities (person, location, organization)
- __Generating text content:__
  completing a prompt with auto-generated text, filling in the blanks in a text with masked words
- __Extracting an answer from a text:__
  given a question and a context, extracting the answer to the question based on the information provided in the context
- __Generating a new sentence from an input text:__
  translating a text into another language, summarizing a text

NLP isn't limited to written text, examples:
- Speech recognition:
  generating a transcript of an audio sample
- Computer vision:
  generating a description of an image

### **1.2. Transformers, what can they do?**

The `pipeline()` function connects a model with its necessary preprocessing and postprocessing steps.

Some of the currently [available pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.pipeline.task) are:

- `feature-extraction` (get the vector representation of a text)
- `fill-mask`
- `ner` (named entity recognition)
- `question-answering`
- `sentiment-analysis`
- `summarization`
- `text-generation`
- `translation`
- `zero-shot-classification`

By default, `pipeline()` selects a particular pretrained model that has been fine-tuned for the given task in English. The model is downloaded and cached when the object gets created. When rerunning the command, the cached model will be used instead and there is no need to download the model again.

You can pass a single text as well as several texts.

#### **Sentiment analysis**

Classifies texts as positive or negative.

In [None]:
classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

In [None]:
# Pass several sentences
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

#### **Zero-shot classification**

Classifies texts that haven't been labelled. Allows you to specify which labels to use for the classification, so you don't have to rely on the labels of the pretrained model.

The pipeline is called _zero-shot_ because you don't need to fine-tune the model on your data to use it.

In [None]:
classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

#### **Text generation**

Auto-completes provided prompts by generating the remaining text. Text generation involves randomness, so it's normal if you don't get the same results for the same input.

You can control how many different sequences are generated with the argument `num_return_sequences` and the total length of the output text with the argument `max_length`.

In [None]:
generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

In [None]:
# Control number and length of sequences
generator(
    "In this course, we will teach you how to",
    num_return_sequences=2,
    max_length=15,
)

#### **Using any model from the Hub in a pipeline**

You can choose a particular model from the [Model Hub](https://huggingface.co/models) to use in a pipeline for a specific task (use the tags to easily find supported models for your task). Provide the full model name (as shown on the page) to use it.

If you want to use a model that is not public but you have access to, login using the command `huggingface-cli login` (you will be asked to provide a token).

In [None]:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
)

In [None]:
"""
!huggingface-cli login

generator = pipeline("text-generation", model="meta-llama/Llama-2-7b")
generator(
    "In this course, we will teach you how to",
    max_length=30,
)
"""

#### **Mask filling**

Fills in the blanks in a given text. The `top_k` argument controls how many possibilities you want to be displayed. Note that here the model fills in the special `<mask>` word, which is often referred to as a _mask token_. Other mask-filling models might have different mask tokens.

In [None]:
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

#### **Named entity recognition (NER)**

Finds which parts of the input text correspond to entities such as persons, locations, or organizations.

The option `grouped_entities=True` tells the pipeline to regroup together the parts of the sentence that correspond to the same entity.

In [None]:
ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

#### **Question answering**

Answers questions using information from a given context. Note that it works by extracting information from the provided context; it does not generate the answer.

In [None]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn.",
)

#### **Summarization**

Reduces a text into shorter text while keeping all (or most) of the important aspects referenced in the text.

Like with text generation, you can specify a `max_length` or a `min_length` for the result.

In [None]:
summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
    """
)

#### **Translation**

You can use a default model if you provide a language pair in the task name (such as `"translation_en_to_fr"`), but the easiest way is to pick the model you want to use on the [Model Hub](https://huggingface.co/models).

Like with text generation and summarization, you can specify a `max_length` or a `min_length` for the result.

In [None]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

### **1.3. How do Transformers work?**

Broadly, Transformer models can be grouped into three categories:
- _auto-regressive_ Transformer models (GPT-like)
- _auto-encoding_ Transformer models (BERT-like)
- _sequence-to-sequence_ Transformer models (BART/T5-like)

Transformers are language models. They have been trained on large amounts of raw text in a self-supervised fashion. Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. That means that humans are not needed to label the data!

This type of model develops a statistical understanding of the language it has been trained on, but it's not very useful for specific practical tasks. Because of this, the general pretrained model then goes through a process called _transfer learning_. During this process, the model is fine-tuned in a supervised way - that is, using human-annotated labels - on a given task. Task examples:
- _causal language modeling:_ predict the next word in a sentence having read the _n_ previous words
- _masked language modeling:_ predict a masked word in the sentence

The general strategy (outliers exist) to achieve better performance is by increasing the models' sizes as well as the amount of data they are pretrained on. This becomes very costly in terms of time and compute resources. It even translates to environmental impact. This is why sharing language models is paramount: sharing the trained weights and building on top of already trained weights reduces the overall compute cost and carbon footprint of the community. You can evaluate the carbon footprint of your models' training through several tools. For example [ML CO2 Impact](https://mlco2.github.io/impact/) or [Code Carbon](https://codecarbon.io/) which is integrated in HF Transformers.

#### **Transfer Learning**

_Pretraining_ is the act of training a model from scratch: the weights are randomly initialized, and the training starts without any prior knowledge. This pretraining is usually done on very large amounts of data. Therefore, it requires a very large corpus of data, and training can take up to several weeks.

_Fine-tuning_, on the other hand, is the training done __after__ a model has been pretrained. To perform fine-tuning, you first acquire a pretrained language model, then perform additional training with a dataset specific to your task. This will only require a limited amount of data: the knowledge the pretrained model has acquired is "transferred", hence the term _transfer learning_. Fine-tuning a model therefore has lower time, data, financial, and environmental costs. It is also quicker and easier to iterate over different fine-tuning schemes, as the training is less constraining than a full pretraining. This process will also achieve better results than training from scratch (unless you have lots of data), which is why you should always try to leverage a pretrained model - one as close as possible to the task you have at hand - and fine-tune it.

#### **General architecture**

The model is primarily composed of two blocks:
- __Encoder:__ The encoder receives an input and builds a representation of it (its features). This means that the model is optimized to acquire understanding from the input.
- __Decoder:__ The decoder uses the encoder's representation (features) along with other inputs to generate a target sequence. This means that the model is optimized for generating outputs.

Each of these parts can be used independently, depending on the task:
- __Encoder-only models:__ good for tasks that require understanding of the input, such as sentence classification and named entity recognition _(in own words: understands the input)_
- __Decoder-only models:__ good for generative tasks such as text generation _(in own words: generates some output)_
- __Encoder-decoder models__ or __sequence-to-sequence models:__ good for generative tasks that require an input, such as translation or summarization _(in own words: generates output depending on the given input)_

#### **Attention layers**

A key feature of Transformer models is that they are built with special layers called _attention layers_. Those layers will tell the model to pay specific attention to certain words in the sentence you passed it (and more or less ignore the others) when dealing with the representation of each word. A word by itself has a meaning, but that meaning is deeply affected by the context, which can be any other word (or words) before or after the word being studied.

#### **The original architecture**

The Transformer architecture was originally designed for translation. During training, the encoder receives inputs (sentences) in a certain language, while the decoder receives the same sentences in the desired target language. In the encoder, the attention layers can use all the words in a sentence (because of the words' possible dependencies to other words). The decoder, however, works sequentially and can only pay attention to the words in the sentence that it has already translated. For example, when we have predicted the first three words of the translated target, we give them to the decoder which then uses all the inputs of the encoder to try to predict the fourth word.

The first attention layer in a decoder block pays attention to all (past) inputs to the decoder, but the second attention layer uses the output of the encoder. It can thus access the whole input sentence to best predict the current word. This is very useful as different languages can have grammatical rules that put the words in different orders, or some context provided later in the sentence may be helpful to determine the best translation of a given word.

The _attention mask_ can also be used in the encoder/decoder to prevent the model from paying attention to some special words - for instance, the special padding word used to make all the inputs the same length when batching together sentences.

#### **Architectures vs. checkpoints vs. models**

These terms all have slightly different meanings:
- __Architecture:__ This is the skeleton of the model - the definition of each layer and each operation that happens within the model.
- __Checkpoints:__ These are the weights that will be loaded in a given architecture.
- __Model:__ This is an umbrella term that isn't as precise as "architecture" or "checkpoint": it can mean both.

For example, BERT is an architecture while `bert-base-cased`, a set of weights trained by the Google team for the first release of BERT, is a checkpoint. However, one can say "the BERT model" and "the `bert-base-cased` model".

### **1.4. Encoder models**

Encoder models use only the encoder of a Transformer model. At each stage, the attention layers can access all the words in the initial sentence. These models are often characterized as having "bi-directional" attention (because it can access context from the left and the right), and are often called _auto-encoding models_.

For each input word, the encoder generates a numerical representation, which can also be called a "Feature vector" or "Feature tensor", as output. The dimension of those vectors is defined by the architecture of the model. Each representation contains the value of a word, but contextualized, meaning the words around it are also taken into account. It is therefore a contextualized value. This is possible because of the self-attention mechanism ("bi-directional" attention).

The pretraining of these models usually revolves around somehow corrupting a given sentence (for instance, by masking random words in it) and tasking the model with finding or reconstructing the initial sentence.

Encoder models perform good at Natural Language Understanding (NLU). They are best suited for tasks requiring an understanding of the full sentence, such as __sentence classification__, __named entity recognition__ (and more generally word classification), __extractive question answering__, and __masked language modeling (MLM)__.

Examples of encoder models:
- ALBERT
- BERT
- DistilBERT
- ELECTRA
- RoBERTa

### **1.5. Decoder models**

Decoder models use only the decoder of a Transformer model. At each stage, for a given word the attention layers can only access the words positioned before it in the sentence ("uni-directional" attention). These models are often called _auto-regressive models_ (those models re-use their past outputs as inputs in the following steps).

For each input word, the decoder generates a numerical representation, which can also be called a "Feature vector" or "Feature tensor", as output. The dimension of those vectors is defined by the architecture of the model. Where the decoder differs from the encoder is principally with its self-attention mechanism. It's using what is called "masked self-attention" ("uni-directional" attention). This mechamism uses an additional mask to hide the context on either side of the word. The word's numerical representation will not be affected by the words in the hidden context.

The pretraining of decoder models usually revolves around predicting the next word in the sentence.

Decoder models perform good at Natural Language Generation (NLG). They are best suited for tasks involving __text generation__.

Because decoders are extremely similar to encoders, they can be used for most of the same tasks as an encoder, but with a slight performance penalty in general.

Examples of decoder models:
- CTRL
- GPT
- GPT-2
- Transformer XL

### **1.6. Sequence-to-sequence models**

Encoder-decoder models (also called _sequence-to-sequence models_) use both parts of the Transformer architecture. At each stage, the attention layers of the encoder can access all the words in the initial sentence, whereas the attention layers of the decoder can only access the words positioned before a given word in the input. Weights are not necessarily shared across the encoder and decoder.

The encoder takes words (sequence) as inputs, computes a prediction, and retrieves a numerical representation for each word cast through it. It has, in a sense, encoded the sequence. The numerical representation holds information about the meaning of the sequence. That encoder output is then sent to the decoder. Additionally to the encoder output, the decoder gets a (different) sequence as input. When prompting the decoder for an output with no initial sequence, we can give it the value that indicates the start of a sequence. The decoder decodes the sequence, and outputs a word. As of now, we don't need to make sense of that word, but we can understand that the decoder is essentially decoding what the encoder has output. Now that we have both the feature vector and an initial generated word, we don't need the encoder anymore, meaning we can discard it after a single use. The decoder can act in an auto-regressive manner: The word it has just output can now be used as an input. This, in combination with the numerical representation output by the encoder, can now be used to generate a second word. The decoder can continue on and on; for example until it outputs a value that we consider a "stopping value", like a dot, meaning the end of a sequence.

The pretraining of these models can be done using the objectives of encoder or decoder models, but usually involves something a bit more complex. For instance, T5 is pretrained by replacing random spans of text (that can contain several words) with a single mask special word, and the objective is then to predict the text that this mask word replaces.

Sequence-to-sequence models are best suited for tasks revolving around generating new sentences depending on a given input (also called "sequence to sequence tasks"), such as __summarization__, __translation__, or __generative question answering__.

Examples of encoder-decoder models:
- BART
- Marian
- mBART
- T5

Additionally, you can load an encoder and a decoder inside an encoder-decoder model. For instance, you could combine the encoder model BERT with the decoder model GPT-2. Therefore, according to the specific task you are targeting, you may choose to use specific encoders and decoders, which have proven their worth on these specific tasks.

### **1.7. Bias and limitations**

If your intent is to use a pretrained model or a fine-tuned version, please be aware that they come with limitations. The biggest of these is that, to enable pretraining on large amounts of data, researchers often scrape all the content they can find, taking the best as well as the worst of what is available on the internet. When you use these tools, you therefore need to keep in the back of your mind that the original model you are using could very easily generate sexist, racist, or homophobic content. Fine-tuning the model on your data won't make this intrinsic bias disappear.

In [None]:
unmasker = pipeline("fill-mask", model="bert-base-uncased")
results = [
    unmasker("This man works as a [MASK]."),
    unmasker("This man has a job as a [MASK]."),
    unmasker("This woman works as a [MASK]."),
    unmasker("This woman has a job as a [MASK]."),
]
for result in results:
  print([r["token_str"] for r in result])