# 7.2 Hugging Face Transformers

In this lab session, we will have a look at [Hugging Face Transformers](https://huggingface.co/). Hugging Face is a community and data science platform that provides tools that enable users to build, train and deploy ML models based on open source code and technologies. Hugging Face started off as a platform specific to the domain of natural language processing (NLP) but has been extended to other domains such as computer vision, audio, tabular, multimodal and reinforcement learning in the meantime. Most notably, Hugging Face offers a variety of large-scale pretrained (transformer) models that you can download and finetune on your own datasets, a concept called transfer learning that has taken the deep learning field by storm.

The Hugging Face ecosystem consists of:
- A hub for models and datasets
- Open source libraries that facilitate the development, training and deployment of ML models, especially leveraging the state-of-the-art transformers architecture:
    - Transformers
    - Datasets
    - Tokenizers
    - Accelerate

To learn more about the Hugging Face ecosystem, check the [Hugging Face course](https://huggingface.co/course/chapter1/1). This notebook is heavily based on this course.
If you want to have a look at the models that are available through Hugging Face, check the [Hugging Face hub](https://huggingface.co/models).

**What is NLP?**

NLP is a field of linguistics and machine learning focused on understanding everything related to human language. The aim of NLP tasks is not only to understand single words individually, but to be able to understand the context of those words.

The following is a list of common NLP tasks, with some examples of each:

- Classifying whole sentences: Getting the sentiment of a review, detecting if an email is spam, determining if a sentence is grammatically correct or whether two sentences are logically related or not
- Classifying each word in a sentence: Identifying the grammatical components of a sentence (noun, verb, adjective), or the named entities (person, location, organization)
- Generating text content: Completing a prompt with auto-generated text, filling in the blanks in a text with masked words
- Extracting an answer from a text: Given a question and a context, extracting the answer to the question based on the information provided in the context
- Generating a new sentence from an input text: Translating a text into another language, summarizing a text

NLP isn’t limited to written text though. It also tackles complex challenges in speech recognition and computer vision, such as generating a transcript of an audio sample or a description of an image.

**Transformers**

Transformers are the state-of-the-art deep learning architecture for sequence (and image) tasks. If you are curious about the transformer architecture and how it works under the hood, make sure to check out the Hugging Face course linked at the top of the notebook.

Reading the paper that originally proposed the transformer architecture called "Attention Is All You Need" is a must do for everyone who wants to dive into modern deep learning. It can be found [here](https://arxiv.org/abs/1706.03762). Below is an overview diagram of the model architecture:

![transformer_architecture](images/transformer_architecture.png)

### 7.2.1 Transformers Library

The goal of the 🤗 Transformers is to provide a single API through which any Transformer model can be loaded, trained, and saved. The library’s main features are:

Ease of use: Downloading, loading, and using a state-of-the-art NLP model for inference can be done in just two lines of code.
Flexibility: At their core, all models are simple PyTorch nn.Module or TensorFlow tf.keras.Model classes and can be handled like any other models in their respective machine learning (ML) frameworks.
Simplicity: Hardly any abstractions are made across the library. The “All in one file” is a core concept: a model’s forward pass is entirely defined in a single file, so that the code itself is understandable and hackable.

The most basic object in the 🤗 Transformers library is the pipeline() function. It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer.

Some of the currently available pipelines are:

- feature-extraction (get the vector representation of a text)
- fill-mask
- ner (named entity recognition)
- question-answering
- sentiment-analysis
- summarization
- text-generation
- translation
- zero-shot-classification

In [1]:
from transformers import pipeline

In [2]:
classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")
# By default, this pipeline selects a particular pretrained model that has been fine-tuned for sentiment analysis in English.

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9598049521446228}]

In [3]:
classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445994853973389, 0.11197376996278763, 0.043426763266325]}

Let's search for a particular model on the [model hub](https://huggingface.co/models).

In [4]:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

Downloading config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/336M [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to build an app for Android using the open source code, using the open source SDK. After these resources'},
 {'generated_text': 'In this course, we will teach you how to use Python 2.2 and 3 to configure all the virtual machines for the Virtual Machine (VM).'}]

In [5]:
### TO DO ###
'''
Find a text generation model for another language and try it out!
'''

'\nFind a text generation model for another language and try it out!\n'

In [6]:
ner = pipeline("ner", grouped_entities=True)
ner("My name is Sebastian and I work at Avanade in Munich.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/208k [00:00<?, ?B/s]



[{'entity_group': 'PER',
  'score': 0.999059,
  'word': 'Sebastian',
  'start': 11,
  'end': 20},
 {'entity_group': 'ORG',
  'score': 0.9967142,
  'word': 'Avanade',
  'start': 35,
  'end': 42},
 {'entity_group': 'LOC',
  'score': 0.9960038,
  'word': 'Munich',
  'start': 46,
  'end': 52}]

In [7]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sebastian and I work at Avanade in Munich.",
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/249M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/426k [00:00<?, ?B/s]

{'score': 0.43235573172569275,
 'start': 35,
 'end': 52,
 'answer': 'Avanade in Munich'}

In [8]:
summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

In [9]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce datascience bootcamp cours est produit par Avanade.")

Downloading config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/287M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading source.spm:   0%|          | 0.00/784k [00:00<?, ?B/s]

Downloading target.spm:   0%|          | 0.00/760k [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/1.28M [00:00<?, ?B/s]



[{'translation_text': 'This datascience bootcamp course is produced by Avanade.'}]

This is a short history summary of transformer models. Depending on the architecture and tasks they were trained on, the performance is different for different tasks at hand.

![transformers_history](images/transformers_history.png)

All the Transformer models mentioned above (GPT, BERT, BART, T5, etc.) have been trained as language models. This means they have been trained on large amounts of raw text in a self-supervised fashion. Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. That means that humans are not needed to label the data!

This type of model develops a statistical understanding of the language it has been trained on, but it’s not very useful for specific practical tasks. Because of this, the general pretrained model then goes through a process called transfer learning. During this process, the model is fine-tuned in a supervised way — that is, using human-annotated labels — on a given task.

An example of a task is predicting the next word in a sentence having read the n previous words. This is called causal language modeling because the output depends on the past and present inputs, but not the future ones.

![causal_language_modeling](images/causal_language_modeling.png)

Another example is masked language modeling, in which the model predicts a masked word in the sentence.

![masked_language_modeling](images/masked_language_modeling.png)

Apart from a few outliers (like DistilBERT), the general strategy to achieve better performance is by increasing the models’ sizes as well as the amount of data they are pretrained on.

![transformers_numbersofparams](images/transformers_numberofparams.png)

[Transfer learning](https://www.youtube.com/watch?v=BqqfQnyjmgg)

Pretraining is the act of training a model from scratch: the weights are randomly initialized, and the training starts without any prior knowledge. This pretraining is usually done on very large amounts of data. Therefore, it requires a very large corpus of data, and training can take up to several weeks.

![pretraining](images/pretraining.png)

Fine-tuning, on the other hand, is the training done after a model has been pretrained. To perform fine-tuning, you first acquire a pretrained language model, then perform additional training with a dataset specific to your task. Wait — why not simply train directly for the final task? There are a couple of reasons:

The pretrained model was already trained on a dataset that has some similarities with the fine-tuning dataset. The fine-tuning process is thus able to take advantage of knowledge acquired by the initial model during pretraining (for instance, with NLP problems, the pretrained model will have some kind of statistical understanding of the language you are using for your task).
Since the pretrained model was already trained on lots of data, the fine-tuning requires way less data to get decent results.
For the same reason, the amount of time and resources needed to get good results are much lower.
For example, one could leverage a pretrained model trained on the English language and then fine-tune it on an arXiv corpus, resulting in a science/research-based model. The fine-tuning will only require a limited amount of data: the knowledge the pretrained model has acquired is “transferred,” hence the term transfer learning.

The fine-tuning of a language model is cheaper than pretraining in both time and money.
Fine-tuning a model therefore has lower time, data, financial, and environmental costs. It is also quicker and easier to iterate over different fine-tuning schemes, as the training is less constraining than a full pretraining.

This process will also achieve better results than training from scratch (unless you have lots of data), which is why you should always try to leverage a pretrained model — one as close as possible to the task you have at hand — and fine-tune it.

![finetuning](images/finetuning.png)