<a href="https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/pipeline_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hugging Face 🤗 Transformers: A Gateway to Advanced ML Models

by [Sanchit Gandhi](https://huggingface.co/sanchit-gandhi) for the MIT MFin programme.

## Introduction

Hugging Face [🤗 Transformers](https://huggingface.co/docs/transformers/index) is an open-source library for state-of-the-art machine learning models.
It provides a unified interface for over 230 Transformer-based models, making it trivial to apply the latest research to a range of different tasks.
The library offers various levels of interaction granularity, from editing the low-level source code, through to high-level 1-line commands that take
care of all the complexity under-the-hood, making it accessible for ML researchers, practitioners and hobbyists alike.

Models in the Transformers library can be applied to:

1. 📝 **Text:** for tasks like text classification, information extraction, summarisation, and text generation, in over 100 languages.
2. 🖼️ **Images:** for tasks like image classification, object detection, and segmentation.
3. 🗣️ **Audio:** for tasks like speech recognition, audio classification, and speech synthesis.

Transformer models can also perform tasks on several modalities combined, such as visual question answering, optical character recognition, and information extraction from scanned documents.

No matter the task or modality at hand, 🤗 Transformers has a model that you can use easily and quickly!

## Focus of This Tutorial: NLP Applications with Transformers

This tutorial concentrates on the initial phases of the machine learning pipeline in the context of NLP, specifically:
1. **Identifying the task:** given a specific input-output pair, what is the best choice of task to solve our problem?
2. **Applying a pre-trained model:** load a pre-trained model from the Hub and apply it to our problem.

For detailed information on other steps in the machine learning pipeline, including dataset selection, model fine-tuning, evaluation, and deployment,
refer to the Hugging Face [Documentation](https://huggingface.co/docs).

The tutorial will cover three popular NLP tasks, particularly relevant in the financial sector:
1. **Sentiment Analysis**: automatically tag text according to it's sentiment (e.g. positive, negative, neutral).
2. **Named-Entity Recognition (NER):** identify and classify key financial entities in text, such as company names, stock tickers, and monetary values.
2. **Summarisation:** efficiently summarise financial reports or news articles.

## Sentiment Analysis

Sentiment analysis is the automated process of tagging data according to its sentiment, such as positive, negative and neutral.
Sentiment analysis allows companies to analyse data at scale, in order to detect insights and automate processes. For example,
one might wish to analyse financial news, market reports, or social media posts to gauge investor sentiment or market trends.
This information can then be used to assess the overall economic outlook or to update stock market predictions.

In the following example, we'll use the [BERT model](https://huggingface.co/docs/transformers/model_doc/bert) to classify an
input text. The overall workflow is as follows:
1. **Pre-processing:** the input text is converted to word-piece tokens by action of the tokenizer
2. **Modelling:** the tokens are fed through the BERT model to get a sequence of encoder representations, one for each word-piece token
3. **Prediction:** a linear ("dense") transformation is applied to the special [CLS] token to give the probabilities of each class
4. **Post-processing:** the final class label is declared as the one with the highest probability

These four steps are summarised below, working from bottom to top:

<img src="https://github.com/huggingface/workshops/blob/main/nlp-zurich/images/clf_arch.png?raw=1" width=600>

If you want to dive into any of these stages in more detail, we refer you to the blog post [The Illustrated BERT](http://jalammar.github.io/illustrated-bert/), as well as the
original [BERT paper](https://arxiv.org/abs/1810.04805).

Working on the highest-level with Transformers, the `pipeline()` class handles all four stages for us. We simply have to feed it our
input text, and we are returned the final class label prediction:

<img src="https://huggingface.co/datasets/sanchit-gandhi/notebook-figures/resolve/main/mit-tutorial/pipeline.png?download=true" width=600>

Let's see this in action below! First, we'll import the `pipeline()` from the Transformers library:

In [None]:
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm
2023-11-23 13:57:05.873696: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-23 13:57:05.873725: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-23 13:57:05.873765: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


We can then load a model into the `pipeline()` class. For this example, we'll use the official [DistilBERT](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) checkpoint from the Hugging Face Hub. You can switch this for any [text classification model](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads) on the Hub by replacing the model id:

In [None]:
sentiment_pipeline = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

Let's define the text we want to classify. In this example, we'll use a summary of [Apple's Q3 Earnings Report](https://www.apple.com/uk/newsroom/2023/08/apple-reports-third-quarter-results/):

In [None]:
APPLE_EARNINGS = (
    "Apple today announced financial results for its fiscal 2023 third quarter ended July 1, 2023. The Company posted quarterly revenue of $81.8 billion, down 1 percent year over year, and quarterly earnings per diluted share of $1.26, up 5 percent year over year. "
    "'We are happy to report that we had an all-time revenue record in Services during the June quarter, driven by over 1 billion paid subscriptions, and we saw continued strength in emerging markets thanks to robust sales of iPhone,' said Tim Cook, Apple’s CEO. 'From education to the environment, we are continuing to advance our values, while championing innovation that enriches the lives of our customers and leaves the world better than we found it.'"
    "'Our June quarter year-over-year business performance improved from the March quarter, and our installed base of active devices reached an all-time high in every geographic segment,' said Luca Maestri, Apple’s CFO. 'During the quarter, we generated very strong operating cash flow of $26 billion, returned over $24 billion to our shareholders, and continued to invest in our long-term growth plans."
    "Apple’s board of directors has declared a cash dividend of $0.24 per share of the Company’s common stock. The dividend is payable on August 17, 2023 to shareholders of record as of the close of business on August 14, 2023."
)

Alright! We can now pass the text to the `pipeline()` and classify it accordingly:

In [None]:
sentiment_prediction = sentiment_pipeline(APPLE_EARNINGS)
sentiment_prediction

[{'label': 'POSITIVE', 'score': 0.9991182684898376}]

Great - we see that the sentiment is "positive" with 99.9% probability. If you took the time to read through the earnings summary, you'd agree that this makes sense given the strong growth in revenue from services and subscriptions. You can see how such a system could be integrated into a financial workflow to make informed decisions based-on the information published in the news.

In this example, we classified the *overall* sentiment of the text. That is, we predicted a single class label for the entire text. Let's now go a level deeper and make predictions on the *token-level*.

## Named Entity Recognition

Named entity recognition (NER) is the process of extracting **entities** from a passage of text. It can be used to identify and categorise financial entities such as company names, stock symbols, monetary values, and economic indicators from financial reports, news articles, or social media content. Therefore, it can be used to give a structured understanding of vast amounts of textual data.

The workflow for NER is largely the same as sentiment analysis. However, instead of predicting the class probabilities for a single token, we
predict probabilities for every token using the linear layer:

<img src="https://github.com/huggingface/workshops/blob/main/nlp-zurich/images/ner_arch.png?raw=1" width=600>

We can instantiate our NER pipeline in much the same way as our sentiment analysis one. This time, we'll use the checkpoint [dslim/bert-large-NER](https://huggingface.co/dslim/bert-large-NER):

In [None]:
ner_pipeline = pipeline("ner", model="dslim/bert-large-NER")

Some weights of the model checkpoint at dslim/bert-large-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Let's pass our earnings report to the NER pipeline to get our class predictions:

In [None]:
entities = ner_pipeline(APPLE_EARNINGS, aggregation_strategy="simple")
print(entities)

[{'entity_group': 'ORG', 'score': 0.99825484, 'word': 'Apple', 'start': 0, 'end': 5}, {'entity_group': 'MISC', 'score': 0.9887601, 'word': 'iPhone', 'start': 481, 'end': 487}, {'entity_group': 'PER', 'score': 0.99950564, 'word': 'Tim Cook', 'start': 495, 'end': 503}, {'entity_group': 'ORG', 'score': 0.9981773, 'word': 'Apple', 'start': 505, 'end': 510}, {'entity_group': 'PER', 'score': 0.98965263, 'word': 'Luca Maestri', 'start': 899, 'end': 911}, {'entity_group': 'ORG', 'score': 0.998454, 'word': 'Apple', 'start': 913, 'end': 918}, {'entity_group': 'ORG', 'score': 0.9986727, 'word': 'Apple', 'start': 1109, 'end': 1114}, {'entity_group': 'ORG', 'score': 0.56803095, 'word': 'Company', 'start': 1191, 'end': 1198}]


This isn't very easy to read, so let's clean up the outputs a bit by printing them on separate lines and rounding the class probabilities to 2 decimal places:

In [None]:
for entity in entities:
    print(f"{entity['word']}: {entity['entity_group']} ({entity['score']:.2f})")

Apple: ORG (1.00)
iPhone: MISC (0.99)
Tim Cook: PER (1.00)
Apple: ORG (1.00)
Luca Maestri: PER (0.99)
Apple: ORG (1.00)
Apple: ORG (1.00)
Company: ORG (0.57)


That's much better! It seems that the model found most of the named entities in the text: "Apple" is correctly identified as an organisation (ORG),
and "Tim Cook" as a person (PER). The model also assigned the "miscellaneous" (MISC) label to "iPhone", indicating that it predicted it to be an
important object in the text. If we want to refine the model for a specific domais, such as tech company earnings reports, we'd likely want to [fine-tune](https://huggingface.co/learn/nlp-course/chapter7/2) it on additional data, in order to give more detailed class labels and improve the performance.

## Why use the `pipeline()`?

When working on solving your own task, starting with a simple pipeline like the one shown above is a valuable tool that offers several benefits:

* A pre-trained model may exist that already solves your task, saving you plenty of time
* `pipeline()` takes care of all the pre/post-processing for you, so you don’t have to worry about getting the data into the right format for a model
* If the result isn’t ideal, it gives you a baseline for future fine-tuning
* Should you fine-tune a model on your custom data and share it on Hub, the whole community will be able to use it quickly and effortlessly via the `pipeline()` class, making AI more accessible

In the NER task, we predicted a single class label for each of our token inputs. For our final task, we'll provide a text input, but predict a text output
that can take variable length. We'll also expose a lower-level way of interacting with Transformers that explains how the `pipeline()` works under-the-hood.

## Summarisation



Summarsation creates a shorter version of a document or an article that captures all the important information. It can be used to distill lengthy financial statements and news articles into concise summaries that can be used to provide high-level overviews of the information conveyed.

Summarisation is an example of a **sequence-to-sequence** task: a *sequence* of text inputs are mapped to a *sequence* of text outputs. The lengths of the outputs are not known beforehand, but are rather a function of the length of the inputs. Let's suppose we're summarising a one-line headline. At most, the summary will be one-line, since it cannot exceed the length of the input. However, a multi-page document will likely require a full paragraph to summarise all the information. The length of the output is not just a function of the input, but also the content of the text. For example, a very lengthy passage could be summarised in just one sentence, should there be little important information in the text. Whereas, an information-dense document of the same length may require multiple sentences.

Summarisation can take one of two flavours:

1. Extractive: extract the most relevant information from a document.
2. Abstractive: generate new text that captures the most relevant information.

In this example, we'll use an **abstractive** summarisation model to pull-out the key information from the earnings summary.

You'll be very familiar with the pipeline workflow now! Let's go ahead and load one of the official [DistilBART](sshleifer/distilbart-cnn-12-6) checkpoints:

In [None]:
summarization_pipeline = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

> **Tip:** a model with "distil" in the name likely means it is a compressed version of a larger model. These models tend to be smaller and faster to run, while largely maintaining the performance of the original model. Hugging Face has a big soft spot for distilling models, given they assist in making them more accessible to the community 🤗. If you're interested in finding out more about model distillation, refer to the [DistilBERT](https://arxiv.org/abs/1910.01108), [DistilBART](https://arxiv.org/abs/2010.13002) and [Distil-Whisper](https://arxiv.org/abs/2311.00430) papers.

As before, we'll pass the input text to the pipeline to get our outputs:

In [None]:
summarization_pipeline(APPLE_EARNINGS)

[{'summary_text': " Apple today announced financial results for its fiscal 2023 third quarter ended July 1, 2023 . The Company posted quarterly revenue of $81.8 billion, down 1 percent year over year, and quarterly earnings per diluted share of $1.26 . 'We are happy to report that we had an all-time revenue record in Services during the June quarter'"}]

Fantastic - the model summarised the multi-paragraph input text into a one-paragraph summary.

We can also string together all of our pipelines to:
1. Generate a summary
2. Predict the overall sentiment
3. Classify the named entities

In [None]:
summary = summarization_pipeline(APPLE_EARNINGS)[0]["summary_text"]
sentiment = sentiment_pipeline(summary)
entities = ner_pipeline(summary, aggregation_strategy="simple")

print(summary)
print("Sentiment: ", sentiment[0])

for entity in entities:
    print(f"Entity: {entity['word']} {entity['entity_group']} ({entity['score']:.2f})")

 Apple today announced financial results for its fiscal 2023 third quarter ended July 1, 2023 . The Company posted quarterly revenue of $81.8 billion, down 1 percent year over year, and quarterly earnings per diluted share of $1.26 . 'We are happy to report that we had an all-time revenue record in Services during the June quarter'
Sentiment:  {'label': 'POSITIVE', 'score': 0.9983645081520081}
Entity: Apple ORG (1.00)


## Let's Go Deeper!

So far, we've been using the `pipeline()` class, which takes care of the pre-processing, post-processing and modelling steps. In our final example, we'll use a lower-level API in Transformers, which splits these three steps up:

<img src="https://huggingface.co/datasets/sanchit-gandhi/notebook-figures/resolve/main/mit-tutorial/model_tokenizer.png?download=true" width=600>

To achieve this, we'll define two classes:
1. The `tokenizer`: responsible for pre-processing the input text to token ids, and post-processing the predicted ids to output text
2. The `model`: responsible for the auto-regressive generation

We can go ahead and load the corresponding classes from the Transformers library. The first of these classes, `AutoTokenizer`, is the tokenizer class,
which converts the input string to a token id (discrete number) representation. The second class, `AutoModelForSeq2SeqLM`, is our
model class. This is the Python class that holds the model weights and graph definition:

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

> **Tip:** using auto-classes means that we can easily swap the checkpoint id for any other checkpoint on the Hugging Face Hub and re-use our code without any changes. The auto-classes will take care of loading the correct model and tokenizer classes for us!

Great! We can now load the model and tokenizer from the pre-trained checkpoint on the Hub:

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained("sshleifer/distilbart-cnn-12-6")
tokenizer = AutoTokenizer.from_pretrained("sshleifer/distilbart-cnn-12-6")

The model is now loaded into memory and ready to be used for predictions. Let's go through the steps for summarisation one-by-one.

**Step 1:** pre-process (encode) the text inputs to token ids using the tokenizer. We'll return the input ids as PyTorch (pt) tensors:

In [None]:
inputs = tokenizer(APPLE_EARNINGS, max_length=2048, return_tensors="pt")

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


**Step 2:** auto-regressively generate using the model to get the predicted ids. For this, we'll use the model's [`.generate`](https://huggingface.co/docs/transformers/main_classes/text_generation) method:

In [None]:
pred_ids = model.generate(inputs["input_ids"])

**Step 3:** post-process (decode) the predicted ids to the text outputs. We'll skip any "special" task token ids from the prediction:

In [None]:
tokenizer.decode(pred_ids[0], skip_special_tokens=True)

" Apple announced financial results for its fiscal 2023 third quarter ended July 1, 2023. Company posted quarterly revenue of $81.8 billion, down 1 percent year over year, and quarterly earnings per diluted share of $1.26. 'We are happy to report that we had an all-time revenue record in Services during the June quarter'"

Alright! We see that we get the same output as we had using the `pipeline()`, but this time explicitly executing each step in the process. The following code-snippet concatenates the three so that they can be run one after the other, as you would typically have in any application:

In [None]:
inputs = tokenizer(APPLE_EARNINGS, max_length=2048, return_tensors="pt")

pred_ids = model.generate(inputs["input_ids"])

pred_text = tokenizer.decode(pred_ids[0], skip_special_tokens=True)
print(pred_text)

 Apple announced financial results for its fiscal 2023 third quarter ended July 1, 2023. Company posted quarterly revenue of $81.8 billion, down 1 percent year over year, and quarterly earnings per diluted share of $1.26. 'We are happy to report that we had an all-time revenue record in Services during the June quarter'


## Why Go Deeper?

The advantage of using the lower-level `model` + `processor` API is that you have more control over the specific generation parameters. For example, we can enable the [*beam search*](https://huggingface.co/blog/how-to-generate#beam-search) and [*sampling*](https://huggingface.co/blog/how-to-generate#sampling) generation strategies by passing `num_beams=5` and `do_sample=True` respectively:

In [None]:
pred_ids = model.generate(inputs["input_ids"], num_beams=5, do_sample=True)

tokenizer.decode(pred_ids[0], skip_special_tokens=True)

" Apple posted quarterly revenue of $81.8 billion, down 1 percent year over year, and quarterly earnings per diluted share of $1.26. 'We are happy to report that we had an all-time revenue record in Services during the June quarter, driven by over 1 billion paid subscriptions'"

You also have access to the intermediate outputs in the workflow, for example the token ids from the tokenizer, or the predicted ids from the model. You can re-use these to quickly experiment with using different strategies without having to run everything from scratch each time.

## Conclusion

In this tutorial, we covered how the Transformers library can be applied to three common NLP tasks: sentiment analysis, NER and summarisation. We demonstrated the flexibility fo the `pipeline()` class for easily switching between different tasks, and the lower-level API for more fine-grained control over the model.