# Chapter 17: Pipelines for NLP Tasks

Installation Notes
To run this notebook on Google Colab, you will need to install the following libraries: transformers, portalocker, and torchdata.

In Google Colab, you can run the following command to install them:

In [None]:
!pip install transformers datasets

## 17.2 Learning Objectives

By the end of this chapter, you should be able to:
- load pretrained HuggingFace pipelines for typical NLP tasks
- identify typical models for each particular task
- understand the output produced by HuggingFace tokenizers

## 17.3 HuggingFace Pipelines

####Hugging Face Pipelines: Overview
In "Pretrained Models for Natural Language Processing", we used a Hugging Face pipeline to perform sentiment analysis out-of-the-box. We used a text-classification pipeline and its default model distilbert-base-uncased-finetuned-sst-2-english. There are several other tasks and models available, allowing us to easily tackle tasks such as summarization, question answering, and more. Here is the list of NLP tasks pipelines can handle:

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch15/hf_nlp_tasks.png)

Natural Language Processing Pipelines in Hugging Face

### 17.3.1 Models
Although there are literally thousands of available models, each pipeline has its own default model. It's not required that you specify the model while creating the pipeline, but it is strongly recommended since it's not guaranteed that the default model will stay the same over time. Let's say you have built your application using one or more pipelines: if you don't specify the model they should use, upgrading your environment may inadvertently lead to an unforeseen change in behavior, and you definitely don't want that.

You can easily inspect the default model for each task by checking the SUPPORTED_TASKS dictionary and narrowing its items using the type of task you're looking for (text, image, audio, video, or multimodal), and the task itself (from the screenshot on the previous page).

In [None]:
from transformers.pipelines import SUPPORTED_TASKS

[(task, conf['default'].get('model', {}).get('pt', (None,))[0])
 for task, conf in SUPPORTED_TASKS.items()
 if conf['type'] == 'text']

[('text-classification', 'distilbert-base-uncased-finetuned-sst-2-english'),
 ('token-classification', 'dbmdz/bert-large-cased-finetuned-conll03-english'),
 ('question-answering', 'distilbert-base-cased-distilled-squad'),
 ('table-question-answering', 'google/tapas-base-finetuned-wtq'),
 ('fill-mask', 'distilroberta-base'),
 ('summarization', 'sshleifer/distilbart-cnn-12-6'),
 ('translation', None),
 ('text2text-generation', 't5-base'),
 ('text-generation', 'gpt2'),
 ('zero-shot-classification', 'facebook/bart-large-mnli'),
 ('conversational', 'microsoft/DialoGPT-medium')]

Notice that most model's are variations of BERT, an encoder-based model:
- [BERT](https://huggingface.co/docs/transformers/model_doc/bert) for token classification
- DistilBERT for text classification and question answering
- DistilRoBERTa  for mask filling
- TAPAS (also BERT-like) for table question answering

Some models are variations of GPT, a decoder-based model:
- [GPT2](https://huggingface.co/gpt2) for text generation
- DialoGPT for conversations

And some are fully-fledged Transformers, having both encoder and decoder parts:
- [T5](https://huggingface.co/docs/transformers/model_doc/t5) for text to text generation
- [BART](https://huggingface.co/docs/transformers/model_doc/bart) for zero-shot classification
- DistilBART for summarization

We have gone through the general idea behind encoder-based models, the classification token, and how they can be used to produce contextual embeddings. We also briefly discussed the Transformer architecture. Although we're not going into details of the models above, we'll be diving a little bit into decoder-based models later on.

For a short and interesting overview of the different models, check Sebastian Raschka's [Understanding Encoder And Decoder LLMs](https://magazine.sebastianraschka.com/p/understanding-encoder-and-decoder) blog post.

### 17.3.2 Tokenizers

At this point, it should be clear the importance tokenizers have in the pipeline. They preprocess the input sentences and return token indices according to the vocabulary the model was trained on, and they also return additional information such as masks.

Just like in Torchtext, we'll also find BERT, GPT2BPE, and CLIP tokenizers in Hugging Face:

- [BertTokenizer](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer)
- [GPT2Tokenizer](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2Tokenizer)
- [CLIPTokenizer](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTokenizer)
Let's do a quick tour of a pretrained BERT tokenizer:

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

We can check its vocabulary size and the special tokens in its vocabulary:

In [None]:
tokenizer.vocab_size, tokenizer.all_special_tokens

(30522, ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]'])

We can tokenize it, similarly to our basic English tokenizer, but you'll see that it splits some uncommon words (such as "unexplicably") into pieces (it is called WordPiece tokenization for a reason!):

In [None]:
sentences = ("The core of the planet is becoming unexplicably unstable.",
             "The shift in the company's core business markets had impacted their quartely results.")

tokens = tokenizer.tokenize(sentences[0])
tokens

['the',
 'core',
 'of',
 'the',
 'planet',
 'is',
 'becoming',
 'une',
 '##x',
 '##pl',
 '##ica',
 '##bly',
 'unstable',
 '.']

Alternatively, we can encode it, which means converting the tokens into their corresponding indices in the vocabulary:

In [None]:
token_ids = tokenizer.encode(sentences[0])
token_ids

[101,
 1996,
 4563,
 1997,
 1996,
 4774,
 2003,
 3352,
 16655,
 2595,
 24759,
 5555,
 6321,
 14480,
 1012,
 102]

Did you notice that there are more token ids than tokens?

Let's decode the token ids to see what we'll get back:

In [None]:
tokenizer.decode(token_ids)

'[CLS] the core of the planet is becoming unexplicably unstable. [SEP]'

Surprise (or maybe not), we got special tokens prepended and appended to the sentence.

There's also the encode_plus() method which not only returns the token ids, but also additional information such as the attention mask and a mysterious token type id:

In [None]:
token_dict = tokenizer.encode_plus(sentences[0])
token_dict

{'input_ids': [101, 1996, 4563, 1997, 1996, 4774, 2003, 3352, 16655, 2595, 24759, 5555, 6321, 14480, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

For single sentences, the token type id is meaningless. But, if we pair two sentences together, it will work like a "sentence index", telling us which sentence the token belongs to:

In [None]:
token_dict_mult = tokenizer(*sentences)
token_dict_mult

{'input_ids': [101, 1996, 4563, 1997, 1996, 4774, 2003, 3352, 16655, 2595, 24759, 5555, 6321, 14480, 1012, 102, 1996, 5670, 1999, 1996, 2194, 1005, 1055, 4563, 2449, 6089, 2018, 19209, 2037, 24209, 24847, 2135, 3463, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

If we decode the token ids, we'll get the two sentences back as one. This is useful for some training tasks such as "next sentence prediction (NSP)":

In [None]:
tokenizer.decode(token_dict_mult['input_ids'])

"[CLS] the core of the planet is becoming unexplicably unstable. [SEP] the shift in the company's core business markets had impacted their quartely results. [SEP]"

For a more detailed overview of tokenizers in Hugging Face, check the "[Summary of the tokenizers](https://huggingface.co/docs/transformers/tokenizer_summary)" post.

While it's important to understand the role of tokenizers, they are abstracted away when you're using pipelines. You only need to use the raw sentences as inputs and everything else will be handled under the hood. This is very convenient for us since our data pipes and data loaders were built to return the raw sentences directly.

### 17.3.3 Zero-Shot Text Classification

"The third time's the charm," as the saying goes. We can load a pretrained pipeline and try our hand at [zero-shot text classification](https://huggingface.co/tasks/zero-shot-classification) once again. We'll be using the default model for the task, facebook/bart-large-mnli, a full Transformer model but, first, let's rebuild the datapipes for the AG News dataset:

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step1.png)

Let's quickly retrace our steps here to prepare the dataset one more time.

You can download the files from the following links:

https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv
https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/test.csv
https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/classes.txt
Alternatively, you can download all files as a single compressed file instead:

https://raw.githubusercontent.com/lftraining/LFD273-code/main/data/AGNews/agnews.zip

If you're running Google Colab, you can download the files using the commands below:

In [None]:
!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv
!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/test.csv
!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/classes.txt

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step2.png)

Next, let's do some data cleaning, getting rid of a few HTML tags, replacing some special characters, etc. Here is a non-exhaustive list of characters and tags for replacement:

In [None]:
import numpy as np

chr_codes = np.array([
     36,   151,    38,  8220,   147,   148,   146,   225,   133,    39,  8221,  8212,   232,   149,   145,   233,
  64257,  8217,   163,   160,    91,    93,  8211,  8482,   234,    37,  8364,   153,   195,   169
])
chr_subst = {f' #{c};':chr(c) for c in chr_codes}
chr_subst.update({' amp;': '&', ' quot;': "'", ' hellip;': '...', ' nbsp;': ' ', '&lt;': '', '&gt;': '',
                  '&lt;em&gt;': '', '&lt;/em&gt;': '', '&lt;strong&gt;': '', '&lt;/strong&gt;': ''})

And here are a couple of helper functions we used to perform the cleanup:

In [None]:
def replace_chars(sent):
    to_replace = [c for c in list(chr_subst.keys()) if c in sent]
    for c in to_replace:
        sent = sent.replace(c, chr_subst[c])
    return sent

def preproc_description(desc):
    desc = desc.replace('\\', ' ').strip()
    return replace_chars(desc)

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step4.png)

After loading the CSV files using load_dataset() and building a DatasetDict out of them, we used the functions above to transform our datasets, cleaning up the text and converting the label into a 0-based numeric value:

In [None]:
from datasets import load_dataset, Split, DatasetDict

colnames = ['topic', 'title', 'news']

train_ds = load_dataset("csv", data_files='train.csv', sep=',', split=Split.ALL, column_names=colnames)
test_ds = load_dataset("csv", data_files='test.csv', sep=',', split=Split.ALL, column_names=colnames)

datasets = DatasetDict({'train': train_ds, 'test': test_ds})
datasets = datasets.map(lambda row: {'topic': row['topic']-1, 'news': preproc_description(row['news'])})
datasets = datasets.select_columns(['topic', 'news'])

Finally, we created their corresponding data loaders:

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step5.png)

In [None]:
from torch.utils.data import DataLoader

dataloaders = {}
dataloaders['train'] = DataLoader(dataset=datasets['train'], batch_size=32, shuffle=True)
dataloaders['test'] = DataLoader(dataset=datasets['test'], batch_size=32)

Ok, now that we have the data ready once again, we can load the pipeline:

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step5.png)

In [None]:
import torch
from transformers import pipeline

device = 0 if torch.cuda.is_available() else -1

classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=device)

You know the drill:

define a list of candidate labels
load a mini-batch of sentences
call the pipeline to get the results

In [None]:
import warnings
warnings.filterwarnings("ignore")

candidate_labels = ['world', 'sports', 'business', 'science and technology']

batch = next(iter(dataloaders['test']))
labels, sentences = batch['topic'], batch['news']

out = classifier(list(sentences), candidate_labels)

For each sentence, it will return the candidate labels ordered by their scores in decreasing order:

In [None]:
out[0]

{'sequence': "Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.",
 'labels': ['business', 'world', 'sports', 'science and technology'],
 'scores': [0.5680877566337585,
  0.32770952582359314,
  0.05973348021507263,
  0.044469203799963]}

We can easily get the most likely label and corresponding index:

In [None]:
pred_label = out[0]['labels'][0]
pred_class = candidate_labels.index(pred_label)
pred_label, pred_class

('business', 2)

How many did it get right?

In [None]:
pred_labels = torch.as_tensor([candidate_labels.index(s['labels'][0]) for s in out])
(pred_labels == labels).float().mean()

tensor(0.4062)

That's definitely not impressive. As a suggested exercise, you can run the full evaluation to see how well it performs on the whole test set, and you can try different candidate labels.

TIP: Remember that our attention heads were looking for country names to classify the sentences as belonging to the "world" class? Choosing representative labels may play a significant role in zero-shot classification. Choose your candidate labels wisely!