# Transformer
Version 2022-12-1

Transformer-based models are the current state of the art in the field of natural language processing. It is the basis of some of the most advanced AI currently in existence, including image-generator Stable Diffusion [AlphaStar](https://stability.ai/blog/stable-diffusion-v2-release) and text-generator [GPT-3](https://openai.com/api/).

Training a Transformer-based model from scratch is very expensive, due to the large number of parameters and the huge volume of data involved. The cost of training GPT-3 was [estimated](https://bdtechtalks.com/2020/09/21/gpt-3-economy-business-model/) to be in the range of tens of millions of U.S. dollars. Fortunately, many pre-trained models are available. Pre-trained models can be fine-tuned to specific needs by training them further with domain-specific data.

In this notebook, we will use the `transformers` library developed by [Hugging Face](https://huggingface.co/), a startup "on a mission to democratize good machine learning." 

## A. Using Pre-Trained Models

The `transformers` library makes it very easy to download pre-trained models. Downloaded models are saved in a cache folder, which is by default under your home directory at `$HOME/.cache/huggingface`. Because Transformer models requires a lot of disk space&mdash;larger ones can run into hundreds of GB's&mdash;we will change the cache folder to a shared one, where I have already downloaded some models. You should change it to a folder that you control when you work on your own projects.


In [1]:
# Hugging Face's Default cache directory is $HOME/.cache/huggingface
# To change it, set the environment variable HF_HOME
# BEFORE importing Hugging Face libraries
import os
os.environ["HF_HOME"] = "/data/huggingface/"

# Hugging Face Transformers
# Either PyTorch or Tensorflow must be installed
from transformers import pipeline

Next we have to decide what model to download. Models are categorized by attributes, including:

#### Model architecture
- BERT, GPT-2, ALBERT, RoBERTa,...

#### Fine-tuned task
- Default is whatever the model is trained on. 
e.g. BERT is trained to fill in missing words, 
while GPT-2 is trained to predict next words.
- [*text-generation*](https://huggingface.co/models?pipeline_tag=text-generation) models are fine-tuned for text generation.
- [*question-anwsering*](https://huggingface.co/models?pipeline_tag=question-answering) models are fine-tuned to answer questions based on a user-provided context.
- [*text-classification*](https://huggingface.co/models?pipeline_tag=text-classification) covers sentiment analysis and topic classification.

There are also models for [summarization](https://huggingface.co/models?pipeline_tag=summarization), [conversation](https://huggingface.co/models?pipeline_tag=conversational), [sentence comparison](https://huggingface.co/models?pipeline_tag=sentence-similarity) and [translation](https://huggingface.co/models?pipeline_tag=translation). You can search for available models on Hugging Face's [website](https://huggingface.co/). 

#### Language
- Models are usually trained on English data, but you can search for other languages, e.g. [Chinese](https://huggingface.co/models?search=chinese).

### A1. Question Answering

Let us start by loading the default Q&A model.  `transformers` provide the `pipline` class for this purpose. The syntax is:
```python
model = pipline(task,[model])
```

In [2]:
# Question answering with default model.
# This will download the model if not already present
question_answerer = pipeline('question-answering')

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Once the model is loaded, we need to provide it with a `question` and a `context` in a dictionary:

In [3]:
inputs = {
'question': 'What is the ranking of CUHK in Asia?',
'context': 'The Chinese University of Hong Kong ranks 8th in Asia and 48th in the world in the field of Economics and Econometrics (QS World University Rankings by Subject 2021).'
}

question_answerer(inputs)

{'score': 0.9862250685691833, 'start': 42, 'end': 45, 'answer': '8th'}

Try different questions and context and see what you get.

### A2. Text Generation

For text generation, we will specify that we want the GPT-2 model:

In [4]:
# Text generation with GPT-2
text_generator = pipeline('text-generation', model='gpt2')

We need to provide the model with a text prompt. 
The model will then predict what words should follow.
We can also specify the maximum length of the generated text with `max_length`
and how many sequences of text we want with `num_return_sequences`.

In [5]:
# Generate five sequences of 20 words each.
text_generator("I major in economics,", 
               max_length=20, 
               num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I major in economics, and my career involves making money. In fact, it helped me realize why'},
 {'generated_text': 'I major in economics, is an adjunct professor of ecology and evolutionary biology from Columbia University. He and'},
 {'generated_text': 'I major in economics, and has previously done research for a small university in central London working with the'},
 {'generated_text': "I major in economics, and I'm an economist. I'm a big believer that markets and incentives"},
 {'generated_text': 'I major in economics, but my college was full of people with no experience or interest in government,"'}]

Try changing `max_length` and note how the quality of the generated text varies with it.

### A3. Sentiment Analysis

Finally, let us try a sentiment analysis model:

In [6]:
classifier = pipeline('text-classification')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


For sentiment analysis we only need to provide a string of text:

In [7]:
classifier("I am very sad today.")

[{'label': 'NEGATIVE', 'score': 0.9992952346801758}]

### A4. Multiple Samples

`pipeline` allows you to provide multiple samples in a list, though if you want to go through a whole dataset, you might want to use the underlying model directly. How to do so will be covered in part C.

**Do not use `pipeline` on GPU, as that combination is even slower than using `pipeline` on CPU.**

In [8]:
# Time the execution speed of one sample
%time classifier("I am very sad today.")

# Time the execution speed of two samples
%time classifier(["I am very sad today.","I am very sad today."])

# Time the execution speed of three samples
%time classifier(["I am very sad today.","I am very sad today.","I am very sad today."])

CPU times: user 84.8 ms, sys: 0 ns, total: 84.8 ms
Wall time: 8.34 ms
CPU times: user 255 ms, sys: 0 ns, total: 255 ms
Wall time: 15.7 ms
CPU times: user 382 ms, sys: 0 ns, total: 382 ms
Wall time: 22.5 ms


[{'label': 'NEGATIVE', 'score': 0.9992952346801758},
 {'label': 'NEGATIVE', 'score': 0.9992952346801758},
 {'label': 'NEGATIVE', 'score': 0.9992952346801758}]

## B. Tokenizer

If you want to fine-tune a model, you will need to convert your text data
into a suitable format. This is the job of a model's *tokenizer*. 
Because different models have different designs, 
you need to use the tokenizer that comes with the model.

In [9]:
# Tokenizer for DistilBERT
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-cased')

# Use the tokenizer. 
# Note that question and text can be arrays rather than one sample.
question, text = "What is the ranking of CUHK in Asia?","8th in Asia"
encodings = tokenizer(question,text)
encodings

{'input_ids': [101, 1327, 1110, 1103, 5662, 1104, 140, 2591, 3048, 2428, 1107, 3165, 136, 102, 5192, 1107, 3165, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

`input_ids` is the text we provide, with each word replaced by its numeric ID. We can use `tokenizer.decode()` to convert it back to text:

In [10]:
tokenizer.decode(encodings['input_ids'])

'[CLS] What is the ranking of CUHK in Asia? [SEP] 8th in Asia [SEP]'

Note the special characters `[CLS]` and `[SEP]` added by the BERT tokenizer.

Models such as BERT often use *sub-word tokens* to provide even more information to the model. We usually do not need to construct the sub-word tokens manually, but it can be done with 
```
tokenizer.convert_ids_to_tokens(input_ids)
```

In [11]:
tokens = tokenizer.convert_ids_to_tokens(encodings['input_ids'])
tokens

['[CLS]',
 'What',
 'is',
 'the',
 'ranking',
 'of',
 'C',
 '##U',
 '##H',
 '##K',
 'in',
 'Asia',
 '?',
 '[SEP]',
 '8th',
 'in',
 'Asia',
 '[SEP]']

Note how BERT separates 'CUHK' into four separate tokens.

To convert sub-word tokens back to string, use `tokenizer.convert_tokens_to_string()`:

In [12]:
tokenizer.convert_tokens_to_string(tokens)

'[CLS] What is the ranking of CUHK in Asia? [SEP] 8th in Asia [SEP]'

Let use try another example. This time we will use GPT-2's tokenizer:

In [13]:
# Tokenizer for GPT-2
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

text = "I major in economics"
encodings = tokenizer(text)
print(encodings)
print(tokenizer.decode(encodings['input_ids']))
print(tokenizer.convert_ids_to_tokens(encodings['input_ids']))

{'input_ids': [40, 1688, 287, 12446], 'attention_mask': [1, 1, 1, 1]}
I major in economics
['I', 'Ġmajor', 'Ġin', 'Ġeconomics']


The `Ġ` character in tokens stands for whitespace.

## C. Using the Underlying Model

If you want the model to process a lot of samples, you need to use the underlying model directly instead of using `pipeline`. 

First, load the appropriate model and tokenizer. Will use the DistilBERT question and answer model as an example:

In [14]:
from transformers import TFDistilBertForQuestionAnswering
from transformers import DistilBertTokenizerFast
import tensorflow as tf # Need to import either Tensorflow or PyTorch
import numpy as np

# Set up model. Tensorflow models starts with 'TF'
model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert-base-cased-distilled-squad")

#Tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-cased')

2022-12-01 10:15:55.874938: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-12-01 10:15:55.874964: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (scrp-login-2): /proc/driver/nvidia/version does not exist
2022-12-01 10:15:55.875602: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
All model checkpoint layers were used when initializing TFDistilBertForQuestionAnswering.

All the layers of TFDistilBertForQuestionAnswering were initialized from the model checkpoint at distilbert-base-cased-distilled-squad.
If your task is similar to the task the model of the checkpoint wa

Next, feed the data to the model and process the output:

In [15]:
# Data
question = ["Where is CUHK?", 
            "What is an apple?"]
text = ["CUHK is a university in Hong Kong.", 
        "Apple and orange are examples of fruits."]
inputs = tokenizer(question, text, 
                   return_tensors='tf', 
                   truncation=True, 
                   padding=True)

# Feed data through the model
outputs = model(inputs)

# Q&A model outputs the two logit scores for each word.
# One for its chance of being the start of the answer
# and one for its chance of being the end
start_logits = outputs.start_logits.numpy()
end_logits = outputs.end_logits.numpy()

# Find the words with the highest score
start = np.argmax(start_logits, 1)
end = np.argmax(end_logits, 1)

# Return the answers
tokens = [tokenizer.convert_ids_to_tokens(x) for x in inputs["input_ids"].numpy()]
ans_tokens = [x[start[i]:end[i]+1] for i,x in enumerate(tokens)]
answers = [tokenizer.convert_tokens_to_string(x) for x in ans_tokens]
answers

['Hong Kong', 'orange']

To time the script, let us wrap the code above in a function:

In [16]:
def batch_inference(question,text):
    inputs = tokenizer(question, text, 
                       return_tensors='tf', 
                       truncation=True, 
                       padding=True)

    # Feed data through the model
    outputs = model(inputs)

    # Q&A model outputs the two logit scores for each word.
    # One for its chance of being the start of the answer
    # and one for its chance of being the end
    start_logits = outputs.start_logits.numpy()
    end_logits = outputs.end_logits.numpy()

    # Find the words with the highest score
    start = np.argmax(start_logits, 1)
    end = np.argmax(end_logits, 1)

    # Return the answers
    tokens = [tokenizer.convert_ids_to_tokens(x) for x in inputs["input_ids"].numpy()]
    return [tokenizer.convert_tokens_to_string(x[start[i]:end[i]+1]) for i,x in enumerate(tokens)]

Now we can use the magic command `%time` to time the function. This time, we feed the model with 1000 samples:

In [17]:
question = ["Where is CUHK?" for i in range(1000)]
text = ["CUHK is a university in Hong Kong." for i in range(1000)]

%time ans = batch_inference(question,text)

CPU times: user 52.4 s, sys: 27.9 s, total: 1min 20s
Wall time: 4.11 s


Compare to using `pipeline`:

In [18]:
inputs={'question':question,'context':text}
question_answerer = pipeline('question-answering')

%time ans = question_answerer(inputs)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


CPU times: user 3min 7s, sys: 73.7 ms, total: 3min 7s
Wall time: 11.8 s


## D. Running on Cluster

We can speed up the process by using more CPU cores, but it will be even better if we use a GPU. Too see how much speed up we can get, let us put what we have above in a python script. This is available as `hf-batch-inference.py` under the 'Examples' folder.

If you using the Department of Economics' SCRP HPC Cluster, you can run it on four CPU cores by typing the following commands in a terminal:

```
conda activate tensorflow
compute python [path]/hf-batch-inference.py
```

This should take around six seconds to complete.


To run on a GPU :

```
gpu python [path]/hf-batch-inference.py
```

This runs the script on the slowest available GPU on the cluster. This usually means a RTX 3060. You can expect the inference to complete in 0.35 seconds, excluding the time it takes to load the model and the tokenizer.

The speed up is going to be much more impressive if we use the fastest GPU available:
```
gpu ---gpus=rtx3090:1 python [path]/hf-batch-inference.py
```

Inferencing 1000 samples should take less than 0.2 seconds, a 200x speed up over using `pipeline` on a login node.

One thing to beware of is that GPU on-board memory is generally much smaller than main memory, and for that reason you could ran out of memory if you try to feed a large dataset to the model all at once. In that case you will have to feed data in batches. Both Tensorflow and Hugging Face have a `Dataset` class for this purpose. 