# Transformer

Transformer-based models are the current state of the art in the field of natural language processing. It is the basis of some of the most advanced AI currently in existence, including the Starcraft-playing [AlphaStar](https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii) and text-generating [GPT-3](https://en.wikipedia.org/wiki/GPT-3).

Training a Transformer-based model from scratch is very expensive, due to the large number of parameters and the huge volume of data involved. The cost of training GPT-3 was [estimated](https://bdtechtalks.com/2020/09/21/gpt-3-economy-business-model/) to be in the range of tens of millions of U.S. dollars. Fortunately, many pre-trained models are available. Pre-trained models can be fine-tuned to specific needs by training them further with domain-specific data.

In this notebook, we will use the `transformers` library developed by [Hugging Face](https://huggingface.co/), a startup "on a mission to democratize good machine learning." 

## A. Using Pre-Trained Models

The `transformers` library makes it very easy to download pre-trained models. Downloaded models are saved in a cache folder, which is by default under your home directory at `$HOME/.cache/huggingface`. Because Transformer models requires a lot of disk space&mdash;larger ones can run into hundreds of GB's&mdash;we will change the cache folder to a shared one, where I have already downloaded some models. 


In [1]:
# Hugging Face's Default cache directory is $HOME/.cache/huggingface
# To change it, set the environment variable HF_HOME
# BEFORE importing Hugging Face libraries
import os
os.environ["HF_HOME"] = "/data/huggingface/"

# Hugging Face Transformers
# Either PyTorch or Tensorflow must be installed
from transformers import pipeline

Next we have to decide what model to download. Models are categorized by attributes, including:

#### Model architecture
- BERT, GPT-2, ALBERT, RoBERTa,...

#### Fine-tuned task
- Default is whatever the model is trained on. 
e.g. BERT is trained to fill in missing words, 
while GPT-2 is trained to predict next words.
- [*text-generation*](https://huggingface.co/models?pipeline_tag=text-generation) models are fine-tuned for text generation.
- [*question-anwsering*](https://huggingface.co/models?pipeline_tag=question-answering) models are fine-tuned to answer questions based on a user-provided context.
- [*text-classification*](https://huggingface.co/models?pipeline_tag=text-classification) covers sentiment analysis and topic classification.

There are also models for [summarization](https://huggingface.co/models?pipeline_tag=summarization), [conversation](https://huggingface.co/models?pipeline_tag=conversational), [sentence comparison](https://huggingface.co/models?pipeline_tag=sentence-similarity) and [translation](https://huggingface.co/models?pipeline_tag=translation). You can search for available models on Hugging Face's [website](https://huggingface.co/). 

#### Language
- Models are usually trained on English data, but you can search for other languages, e.g. [Chinese](https://huggingface.co/models?search=chinese).

### A1. Question Answering

Let us start by loading the default Q&A model.  `transformers` provide the `pipline` class for this purpose. The syntax is:
```python
model = pipline(task,[model])
```

In [None]:
# Question answering with default model.
# This will download the model if not already present
question_answerer = pipeline('question-answering')

Once the model is loaded, we need to provide it with a `question` and a `context` in a dictionary:

In [3]:
inputs = {
'question': 'What is the ranking of CUHK in Asia?',
'context': 'The Chinese University of Hong Kong ranks 8th in Asia and 48th in the world in the field of Economics and Econometrics (QS World University Rankings by Subject 2021).'
}

question_answerer(inputs)

{'score': 0.9862246513366699, 'start': 42, 'end': 45, 'answer': '8th'}

Try different questions and context and see what you get.

### A2. Text Generation

For text generation, we will specify that we want the GPT-2 model:

In [None]:
# Text generation with GPT-2
text_generator = pipeline('text-generation', model='gpt2')

We need to provide the model with a text prompt. 
The model will then predict what words should follow.
We can also specify the maximum length of the generated text with `max_length`
and how many sequences of text we want with `num_return_sequences`.

In [11]:
# Generate five sequences of 20 words each.
text_generator("I major in economics,", 
               max_length=20, 
               num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I major in economics, the only role for me is as an economics teacher." And he added:'},
 {'generated_text': "I major in economics, including financial science and law, which I did in high school and I'm"},
 {'generated_text': 'I major in economics, and can be reached at (202) 694-4161.'},
 {'generated_text': "I major in economics, but as a graduate student or professor, I've always wondered if we can"},
 {'generated_text': 'I major in economics, philosophy, science, economics, etc.)\n\nGran Turismo Sport'}]

Try changing `max_length` and note how the quality of the generated text varies with it.

### A3. Sentiment Analysis

Finally, let us try a sentiment analysis model:

In [12]:
classifier = pipeline('text-classification')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

For sentiment analysis we only need to provide a string of text:

In [13]:
classifier("I am very sad today.")

[{'label': 'NEGATIVE', 'score': 0.9992952346801758}]

## B. Tokenizer

If you want to fine-tune a model, you will need to convert your text data
into a suitable format. This is the job of a model's *tokenizer*. 
Because different models have different designs, 
you need to use the tokenizer that comes with the model.

In [4]:
# Tokenizer for DistilBERT
from transformers import DistilBertTokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased-distilled-squad')

In [5]:
# Use the tokenizer. Question and text can be arrays rather than one sample.
question, text = "What is the ranking of CUHK in Asia?","8th in Asia"
encodings = tokenizer(question,text)
encodings

{'input_ids': [101, 1327, 1110, 1103, 5662, 1104, 140, 2591, 3048, 2428, 1107, 3165, 136, 102, 5192, 1107, 3165, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [6]:
tokenizer.decode(encodings['input_ids'])

'[CLS] What is the ranking of CUHK in Asia? [SEP] 8th in Asia [SEP]'

In [12]:
print(type(encodings))
print(type(dict(encodings)))

<class 'transformers.tokenization_utils_base.BatchEncoding'>
<class 'dict'>


In [None]:



from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

import tensorflow as tf

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
))
test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
))

In [None]:
from transformers import TFDistilBertForSequenceClassification

model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)