<a href="https://colab.research.google.com/github/vanessa920/2021-ComputeFest/blob/main/Lab_1_Intro_to_Language_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 style="padding-top: 25px;padding-bottom: 25px;text-align: left; padding-left: 10px; background-color: #DDDDDD; 
    color: black;"> <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png"> <a href='https://www.computefest.seas.harvard.edu/' target='_blank'><strong>IACS: ComputeFest 2021</strong></a></h1>

# Language Models

#### **Authors/Instructors:**
Chris Tanner, Shivas Jayaram, Rohit Beri, Zhao Lyu, Xiaohan Yang

## <font color="darkred">Workshop Outline</font>



1. [**Setup Notebook**](#Setup-Notebook)
2. [**Text Generation**](#Text-Generation)
 - ***Transformers***
 - ***GPT-2 Pretrained Lanaguage Model***
    - Overview
    - Greedy Search
    - Beam Search
    - Top-K Sampling
    - Top-p Sampling
    - Sampling using Temperature
    - Interactive Examples
    - Parameters for text generation
 - ***Additional Notes***
3. [**References**](#References)

In [None]:
import requests
from IPython.core.display import HTML, display
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text

def style():
    HTML(styles)

get_ipython().events.register('pre_run_cell', style)

---

## <font color="darkred">Setup Notebook

### <font color="green">Copy & Setup Colab with GPU

1) Select "File" menu and pick "Save a copy in Drive"  
2) This notebooks is already setup to use GPU but if you want to change it. Go to "Runtime" menu and select "Change runtime type". Then in the popup in "Hardware accelerator" select "GPU" and then click "Save"   
3) If you want high RAM there is an option for that

### <font color="green">Installs

Install ```transformers``` library (State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.) from https://huggingface.co/transformers/index.html

In [None]:
!pip install transformers

### <font color="orange">Nano Quiz</font>

**1.** What is the HuggingFace?
* A.  a library that contains many language models
* B.  a specific type of Transformers
* C.  an emoji used frequently in NLP community
* D.  an NLP-focused startup with a large open-source community, in particular around the Transformers library.

#### <font color="purple">Answer

D.  an NLP-focused startup with a large open-source community, in particular around the Transformers library.

### <font color="green">Imports

In [None]:
import sys
import logging
from argparse import ArgumentParser
from subprocess import call

import tensorflow as tf

from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

### <font color="green">Setup Logger

In [None]:
# Setup Logger
if '__file__' not in globals():
  __file__ = "."
logger = logging.getLogger(__file__)

# Logger config
logging.basicConfig(level=logging.INFO)

### <font color="green">Verify Setup

In [None]:
# Enable/Disable Eager Execution
# Reference: https://www.tensorflow.org/guide/eager
# TensorFlow's eager execution is an imperative programming environment that evaluates operations immediately, 
# without building graphs

#tf.compat.v1.disable_eager_execution()
#tf.compat.v1.enable_eager_execution()

logger.info('__Python VERSION: %s', sys.version)
logger.info("tensorflow version: %s", tf.__version__)
logger.info("keras version: %s", tf.keras.__version__)
logger.info("Eager Execution Enabled: %s", tf.executing_eagerly())

# Get the number of replicas 
strategy = tf.distribute.MirroredStrategy()
logger.info("Number of replicas: %s", strategy.num_replicas_in_sync)

devices = tf.config.experimental.get_visible_devices()
logger.info("Devices: %s", devices)
logger.info(tf.config.experimental.list_logical_devices('GPU'))

logger.info("GPU Available: %s", tf.config.list_physical_devices('GPU'))
logger.info("All Pysical Devices: %s", tf.config.list_physical_devices())

# nvidia-smi
call(["nvidia-smi", "--format=csv", "--query-gpu=index,name,driver_version,memory.total,memory.used,memory.free"])

# Better performance with the tf.data API
# Reference: https://www.tensorflow.org/guide/data_performance
AUTOTUNE = tf.data.experimental.AUTOTUNE

---

## <font color="darkred">Text Generation using Language Models

> **Language Model:**
>> A machine learning model that is able to look at part of a sentence and predict the next word.

### <font color="green">Transformers</font>

* [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)

<br>

A **transformer** is an attention based deep learning model that has proven to be effective in some common NLP tasks. It is essentially a stack of encoder and decoder layers, where the encoder encodes our input using the attention mechanism, and the decoder uses the information encoded in the encoder to give the output. Unlike RNNs/LSTMs, tranformers lend themselves to parallelization.

![Transformer](https://miro.medium.com/max/4800/1*zashbcHkygPg0GGEzrmemg.png)

### <font color="green">GPT-2 Pretrained Lanaguage Model

#### <font color="orange">Overview</font>

* [Open AI: GPT-2](https://openai.com/blog/better-language-models/)
* [The Illustrated GPT-2](http://jalammar.github.io/illustrated-gpt2/)

<br>

GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains.

> GPT-2 (a successor to GPT), was trained simply to predict the next word in 40GB of Internet text. GPT-2 displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where we prime the model with an input and have it generate a lengthy continuation. 

<br>

In addition, GPT-2 outperforms other language models trained on specific domains (like Wikipedia, news, or books) without needing to use these domain-specific training datasets. On language tasks like question answering, reading comprehension, summarization, and translation, GPT-2 begins to learn these tasks from the raw text, using no task-specific training data. While scores on these downstream tasks are far from state-of-the-art, they suggest that the tasks can benefit from unsupervised techniques, given sufficient (unlabeled) data and compute.

> The GPT-2 is built using transformer decoder blocks. BERT, on the other hand, uses transformer encoder blocks. One key difference between the two is that GPT2, like traditional language models, outputs one token at a time.

<br>

![GPT-2](http://jalammar.github.io/images/gpt2/gpt2-sizes-hyperparameters-3.png)


#### <font color="orange">Load Model & Tokenizer</font>

Let's first initalize the pretrained GPT-2 model and the tokenizer.

In [None]:
# Tokenizer - Converts text into numerical tokens. 
# Special tokens for Start/end of text and words unknown to the pretrained model.
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Model - Load pretrained GPT Language Generation Model
# Lanaguage Generation model generates text based on an input of beginning of a sentence.
model = TFGPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

#### <font color="orange">Input Text</font>
Here we use the GPT2 tokenizer to have the input encoded by the encoder of GPT2.

In [None]:
tf.random.set_seed(1234)

# Tokenize Input
input_ids = tokenizer.encode('Today is a great day', return_tensors='tf')
print("input_ids",input_ids)

#### <font color="orange">Text generation using Greedy Search</font>

Greedy search will simply select the word with the highest probability as its next word: $w_t = argmax_{w}P(w | w_{1:t-1})$ at each timestep $t$. 

![Greedy](https://www.techopedia.com/images/uploads/cdbfcae0113e42b0b02a0821db92d660.PNG)

> Below is an example to generate the next five text that are most likely to appear. (The last text is a shifted line)

In [None]:
output_ids = tf.identity(input_ids)
for i in range(5):
  model_outputs = model(input_ids=output_ids)
  next_token_logits = model_outputs[0][:, -1, :]

  # Greedy decoding
  next_token = tf.math.argmax(next_token_logits, axis=-1, output_type=tf.int32)

  next_token = tf.reshape(next_token, [1,-1])
  output_ids = tf.concat([output_ids,next_token], axis=1)
  print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

In [None]:
# Using the prebuilt method in "transformers" model
# max_length is the maximum length of the whole text, including input words and generated ones.

outputs = model.generate(input_ids, max_length=10,num_return_sequences=1)
print("Generated text:")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

#### <font color="orange">Text generation using Beam search</font>

One important drawback of greedy search is that it would miss words with high probability but hidden behind words with low probability. We can keep track of multiple words with probabilities higher than others to reduce the risk of missing words with high probability hidden by those with low probability. This is exactly what Beam search does, and it choose the one with the highest probability when it turns to end.

Beam search is an improved version of greedy search. It has a hyperparameter named beam size, $k$ . At time step 1, we select to $k$ tokens with the highest conditional probabilities. Each of them will be the first token of $k$ candidate output sequences, respectively. At each subsequent time step, based on the $k$ candidate output sequences at the previous time step, we continue to select $k$ candidate output sequences with the highest conditional probabilities.

![Beam Search](https://d2l.ai/_images/beam-search.svg)

In [None]:
# Beam search
n = 2

# Predict from model
def predict_from_model(model,input_ids):
  model_outputs = model(input_ids=input_ids)
  next_token_logits = model_outputs[0][:, -1, :]
  top_n = tf.math.top_k(next_token_logits, k=n)
  top_n_logits = top_n[0]
  top_n_scores = tf.nn.softmax(top_n_logits, axis=-1)
  top_n_index = top_n[1]
  return top_n_index, top_n_scores

def generate_beam_search_options(level,score, model, current_output_ids):
  top_n_index_next, top_n_scores_next = predict_from_model(model, current_output_ids)
  for k in range(n):
    next_score = top_n_scores_next[0][k].numpy()
    next_next_token = tf.reshape(top_n_index_next[0][k], [1,-1])
    next_current_output_ids = tf.concat([current_output_ids,next_next_token], axis=1)
    output_text = tokenizer.decode(next_current_output_ids[0], skip_special_tokens=True)
    print("\t"*level,k,output_text,":",next_score,",", next_score+score)

    if level < 3:
      generate_beam_search_options(level+1,next_score+score, model, next_current_output_ids)

print("Prompt:",tokenizer.decode(input_ids[0], skip_special_tokens=True))
output_ids = tf.identity(input_ids)
top_n_index, top_n_scores = predict_from_model(model, output_ids)

for j in range(n):
  score = top_n_scores[0][j].numpy()
  next_token = tf.reshape(top_n_index[0][j], [1,-1])
  current_output_ids = tf.concat([output_ids,next_token], axis=1)
  output_text = tokenizer.decode(current_output_ids[0], skip_special_tokens=True)
  print(j,output_text,":",score)

  # Next word
  generate_beam_search_options(1,score, model, current_output_ids)


In [None]:
num_beams = # TODO: set beam width to be 2 
max_length = # TODO: generate next 5 words

print("Generated text:")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# Using the prebuilt methond in "transformers" model
outputs = model.generate(
    input_ids,  
    max_length=max_length, 
    num_beams=num_beams, 
    early_stopping=True
)

print("Generated text:")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

**Challenges with Beam search:**

One obvious is the length of generation. As in our example above, the the length of generation is given, but this in some open-ended generations, the length varies or becomes hard to be predicted. Another problem is that distribution of high probability words does not necessarily capture human language. 

---

#### <font color="orange">Nano Quiz</font>

**2.** When Beam width is set to 1, would Beam search perform the same as greedy search?

* A. Yes
* B. No

##### <font color="purple">Answer

A. Yes - Greedy Search can be treated as a special type of beam search with a beam size of 1

#### <font color="orange">Food for thought

* **What is the limiting case of beam search?**



* Exhaustive Search

#### <font color="orange">Text generation using Top-K Sampling</font>

* Greedy search and Beam search are determinstic by probabilities. 
* We can use sampling to introduce randomness in text generation. 
* **In Top-K Sampling, the $k$ most likely next words are filtered and the probability mass is redistributed among only those $k$ words.**

![Top-K](https://miro.medium.com/max/1400/1*ixvVLan_Ll3MdLDEfJj5qA.png)

In [None]:
top_k = 50

output_ids = tf.identity(input_ids)
for i in range(5):
  model_outputs = model(input_ids=output_ids)
  next_token_logits = model_outputs[0][:, -1, :]

  # Remove all tokens with a probability less than the last token of the top-k
  indices_to_remove = next_token_logits < tf.math.top_k(next_token_logits, k=top_k)[0][..., -1, None]
  top_k_logits = tf.zeros_like(next_token_logits) + -float("Inf")
  top_k_logits = tf.where(indices_to_remove, top_k_logits, next_token_logits)
  
  # Next token based on top k probs
  next_token = tf.random.categorical(top_k_logits, dtype=tf.int32, num_samples=1)
  next_token = tf.reshape(next_token, [1,-1])
  output_ids = tf.concat([output_ids,next_token], axis=1)
  print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

In [None]:
# Using the prebuilt methond in "transformers" model
outputs = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=10, 
    top_k=50
)

print("Generated text:")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

**Top-K sampling is not perfect:** 
* One concern with Top-K sampling is that it does not dynamically adapt the number of words that are filtered from the next word probability distribution. 
* This can be problematic as some words might be sampled from a very sharp distribution, whereas others from a much more flat distribution.
* Limiting the sample pool to a fixed size K could endanger the model to produce gibberish for sharp distributions and limit the model's creativity for flat distribution.

#### <font color="orange">Nano Quiz</font>

**3.** In which case we may not want to use top-k sampling for text generation?

* A. when the probability distribution is flat 
* B. when the probability distribution is sharp 

##### <font color="purple">Answer

B. When the probablity distribution is sharp

#### <font color="orange">Text generation using Top-p sampling</font>

In contrast to Top-K sampling, the **Top-p sampling chooses from the smallest possible set of words whose cumulative probability exceeds the probability $p$.** The probability mass is then redistributed among this set of words. The number of words in such a set can dynamically increase and decrease according to the next word's probability distribution. 

![Top-p](https://miro.medium.com/max/1400/1*9HEQLJLkPe1Tc1VwIYk5Iw.png)

In [None]:
# Using the prebuilt methond in "transformers" model
outputs = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=10, 
    top_p=0.80, 
    top_k=0
)

print("Generated text:")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

While in theory, Top-p seems more elegant than Top-K, both methods work well in practice. **Top-p can also be used in combination with Top-K, which can avoid very low ranked words while allowing for some dynamic selection.**

#### <font color="orange">Sampling using Temperature parameter</font>


**Temperature is a hyper-parameter used to control the randomness of predictions by scaling the logits before applying softmax:**

* when temperature is a small value (e.g. 0,2), the GPT-2 model is more confident but also more conservative
* when temperature is a large value (e.g. 1), the GPT-2 model produces more diversity and also more mistakes

> <font color="blue">**Default Logits i.e. Temperature = 1**

![Temp1](https://huggingface.co/blog/assets/02_how-to-generate/sampling_search.png)

> <font color="blue"> **Temperature = 0.7**

![Temp7](https://huggingface.co/blog/assets/02_how-to-generate/sampling_search_with_temp.png)

In [None]:
# Using the prebuilt method in "transformers" model
outputs = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=20,
    temperature=0.7
)

print("Generated text:")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

#### <font color="orange">Nano Quiz</font>

**4.** As temperature gets closer to zero, the output gets closer to?

* A. Greedy Search
* B. Beam Search with beam size equal to the size of vocabulary
* C. Top-p sampling with p=0.5

##### <font color="purple">Answer

A. Greedy Search

#### <font color="orange">Interactive Examples</font>

In [None]:
from __future__ import print_function
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

In [None]:
def f(input, max_length, top_p, top_k, temperature, num_beams):

    tf.random.set_seed(42)

    input_ids = tokenizer.encode(input, return_tensors='tf')

    outputs = model.generate(
        input_ids, 
        do_sample=True, 
        max_length=max_length, 
        top_p=top_p, 
        top_k=top_k,
        temperature=temperature,
        num_beams=num_beams,
    )

    print()
    print("Generated text:")
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))
    print()
    return None

In [None]:
interact(
    f, 
    input="Today is a great day",
    max_length=widgets.IntSlider(min=10, max=100, step=5, value=10),
    top_p=widgets.FloatSlider(min=0.25, max=1, step=0.05, value=0.90), 
    top_k=widgets.IntSlider(min=1, max=25, step=1, value=10),
    temperature=widgets.FloatSlider(min=0.1, max=2, step=0.1, value=1),
    num_beams=widgets.IntSlider(min=1, max=5, step=1, value=2),
    )

#### <font color="orange">Parameters for model.generate(...)



```
def generate(
        self,
        input_ids=None,
        max_length=None,
        min_length=None,
        do_sample=None,
        early_stopping=None,
        num_beams=None,
        temperature=None,
        top_k=None,
        top_p=None,
        repetition_penalty=None,
        bad_words_ids=None,
        bos_token_id=None,
        pad_token_id=None,
        eos_token_id=None,
        length_penalty=None,
        no_repeat_ngram_size=None,
        num_return_sequences=None,
        attention_mask=None,
        decoder_start_token_id=None,
        use_cache=None,
    ):
```


```
Parameters:
        input_ids:
            The sequence used as a prompt for the generation.
        max_length:
            The maximum length of the sequence to be generated.
        min_length:
            The minimum length of the sequence to be generated.
        do_sample:
            Whether or not to use sampling ; use greedy decoding otherwise.
        early_stopping:
            Whether to stop the beam search when at least ``num_beams`` sentences are finished per batch or not.
        num_beams:
            Number of beams for beam search. 1 means no beam search.
        temperature:
            The value used to module the next token probabilities.
        top_k:
            The number of highest probability vocabulary tokens to keep for top-k-filtering.
        top_p:
            If set to float < 1, only the most probable tokens with probabilities that add up to ``top_p`` or
            higher are kept for generation.
        repetition_penalty:
            The parameter for repetition penalty.
        pad_token_id:
            The id of the `padding` token.
        bos_token_id:
            The id of the `beginning-of-sequence` token.
        eos_token_id:
            The id of the `end-of-sequence` token.
        length_penalty:
            Exponential penalty to the length. 1.0 means no penalty.    
            Set to values < 1.0 in order to encourage the model to generate shorter sequences, to a value > 1.0 in
            order to encourage the model to produce longer sequences.
        no_repeat_ngram_size:
            If set to int > 0, all ngrams of that size can only occur once.
        bad_words_ids:
            List of token ids that are not allowed to be generated.
        num_return_sequences:
            The number of independently computed returned sequences for each element in the batch.
        attention_mask:
            Mask to avoid performing attention on padding token indices. Mask values are in ``[0, 1]``, 1 for
            tokens that are not masked, and 0 for masked tokens.
        decoder_start_token_id:
            If an encoder-decoder model starts decoding with a different token than `bos`, the id of that token.
        use_cache:
            Whether or not the model should use the past last key/values attentions (if applicable to the model) to
            speed up decoding.
        model_specific_kwargs:
            Additional model specific kwargs will be forwarded to the :obj:`forward` function of the model.
```



### <font color="green">Additional Notes


#### <font color="orange">Understand logits

* Logits here means the prediction scores of the language modeling head, namely scores for each vocabulary token before the SoftMax layer.
* The shape of logits is 
 ```(batch_size, sequence_length, config.vocab_size) ```

In [None]:
inputs = tokenizer("Hello, my dog is cute", return_tensors="tf")
outputs = model(inputs)
logits = outputs.logits
probs = tf.nn.softmax(logits, axis=-1)
print(logits)
print("==============================")
print(probs)

---

## <font color="darkred">References

### <font color="green">Research Papers
* [Attention is all you need (2017)](https://arxiv.org/abs/1706.03762)
* [Summary of the models](https://huggingface.co/transformers/model_summary.html)
* [GPT-2 (2019)](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
* [Top-K Sampling](https://arxiv.org/pdf/1805.04833.pdf)
* [Top-p Sampling](https://arxiv.org/abs/1904.09751)

### <font color="green">Code

- [transformers.generation_tf_utils](https://huggingface.co/transformers/_modules/transformers/generation_tf_utils.html)

### <font color="green">Articles

- [How to generate text: using different decoding methods for language generation with Transformers](https://huggingface.co/blog/how-to-generate)