<a href="https://colab.research.google.com/github/royam0820/DL/blob/master/pytorch_transformers_GPT2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pytorch Tranformers
Ref.:  https://www.analyticsvidhya.com/blog/2019/07/pytorch-transformers-nlp-python/

PyTorch-Transformers is a library of state-of-the-art pre-trained models for Natural Language Processing (NLP).

This library currently contains PyTorch implementations, pre-trained model weights, usage scripts and conversion utilities for the following models:

- BERT (from Google) released with the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- GPT (from OpenAI) released with the paper Improving Language Understanding by Generative Pre-Training
- GPT-2 (from OpenAI) released with the paper Language Models are Unsupervised Multitask Learners
- Transformer-XL (from Google/CMU) released with the paper Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
- XLNet (from Google/CMU) released with the paper XLNet: Generalized Autoregressive Pretraining for Language Understanding
- XLM (from Facebook) released together with the paper Cross-lingual Language Model Pretraining
All of the above models are the best in class for various NLP tasks. Some of these models are as recent as the previous month!

Most of the State-of-the-Art models require tons of training data and days of training on expensive GPU hardware which is something only the big technology companies and research labs can afford. But with the launch of PyTorch-Transformers, now anyone can utilize the power of State-of-the-Art models!

In [1]:
# installing the pytorch-transformers
!pip install pytorch-transformers

Collecting pytorch-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/40/b5/2d78e74001af0152ee61d5ad4e290aec9a1e43925b21df2dc74ec100f1ab/pytorch_transformers-1.0.0-py3-none-any.whl (137kB)
[K     |██▍                             | 10kB 15.9MB/s eta 0:00:01[K     |████▊                           | 20kB 4.3MB/s eta 0:00:01[K     |███████▏                        | 30kB 6.1MB/s eta 0:00:01[K     |█████████▌                      | 40kB 4.0MB/s eta 0:00:01[K     |████████████                    | 51kB 4.9MB/s eta 0:00:01[K     |██████████████▎                 | 61kB 5.8MB/s eta 0:00:01[K     |████████████████▊               | 71kB 6.6MB/s eta 0:00:01[K     |███████████████████             | 81kB 7.4MB/s eta 0:00:01[K     |█████████████████████▌          | 92kB 8.1MB/s eta 0:00:01[K     |███████████████████████▉        | 102kB 6.4MB/s eta 0:00:01[K     |██████████████████████████▎     | 112kB 6.4MB/s eta 0:00:01[K     |████████████████████████████

## Predicting the next word in GPT2
The code below is straightforward. We tokenize and index the text as a sequence of numbers and pass it to the GPT2LMHeadModel. This is nothing but the GPT2 model transformer with a language modeling head on top (linear layer with weights tied to the input embeddings).

In [0]:
# Import required libraries
import torch
from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel

In [3]:
# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

100%|██████████| 1042301/1042301 [00:01<00:00, 946275.44B/s]
100%|██████████| 456318/456318 [00:00<00:00, 616929.18B/s]


In [0]:
# Encode a text inputs (English)
text = "What is the fastest car in the"
indexed_tokens = tokenizer.encode(text)


In [0]:
# Convert indexed tokens in a PyTorch tensor
tokens_tensor = torch.tensor([indexed_tokens])

In [0]:
# Load pre-trained model (weights)
model = GPT2LMHeadModel.from_pretrained('gpt2')


In [31]:
# Set the model in evaluation mode to deactivate the DropOut modules
model.eval()


GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1)
    (h): ModuleList(
      (0): Block(
        (ln_1): BertLayerNorm()
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1)
          (resid_dropout): Dropout(p=0.1)
        )
        (ln_2): BertLayerNorm()
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1)
        )
      )
      (1): Block(
        (ln_1): BertLayerNorm()
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1)
          (resid_dropout): Dropout(p=0.1)
        )
        (ln_2): BertLayerNorm()
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1)
        )
      )
      (2): Block(
        (ln_1): BertLayerNorm()
        (att

In [32]:
# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
model.to('cuda')


GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1)
    (h): ModuleList(
      (0): Block(
        (ln_1): BertLayerNorm()
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1)
          (resid_dropout): Dropout(p=0.1)
        )
        (ln_2): BertLayerNorm()
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1)
        )
      )
      (1): Block(
        (ln_1): BertLayerNorm()
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1)
          (resid_dropout): Dropout(p=0.1)
        )
        (ln_2): BertLayerNorm()
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1)
        )
      )
      (2): Block(
        (ln_1): BertLayerNorm()
        (att

In [0]:
# Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]


In [0]:
# Get the predicted next sub-word
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])


In [35]:
# Print the predicted word
print(predicted_text)

What is the fastest car in the world


NOTE:  Awesome! The model successfully predicts the next word as “world”. This is pretty amazing as this is what Google was suggesting. I recommend you try this model with different input sentences and see how it performs while predicting the next word in a sentence.

# Natural Language Generation using GPT-2, Transformer-XL and XLNet

Let’s take Text Generation to the next level now. Instead of predicting only the next word, we will generate a paragraph of text based on the given input. Let’s see what output our models give for the following input text:



```
In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

```

We will be using the readymade script that PyTorch-Transformers provides for this task. Let’s clone their repository first:






In [2]:
!git clone https://github.com/huggingface/pytorch-transformers.git

Cloning into 'pytorch-transformers'...
remote: Enumerating objects: 48, done.[K
remote: Counting objects:   2% (1/48)   [Kremote: Counting objects:   4% (2/48)   [Kremote: Counting objects:   6% (3/48)   [Kremote: Counting objects:   8% (4/48)   [Kremote: Counting objects:  10% (5/48)   [Kremote: Counting objects:  12% (6/48)   [Kremote: Counting objects:  14% (7/48)   [Kremote: Counting objects:  16% (8/48)   [Kremote: Counting objects:  18% (9/48)   [Kremote: Counting objects:  20% (10/48)   [Kremote: Counting objects:  22% (11/48)   [Kremote: Counting objects:  25% (12/48)   [Kremote: Counting objects:  27% (13/48)   [Kremote: Counting objects:  29% (14/48)   [Kremote: Counting objects:  31% (15/48)   [Kremote: Counting objects:  33% (16/48)   [Kremote: Counting objects:  35% (17/48)   [Kremote: Counting objects:  37% (18/48)   [Kremote: Counting objects:  39% (19/48)   [Kremote: Counting objects:  41% (20/48)   [Kremote: Counting objects:  4

## GPT2
Now, you just need a single command to start the model

In [0]:
!python pytorch-transformers/examples/run_generation.py \
    --model_type=gpt2 \
    --length=100 \
    --model_name_or_path=gpt2 \

07/27/2019 02:47:19 - INFO - pytorch_transformers.tokenization_utils -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json from cache at /root/.cache/torch/pytorch_transformers/f2808208f9bec2320371a9f5f891c184ae0b674ef866b79c58177067d15732dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
07/27/2019 02:47:19 - INFO - pytorch_transformers.tokenization_utils -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt from cache at /root/.cache/torch/pytorch_transformers/d629f792e430b3c76a1291bb2766b0a047e36fae0588f9dbc1ae51decdff691b.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
07/27/2019 02:47:20 - INFO - pytorch_transformers.modeling_utils -   loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json from cache at /root/.cache/torch/pytorch_transformers/4be02c5697d91738003fb1685c9872f284166aa32e061576bbe6aaeb95649fcf.085d5f6a8e7812ea05ff0e6ed0645ab2e75d8038

NOTE:  the result: 


```
The unicorns had seemed to know each other almost as well as they did common humans. The study was published in Science Translational Medicine on May 6. What's more, researchers found that five percent of the unicorns recognized each other well. The study team thinks this might translate into a future where humans would be able to communicate more clearly with those known as super Unicorns. And if we're going to move ahead with that future, we've got to do it at least a
```



Awsome! The text that the model generated is very cohesive and actually can be mistaken as a real news article.



## XL-NET
XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, XLNet outperforms BERT on 20 tasks, often by a large margin. XLNet achieves state-of-the-art results on 18 tasks including question answering, natural language inference, sentiment analysis, and document ranking.

You can use this text to test it:

```
In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
```

You can use the following code for the same:

In [3]:
!python pytorch-transformers/examples/run_generation.py \
    --model_type=xlnet \
    --length=50 \
    --model_name_or_path=xlnet-base-cased \

07/27/2019 03:14:58 - INFO - pytorch_transformers.file_utils -   https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-base-cased-spiece.model not found in cache, downloading to /tmp/tmptdmaw3k8
100% 798011/798011 [00:00<00:00, 1935899.61B/s]
07/27/2019 03:14:59 - INFO - pytorch_transformers.file_utils -   copying /tmp/tmptdmaw3k8 to cache at /root/.cache/torch/pytorch_transformers/dad589d582573df0293448af5109cb6981ca77239ed314e15ca63b7b8a318ddd.8b10bd978b5d01c21303cc761fc9ecd464419b3bf921864a355ba807cfbfafa8
07/27/2019 03:14:59 - INFO - pytorch_transformers.file_utils -   creating metadata file for /root/.cache/torch/pytorch_transformers/dad589d582573df0293448af5109cb6981ca77239ed314e15ca63b7b8a318ddd.8b10bd978b5d01c21303cc761fc9ecd464419b3bf921864a355ba807cfbfafa8
07/27/2019 03:14:59 - INFO - pytorch_transformers.file_utils -   removing temp file /tmp/tmptdmaw3k8
07/27/2019 03:14:59 - INFO - pytorch_transformers.tokenization_utils -   loading file https://s3.amazonaws.com/models.

NOTE:  here is the result:



```
Even more surprising to the scientists was the fact that "U" was "He" in an actual human language. The "U" as "He" was translated into a Japanese language, which was a very difficult process for the
```




Interesting. While the GPT-2 model focussed directly on the scientific angle of the news about unicorns, XLNet actually nicely built up the context and subtly introduced the topic of unicorns. Let’s see how does Transformer-XL performs!



## Transformer XL
Transformer networks are limited by a fixed-length context and thus can be improved through learning longer-term dependency. That’s why Google proposed a novel method called Transformer-XL (meaning extra long) for language modeling, which enables a Transformer architecture to learn longer-term dependency.

**Transformer-XL is up to 1800 times faster than a typical Transformer. **

In [4]:
!python pytorch-transformers/examples/run_generation.py \
    --model_type=transfo-xl \
    --length=100 \
    --model_name_or_path=transfo-xl-wt103 \

07/27/2019 03:22:55 - INFO - pytorch_transformers.file_utils -   https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-vocab.bin not found in cache, downloading to /tmp/tmpoxc523na
100% 9143613/9143613 [00:00<00:00, 10863199.69B/s]
07/27/2019 03:22:56 - INFO - pytorch_transformers.file_utils -   copying /tmp/tmpoxc523na to cache at /root/.cache/torch/pytorch_transformers/b24cb708726fd43cbf1a382da9ed3908263e4fb8a156f9e0a4f45b7540c69caa.a6a9c41b856e5c31c9f125dd6a7ed4b833fbcefda148b627871d4171b25cffd1
07/27/2019 03:22:56 - INFO - pytorch_transformers.file_utils -   creating metadata file for /root/.cache/torch/pytorch_transformers/b24cb708726fd43cbf1a382da9ed3908263e4fb8a156f9e0a4f45b7540c69caa.a6a9c41b856e5c31c9f125dd6a7ed4b833fbcefda148b627871d4171b25cffd1
07/27/2019 03:22:56 - INFO - pytorch_transformers.file_utils -   removing temp file /tmp/tmpoxc523na
07/27/2019 03:22:56 - INFO - pytorch_transformers.tokenization_utils -   loading file https://s3.amazonaws.com/models.

NOTE:  here is the result:



```
language ; both never spoke in their native language ( a natural language ). If they are speaking in their native language they will have no communication with the original speakers. The encounter with a dingo brought between two and four unicorns to a head at once, thus crossing the border into Peru to avoid internecine warfare, as they did with the Aztecs. On September 11, 1930, three armed robbers killed a donkey for helping their fellow soldiers fight alongside a group of Argentines. During the same year, a pygmy @-@
```



Now, this is awesome. It is interesting to see how different models focus on different aspects of the input text to generate further. This variation is due to a lot of factors but mostly can be attributed to different training data and model architectures.

But there’s a caveat. Neural text generation has been facing a bit of backlash in recent times as people worry it can increase problems related to fake news. But think about the positive side of it! We can use it for many positive applications like- helping writers/creatives with new ideas, and so on.

# Training a Masked Language Model for BERT

The BERT framework, a new language representation model from Google AI, uses pre-training and fine-tuning to create state-of-the-art NLP models for a wide range of tasks. These tasks include question answering systems, sentiment analysis, and language inference.

BERT is pre-trained using the following two unsupervised prediction tasks:

Masked Language Modeling (MLM)
- Next Sentence Prediction
- And you can implement both of these using PyTorch-Transformers. In fact, you can build your own BERT model from scratch or fine-tune a pre-trained version. So, let’s see how can we implement the Masked Language Model for BERT.

## Problem Definition
Let’s formally define our problem statement:

```
Given an input sequence, we will randomly mask some words. The model then should predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence.
```

First, let’s prepare a tokenized input from a text string using `BertTokenizer:``


In [5]:
import torch
from pytorch_transformers import BertTokenizer, BertModel, BertForMaskedLM

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')



100%|██████████| 231508/231508 [00:00<00:00, 935757.46B/s]


In [0]:
# Tokenize input
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = tokenizer.tokenize(text)

In [7]:
tokenized_text

['[CLS]',
 'who',
 'was',
 'jim',
 'henson',
 '?',
 '[SEP]',
 'jim',
 'henson',
 'was',
 'a',
 'puppet',
 '##eer',
 '[SEP]']

## Next Step
The next step would be to convert this into a sequence of integers and create PyTorch tensors of them so that we can use them directly for computation:

In [0]:
# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 8
tokenized_text[masked_index] = '[MASK]'
assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']

# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

NOTE:  **Notice that we have set [MASK] at the 8th index in the sentence which is the word ‘Hensen’. This is what our model will try to predict.**

Now that our data is rightly pre-processed for BERT, we will create a Masked Language Model. Let’s now use `BertForMaskedLM` to predict a masked token:

In [10]:
# Load pre-trained model (weights)
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')

# Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    predictions = outputs[0]

# confirm we were able to predict 'henson'
predicted_index = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
assert predicted_token == 'henson'
print('Predicted token is:',predicted_token)

100%|██████████| 313/313 [00:00<00:00, 74224.98B/s]
100%|██████████| 440473133/440473133 [00:16<00:00, 26547031.57B/s]


Predicted token is: henson


NOTE:  That’s quite impressive.

This was a small demo of training a Masked Language Model on a single input sequence. Nevertheless, it is a very important part of the training process for many Transformer-based architectures. This is because it allows bidirectional training in models – which was previously impossible.

Congratulations! You’ve just implemented your first Masked Language Model! If you were trying to train BERT, you just finished half your work. This example will have given you a good idea of how to use PyTorch-Transformers to work with the BERT model.




# Summary

We hav  implemented and explored various State-of-the-Art NLP models like BERT, GPT-2, Transformer-XL, and XLNet using PyTorch-Transformers. This was more like a firest impressions expertiment that I did to give you a good intuition on how to work with this amazing library.

Here are 6 compelling reasons why I think you would love this library:

- **Pre-trained models**: It provides pre-trained models for 6 State-of-the-Art NLP architectures and pre-trained weights for 27 variations of these models
- **Preprocessing and Finetuning API**: PyTorch-Transformers doesn’t stop at pre-trained weights. It also provides a simple API for doing all the preprocessing and finetuning steps required for these models. Now, if you have read recent research papers, you’d know many of the State-of-the-Art models have unique ways of preprocessing the data and a lot of times it becomes a hassle to write code for the entire preprocessing pipeline
- **Usage scripts**: It also comes with scripts to run these models against benchmark NLP datasets like SQUAD 2.0 (Stanford Question Answering Dataset), and GLUE (General Language Understanding Evaluation). By using - - -PyTorch-Transformers, you can directly run your model against these datasets and evaluate the performance accordingly
- **Multilingual**: PyTorch-Transformers has multilingual support. This is because some of the models already work well for multiple languages
- TensorFlow Compatibility: You can import TensorFlow checkpoints as models in PyTorch
- **BERTology**: There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT (that some call “BERTology”)

