# Title: XLNet Exposed

#### Members Names: Oscar Bobadilla, Igor Ilic

#### Members Emails: {oscar.bobadilla, iilic} @ ryerson.ca

# Introduction:

#### Problem Description:

Language modelling is a tough task, as we have seen throughout the course. Computers have
a difficult time picking up on a lot of nuances that human's have learned. As well, each language has
a different structure, without logical rules. We need a computer to be able to bring logic to an illogical world.

Language Modelling Tasks include:
- Text Classification
- Question Answering
- Document Ranking
- Reading Comprehension
  - Text classification (example benchmark: SST-2)
  - Question Answering (example benchmark: SQuad v2.0)
- etc.

***
#### Context of the Problem:

##### BERT [4]
The biggest improvment in recent history is BERT, (Bidirectional Encoder Representations from Transformers). BERT is able to capture bidirectional importance using transformer networks [10], and masking. This brings a huge improvement from unidirectional methods prior (like ELMo [1], which uses LSTMs [11]).

##### Transformer-XL [5]
Another important network is TransformerXL. Transformers are limited by the fact that they have a fixed window when encoding sequences. This means the long sequences need to be broken down into multiple smaller sequences, losing the positional relation. TransformerXL fixes this by adding some state (not like RNNs though), to allow future sequeneces to understand previous sequences. This makes it autoregressive in nature.

***
#### Limitation About other Approaches:

##### BERT
Using a [MASK] token is problematic because it doesn't exist in the fine-tuning portion. As well, it is trivial to predict an unmasked word. The authors of BERT tried to address this problem by taking a portion of [MASK]ed words, and un[MASK]ing them. As well, for a portion of the un[MASK]ed words, they were randomly replaced. This tried to address the [MASK]ing problem

As well, BERT performed all predictions in parallel. In some cases, the ordering of predictions matters. Take:  
> I'm [MASK], I should [MASK]. 

This could be filled with "energetic" + "dance", and "tired" + "sleep", but BERT would allow for "energetic" + "sleep" to be predicted, due to the paralleized nature of BERT. The difference in prediction strategy can be seen below [7].

<img src="images/conceptual_difference.png" alt="Alt text that describes the graphic" title="Title text" />
  
***
#### Solution:

XLNet [3] fixes the [MASK] problem by using permutation language modelling. This allows for bidirectional training, without corrupting the data. This works by predicting words in a random order. As well, during the pretraining phase, XLNet utilizes a random permutation of words, combined with a positional encoding to capture bidirectionality.

Then, XLNet incorporates features from TransformerXL to further enhance itself. This allows for longer term dependencies. They found incorporating TransformerXL into BERT alone yielded benefits, but they skipped the intermediate step and directly incorporated TransformerXL into their own paper.

By using TransformerXL, the model retains state. This previous state is frozen and cached, then fed into the current state, in whole. It can be fed in whole because all the words exist, so the permutation order doesn't matter.

# Background
Results have been pulled from XLNet paper directly. [3]

**Full Comparison With BERT**
Comparison with BERT on all major Language Modelling Benchmarks
<img src="images/bert_comparison.png" alt="BERT Comparison" title="BERT Comparison" />
***

**Reading Comprehension Comparison**
Comparison on reading comprehension tasks.
<img src="images/roberta_comparison.png" alt="Test 2 Results" title="Test 2 Results" />

**Question Answering [2]**
Comparison on question and answering tasks. IE: Can a model answer a question about a document.
<img src="images/qa_comparison.png" alt="QA Comparison" title="QA Comparison" />

**Text Classification**
Ability to group/classify documents. This includes the IMDb dataset we have worked with.
<img src="images/text_classification.png" alt="Text Classification" title="Text Classification" />

**GLUE [6]**
Popular benchmark to measure Language Modelling results. 
<img src="images/glue.png" alt="GLUE Results" title="GLUE Results" />


XLNet is cleary very strong, and currently the state of the art method. Deep Language modelling has come a long way with BERT coming out, and is changing rapidly, so we suspect improvements to come soon.

From personal use, we have found this to be a huge deep learning task, requiring powerful computers. It would be great to see a minimized version of XLNet come out (similarly to how DistilBERT [9] came out after BERT). This would allow XLNet to be more widely used.

# Methodology

Though there are many ways to use bidirectional transformers, the way
we explored was classification with the SST2 database. This dataset is commonly used as a benchmark,
so we determined it to be a good place to explore.

SST-2 consists of many different strings, which are classified as positive (1) or negative (0).
Samples have been included below.

Training an XLnet from scratch is a very complicated task. Because of this, we instead use a pretrained, light version of the model (xlnet-base-cased: 110M params), and grabbed embeddings to put into a classification layer. This is different from the results from XLNet.

Typically, the way to use XLNet would be to take the pretrained version (xlnet-large-cased), and then fine-tune to the data set. This would yield the high results in the paper. We briefly discuss this at the end.

# Implementation

*Similar to Alammar (2019) [8] demonstration of using BERT.*

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier

# XLNet specifics
import torch
from transformers import XLNetModel, XLNetTokenizer

## Dataset info
Using dataset SST-2 (*Note: XLNet reported 94.4% accuracy using XLNet large*)

In [2]:
df_train = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv',
                 delimiter='\t',
                 names=['sentence','label'])

df_test = pd.read_csv('https://raw.githubusercontent.com/clairett/pytorch-sentiment-classification/master/data/SST2/test.tsv',
                 delimiter='\t',
                 names=['sentence','label'])

split_point = len(df_train)
df = pd.concat([df_train, df_test])

In [3]:
for s, l in df.head().values:
    print(f'Label {l}: {s}')

Label 1: a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films
Label 0: apparently reassembled from the cutting room floor of any given daytime soap
Label 0: they presume their audience wo n't sit still for a sociology lesson , however entertainingly presented , so they trot out the conventional science fiction elements of bug eyed monsters and futuristic women in skimpy clothes
Label 1: this is a visually stunning rumination on love , memory , history and the war between art and commerce
Label 1: jonathan parker 's bartleby should have been the be all end all of the modern office anomie films


In [4]:
df_train['label'].value_counts()

1    3610
0    3310
Name: label, dtype: int64

In [5]:
df_test['label'].value_counts()

0    912
1    909
Name: label, dtype: int64

## Training Model

In [6]:
 # See here for all pretrained https://huggingface.co/transformers/pretrained_models.html?highlight=pretrained
pretrained_label = 'xlnet-base-cased'

tokenizer = XLNetTokenizer.from_pretrained(pretrained_label)
model = XLNetModel.from_pretrained(pretrained_label)

## Data preparation
### Tokenization

In [7]:
tokenized = df['sentence'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

These added special tokens include a classifier token, "\< cls\>" which we are mainly interested in.

In [8]:
print(f'Tokenizer string form: {tokenizer.cls_token}, id: {tokenizer.cls_token_id}')

Tokenizer string form: <cls>, id: 3


In [9]:
max_len = 0
cls_positions = np.zeros(len(tokenized), dtype=np.int64)
for posn, i in enumerate(tokenized.values):
    cls_positions[posn] = len(i) - 1
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

A sample padded sentence looks like the following:

In [10]:
padded[0]

array([   24, 16003,    17,    19,  5787,    21,  1381, 21469,    17,
          88,  7693, 15930,    56,    20,  4111,    21,    18, 11740,
          21,  4974,    23,  6941,  2701,     4,     3,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0])

We can see the cls token is at the end of the sentence, before the padding.

### Masking
A slight nuance, we need to pass in a attention masking map, which allows XLNet to identify where the sentence is.

In [12]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(8741, 86)

### Embedding
We need to pass the tokenized sentences into XLNet now, and get the embeddings to pass into
another classifier.

In [13]:
inputs = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(inputs, attention_mask=attention_mask)

In order to select the correct term, we select the classifier token from the mapping.
We know that the final hidden state maps the classifier tokens to their respective position.

In [14]:
# Features
final_hidden_state = last_hidden_states[0]
X = final_hidden_state[np.arange(len(final_hidden_state)), cls_positions]

# Labels
y = df['label']

## Classification

With all of the encoded values, now we can pass into any classifier we'd like to. For simplicity,
we chose to take the default values of the sklearn MLPClassifier.

In [15]:
num_test_pts = len(df) - split_point
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=num_test_pts)

In [16]:
clf = MLPClassifier()
clf.fit(X_train, y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

In [17]:
clf.score(X_test, y_test)

0.7874794069192751

This is a lot better than random guessing! It doesn't reach XLNets full accuracy in the paper (94.4%), but this model has:  
- Simple, unoptimized classification layer
- Fewer, untuned weights in XLNet

## Fine-tuning XLNet

Although the heursitic test above is fast and easy to implement, it doesn't obtain the results
found in the XLNet paper (94.4% accuracy).

By running fine tuning on the SST-2 dataset, we were able to up the accuracy to 94.0%. This was accomplished
by setting up the transformers repo, and running the [examples](https://github.com/huggingface/transformers/blob/master/examples/README.md):

```bash
export GLUE_DIR=/path/to/data
export TASK_NAME=SST-2

python run_glue.py \
  --model_type xlnet \
  --model_name_or_path xlnet-base-cased \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 16 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir $GLUE_DIR/tmp/$TASK_NAME/
```

yields:
```bash
$ cat eval_results.txt
acc = 0.9403669724770642
```

This task took significantly longer to do, but was able to fine tune to a particular data set incredibly well.

# Conclusion and Future Direction

These results have shown that even using the raw pretrained XLNet weights can be hugely useful in making predictions on new tasks with limited data. However, as we have seen, there can be great improvement by fine tuning XLNet from the data set.

We were able to see an increase from 78.7% up to 94.0% by fine-tuning XLNet instead of using the classification layer. This task required powerful computers (128GB RAM, 2x GeForce RTX2070 GPUs, and a 16core i7 processor), and still took an hour to compute one number.

Since we were able to reproduce the results, in the future it would be great to extend use of XLNet beyond classic metrics (SST-2, SQuadv2.0, etc.). By using different datasets, it is possible to push other benchmarks, and bring XLNet into industry production as well.

Finally, it is important to see what comes after XLNet. XLNet is still really new, and there are definitely new models that will come out during 2020 that will outperform XLNet.

# References:

[1]: Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365.

[2]: Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.

[3]: Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., & Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems (pp. 5754-5764).

[4]: Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

[5]: Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., & Salakhutdinov, R. (2019). Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.

[6]: Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2018). Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.

[7]: Kurita, Keita (2019). Paper Dissected: “XLNet: Generalized Autoregressive Pretraining for Language Understanding” Explained [Blog post]. Retrieved from https://mlexplained.com/2019/06/30/paper-dissected-xlnet-generalized-autoregressive-pretraining-for-language-understanding-explained/

[8]: Alammar, Jay (2019). A Visual Guide to Using BERT for the First Time [Blog post]. Retrieved from http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/

[9]: Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.

[10]: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).

[11]: Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.