<a href="https://colab.research.google.com/github/sujitpal/nlp-deeplearning-ai-examples/blob/master/lf1_longformer_pretrained.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## LongTransformer Tests (Pretrained)

This notebook covers two use cases using a pre-trained [LongFormer](https://arxiv.org/abs/2004.05150) model. The first use case uses a pre-trained LongFormer language model to generate document embeddings, and the second use case uses a LongFormer model [fine-tuned on the SQuAD v1 dataset](https://huggingface.co/valhalla/longformer-base-4096-finetuned-squadv1) for question answering.

In [1]:
!pip install transformers



### Imports

In [2]:
import numpy as np
import torch

from transformers import (
    LongformerModel, LongformerTokenizer, LongformerForSequenceClassification,
    TFLongformerForQuestionAnswering, LongformerForMultipleChoice)
import tensorflow as tf

from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

### Data

Copied first paragraph of 3 wikipedia pages on Albert Einstein, Richard Feynman, and William Shakespeare.

In [3]:
einstein_text = """Albert Einstein; 14 March 1879 – 18 April 1955) was a German-born theoretical physicist[5] who developed the theory of relativity, one of the two pillars of modern physics (alongside quantum mechanics).[3][6]:274 His work is also known for its influence on the philosophy of science.[7][8] He is best known to the general public for his mass–energy equivalence formula E = mc2, which has been dubbed "the world's most famous equation".[9] He received the 1921 Nobel Prize in Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect",[10] a pivotal step in the development of quantum theory."""
feynman_text = """Richard Phillips Feynman ForMemRS (/ˈfaɪnmən/; May 11, 1918 – February 15, 1988) was an American theoretical physicist, known for his work in the path integral formulation of quantum mechanics, the theory of quantum electrodynamics, the physics of the superfluidity of supercooled liquid helium, as well as his work in particle physics for which he proposed the parton model. For contributions to the development of quantum electrodynamics, Feynman received the Nobel Prize in Physics in 1965 jointly with Julian Schwinger and Shin'ichirō Tomonaga."""
shakespeare_text = """William Shakespeare (bapt. 26 April 1564 – 23 April 1616)[a] was an English playwright, poet, and actor, widely regarded as the greatest writer in the English language and the world's greatest dramatist.[2][3][4] He is often called England's national poet and the "Bard of Avon" (or simply "the Bard").[5][b] His extant works, including collaborations, consist of some 39 plays,[c] 154 sonnets, two long narrative poems, and a few other verses, some of uncertain authorship. His plays have been translated into every major living language and are performed more often than those of any other playwright.[7] They also continue to be studied and reinterpreted."""

len(einstein_text), len(feynman_text), len(shakespeare_text)

(655, 548, 658)

### Instantiate Pretrained Longformer model and tokenizer

In [4]:
model = LongformerModel.from_pretrained('allenai/longformer-base-4096', return_dict=True)
tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')

### Document Embeddings

We tried using the following embedding strategies. 

* Getting the embedding for the document via the `pooler_output` (which is the [CLS] followed by a tanh according to the [Longformer docs](https://huggingface.co/transformers/model_doc/longformer.html)).
* Getting the last hidden state via the `last_hidden_state` for the document and then computing the mean across all elements in the sequence.

Idea is that cosine similarity between embeddings for Einstein and Feynman should be higher than cosine similarity between Einstein and Shakespeare.

In [5]:
def get_cls_embedding(text, tokenizer, model, max_length=4096):
  input_ids = torch.tensor(tokenizer.encode(text)[0:max_length]).unsqueeze(0)
  attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device)

  outputs = model(input_ids, attention_mask=attention_mask)
  text_embedding = outputs.pooler_output

  return text_embedding.detach().numpy()

einstein_embedding = get_cls_embedding(einstein_text, tokenizer, model)
feynman_embedding = get_cls_embedding(feynman_text, tokenizer, model)
shakespeare_embedding = get_cls_embedding(shakespeare_text, tokenizer, model)

X = np.array([einstein_embedding, feynman_embedding, shakespeare_embedding]).squeeze(axis=1)
print("cosine similarity:")
print(cosine_similarity(X))
print("euclidean distance:")
print(euclidean_distances(X))

cosine similarity:
[[1.         0.99713266 0.9979736 ]
 [0.99713266 1.0000001  0.9953785 ]
 [0.9979736  0.9953785  0.9999999 ]]
euclidean distance:
[[0.        0.5178683 0.4370549]
 [0.5178683 0.        0.6623964]
 [0.4370549 0.6623964 0.       ]]


In [6]:
def get_mean_embedding(text, tokenizer, model, max_length=4096):
  input_ids = torch.tensor(tokenizer.encode(text)[0:max_length]).unsqueeze(0)
  attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device)

  outputs = model(input_ids, attention_mask=attention_mask)
  text_embedding = torch.mean(outputs.last_hidden_state, dim=1)

  return text_embedding.detach().numpy()

einstein_embedding = get_mean_embedding(einstein_text, tokenizer, model)
feynman_embedding = get_mean_embedding(feynman_text, tokenizer, model)
shakespeare_embedding = get_mean_embedding(shakespeare_text, tokenizer, model)

X = np.array([einstein_embedding, feynman_embedding, shakespeare_embedding]).squeeze(axis=1)
print("cosine similarity:")
print(cosine_similarity(X))
print("euclidean distance:")
print(euclidean_distances(X))

cosine similarity:
[[0.9999999 0.990968  0.987079 ]
 [0.990968  1.0000002 0.981339 ]
 [0.987079  0.981339  1.0000004]]
euclidean distance:
[[0.        1.9428359 2.2024715]
 [1.9428359 0.        2.659282 ]
 [2.2024715 2.659282  0.       ]]


### Trivia Question Answering

This model has been fine-tuned for trivia style question answering using the [SQuAD v1 dataset](https://rajpurkar.github.io/SQuAD-explorer/).

In [7]:
tokenizer = LongformerTokenizer.from_pretrained("valhalla/longformer-base-4096-finetuned-squadv1")
model = TFLongformerForQuestionAnswering.from_pretrained("valhalla/longformer-base-4096-finetuned-squadv1")

All model checkpoint weights were used when initializing TFLongformerForQuestionAnswering.

All the weights of TFLongformerForQuestionAnswering were initialized from the model checkpoint at valhalla/longformer-base-4096-finetuned-squadv1.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFLongformerForQuestionAnswering for predictions without further training.


In [8]:
def answer_question(question, passage, model, tokenizer):
  input_dict = tokenizer(question, passage, return_tensors='tf')
  start_scores, end_scores = model(input_dict)
  all_tokens = tokenizer.convert_ids_to_tokens(input_dict["input_ids"].numpy()[0])
  answer = ' '.join(all_tokens[tf.math.argmax(start_scores, 1)[0] : tf.math.argmax(end_scores, 1)[0]+1])
  return answer

answer = answer_question("Who was Albert Einstein?", einstein_text, model, tokenizer)
print("answer:", answer)
answer = answer_question("When was William Shakespeare born?", shakespeare_text, model, tokenizer)
print("answer:", answer)


answer: Ġa ĠGerman - born Ġtheoretical Ġphysicist
answer: Ġ26 ĠApril Ġ15 64
