BART
====

**BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension**

 * Paper: https://arxiv.org/pdf/1910.13461


**BERT vs GPT vs BART** 
![BERT vs GPT vs BART](../assets/bert_gpt_bart.png)

**BART Noising Transformations**
![BART Noising Transformations](../assets/bart_noising.png)

**BART Inference: classification, translation**
![BART Inferebce](../assets/bart_inference_overview.png)

 * Installation

```bash
pip install torch transformers
```

### Summarization

In [4]:
import torch
from transformers import BartTokenizer, BartForConditionalGeneration

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
model.eval().to(device)

# Input article
ARTICLE_TO_SUMMARIZE = (
    "Mars is the fourth planet from the Sun and is often "
    'referred to as the "Red Planet" due to its reddish appearance. '
    "reversedIt has the tallest volcano in the solar system, "
    "Olympus Mons, and the deepest canyon, Valles Marineris. "
    "Mars has seasons like Earth, polar ice caps, and signs that "
    "liquid water once flowed on its surface. "
    "Scientists are especially interested in Mars because "
    "of its potential to have supported life in the past. "
    "NASA’s Perseverance rover is currently exploring "
    "the Martian surface, collecting soil samples and "
    "searching for signs of ancient microbes. "
    "Multiple missions by various space agencies have aimed "
    "to study Mars’ geology, climate, and suitability "
    "for human colonization in the future."
)

# Tokenize and move input to same device as model
inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors="pt").to(device)

# Generate summary
summary_ids = model.generate(
    inputs["input_ids"],
    num_beams=2,
    min_length=0,
    max_length=20
)

# Decode
summary = tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(summary)


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Mars is the fourth planet from the Sun and is often referred to as the "Red


### Sentence classification

In [7]:
from transformers import BartTokenizer, BartModel
import torch

tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
model = BartModel.from_pretrained("facebook/bart-base")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

texts = [
    "The new iPhone was released last week and it's amazing.",
    "The movie was boring and too long.",
    "The vaccine rollout has helped control the pandemic."
]


inputs = tokenizer(
    texts, return_tensors="pt",
    padding=True, truncation=True
).to(device)
with torch.no_grad():
    outputs = model(**inputs)

# Get sentence embeddings from <s> token
sentence_embeddings = outputs.last_hidden_state[:, 0, :]
print(sentence_embeddings.shape)

sentence_embeddings

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


torch.Size([3, 768])


tensor([[ 2.6215,  2.1697,  1.2741,  ...,  1.8379, -0.1814, -0.4693],
        [ 2.8047,  2.2025,  1.4536,  ...,  1.9282, -0.2587, -0.5650],
        [ 2.9222,  2.7109,  1.7373,  ...,  1.4853,  0.0573, -0.2068]],
       device='cuda:0')