# BART Overview

BART was [published](https://arxiv.org/abs/1910.13461) in October 2019. The abstract describes BART as a denoising autoencoder 
for pretraining sequence-to-sequence models Specifically, the paper states:

> BART ... pre-trains 
a model combining Bidirectional and Auto-Regressiv 
Transformers.

Recall that an [autoencoder](ExtraTopics.ipynb#Autoencoder) is a special case of the encoder-decoder architecture in which the input is also provided as the output. The authors not that BART is specifically a Transformer. In this case the encoder creates an intermediary representation of the data that the decoder then learns how to translate back into the original input space.

Generally spreaking, the term denoising refers to the process of ignoring or removing irrelevant information (the noise) from a dataset so that it accurately reflects the underlying process.

In the context of an autoencoder, the term denoise refers to the model ignoring irrelevant tokens in the input sequence while still predicting the right output sequence.

The paper goes even further however to abstract the concept of noise so that it applies to missing or "corrupt" data (i.e. invalid tokens, random order, etc.). Thus BART is able to decode corrupt input sequences and produce the desired output sequence.

It notes that: 
> Unlike existing denoising autoencoders, which are tailored to specific noising schemes, BART allows us to
apply any type of document corruption. In the extremeecase, where all information about the source is lost, BART is equivalent to a language model.

Applying this ability to a solution that generates answers to questions, the implication would be that training on corrupted questions will allow the model to yield the correct response regardless of how the question is asked allowing it to yield better accuracy scores.

> This approach generalizes the original word masking and next sentence prediction objectives in BERT by forcing the model to reason more about overall sentence length and
make longer range transformations to the input.

# Use Cases

The authors claim that the representations produced by BART can be used in 
several ways for downstream application including:

- Sequence Classification Tasks
- Token Classification Tasks
- Sequence Generation Tasks
- Machine Translation

**Note**: The authors use similar terminology as those who published [BERT](./BERT.ipynb) by referring to BART as a model that produces representations.

## Fine Tuning

The paper notes that a major advancement of BART is that it changes the way we think about pretraining and fine tuning.

Previously, with BERT, the encoder and decoder were separated by the embeddings (representations). The encoder translated an input sequence into an embedding and the decoder translated the embedding into an output sequence. This required the encoder inputs to be "aligned" with the decoder outputs. 

**Note**: Recall that [alignment](Neural%20Sequence%20Transducers.ipynb#Alignment) deals with the relationship between input and output tokens.

The authors state:

> (with BART)... Inputs to the encoder need not be aligned with decoder outputs, allowing arbitary noise transformations. Here, a document has been corrupted by replacing spans of text with mask symbols. The corrupted document (left) is encoded with a bidirectional model, and then the likelihood of the original document (right) is calculated with an autoregressive decoder. For fine-tuning, an uncorrupted document is input to both the encoder and decoder, and we use representations from the final hidden state of the decoder

The authors go on to discuss the various noise transformations used by BART's encoder (red) to abritrarily introduce noise that the decoder (blue) is trained to remove.

<center><img src="./images/BART_encoder_transformations.png" style="width: 50%" ></center>



The authors also compare BART to BERT and GPT:

> BART uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes

<center><img src='images/BART_comparison_with_BERT_and_GPT.png'></center>

With BART, the embedding passed between encoder and decoder are "denoised". In the case of fine tuning, this would mean that the input sequence is transformed into a sequence that is most like the optimal input for the decoder to produce the desired output.