# BART
BART is a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text.</br>

BART uses a standard Transformer architecture (Encoder-Decoder) and is a combination of BERT, which is only encoder-model and GPT, which is a decoder-only model.</br>
# Pre-Training BART
BART is pre-trained by minimizing the cross-entropy loss between the decoder output and the original sequence.
## Masked Language Modeling (MLM), using BERT
MLM models such as BERT are pre-trained to predict masked tokens. i.e. 
Replace a random subset of the input with a mask token [MASK], which can said be as Adding noise/corruption, then original tokens for each of the [MASK] tokens, which can be called *Denoising*. 

Importantly, because BERT models can “see” the the tokens before and after the masked tokens, when attempting to predict the original tokens, BERT is a bidirectional model.

This is suitable for classification tasks, information from the full sequence is needd to perform the prediction. However, for text generation tasks where the prediction depends only on the previous words, it is not suitable. 
## Autoregressive Models
Models which use previous inputs, to predict the next token are said to be autoregressive, such as GPT2, which are pre-trained to predict the next token given the previous sequence of tokens. Since they can't see the full sentence, they are not much suitable for classification
## BART Sequence-to-Sequence
BART has both an encoder (like BERT) and a decoder (like GPT). The encoder uses a denoising objective similar to BERT while the decoder attempts to reproduce the original sequence (autoencoder), token by token, using the previous (uncorrupted) tokens and the output from the encoder.
This gives multiple ways to add noise to the text.
The corruption schemes used in the paper are summarized below.</br>

Name | Description | Example 
-----|---------|---------
Token Masking | A random subset of the input is replaced with [MASK] tokens, like in BERT. | **ABC.DE.** changed to **A_C._E.**, 	Both **B** and **D** are masked with a single mask token for each.
Token Deletion | Random tokens are deleted from the input. The model be able to must decide which positions are missing | **ABC.DE.** is changed to 	**A.C.E.**	Both **B** and **D** are deleted and not replaced. 
Text Infilling | A number of text spans (length can vary) are each replaced with a single [MASK] token.| **ABC.DE.** is changed to 	**A_.D_E.** The span **BC** is replaced with a single mask token. A 0 length span is inserted between **D** and **E**. 
Sentence Permutation | The input is split based on periods (.), and the sentences are shuffled.| **ABC.DE.** is changed to 	**DE.ABC. **
Document Rotation |  A token is chosen at random, and the sequence is rotated so that it starts with the chosen token. |**ABC.DE.**	is changed to **C.DE.AB**	The sequence is rotated around C. 

The authors note that training BART with text infilling yields the most consistently strong performance across many tasks.

The task we are interested in, i.e. **paraphrasing**, the pre-trained BART model can be fine-tuned directly using the input sequence (original phrase) and the target sequence (paraphrased sentence) as a Sequence-to-Sequence model. </br>

This also works for tasks like summarization and abstractive question answering.


# Set up the Environment

In [3]:
!python -V

Python 3.7.11


In [4]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0


In [1]:
import torch
print(torch.__version__)

1.9.0+cu102


In [2]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [3]:
!pip install simpletransformers

Collecting simpletransformers
  Downloading simpletransformers-0.61.13-py3-none-any.whl (221 kB)
[K     |████████████████████████████████| 221 kB 7.4 MB/s 
[?25hCollecting tokenizers
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 66.5 MB/s 
Collecting wandb>=0.10.32
  Downloading wandb-0.12.0-py2.py3-none-any.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 58.3 MB/s 
[?25hCollecting tqdm>=4.47.0
  Downloading tqdm-4.62.0-py2.py3-none-any.whl (76 kB)
[K     |████████████████████████████████| 76 kB 6.0 MB/s 
Collecting transformers>=4.2.0
  Downloading transformers-4.9.2-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 60.6 MB/s 
[?25hCollecting datasets
  Downloading datasets-1.11.0-py3-none-any.whl (264 kB)
[K     |████████████████████████████████| 264 kB 78.8 MB/s 
[?25hCollecting tensorboardx
  Do

In [4]:
# Some helper functions here


# To load the data 
def load_data(
    file_path, input_text_column, target_text_column, label_column, keep_label=1
):
    df = pd.read_csv(file_path, sep="\t", error_bad_lines=False)
    df = df.loc[df[label_column] == keep_label]
    df = df.rename(
        columns={input_text_column: "input_text", target_text_column: "target_text"}
    )
    df = df[["input_text", "target_text"]]
    df["prefix"] = "paraphrase"

    return df

#Some of the data have spaces before punctuation marks that we need to remove.
def clean_unnecessary_spaces(out_string):
    if not isinstance(out_string, str):
        warnings.warn(f">>> {out_string} <<< is not a string.")
        out_string = str(out_string)
    out_string = (
        out_string.replace(" .", ".")
        .replace(" ?", "?")
        .replace(" !", "!")
        .replace(" ,", ",")
        .replace(" ' ", "'")
        .replace(" n't", "n't")
        .replace(" 'm", "'m")
        .replace(" 's", "'s")
        .replace(" 've", "'ve")
        .replace(" 're", "'re")
    )
    return out_string

In [2]:
!wget http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv -P dataset

--2021-08-11 06:30:22--  http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv
Resolving qim.fs.quoracdn.net (qim.fs.quoracdn.net)... 151.101.1.2, 151.101.65.2, 151.101.129.2, ...
Connecting to qim.fs.quoracdn.net (qim.fs.quoracdn.net)|151.101.1.2|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 58176133 (55M) [text/tab-separated-values]
Saving to: ‘dataset/quora_duplicate_questions.tsv’


2021-08-11 06:30:25 (221 MB/s) - ‘dataset/quora_duplicate_questions.tsv’ saved [58176133/58176133]



#Paraphrasing with BART
Once the data is prepared, training the model is quite simple.

First, we import all the necessary stuff and set up logging.

In [5]:
import os
from datetime import datetime
import logging

import pandas as pd
from sklearn.model_selection import train_test_split
from simpletransformers.seq2seq import Seq2SeqModel, Seq2SeqArgs

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.ERROR)

#Data Preparation
Since we could reduce the datasets on which to train on, I am going to use only Quora Question answer pair data set

In [13]:
# The Quora Dataset is not separated into train/test, so we do it manually the first time.
df = load_data(
    "/content/drive/MyDrive/dataset/quora_duplicate_questions.tsv", "question1", "question2", "is_duplicate"
)
q_train, q_test = train_test_split(df)

q_train.to_csv("/content/drive/MyDrive/dataset/quora_train.tsv", sep="\t")
q_test.to_csv("/content/drive/MyDrive/dataset/quora_test.tsv", sep="\t")

INFO:numexpr.utils:NumExpr defaulting to 2 threads.


In [6]:
q_train = pd.read_csv("/content/drive/MyDrive/dataset/quora_train.tsv", sep="\t")
q_test = pd.read_csv("/content/drive/MyDrive/dataset/quora_test.tsv", sep="\t")

In [7]:
train_df=q_train
eval_df=q_test

In [8]:
len(train_df)

111947

In [9]:
train_df = train_df[["prefix", "input_text", "target_text"]]
eval_df = eval_df[["prefix", "input_text", "target_text"]]

In [10]:
train_df.head()

Unnamed: 0,prefix,input_text,target_text
0,paraphrase,What if people spoke only one language?,How would the world be different if everyone s...
1,paraphrase,How do I lose weight without doing any sport?,Can I lose weight without exercise?
2,paraphrase,How do you start a private equity firm?,Could you start a private equity firm in your ...
3,paraphrase,"Why do people say ""God bless you""?","Why do people say ""bless you"" whenever someone..."
4,paraphrase,Can someone tell how many seat in total for ge...,How many total seats are there in neet?


In [10]:
train_df = train_df.dropna()
eval_df = eval_df.dropna()

train_df["input_text"] = train_df["input_text"].apply(clean_unnecessary_spaces)
train_df["target_text"] = train_df["target_text"].apply(clean_unnecessary_spaces)

eval_df["input_text"] = eval_df["input_text"].apply(clean_unnecessary_spaces)
eval_df["target_text"] = eval_df["target_text"].apply(clean_unnecessary_spaces)

print(train_df)

INFO:numexpr.utils:NumExpr defaulting to 2 threads.


            prefix  ...                                        target_text
0       paraphrase  ...  How would the world be different if everyone s...
1       paraphrase  ...                Can I lose weight without exercise?
2       paraphrase  ...  Could you start a private equity firm in your ...
3       paraphrase  ...  Why do people say "bless you" whenever someone...
4       paraphrase  ...            How many total seats are there in neet?
...            ...  ...                                                ...
111942  paraphrase  ...       How can I see who viewed my Instagram video?
111943  paraphrase  ...  What are the major open problems in computer v...
111944  paraphrase  ...  What's your New Year resolutions for 2017 and ...
111945  paraphrase  ...  I have more than 6 tens and less than 5 ones W...
111946  paraphrase  ...              How do I change my Facebook password?

[111947 rows x 3 columns]


In [11]:
train_df["input_text"]

0                   What if people spoke only one language?
1             How do I lose weight without doing any sport?
2                   How do you start a private equity firm?
3                        Why do people say "God bless you"?
4         Can someone tell how many seat in total for ge...
                                ...                        
111942     How do I see who is viewing my Instagram videos?
111943          What are major problems in computer vision?
111944             What is your 2017 New Year’s resolution?
111945    I have more than 6/10 and less than five ones ...
111946    How do you log in to Facebook if you forgot yo...
Name: input_text, Length: 111947, dtype: object

In [12]:
train_df["target_text"]

0         How would the world be different if everyone s...
1                       Can I lose weight without exercise?
2         Could you start a private equity firm in your ...
3         Why do people say "bless you" whenever someone...
4                   How many total seats are there in neet?
                                ...                        
111942         How can I see who viewed my Instagram video?
111943    What are the major open problems in computer v...
111944    What's your New Year resolutions for 2017 and ...
111945    I have more than 6 tens and less than 5 ones W...
111946                How do I change my Facebook password?
Name: target_text, Length: 111947, dtype: object

#Setup the model and hyperparameter values
Then, we set up the model and hyperparameter values. Note that we are using the pre-trained facebook/bart-large model, and fine-tuning it on our own dataset.
Finally, we’ll generate paraphrases for each of the sentences in the test data.

In [13]:
model_args = Seq2SeqArgs()
model_args.do_sample = True
model_args.eval_batch_size = 64
model_args.evaluate_during_training = True
model_args.evaluate_during_training_steps = 2500
model_args.evaluate_during_training_verbose = True
model_args.fp16 = False
model_args.learning_rate = 5e-5
model_args.max_length = 128
model_args.max_seq_length = 128
model_args.num_beams = None
model_args.num_return_sequences = 3
model_args.num_train_epochs = 2
model_args.overwrite_output_dir = True
model_args.reprocess_input_data = True
model_args.save_eval_checkpoints = False
model_args.save_steps = -1
model_args.top_k = 50
model_args.top_p = 0.95
model_args.train_batch_size = 8
model_args.use_multiprocessing = False
model_args.wandb_project = "Paraphrasing with BART"


model = Seq2SeqModel(
    encoder_decoder_type="bart",
    encoder_decoder_name="facebook/bart-large",
    args=model_args,
)



INFO:filelock:Lock 139962045896912 acquired on /root/.cache/huggingface/transformers/3f12fb71b844fcb7d591fdd4e55027da90d7b5dd6aa5430ad00ec6d76585f26c.58d5dda9f4e9f44e980adb867b66d9e0cbe3e0c05360cefe3cd86f5db4fff042.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1600.0, style=ProgressStyle(description…

INFO:filelock:Lock 139962045896912 released on /root/.cache/huggingface/transformers/3f12fb71b844fcb7d591fdd4e55027da90d7b5dd6aa5430ad00ec6d76585f26c.58d5dda9f4e9f44e980adb867b66d9e0cbe3e0c05360cefe3cd86f5db4fff042.lock





INFO:filelock:Lock 139965400228240 acquired on /root/.cache/huggingface/transformers/d065edfe6954baf0b989a2063b26eb07e8c4d0b19354b5c74af9a51f5518df6e.6ca4df1a6ec59aa763989ceec10dff41dde19f0f0824b9f5d3fcd35a8abffdb2.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1018571383.0, style=ProgressStyle(descr…

INFO:filelock:Lock 139965400228240 released on /root/.cache/huggingface/transformers/d065edfe6954baf0b989a2063b26eb07e8c4d0b19354b5c74af9a51f5518df6e.6ca4df1a6ec59aa763989ceec10dff41dde19f0f0824b9f5d3fcd35a8abffdb2.lock





INFO:filelock:Lock 139961997723984 acquired on /root/.cache/huggingface/transformers/0d6fc8b2ef1860c1f8f0baff4b021e3426cc7d11b153f98e563b799603ee2f25.647b4548b6d9ea817e82e7a9231a320231a1c9ea24053cc9e758f3fe68216f05.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898822.0, style=ProgressStyle(descripti…

INFO:filelock:Lock 139961997723984 released on /root/.cache/huggingface/transformers/0d6fc8b2ef1860c1f8f0baff4b021e3426cc7d11b153f98e563b799603ee2f25.647b4548b6d9ea817e82e7a9231a320231a1c9ea24053cc9e758f3fe68216f05.lock





INFO:filelock:Lock 139962047155792 acquired on /root/.cache/huggingface/transformers/6e75e35f0bdd15870c98387e13b93a8e100237eb33ad99c36277a0562bd6d850.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…

INFO:filelock:Lock 139962047155792 released on /root/.cache/huggingface/transformers/6e75e35f0bdd15870c98387e13b93a8e100237eb33ad99c36277a0562bd6d850.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock





INFO:filelock:Lock 139962047155792 acquired on /root/.cache/huggingface/transformers/d94f53c8851dcda40774f97280e634b94b721a58e71bcc152b5f51d0d49a046a.fc9576039592f026ad76a1c231b89aee8668488c671dfbe6616bab2ed298d730.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355863.0, style=ProgressStyle(descript…

INFO:filelock:Lock 139962047155792 released on /root/.cache/huggingface/transformers/d94f53c8851dcda40774f97280e634b94b721a58e71bcc152b5f51d0d49a046a.fc9576039592f026ad76a1c231b89aee8668488c671dfbe6616bab2ed298d730.lock





INFO:filelock:Lock 139961997723920 acquired on /root/.cache/huggingface/transformers/1abf196c889c24daca2909359ca2090e5fcbfa21a9ea36d763f70adbafb500d7.67d01b18f2079bd75eac0b2f2e7235768c7f26bd728e7a855a1c5acae01a91a8.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=26.0, style=ProgressStyle(description_w…

INFO:filelock:Lock 139961997723920 released on /root/.cache/huggingface/transformers/1abf196c889c24daca2909359ca2090e5fcbfa21a9ea36d763f70adbafb500d7.67d01b18f2079bd75eac0b2f2e7235768c7f26bd728e7a855a1c5acae01a91a8.lock





In [None]:
model.train_model(train_df, eval_data=eval_df)

to_predict = [
    prefix + ": " + str(input_text)
    for prefix, input_text in zip(eval_df["prefix"].tolist(), eval_df["input_text"].tolist())
]
truth = eval_df["target_text"].tolist()

preds = model.predict(to_predict)

# Saving the predictions if needed
os.makedirs("/content/drive/MyDrive/dataset/predictions", exist_ok=True)

with open(f"/content/drive/MyDrive/dataset/predictions/predictions_{datetime.now()}.txt", "w") as f:
    for i, text in enumerate(eval_df["input_text"].tolist()):
        f.write(str(text) + "\n\n")

        f.write("Truth:\n")
        f.write(truth[i] + "\n\n")

        f.write("Prediction:\n")
        for pred in preds[i]:
            f.write(str(pred) + "\n")
        f.write(
            "________________________________________________________________________________\n"
        )

INFO:simpletransformers.seq2seq.seq2seq_utils: Creating features from dataset file at cache_dir/


HBox(children=(FloatProgress(value=0.0, max=111947.0), HTML(value='')))

INFO:simpletransformers.seq2seq.seq2seq_model: Training started





HBox(children=(FloatProgress(value=0.0, description='Epoch', max=2.0, style=ProgressStyle(description_width='i…

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


HBox(children=(FloatProgress(value=0.0, description='Running Epoch 0 of 2', max=13994.0, style=ProgressStyle(d…

INFO:simpletransformers.seq2seq.seq2seq_utils: Creating features from dataset file at cache_dir/


HBox(children=(FloatProgress(value=0.0, max=37316.0), HTML(value='')))

Exception in thread Thread-31:
Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 470, in _handle_results
    task = get()
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 251, in recv
    return _ForkingPickler.loads(buf.getbuffer())
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/reductions.py", line 294, in rebuild_storage_fd
    storage = cls._new_shared_fd(fd, size)
RuntimeError: unable to mmap 1024 bytes from file <filename not specified>: Cannot allocate memory (12)

