## 🤗 Finetune **Longformer Encoder-Decoder (LED)** on 8K Tokens 🤗

The *Longformer Encoder-Decoder (LED)* was recently added as an extension to [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.

In this notebook we will finetune *LED* for Summarization on [Pubmed](https://huggingface.co/datasets/viewer/?dataset=scientific_papers). *Pubmed* is a long-range summarization dataset, which makes it a good candidate for LED. LED will be finetuned up to an input length of 8K tokens on a single GPU.

We will leverage 🤗`Seq2SeqTrainer`, gradient checkpointing and as usual 🤗`datasets`.

First, let's try to get a GPU with at least 15GB RAM.

To check that we are having enough RAM we can run the following command.
If the randomely allocated GPU is too small, the above cells can be run 
to crash the notebook hoping to get a better GPU.

In [1]:
import pandas as pd
from datasets import Dataset
from datasets import load_from_disk
from sklearn.model_selection import train_test_split

In [2]:
from datasets import load_dataset, load_metric
from transformers import (
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
)

In [3]:
# %%capture
# !pip install datasets==1.2.1
# !pip install transformers==4.2.0
# !pip install rouge_score

In [4]:
# huggingface datasets load

train_dataset = load_from_disk("/home/aiffelsummabot/LED/HF_train_df/")
val_dataset = load_from_disk("/home/aiffelsummabot/LED/HF_val_df/")
test_dataset = load_from_disk("/home/aiffelsummabot/LED/HF_test_df/")

In [5]:
train_dataset[0]

{'article': ' from outstanding performance in 2011 12, burberry began the year cautiously optimistic , our long-range objectives ensuring clarity of the luxury brand message , enabling sustainable growth and being a great company firmly in sight . angela ahrendts chief executive officer this combination of optimism and determination , fuelled by the brand’s wealth of opportunity , suggested continued pursuit of the investment-oriented strategic agenda in the year ahead . at the same time , this pre-disposition was tempered by uncertainties in the macro environment and the goal to deliver near-term financial performance . in the final analysis , the result was a balance of dynamic management , core execution and strategic investment . challenging context following standout growth in 2011 relative to the range of consumer sectors , luxury slowed dramatically in 2012. the ongoing economic crisis in the eurozone and a continued sluggish us weighed on all areas of consumer spending . althou

In [6]:
train_dataset

Dataset({
    features: ['__index_level_0__', 'abstract', 'article'],
    num_rows: 2445
})

Next, we install 🤗Transformers, 🤗Datasets, and `rouge_score`.



Let's start by loading and preprocessing the dataset.



Next, we download the pubmed train and validation dataset ([click to see on 🤗Datasets Hub](https://huggingface.co/datasets/scientific_papers)). This can take a couple of minutes **☕** .

It's always a good idea to take a look at some data samples. Let's do that here.

In [7]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=4):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

We can see that the input data is the `article` - a scientific report and the target data is the `abstract` - a concise summary of the report.

Cool! Having downloaded the dataset, let's tokenize it.
We'll import the convenient `AutoTokenizer` class.

Note that for the sake of this notebook, we finetune the "smaller" LED checkpoint ["allenai/led-base-16384"](https://huggingface.co/allenai/led-base-16384). Better performance can however be attained by finetuning ["allenai/led-large-16384"](https://huggingface.co/allenai/led-large-16384) at the cost of a higher required GPU RAM.

Pubmed's input data has a median token length of 2715 with the 90%-ile token length being 6101. The output data has a media token length of 171 with the 90%-ile token length being 352.${}^1$. 

Thus, we set the maximum input length to 8192 and the maximum output length to 512 to ensure that the model can attend to almost all input tokens is able to generate up to a large enough number of output tokens.

In this notebook, we are only able to train on `batch_size=2` to prevent out-of-memory errors.

---
${}^1$ The data is taken from page 11 of [Big Bird: Transformers for Longer Sequences](https://arxiv.org/pdf/2007.14062.pdf).


In [8]:
max_input_length = 4096
max_output_length = 512
batch_size = 1

Now, let's write down the input data processing function that will be used to map each data sample to the correct model format.
As explained earlier `article` represents here our input data and `abstract` is the target data. The datasamples are thus tokenized up to the respective maximum lengths of 8192 and 512.

In addition to the usual `attention_mask`, LED can make use of an additional `global_attention_mask` defining which input tokens are attended globally and which are attended only locally, just as it's the case of [Longformer](https://huggingface.co/transformers/model_doc/longformer.html). For more information on Longformer's self-attention, please take a look at the corresponding [docs](https://huggingface.co/transformers/model_doc/longformer.html#longformer-self-attention). For summarization, we follow recommendations of the [paper](https://arxiv.org/abs/2004.05150) and use global attention only for the very first token. Finally, we make sure that no loss is computed on padded tokens by setting their index to `-100`.

In [9]:
def process_data_to_model_inputs(batch):
    # tokenize the inputs and labels
    inputs = tokenizer(
        batch["article"],
        padding="max_length",
        truncation=True,
        max_length=max_input_length,
    )
    outputs = tokenizer(
        batch["abstract"],
        padding="max_length",
        truncation=True,
        max_length=max_output_length,
    )

    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask

    # create 0 global_attention_mask lists
    batch["global_attention_mask"] = len(batch["input_ids"]) * [
        [0 for _ in range(len(batch["input_ids"][0]))]
    ]

    # since above lists are references, the following line changes the 0 index for all samples
    batch["global_attention_mask"][0][0] = 1
    batch["labels"] = outputs.input_ids

    # We have to make sure that the PAD token is ignored
    batch["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in labels]
        for labels in batch["labels"]
    ]

    return batch

For the sake of this notebook, we will reduce the training and validation data 
to a dummy dataset of sizes 250 and 25 respectively. For a full training run, those lines should be commented out.

Great, having defined the mapping function, let's preprocess the training data

## Training

In [11]:
!nvidia-smi

Thu Dec  9 10:38:29 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-PCIE...  On   | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  On   | 00000000:00:05.0 Off |                    0 |
| N/A   40C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                            

In [12]:
import torch
torch.cuda

<module 'torch.cuda' from '/home/aiffelsummabot/anaconda3/envs/summabot/lib/python3.7/site-packages/torch/cuda/__init__.py'>

In [13]:
USE_CUDA = torch.cuda.is_available()
print(USE_CUDA)

True


In [14]:
device = torch.device('cuda:0' if USE_CUDA else 'cpu')

In [15]:
print('학습을 진행하는 기기:',device)

학습을 진행하는 기기: cuda:0


In [16]:
import torch
import math

print(torch.__version__) # torch version 출력

dtype = torch.float
# device = torch.device("cpu")
device = torch.device("cuda")

1.10.0+cu102


In [17]:
import os

os.environ["TOKENIZERS_PARALLELISM"] = "False"

In [102]:
#!/usr/bin/env python3
from datasets import load_dataset, load_metric
from transformers import (
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
)

# load rouge
rouge = load_metric("rouge")

# load pubmed
# pubmed_train = load_dataset("scientific_papers", "pubmed", ignore_verifications=True, split="train")
# pubmed_val = load_dataset("scientific_papers", "pubmed", ignore_verifications=True, split="validation[:10%]")
pubmed_train = train_dataset
pubmed_val = val_dataset

# comment out following lines for a test run
# pubmed_train = pubmed_train.select(range(32))
# pubmed_val = pubmed_val.select(range(32))

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("allenai/led-base-16384")


# max encoder length is 8192 for PubMed
encoder_max_length = 4096
decoder_max_length = 512
batch_size = 1


def process_data_to_model_inputs(batch):
    # tokenize the inputs and labels
    inputs = tokenizer(
        batch["article"],
        padding="max_length",
        truncation=True,
        max_length=encoder_max_length,
    )
    outputs = tokenizer(
        batch["abstract"],
        padding="max_length",
        truncation=True,
        max_length=decoder_max_length,
    )

    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask

    # create 0 global_attention_mask lists
    batch["global_attention_mask"] = len(batch["input_ids"]) * [
        [0 for _ in range(len(batch["input_ids"][0]))]
    ]

    # since above lists are references, the following line changes the 0 index for all samples
    batch["global_attention_mask"][0][0] = 1
    batch["labels"] = outputs.input_ids

    # We have to make sure that the PAD token is ignored
    batch["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in labels]
        for labels in batch["labels"]
    ]

    return batch


# map train data
pubmed_train = pubmed_train.map(
    process_data_to_model_inputs,
    batched=True,
    batch_size=batch_size,
    remove_columns=["article", "abstract", "__index_level_0__"],
)

# map val data
pubmed_val = pubmed_val.map(
    process_data_to_model_inputs,
    batched=True,
    batch_size=batch_size,
    remove_columns=["article", "abstract", "__index_level_0__"],
)

# set Python list to PyTorch tensor
pubmed_train.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
)

# set Python list to PyTorch tensor
pubmed_val.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
)

# enable fp16 apex training  ## ## name 'amp' is not defined 문제로 주석처리
training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="steps",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    fp16=False, ## default = True
    fp16_backend="auto", ## default = apex
    output_dir="./",
    logging_steps=250,
    eval_steps=5000,
    save_steps=500,
    warmup_steps=1500,
    save_total_limit=2,
    gradient_accumulation_steps=4,
)


# compute Rouge score during validation
def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(
        predictions=pred_str, references=label_str, rouge_types=["rouge2"]
    )["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }


# load model + enable gradient checkpointing & disable cache for checkpointing
led = AutoModelForSeq2SeqLM.from_pretrained("allenai/led-base-16384", gradient_checkpointing=True, use_cache=False)

# set generate hyperparameters
led.config.num_beams = 4
led.config.max_length = 512
led.config.min_length = 100
led.config.length_penalty = 2.0
led.config.early_stopping = True
led.config.no_repeat_ngram_size = 3


# instantiate trainer
trainer = Seq2SeqTrainer(
    model=led,
    tokenizer=tokenizer,
    args=training_args, ## optional
    compute_metrics=compute_metrics,
    train_dataset=pubmed_train,
    eval_dataset=pubmed_val,
)

# start training
trainer.train()

HBox(children=(FloatProgress(value=0.0, max=2445.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=272.0), HTML(value='')))




***** Running training *****
  Num examples = 2445
  Num Epochs = 3
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 4
  Total optimization steps = 915
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
[34m[1mwandb[0m: Currently logged in as: [33mstk346[0m (use `wandb login --relogin` to force relogin)




Step,Training Loss,Validation Loss


Saving model checkpoint to ./checkpoint-500
Configuration saved in ./checkpoint-500/config.json
Model weights saved in ./checkpoint-500/pytorch_model.bin
tokenizer config file saved in ./checkpoint-500/tokenizer_config.json
Special tokens file saved in ./checkpoint-500/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=915, training_loss=0.5458936263954705, metrics={'train_runtime': 4036.8547, 'train_samples_per_second': 1.817, 'train_steps_per_second': 0.227, 'total_flos': 1.979250143920128e+16, 'train_loss': 0.5458936263954705, 'epoch': 3.0})

## Evaluation

In [36]:
train_dataset

Dataset({
    features: ['__index_level_0__', 'abstract', 'article'],
    num_rows: 2167
})

In [38]:
test_dataset

Dataset({
    features: ['attention_mask', 'global_attention_mask', 'input_ids', 'labels'],
    num_rows: 280
})

In [18]:
test_df = test_dataset.select(range(8))

In [19]:
test_df

Dataset({
    features: ['abstract', 'article'],
    num_rows: 8
})

In [20]:
import torch

from datasets import load_dataset, load_metric
from transformers import LEDTokenizer, LEDForConditionalGeneration

# load pubmed
pubmed_test = test_df

# load tokenizer
model_path = "/home/aiffelsummabot/LED/checkpoint-500/"
tokenizer = LEDTokenizer.from_pretrained(model_path)
model = LEDForConditionalGeneration.from_pretrained(model_path).to("cuda").half()


def generate_answer(batch):
    inputs_dict = tokenizer(batch["article"], padding="max_length", max_length=4096, return_tensors="pt", truncation=True)
    input_ids = inputs_dict.input_ids.to("cuda")
    attention_mask = inputs_dict.attention_mask.to("cuda")
    global_attention_mask = torch.zeros_like(attention_mask)
    # put global attention on <s> token
    global_attention_mask[:, 0] = 1

    predicted_abstract_ids = model.generate(input_ids, attention_mask=attention_mask, global_attention_mask=global_attention_mask)
    batch["predicted_abstract"] = tokenizer.batch_decode(predicted_abstract_ids, skip_special_tokens=True)
    return batch


result = pubmed_test.map(generate_answer, batched=True, batch_size=4)

# load rouge
rouge = load_metric("rouge")

print("Result:", rouge.compute(predictions=result["predicted_abstract"], references=result["abstract"], rouge_types=["rouge2"])["rouge2"].mid)

  next_indices = next_tokens // vocab_size


HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))


Result: Score(precision=0.39738487689689733, recall=0.3347771925156876, fmeasure=0.3585959433230117)


In [35]:
result_pd = pd.DataFrame({'abstract':result['abstract'],
                                        'article': result['article'],
                                        'predicted_abstract': result['predicted_abstract']})
result_pd

Unnamed: 0,abstract,article,predicted_abstract
0,25695 19 march 2018 3 29 pm proof 7 02 s .c .h...,25695 19 march 2018 3 29 pm proof 7 02 s . c ...,25695 19 march 2018 3 29 pm proof 7 02 s.c.har...
1,strategic report chief executive’s statement 1...,strategic report chief executive’s statement ...,strategic report chief executive’s statement 1...
2,summary our dedication to providing our client...,summary our dedication to providing our clien...,summary our dedication to providing our client...
3,"q a with ceo , david miles 92 percent of tenan...","q a with ceo , david miles 92 percent of tena...","q a with ceo, david miles 92 percent of tenant..."
4,"in the spring , we launched our walk in wins’ ...",strategic report domino’s pizza group plc ann...,strategy across all of our markets remains sim...
5,strategic report q a with interim group chief ...,strategic report q a with interim group chief...,the b ga market is looking positive.it showed ...
6,having reduced investment in the national acci...,18 nahl group plc annual report and accounts ...,18 nahl group plc annual report and accounts 2...
7,"importantly , we are pleased to see that our i...",2017 has been another strong year for taylor ...,2017 has been another strong year for taylor w...


In [37]:
result_pd.to_csv("LED_base_4098_512_result.csv")

In [33]:
result

Dataset({
    features: ['abstract', 'article', 'predicted_abstract'],
    num_rows: 8
})

In [32]:
result['abstract']

8

In [29]:
result[1]

{'abstract': 'strategic report chief executive’s statement 14 operational review 2017 marked a strong year of growth for alpha , both in revenue and in our investment in sta and infrastructure .during the year , we increased our client numbers by 39 percent , bringing our total number of clients to 310. pleasingly , the fact that our revenue has grown at a higher percentage is a re ection of the larger trades that we are doing and the increasing size of our clients .we will continue to focus on growing our client base by penetrating our existing corporate marketplace in the uk , alongside continued expansion into the institutional marketplace and overseas sectors .europe in particular presents a very exciting area of expansion for us .during the nancial year we have recruited sta for our london o ce who are uent in foreign languages which has enabled us to steadily expand into select european territories .as a result , we successfully onboarded our rst european clients in the second ha

evaluation까지 완료.  
그냥 pretrain 이랑 fine-tuning된 모델이랑 predict 결과를 비교해 보니까 fine-tuning된 모델의 결과가 훨씬 좋았다. 문장이 매끄럽게 이어지며 표현력 또한 좋았기 때문에 저장된 checkpoint로 evaluation을 진행할 예정이다.