## 🤗 Finetune **Longformer Encoder-Decoder (LED)**  🤗

Longformer Encoder-Decoder (LED)는 최근 발표된 모델이며 트랜스 포머의 확장 버전입니다.  
이 노트북은 Hugging Face의 Pubmed dataset으로 pre-trained된 모델의 checkpoint를 활용하여 fine-tuning을 진행합니다.  
  
**최소 15GB RAM 필요**

## 모듈 import

In [1]:
import os
import torch
import pandas as pd
import rouge_score
from rouge_score import rouge_scorer
import numpy as np

In [2]:
from datasets import load_from_disk
from datasets import load_dataset, load_metric
from transformers import (
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
)

In [3]:
# 환경 설정

# %%capture
# !pip install datasets==1.2.1
# !pip install transformers==4.2.0
# !pip install rouge_score

## 데이터셋 불러오기  
허깅페이스의 데이터셋 모듈을 이용합니다.

In [4]:
# huggingface datasets load

train_dataset = load_from_disk("/home/aiffelsummabot/LED/HF_train_df/")
val_dataset = load_from_disk("/home/aiffelsummabot/LED/HF_val_df/")

In [5]:
train_dataset[0]

{'article': ' from outstanding performance in 2011 12, burberry began the year cautiously optimistic , our long-range objectives ensuring clarity of the luxury brand message , enabling sustainable growth and being a great company firmly in sight . angela ahrendts chief executive officer this combination of optimism and determination , fuelled by the brand’s wealth of opportunity , suggested continued pursuit of the investment-oriented strategic agenda in the year ahead . at the same time , this pre-disposition was tempered by uncertainties in the macro environment and the goal to deliver near-term financial performance . in the final analysis , the result was a balance of dynamic management , core execution and strategic investment . challenging context following standout growth in 2011 relative to the range of consumer sectors , luxury slowed dramatically in 2012. the ongoing economic crisis in the eurozone and a continued sluggish us weighed on all areas of consumer spending . althou

In [6]:
train_dataset

Dataset({
    features: ['__index_level_0__', 'abstract', 'article'],
    num_rows: 2445
})

## Data Processing

데이터셋을 매핑하고 가공하는 함수 코드를 작성합니다.  
**article**은 input 데이터를 나타냅니다.  
**abstract**는 target 데이터입니다.  
`mex_len`은 4096으로 설정하며 `max token`은 512, `batch_size`는 1로 설정합니다.

LED의 `global_attention_mask`는 어떤 입력 토큰의 전역적인(global) 적용과 지역적인(local) 적용을 정의할 수 있습니다.  
대략적인 사항은 논문에 나와 있으며 첫 번째 토큰에만 global attention을 적용합니다. 인덱스를 -100으로 설정하는 것은 패딩된 토큰에 대해 loss가 계산되는 것을 차단하기 위함입니다. 

In [7]:
tokenizer = AutoTokenizer.from_pretrained("allenai/led-base-16384")

In [8]:
max_input_length = 4096
max_output_length = 512
batch_size = 2

In [9]:
def process_data_to_model_inputs(batch):
    # tokenize the inputs and labels
    inputs = tokenizer(
        batch["article"],
        padding="max_length",
        truncation=True,
        max_length=max_input_length,
    )
    outputs = tokenizer(
        batch["abstract"],
        padding="max_length",
        truncation=True,
        max_length=max_output_length,
    )

    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask

    # create 0 global_attention_mask lists
    batch["global_attention_mask"] = len(batch["input_ids"]) * [
        [0 for _ in range(len(batch["input_ids"][0]))]
    ]

    # since above lists are references, the following line changes the 0 index for all samples
    batch["global_attention_mask"][0][0] = 1
    batch["labels"] = outputs.input_ids

    # We have to make sure that the PAD token is ignored
    batch["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in labels]
        for labels in batch["labels"]
    ]

    return batch

위에서 정의한 함수에 데이터셋을 매핑합니다. 토크나이징까지 완료 됐으므로 기존의 컬럼은 삭제합니다.

In [10]:
train_dataset = train_dataset.map(
    process_data_to_model_inputs,
    batched=True,
    batch_size=batch_size,
    remove_columns=["article", "abstract", "__index_level_0__"],
)

HBox(children=(FloatProgress(value=0.0, max=1223.0), HTML(value='')))




In [11]:
val_dataset = val_dataset.map(
    process_data_to_model_inputs,
    batched=True,
    batch_size=batch_size,
    remove_columns=["article", "abstract", "__index_level_0__"],
)

HBox(children=(FloatProgress(value=0.0, max=136.0), HTML(value='')))




데이터셋이 다음과 같은 형태로 변환됐습니다.

In [12]:
train_dataset

Dataset({
    features: ['attention_mask', 'global_attention_mask', 'input_ids', 'labels'],
    num_rows: 2445
})

파이토치를 이용해 모델링을 진행할 것이므로 파이토치의 format으로 변환합니다.

In [13]:
train_dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
)
val_dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
)

`AutoModelForSeq2SeqLM` 클래스를 이용해 모델을 로드합니다.

In [14]:
led = AutoModelForSeq2SeqLM.from_pretrained("allenai/led-base-16384", gradient_checkpointing=True, use_cache=False)

train을 진행하는 동안 Rouge 스코어를 통해 모델을 평가합니다. 이로써 모델의 학습이 잘 되는지 확인할 수 있습니다. beam search를 통해 메모리를 절약합니다.  
`max_length`=100, `min_length`=512로 설정했으므로 100~512개의 토큰이 만들어질 것입니다.

In [15]:
# set generate hyperparameters
led.config.num_beams = 2
led.config.max_length = 512
led.config.min_length = 100
led.config.length_penalty = 2.0
led.config.early_stopping = True
led.config.no_repeat_ngram_size = 3

In [16]:
rouge = load_metric("rouge")

`compute_metrics` 함수는  label과 output(predict)에 대한 인덱스를 얻은 뒤 이를 decode 합니다. 이후 이에 대한 rouge 스코어를 계산합니다. 

In [17]:
def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(
        predictions=pred_str, references=label_str, rouge_types=["rouge2"]
    )["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

## Training

`Seq2SeqTrainer`에서 `predict_with_generate=True`로 설정하면 evaluation이 진행되는 도중에 `generate()`가 시행됩니다.
`gradient_accumulation_steps`를 높이면 GPU RAM을 효과적으로 사용할 수 있습니다.

In [18]:
training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="steps",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    fp16=False,
    output_dir="/home/aiffelsummabot/LED/",
    logging_steps=5,
    eval_steps=10,
    save_steps=10,
    save_total_limit=2,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
)

앞에서 정의했던 model, tokenizer, datasets, `compute_metrics` 함수를 `Seq2SeqTrainer`에 전달합니다.

In [19]:
os.environ["TOKENIZERS_PARALLELISM"] = "False"

In [20]:
trainer = Seq2SeqTrainer(
    model=led,
    tokenizer=tokenizer,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

In [21]:
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33msummabot[0m (use `wandb login --relogin` to force relogin)




Step,Training Loss,Validation Loss


KeyboardInterrupt: 

## Evaluation

In [None]:
test_dataset = load_from_disk("/home/aiffelsummabot/LED/HF_test_df/")

In [None]:
test_dataset

In [None]:
from transformers import LEDTokenizer, LEDForConditionalGeneration

pubmed_test = test_dataset

# 체크포인트 로드
model_path = "/home/aiffelsummabot/LED/checkpoint-500/"
tokenizer = LEDTokenizer.from_pretrained(model_path)
model = LEDForConditionalGeneration.from_pretrained(model_path).to("cuda").half()

In [None]:
def generate_answer(batch):
    inputs_dict = tokenizer(batch["article"], padding="max_length", max_length=4096, return_tensors="pt", truncation=True)
    input_ids = inputs_dict.input_ids.to("cuda")
    attention_mask = inputs_dict.attention_mask.to("cuda")
    global_attention_mask = torch.zeros_like(attention_mask)
    # put global attention on <s> token
    global_attention_mask[:, 0] = 1

    predicted_abstract_ids = model.generate(input_ids, attention_mask=attention_mask, global_attention_mask=global_attention_mask)
    batch["predicted_abstract"] = tokenizer.batch_decode(predicted_abstract_ids, skip_special_tokens=True)
    return batch

In [None]:
result = pubmed_test.map(generate_answer, batched=True, batch_size=4)

In [None]:
# load rouge
rouge = load_metric("rouge")

print("Result:", rouge.compute(predictions=result["predicted_abstract"], references=result["abstract"], rouge_types=["rouge2"])["rouge2"].mid)


In [None]:
result

In [None]:
predicted_result = pd.DataFrame({'abstract': result['abstract'],
                                                    'article': result['article'],
                                                    'predicted_abstract': result['predicted_abstract']})

predicted_result.to_csv('./predicted_document.csv')

#### prediction file 가져오기
Pegasus와 비교하기 위해 처음 100개의 text열을 따로 prediction_df2로 지정하고 Pegasus model 과 같은 rouge score (rouge1, rougeL - f-measure)를 계산해줬다. 

In [2]:
import pandas as pd
prediction_df = pd.read_csv('./predicted_document.csv', index_col=0)
prediction_df.head()

Unnamed: 0,abstract,article,predicted_abstract
0,25695 19 march 2018 3 29 pm proof 7 02 s .c .h...,25695 19 march 2018 3 29 pm proof 7 02 s . c ...,25695 19 march 2018 3 29 pm proof 7 02 s.c.har...
1,strategic report chief executive’s statement 1...,strategic report chief executive’s statement ...,strategic report chief executive’s statement 1...
2,summary our dedication to providing our client...,summary our dedication to providing our clien...,summary our dedication to providing our client...
3,"q a with ceo , david miles 92 percent of tenan...","q a with ceo , david miles 92 percent of tena...","q a with ceo, david miles 92 percent of tenant..."
4,"in the spring , we launched our walk in wins’ ...",strategic report domino’s pizza group plc ann...,strategy across all of our markets remains sim...


In [7]:
prediction_df2 = prediction_df[:100].copy()
len(prediction_df2)

100

#### rouge score 계산

In [8]:
def rouge_scores(gen_summary_list, actual_summary_list, metric='fmeasure'):
    rouge1_scores = []
    rougeL_scores = []
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
    for i in range(0, len(gen_summary_list)):
        scores = scorer.score(actual_summary_list[i], gen_summary_list[i])
        if metric == 'recall':
            rouge1_scores.append(scores['rouge1'][1])
            rougeL_scores.append(scores['rougeL'][1])
        elif metric == 'precision':
            rouge1_scores.append(scores['rouge1'][0])
            rougeL_scores.append(scores['rougeL'][0])
        elif metric == 'fmeasure':
            rouge1_scores.append(scores['rouge1'][2])
            rougeL_scores.append(scores['rougeL'][2])     
    print("Average Rouge-1", str(metric), ":", round(np.mean(rouge1_scores), 2))
    print("Average Rouge-L", str(metric), ":", round(np.mean(rougeL_scores), 2))
    return

In [9]:
rouge_scores(prediction_df2['predicted_abstract'], prediction_df['abstract'])

Average Rouge-1 fmeasure : 0.54
Average Rouge-L fmeasure : 0.39
