# Chapter 18. 트랜스포머 모형을 이용한 문서 요약
## 1. 문서 요약의 이해
## 2. 파이프라인을 이용한 문서 요약

In [1]:
from transformers import pipeline

# 문서요약을 위한 파이프라인 생성
summarizer = pipeline("summarization")
# 요약 대상 원문 - 텍스트마이닝의 정의(Wikipedia)
text = '''Text mining, also referred to as text data mining (abbr.: TDM), similar to text analytics, 
        is the process of deriving high-quality information from text. It involves 
        "the discovery by computer of new, previously unknown information, 
        by automatically extracting information from different written resources." 
        Written resources may include websites, books, emails, reviews, and articles. 
        High-quality information is typically obtained by devising patterns and trends 
        by means such as statistical pattern learning. According to Hotho et al. (2005)
        we can distinguish between three different perspectives of text mining: 
        information extraction, data mining, and a KDD (Knowledge Discovery in Databases) process.''' 
summary_text = summarizer(text) #파이프라인으로 문서요약 수행
print("요약문:\n", summary_text)
print("원문 길이:", len(text), "요약문 길이:", len(summary_text[0]["summary_text"]))

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


요약문:
 [{'summary_text': ' Text mining involves deriving high-quality information from text . Written resources may include websites, books, emails, reviews, and articles . Text mining is similar to text analytics . It involves the discovery by computer of new, previously unknown information by automatically extracting information from different written resources .'}]
원문 길이: 778 요약문 길이: 341


## 3. T5 모형과 자동 클래스를 이용한 문서 요약

In [2]:
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("t5-small", model_max_length=512)
print("tokenizer type:", type(tokenizer))
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
print("model type:", type(model))
# GPU 가속을 사용할 수 있으면 device를 cuda로 설정하고, 아니면 cpu로 설정
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model = model.to(device)

tokenizer type: <class 'transformers.models.t5.tokenization_t5_fast.T5TokenizerFast'>
model type: <class 'transformers.models.t5.modeling_t5.T5ForConditionalGeneration'>


In [3]:
# 원문에 필요한 전처리를 수행. 여기서는 strip()을 적용하고 \n(줄바꿈)을 제거
preprocess_text = text.strip().replace("\n","")
# 전처리 결과 앞에 summarize: 를 추가 - 모형의 task를 summarize(문서요약)로 지정
input_text = "summarize: " + preprocess_text

# 입력 원문을 토크나이즈
tokenized_text = tokenizer.encode(input_text, return_tensors="pt").to(device)

In [4]:
# 요약문생성
summary_ids = model.generate(tokenized_text,
                             num_beams=4, # beam의 길이
                             no_repeat_ngram_size=3, #동어 반복을 피하기 위해 사용
                             min_length=30,  #요약문의 최소 토큰 수
                             max_length=100,  #요약문의 최대 토큰 수
                             early_stopping=True) #EOS 토큰을 만나면 종료

output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print ("Summarized text: \n",output)
print("Original text length:", len(text), "Summarized text length:", len(output))

Summarized text: 
 text data mining is the process of deriving high-quality information from text. it involves the discovery by computer of new, previously unknown information. a KDD (Knowledge Discovery in Databases) process is similar to text analytics.
Original text length: 778 Summarized text length: 236


In [5]:
input_text = "translate english to german: That is good"

# 입력 원문을 토크나이즈
tokenized_text = tokenizer.encode(input_text, return_tensors="pt").to(device)
result = model.generate(tokenized_text, 
                         num_beams=4, # beam의 길이
                         no_repeat_ngram_size=3, #동어 반복을 피하기 위해 사용
                         max_length=100,  #요약문의 최대 토큰 수
                         early_stopping=True) #EOS 토큰을 만나면 종료max_new_tokens=100, do_sample=False)
output = tokenizer.decode(result[0], skip_special_tokens=True)
print ("translated text: \n",output)

translated text: 
 Das ist gut.


## 4. T5 모형과 트레이너를 이용한 미세조정학습

In [6]:
import torch
from transformers import T5TokenizerFast, T5ForConditionalGeneration

model = T5ForConditionalGeneration.from_pretrained('t5-small')
tokenizer = T5TokenizerFast.from_pretrained('t5-small', model_max_length=1024)
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model = model.to(device)

In [7]:
text = '''The Inflation Reduction Act lowers prescription drug costs, health care costs, 
and energy costs. It's the most aggressive action on tackling the climate crisis in American history, 
which will lift up American workers and create good-paying, union jobs across the country. 
It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. 
And no one making under $400,000 per year will pay a penny more in taxes.'''

preprocess_text = text.strip().replace("\n","")
# 전처리 결과 앞에 summarize: 를 추가 - 모형의 task를 summarize(문서요약)로 지정
input_text = "summarize: " + preprocess_text

# 입력 원문을 토크나이즈
tokenized_text = tokenizer.encode(input_text, return_tensors="pt").to(device)
summary_ids = model.generate(tokenized_text, 
                         num_beams=4, # beam의 길이
                         no_repeat_ngram_size=3, #동어 반복을 피하기 위해 사용
                         min_length=30,  #요약문의 최소 토큰 수
                         max_length=100,  #요약문의 최대 토큰 수
                         early_stopping=True) #EOS 토큰을 만나면 종료)
output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print ("Summarized text: \n",output)
print("Original text length:", len(text), "Summarized text length:", len(output))

Summarized text: 
 the Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. it's the most aggressive action on tackling the climate crisis in history. no one making under $400,000 per year will pay a penny more in taxes.
Original text length: 441 Summarized text length: 241


In [8]:
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")
billsum = billsum.train_test_split(test_size=0.2)
example = billsum["train"][0]
print("BillSum 데이터 예 - 첫 항목")
print("\tText:", example['text'][:50])
print("\tSummary:", example['summary'][:50])
print("\tTitle:", example['title'][:50])

Found cached dataset billsum (C:/Users/user/.cache/huggingface/datasets/billsum/default/3.0.0/75cf1719d38d6553aa0e0714c393c74579b083ae6e164b2543684e3e92e0c4cc)


BillSum 데이터 예 - 첫 항목
	Text: The people of the State of California do enact as 
	Summary: The California Prompt Payment Act dictates that a 
	Title: An act to amend Section
927
927.2
of the Governmen


In [9]:
def preprocess_text(data):
    # 법안 원본 앞에 "summarize: "를 붙임
    inputs = ["summarize: " + doc for doc in data["text"]]
    # 입력 텍스트를 토크나이즈
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)
    # 라벨로 사용할 요약문을 토크나이즈
    labels = tokenizer(data["summary"], max_length=128, truncation=True)
    # model_inputs의 labels 항목으로 요약문 토크나이즈 결과를 추가
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# 전처리 함수를 데이터에 적용, 원래 billsum에 있던 항목들은 제거
tokenized_billsum = billsum.map(preprocess_text, batched=True, remove_columns=billsum["train"].column_names)
tokenized_billsum

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 989
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 248
    })
})

In [10]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

In [11]:
#!pip install evaluate
#!pip install rouge_score

In [12]:
import numpy as np
import evaluate

rouge = evaluate.load("rouge")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    # 생성한 요약 토큰을 텍스트로 디코드
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # 라벨에서 디코드할 수 없는 -100을 교체
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    # 라벨을 텍스트로 디코드
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    # 디코드된 요약문과 라벨로 ROUGE 스코어 계산
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    return {k: round(v, 4) for k, v in result.items()}

In [13]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

training_args = Seq2SeqTrainingArguments(
    output_dir="./summary",         # 모형 예측과 체크포인트 저장 폴더, 반드시 필요
    evaluation_strategy="epoch",    # 평가 단위, 여기서는 epoch를 선택
    learning_rate=2e-5,             # 학습률
    per_device_train_batch_size=16, # 학습에 사용할 배치 크기
    per_device_eval_batch_size=16,  # 평가에 사용할 배치 크기
    weight_decay=0.01,              # 가중치 감쇠 값
    save_total_limit=3,             # 저장할 체크포인트의 최대값
    num_train_epochs=4,             # 에포크 수
    predict_with_generate=True,     # 평가지표(ROUGE) 계산을 위해 generate할 지의 여부
)

trainer = Seq2SeqTrainer(
    tokenizer=tokenizer,
    model=model,
    args=training_args,
    train_dataset=tokenized_billsum["train"],
    eval_dataset=tokenized_billsum["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

***** Running training *****
  Num examples = 989
  Num Epochs = 4
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 248
  Number of trainable parameters = 60506624
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,No log,2.808137,0.125,0.0374,0.1058,0.1059
2,No log,2.597031,0.1369,0.0473,0.1114,0.1116
3,No log,2.534104,0.1382,0.0495,0.1132,0.1134
4,No log,2.516838,0.1392,0.0497,0.1141,0.1143


***** Running Evaluation *****
  Num examples = 248
  Batch size = 16
***** Running Evaluation *****
  Num examples = 248
  Batch size = 16
***** Running Evaluation *****
  Num examples = 248
  Batch size = 16
***** Running Evaluation *****
  Num examples = 248
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=248, training_loss=3.0233981224798385, metrics={'train_runtime': 72.1547, 'train_samples_per_second': 54.827, 'train_steps_per_second': 3.437, 'total_flos': 1070824333246464.0, 'train_loss': 3.0233981224798385, 'epoch': 4.0})

In [14]:
summary_ids = model.generate(tokenized_text, 
                         num_beams=4, # beam의 길이
                         no_repeat_ngram_size=3, #동어 반복을 피하기 위해 사용
                         min_length=30,  #요약문의 최소 토큰 수
                         max_length=100,  #요약문의 최대 토큰 수
                         early_stopping=True) #EOS 토큰을 만나면 종료max_new_tokens=100, do_sample=False)
output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print ("Summarized text: \n",output)
print("Original text length:", len(text), "Summarized text length:", len(output))

Summarized text: 
 the Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. No one making under $400,000 per year will pay a penny more in taxes.
Original text length: 441 Summarized text length: 341


In [15]:
trainer.save_model("summary")  # 모형 저장
# 저장된 모형 로드
tokenizer = T5TokenizerFast.from_pretrained('./summary')
model = T5ForConditionalGeneration.from_pretrained('./summary')

Saving model checkpoint to summary
Configuration saved in summary\config.json
Model weights saved in summary\pytorch_model.bin
tokenizer config file saved in summary\tokenizer_config.json
Special tokens file saved in summary\special_tokens_map.json
Copy vocab file to summary\spiece.model
loading file spiece.model
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading configuration file ./summary\config.json
Model config T5Config {
  "_name_or_path": "t5-small",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dense_act_fn": "relu",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": false,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 6,
  "num_heads": 8,
  "num_

## 5. 한글 문서 요약

In [16]:
text = """디아블로는 액션 롤플레잉 핵 앤드 슬래시 비디오 게임이다. 
플레이어는 주변 환경을 마우스로 사용해 영웅을 움직이게 한다. 
주문을 외는 등의 다른 활동은 키보드 입력으로 이루어진다. 
플레이어는 이 게임에서 장비를 획득하고, 주문을 배우고, 적을 쓰러뜨리며, NPC와 대화를 나눌 수 있다.
지하 미궁은 주어진 형식이 있고 부분적으로 반복되는 형태가 존재하나 전체적으로 보면 무작위로 생성된다. 
예를 들어 지하 묘지의 경우에는 긴 복도와 닫힌 문들이 존재하고, 동굴은 좀 더 선형 형태를 띠고 있다. 
플레이어에게는 몇몇 단계에서 무작위의 퀘스트를 받는다. 
이 퀘스트는 선택적인 사항이나 플레이어의 영웅들을 성장시키거나 줄거리를 이해하는데 도움을 준다. 
그러나 맨 뒤에 두 퀘스트는 게임을 끝내기 위해 완료시켜야 한다."""

preprocess_text = text.strip().replace("\n","")

In [17]:
from transformers import PreTrainedTokenizerFast
from transformers import BartForConditionalGeneration

tokenizer = PreTrainedTokenizerFast.from_pretrained('gogamza/kobart-summarization')
model = BartForConditionalGeneration.from_pretrained('gogamza/kobart-summarization')

tokenized_text = tokenizer.encode(preprocess_text, return_tensors="pt")
summary_ids = model.generate(tokenized_text,
                             num_beams=4, # beam의 길이
                             no_repeat_ngram_size=3, #동어 반복을 피하기 위해 사용
                             min_length=10,  #요약문의 최소 토큰 수
                             max_length=150,  #요약문의 최대 토큰 수
                             early_stopping=True) #EOS 토큰을 만나면 종료
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)

loading file tokenizer.json from cache at C:\Users\user/.cache\huggingface\hub\models--gogamza--kobart-summarization\snapshots\8a63d6913edc0e16a902e3fa8b688a134f0dd776\tokenizer.json
loading file added_tokens.json from cache at C:\Users\user/.cache\huggingface\hub\models--gogamza--kobart-summarization\snapshots\8a63d6913edc0e16a902e3fa8b688a134f0dd776\added_tokens.json
loading file special_tokens_map.json from cache at C:\Users\user/.cache\huggingface\hub\models--gogamza--kobart-summarization\snapshots\8a63d6913edc0e16a902e3fa8b688a134f0dd776\special_tokens_map.json
loading file tokenizer_config.json from cache at None
loading configuration file config.json from cache at C:\Users\user/.cache\huggingface\hub\models--gogamza--kobart-summarization\snapshots\8a63d6913edc0e16a902e3fa8b688a134f0dd776\config.json
You passed along `num_labels=3` with an incompatible id to label map: {'0': 'NEGATIVE', '1': 'POSITIVE'}. The number of labels wil be overwritten to 2.
Model config BartConfig {
  "_

디아블로는 액션 롤플레잉 핵 앤드 슬래시 비디오 게임이다.


In [18]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("csebuetnlp/mT5_multilingual_XLSum")
model = AutoModelForSeq2SeqLM.from_pretrained("csebuetnlp/mT5_multilingual_XLSum")

tokenized_text = tokenizer.encode(preprocess_text, return_tensors="pt")
summary_ids = model.generate(tokenized_text,
                             num_beams=4, # beam의 길이
                             no_repeat_ngram_size=2, #동어 반복을 피하기 위해 사용
                             min_length=10,  #요약문의 최소 토큰 수
                             max_length=150,  #요약문의 최대 토큰 수
                             early_stopping=True) #EOS 토큰을 만나면 종료
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)

loading configuration file config.json from cache at C:\Users\user/.cache\huggingface\hub\models--csebuetnlp--mT5_multilingual_XLSum\snapshots\2437a524effdbadc327ced84595508f1e32025b3\config.json
Model config MT5Config {
  "_name_or_path": "csebuetnlp/mT5_multilingual_XLSum",
  "architectures": [
    "MT5ForConditionalGeneration"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "layer_norm_epsilon": 1e-06,
  "length_penalty": 0.6,
  "max_length": 84,
  "model_type": "mt5",
  "no_repeat_ngram_size": 2,
  "num_beams": 4,
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "tokenizer_class"

Downloading:   0%|          | 0.00/2.33G [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
loading weights file pytorch_model.bin from cache at C:\Users\user/.cache\huggingface\hub\models--csebuetnlp--mT5_multilingual_XLSum\snapshots\2437a524effdbadc327ced84595508f1e32025b3\pytorch_model.bin
All model checkpoint weights were used when initializing MT5ForConditionalGeneration.

All the weights of MT5ForConditionalGeneration were initialized from the model checkpoint at csebuetnlp/mT5_multilingual_XLSum.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MT5ForConditionalGeneration for predictions without further training.


디아블 게임의 이야기를 들어봤다.
