### CNN/DailyMail Dataset 을 이용한 summarization

In [1]:
from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", '3.0.0')

In [2]:
dataset["train"].features

{'article': Value(dtype='string', id=None),
 'highlights': Value(dtype='string', id=None),
 'id': Value(dtype='string', id=None)}

In [3]:
dataset["train"][10]["article"][:2000]
summarys = {}

### nltk, sent tokenize

In [4]:
import nltk
from nltk.tokenize import sent_tokenize

nltk.download("punkt")

[nltk_data] Downloading package punkt to /home/tommy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [5]:
import nltk
nltk.data.path.append('/usr/local/share/nltk_data')

# 'punkt' 다운로드 및 찾기
nltk.download('popular')
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /home/tommy/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     /home/tommy/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to
[nltk_data]    |     /home/tommy/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     /home/tommy/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     /home/tommy/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /home/tommy/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!

True

### 요약 모델의 baseline
- 주로 요약 모델의 baseline **첫 문장 3개**

In [6]:
test = dataset["train"][10]["article"][:2000]

sent_tokenize(test)[:3] # sentance spliting

['WASHINGTON (CNN) -- As he awaits a crucial progress report on Iraq, President Bush will try to put a twist on comparisons of the war to Vietnam by invoking the historical lessons of that conflict to argue against pulling out.',
 'President Bush pauses Tuesday during a news conference at the  North American Leaders summit in Canada.',
 'On Wednesday in Kansas City, Missouri, Bush will tell members of the Veterans of Foreign Wars that "then, as now, people argued that the real problem was America\'s presence and that if we would just withdraw, the killing would end," according to speech excerpts released Tuesday by the White House.']

### str + "TL;DR"
- too long didn't read
- 한국어 버젼으로 "3 줄 요약좀.."

### gpt2 for summarization

In [7]:
from transformers import pipeline, set_seed


set_seed(42)
pipe = pipeline("text-generation", model="gpt2-xl")
query = test + "\nTL;DR\n"
pip_out = pipe(query, max_length = 512, clean_up_tokenization_spaces = True)

2024-12-08 19:12:45.975179: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1733652766.104591    1012 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1733652766.143635    1012 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-08 19:12:46.471085: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. M

In [8]:
summarys["gpt2"] = pip_out[0]["generated_text"].split("\nTL;DR\n")[-1]

### T5 for summarization

In [9]:
pipe = pipeline("summarization", model="t5-large")
pip_out = pipe(test)
summarys["t5"]=pip_out[0]["summary_text"]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [10]:
summarys

{'gpt2': "To be fair, you can't compare this war to prior conflicts in east Asia in the sense that there's a difference between military intervention and war. Also, the president's speech excerpts are not necessarily a full transcript. The White House says it will release the full remarks sometime in the next 2-3 days.",
 't5': 'president bush will try to put a twist on comparisons of the war to Vietnam . he\'ll tell veterans of foreign war that "the real problem was America\'s presence" "the price of america\'s withdrawal was paid by millions of innocent citizens" speech excerpts released by the white house .'}

### BART for summarization

In [11]:
pipe = pipeline("summarization", model="facebook/bart-large-cnn", device="cuda")
pip_out = pipe(test)
summarys["bart"]=pip_out[0]["summary_text"]

In [12]:
summarys

{'gpt2': "To be fair, you can't compare this war to prior conflicts in east Asia in the sense that there's a difference between military intervention and war. Also, the president's speech excerpts are not necessarily a full transcript. The White House says it will release the full remarks sometime in the next 2-3 days.",
 't5': 'president bush will try to put a twist on comparisons of the war to Vietnam . he\'ll tell veterans of foreign war that "the real problem was America\'s presence" "the price of america\'s withdrawal was paid by millions of innocent citizens" speech excerpts released by the white house .',
 'bart': "President Bush to tell Veterans of Foreign Wars about Vietnam War. He will argue withdrawal from Vietnam emboldened terrorists, he will say. Bush will cite Osama bin Laden's quote that U.S. would rise against Iraq war. Senate Majority Leader Harry Reid says comparison ignores difference between wars."}

### PEGASUS for summarization

In [13]:
pipe = pipeline("summarization", model="google/pegasus-cnn_dailymail", device="cuda")
pip_out = pipe(test)


Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
summarys["pegasus"]=pip_out[0]["summary_text"].replace('<n>','')

### 모델별 비교

In [15]:
print("[grund truth]")
print(dataset["test"][10]["article"],'\n')

for k,v in summarys.items():
    print(f"[{k}]")
    print(v, end="\n\n")

[grund truth]
London (CNN)A 19-year-old man was charged Wednesday with terror offenses after he was arrested as he returned to Britain from Turkey, London's Metropolitan Police said. Yahya Rashid, a UK national from northwest London, was detained at Luton airport on Tuesday after he arrived on a flight from Istanbul, police said. He's been charged with engaging in conduct in preparation of acts of terrorism, and with engaging in conduct with the intention of assisting others to commit acts of terrorism. Both charges relate to the period between November 1 and March 31. Rashid is due to appear in Westminster Magistrates' Court on Wednesday, police said. CNN's Lindsay Isaac contributed to this report. 

[gpt2]
To be fair, you can't compare this war to prior conflicts in east Asia in the sense that there's a difference between military intervention and war. Also, the president's speech excerpts are not necessarily a full transcript. The White House says it will release the full remarks so

### 모델 평가 (BLEU:Bilingual Evaluation Understudy Score)
- 참조 텍스트와 얼마나 많이 정렬되었는지, 단어 or n-gram 을 체크 (:정밀도 에서 비롯된 방법)
    - 수정 정밀도($p_n$) 공식
        $$ p_n = \frac{\sum_{\text{n-gram} \in snt'}Count_{\text{clip}}(\text{n-gram})}{\sum_{\text{n-gram} \in snt}Count(\text{n-gram})}$$
        에서<br>
            $snt$ : **gen** 된 (후보)candidate 문장 <br>
            $snt'$ : 참조(reference) 문장 <br>
            $\text{n-gram}$ : 연속된 단어의 조합 <br>
            $Count_{\text{clip}}(\text{n-gram})$ : 클리핑 된 $\text{n-gram}$ 의 수(겹치는 $\text{n-gram}$)<br>
            $Count(\text{n-gram})$ : 생성된 문장의 $\text{n-gram}$ 갯수 <br>
    - 예시 (1-gram)
        - ref = "the cat is on the mat"
        - gen = "<u>the</u> the the the <u>the</u> the"

        $$  P_{n} = \frac{\text{문장 일치수}}{\text{생성된 총 단어 수}} =  \frac{2}{6}$$

    - 재현 률을 고려한 항 **BR(brevity penalty)** : 즉 생성된 문장이 참조 문장보다 짧은지의 척도, 생성된 문장이 길면 $BR = 1$ 이다
        $$BR = \min(1, e^{1-l_{\text{ref}}/l_{\text{gen}}})$$
        - 결과 길이를 고려한 항,
        - 번역 길이가 짧은경우 P 값이 상승하는 경향이 있어
        - 길이가 짧을 경우 패널티를 준다.
        - 생성된 길이가 참조 문장보다 길면 1로 반환 한다.
    - BLEU-N 점수 계산
        - 수정 정밀도의 기하평균 : $$\left(\prod_{n=1}^{N} p_n\right)^{1/N} == \exp{\left(\frac{1}{N}\prod_{n=1}^{N} \log p_n \right)}$$
        - BLEU-N 점수 계산 : $$\text{BLEU-N} = \text{BR} \times \left(\prod_{n=1}^{N} p_n\right)^{1/N}$$
- 단점 : 
    - **동의어를** 고려하지 않음
    - **토큰화된 텍스트**를 기대한다. <br>
        ex.<br>
        reference : [I'm a doctor] -> ['I',"'",'m',' a','doctor']<br>
        candidate : [I am a doctor] -> ['I',' am',' a','doctor']<br>
        에서 I'm 과 I am 은 동의어로 치부하지 않게 된다. 또한 토큰 길이도 다르게 될 수 있어 이에 대한 패널티를 얻을 수 있다.

In [16]:
# !pip install evaluate
# !pip install rouge_score

$$\text{BLEU-N} = \text{BR} \times \left(\prod_{n=1}^{N} p_n\right)^{1/N}$$

In [17]:
# 1-gram test
import evaluate
bleu= evaluate.load("bleu")


ref = "the cat is on the mat"
can = "the the the the the the"

b_result =bleu.compute(predictions=[can], references=[ref])
print(b_result)

{'bleu': 0.0, 'precisions': [0.3333333333333333, 0.0, 0.0, 0.0], 'brevity_penalty': 1.0, 'length_ratio': 1.0, 'translation_length': 6, 'reference_length': 6}


In [27]:
#n-gram test
bleu= evaluate.load("bleu")

ref = "the cat is on the mat"
# can = "the the the the the the"
can_for_2_gram = "the cat the the the is on the mat mat mat mat"

b_result =bleu.compute(predictions=[can_for_2_gram], references=[ref], max_order = 2) # max_order 로 n-gram 을 정해준다.
for f,(k,v) in zip(["BLEU-N","p","BP", "BR","len_gen","len_ref"],b_result.items()):
    print(f"{f:20s}({k:20s}):{v}")

print("""
생성된 문장의 길이가 참조 문장보다 길다.
BR => 1 로 계산된다
""")

BLEU-N              (bleu                ):0.4264014327112209
p                   (precisions          ):[0.5, 0.36363636363636365]
BP                  (brevity_penalty     ):1.0
BR                  (length_ratio        ):2.0
len_gen             (translation_length  ):12
len_ref             (reference_length    ):6

생성된 문장의 길이가 참조 문장보다 길다.
BR => 1 로 계산된다



In [22]:
# !pip install sacrebleu

In [None]:
# Sacre bleu : 토큰화 단계를 내제화

sacrebleu = evaluate.load("sacrebleu")

sb_result = sacrebleu.compute(predictions=[can_for_2_gram], references=[ref])
sb_result

{'score': 25.211936184349828,
 'counts': [6, 4, 2, 1],
 'totals': [12, 11, 10, 9],
 'precisions': [50.0, 36.36363636363637, 20.0, 11.11111111111111],
 'bp': 1.0,
 'sys_len': 12,
 'ref_len': 6}

### 모델 평가 (ROUGE:Recall-Oriented Understudy for Gisting Evaluation)
- ROUGE-N Recall(참조 문장) 기반 $$ \text{ROUGE-N} = \frac{\sum_{\text{n-gram} \in \text{ref}} \min(\text{Count}_{\text{gen}}(\text{n-gram}), \text{Count}_{\text{ref}}(\text{n-gram}))}{\sum_{\text{n-gram} \in \text{ref}} \text{Count}_{\text{ref}}(\text{n-gram})}$$
    $$ \frac{\text{모든 참조 문장의 <n-gram과 생성문장의 n-gram 중 최솟값 들>의 합}}{\text{모든 참조 문장의 <n-gram 의 경우의 수>의 합}}$$

- ROUGE-N Precision (생성된 문장) 기반 $$ \text{ROUGE-N} = \frac{\sum_{\text{n-gram} \in \text{gen}} \min(\text{Count}_{\text{gen}}(\text{n-gram}), \text{Count}_{\text{ref}}(\text{n-gram}))}{\sum_{\text{n-gram} \in \text{gen}} \text{Count}_{\text{gen}}(\text{n-gram})}$$

- ROUGE-N f-1 score $$ \text{f1-score} = \frac{2 \times \text{precision} \times \text{recall}}{\text{preciaion} + \text{recall}}$$

    > BLUE 와의 비교 $$\text{BLEU-N} = \min(1, e^{1-l_{\text{ref}}/l_{\text{gen}}}) \times \prod_{n=1}^{N} \left( \frac{\sum_{\text{n-gram} \in snt'}Count_{\text{clip}}(\text{n-gram})}{\sum_{\text{n-gram} \in snt}Count(\text{n-gram})} \right)^{1/N}$$

- ROUGE-L
    - LCS 점수 (Longest Common Subsequence : 가장 긴 공통 시퀀스)
        - 문자열 기준 에서 비교가 가능 ("<u>appl</u>e" and "<u>appl</u>ication")
        - $\text{X} = \text{*appl*e}$ 와 $\text{Y} = \text{*appl*ication}$ 에서
          $$\text{LCS(X,Y)} = 4$$
          $$R_{\text{LCS}} = \frac{\text{LCS(X,Y)}}{m} = \frac{4}{5}$$
          $$P_{\text{LCS}} = \frac{\text{LCS(X,Y)}}{n} = \frac{4}{11}$$
          $$F_{\text{LCS}} = \frac{(1 + \beta^2) R_{\text{LCS}} P_{\text{LCS}}}{R_{\text{LCS}} + P_{\text{LCS}}}$$
            - 만약 $\beta = 1$ 일때 기존 **f1-score** 공식이랑 동일 <br>(`transformer` 는 $\beta = 1$ 로 계산)

- meteor 및 rouge

In [35]:
rouge = evaluate.load("rouge")

r_result = rouge.compute(predictions=[can], references=[ref])
print("ROUGE")
print(r_result)

ROUGE
{'rouge1': 0.3333333333333333, 'rouge2': 0.0, 'rougeL': 0.3333333333333333, 'rougeLsum': 0.3333333333333333}


In [50]:
import pandas as pd
all_scores = {}

for k,v in summarys.items():
    scores = rouge.compute(references=[dataset["test"][10]["article"]],predictions=[v])
    all_scores[k] = scores

display(pd.DataFrame(all_scores).T)
"전반적으로 pegasus 의 성능이 좋다"

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
gpt2,0.16092,0.0,0.08046,0.08046
t5,0.179641,0.0,0.107784,0.107784
bart,0.096386,0.0,0.072289,0.072289
pegasus,0.134228,0.013605,0.080537,0.080537


'전반적으로 pegasus 의 성능이 좋다'

- 참고 meteor 스코어 $$\text{METEOR} = \frac{10 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \times \left(1 - \frac{\text{Penalty}}{\text{len}} \right)$$

    여기서
    $$Precision = \frac{\text{LCS}(X,Y)}{m}$$
    $$Recall = \frac{\text{LCS}(X,Y)}{n}$$
    $$\text{Penalty} = n_{\text{chunk}} - 1$$
    - chunk 의 기준
        - 생성문, 참조문 단어 순서 일치 부분
        - 완전히 일치하는 경우 패널티 = 0
        - 일치하는 chunk 가 많아 질 수록 패널티 증가
        - 즉 순서차이가 큰 문장에서 패널티 증가

In [None]:
metero = evaluate.load("meteor")
m_result = metero.compute(predictions=[can], references=[ref])
print("*"*200)
print(m_result)

********************************************************************************************************************************************************************************************************
{'meteor': 0.16666666666666666}


[nltk_data] Downloading package wordnet to /home/tommy/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/tommy/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/tommy/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
