### CNN/DailyMail Dataset 을 이용한 summarization

In [1]:
from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", '3.0.0')

In [2]:
dataset["train"].features

{'article': Value(dtype='string', id=None),
 'highlights': Value(dtype='string', id=None),
 'id': Value(dtype='string', id=None)}

In [3]:
dataset["train"][10]["article"][:2000]
summarys = {}

### nltk, sent tokenize

In [4]:
import nltk
from nltk.tokenize import sent_tokenize

nltk.download("punkt")

[nltk_data] Downloading package punkt to /home/tommy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [5]:
import nltk
nltk.data.path.append('/usr/local/share/nltk_data')

# 'punkt' 다운로드 및 찾기
nltk.download('popular')
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /home/tommy/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     /home/tommy/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to
[nltk_data]    |     /home/tommy/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     /home/tommy/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     /home/tommy/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /home/tommy/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!

True

### 요약 모델의 baseline
- 주로 요약 모델의 baseline **첫 문장 3개**

In [6]:
test = dataset["train"][10]["article"][:2000]

sent_tokenize(test)[:3] # sentance spliting

['WASHINGTON (CNN) -- As he awaits a crucial progress report on Iraq, President Bush will try to put a twist on comparisons of the war to Vietnam by invoking the historical lessons of that conflict to argue against pulling out.',
 'President Bush pauses Tuesday during a news conference at the  North American Leaders summit in Canada.',
 'On Wednesday in Kansas City, Missouri, Bush will tell members of the Veterans of Foreign Wars that "then, as now, people argued that the real problem was America\'s presence and that if we would just withdraw, the killing would end," according to speech excerpts released Tuesday by the White House.']

### str + "TL;DR"
- too long didn't read
- 한국어 버젼으로 "3 줄 요약좀.."

### gpt2 for summarization

In [None]:
from transformers import pipeline, set_seed


set_seed(42)
pipe = pipeline("text-generation", model="gpt2-xl")
query = test + "\nTL;DR\n"
pip_out = pipe(query, max_length = 512, clean_up_tokenization_spaces = True)

2024-11-30 14:35:57.373264: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1732944957.512003   56338 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1732944957.554243   56338 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-30 14:35:57.932568: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. M

In [31]:
summarys["gpt2"] = pip_out[0]["generated_text"].split("\nTL;DR\n")[-1]

### T5 for summarization

In [None]:
pipe = pipeline("summarization", model="t5-large")
pip_out = pipe(test)
summarys["t5"]=pip_out[0]["summary_text"]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [45]:
summarys

{'gpt2': "To be fair, you can't compare this war to prior conflicts in east Asia in the sense that there's a difference between military intervention and war. Also, the president's speech excerpts are not necessarily a full transcript. The White House says it will release the full remarks sometime in the next 2-3 days.",
 't5': 'president bush will try to put a twist on comparisons of the war to Vietnam . he\'ll tell veterans of foreign war that "the real problem was America\'s presence" "the price of america\'s withdrawal was paid by millions of innocent citizens" speech excerpts released by the white house .'}

### BART for summarization

In [46]:
pipe = pipeline("summarization", model="facebook/bart-large-cnn", device="cuda")
pip_out = pipe(test)
summarys["bart"]=pip_out[0]["summary_text"]

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [47]:
summarys

{'gpt2': "To be fair, you can't compare this war to prior conflicts in east Asia in the sense that there's a difference between military intervention and war. Also, the president's speech excerpts are not necessarily a full transcript. The White House says it will release the full remarks sometime in the next 2-3 days.",
 't5': 'president bush will try to put a twist on comparisons of the war to Vietnam . he\'ll tell veterans of foreign war that "the real problem was America\'s presence" "the price of america\'s withdrawal was paid by millions of innocent citizens" speech excerpts released by the white house .',
 'bart': "President Bush to tell Veterans of Foreign Wars about Vietnam War. He will argue withdrawal from Vietnam emboldened terrorists, he will say. Bush will cite Osama bin Laden's quote that U.S. would rise against Iraq war. Senate Majority Leader Harry Reid says comparison ignores difference between wars."}

### PEGASUS for summarization

In [49]:
pipe = pipeline("summarization", model="google/pegasus-cnn_dailymail", device="cuda")
pip_out = pipe(test)


config.json:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

In [59]:
summarys["pegasus"]=pip_out[0]["summary_text"].replace('<n>','')

### 모델별 비교

In [92]:
print("[grund truth]")
print(dataset["test"][10]["article"],'\n')

for k,v in summarys.items():
    print(f"[{k}]")
    print(v, end="\n\n")

[grund truth]
London (CNN)A 19-year-old man was charged Wednesday with terror offenses after he was arrested as he returned to Britain from Turkey, London's Metropolitan Police said. Yahya Rashid, a UK national from northwest London, was detained at Luton airport on Tuesday after he arrived on a flight from Istanbul, police said. He's been charged with engaging in conduct in preparation of acts of terrorism, and with engaging in conduct with the intention of assisting others to commit acts of terrorism. Both charges relate to the period between November 1 and March 31. Rashid is due to appear in Westminster Magistrates' Court on Wednesday, police said. CNN's Lindsay Isaac contributed to this report. 

[gpt2]
To be fair, you can't compare this war to prior conflicts in east Asia in the sense that there's a difference between military intervention and war. Also, the president's speech excerpts are not necessarily a full transcript. The White House says it will release the full remarks so

### 모델 평가 (BLEU)
- 참조 텍스트와 얼마나 많이 정렬되었는지, 단어 or n-gram 을 체크 (:정밀도)
    - 
        $$ P_n = \frac{\sum_{\text{n-gram} \in snt'}Count_{\text{clip}}(\text{n-gram})}{\sum_{\text{n-gram} \in snt}Count(\text{n-gram})}$$
            에서<br>
            $snt$ : 생성된 candidate 문장 <br>
            $snt'$ : 참조 reference 문장 <br>
            $\text{n-gram}$ : 연속된 단어의 조합 <br>
            $Count_{\text{clip}}(\text{n-gram})$ : $\text{n-gram}$ <br>
            $Count(\text{n-gram})$ : 생성된 문장의 $\text{n-gram}$ 갯수 <br>
    - 예시 (1-gram)
        - ref = "the cat is on the mat"
        - ca = "<u>the</u> the the the <u>the</u> the"

        $$  P_{ca} = \frac{\text{문장 일치수}}{\text{생성된 총 단어 수}} =  \frac{2}{6}$$

In [None]:
# bleu 예시
