### 텍스트 생성

- **자기 회귀 (autoregressive model)**: 과거 데이터를 바탕으로 미래 데이터를 예측
- **인과 언어 모델 (causal language model)**: 이전 단어들로 다음 단어를 예측

- GPT-2의 텍스트 생성 방식
    $$ P(y_1, \dots, y_t | x) = \prod_{t=1}^{N} P(y_t | y_{<t}, x) $$

    $$ \therefore P(y_1, \dots, y_t | x) = P(y_1 | x) \times P(y_2 | y_1, x) \times P(y_3 | y_1, y_2, x) \times \dots \times P(y_t | y_1, \dots, y_{t-1}, x) $$

    - **해석**:
        - $ P(y_1, \dots, y_t) $: 모든 단어들의 확률, 즉 한 문장에서 각각의 단어가 등장할 확률
        - $ \prod $: 각 단어마다의 확률을 계산하여 곱하는 형태
        - $ y_{<t} = y_1, \dots, y_{t-1} $: 이전 단어들
        - $ P(y_t | y_{<t}, x) $: 조건부 확률 — 입력값 $ x $와 이전 단어들 $ y_1, \dots, y_{t-1} $이 있을 때, $ y_t $의 확률

- 조건부 텍스트 생성 (Conditional Text Generation)
    - $ x $가 주어졌을 때의 텍스트 생성
        - $ P(y_t = w_i | y_{<t}, x) = \text{softmax}(z_{t,i}) $ 여기서 $z_{t,i}$ 는 모델의 로짓값
        - $ y_t = w_i $: $ i $-번째 예측에서 단어 $ w_i $를 선택할 확률
        - $ \hat{y} = \argmax\limits_{y} P(y | x) $: 조건부 확률에 대해 최댓값을 갖는 $ y $를 선택

- 예시: 텍스트 예측 과정
    - $ x = $ "오늘 날씨가 어때?" 의 토큰
        - $ P(y_1 = w_1 | x) = y_2 $ → $ \text{argmax}(\text{softmax}(z_1)) = $ "오늘"
        - $ P(y_2 = w_2 | y_1, x) = y_3 $ → $ \text{argmax}(\text{softmax}(z_2)) = $ "날씨는"
        - $ P(y_3 = w_3 | y_1, y_2, x) = y_4 $ → $ \text{argmax}(\text{softmax}(z_3)) = $ "맑습니다."


### 그리디 서치 디코딩 (Greedy Search Decoding)

- $ \hat{y_t} = \argmax\limits_{y_t} P(y_t|y_{<t},x) $
    - 가장 확률이 높은 단어만을 선택

### 그리디 서치 디코딩의 gpt-2 예시

In [3]:
import torch

from transformers import AutoTokenizer, AutoModelForCausalLM

device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

2024-11-27 14:42:10.868388: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1732686131.000476   44787 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1732686131.040018   44787 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-27 14:42:11.365763: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


model.safetensors:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [200]:
import pandas as pd

input_txt = "Surfing is a surface water sport in which an individual, a surfer (or two in tandem surfing),"
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)

In [201]:
iterations = []

In [202]:
n_steps = 8
choice_per_step = 5

with torch.no_grad():
    for _ in range(n_steps):
        iteration = {}
        iteration["Input"] = tokenizer.decode(input_ids[0])
        output = model(input_ids=input_ids)
        next_token_logits = output.logits[0,-1,:]
        next_token_probs = torch.softmax(next_token_logits, dim=-1)
        sorted_ids = torch.argsort(next_token_probs, dim=-1, descending=True)

        # 가장 확률높은 토큰 저장
        for choice_idx in range(choice_per_step):
            token_id = sorted_ids[choice_idx]
            token_prob = next_token_probs[token_id].to("cpu")
            token_choice = f"{tokenizer.decode(token_id)}:({100 * token_prob:.2f})%"
            iteration[f"choice {choice_idx+1}"] = token_choice
        input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=1)
        iterations.append(iteration)
pd.DataFrame(iterations)

Unnamed: 0,Input,choice 1,choice 2,choice 3,choice 4,choice 5
0,Surfing is a surface water sport in which an i...,and:(11.29)%,or:(10.31)%,rides:(6.55)%,a:(4.38)%,is:(3.69)%
1,Surfing is a surface water sport in which an i...,a:(60.53)%,an:(7.63)%,/:(5.86)%,their:(3.80)%,the:(3.64)%
2,Surfing is a surface water sport in which an i...,surf:(35.98)%,board:(21.00)%,boat:(8.92)%,wave:(3.64)%,small:(1.48)%
3,Surfing is a surface water sport in which an i...,board:(94.73)%,board:(4.49)%,boat:(0.27)%,-:(0.06)%,boat:(0.05)%
4,Surfing is a surface water sport in which an i...,(:(19.04)%,are:(16.42)%,or:(9.11)%,ride:(6.42)%,travel:(5.82)%
5,Surfing is a surface water sport in which an i...,or:(70.51)%,a:(2.77)%,usually:(2.56)%,called:(1.75)%,also:(1.55)%
6,Surfing is a surface water sport in which an i...,two:(63.94)%,a:(3.49)%,surf:(2.76)%,more:(2.65)%,three:(2.11)%
7,Surfing is a surface water sport in which an i...,in:(33.78)%,surf:(33.14)%,):(8.48)%,boards:(3.87)%,or:(3.48)%


### generate() 함수

In [None]:
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output = model.generate(input_ids, max_new_tokens = n_steps, do_sample=False) # 새로운 토큰 수 지정

print(tokenizer.decode(output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Surfing is a surface water sport in which an individual, a surfer (or two in tandem surfing), and a surfboard (or two in


In [None]:
output = model.generate(input_ids,max_length = 128, do_sample=False) # 최대 길이 지정

print(tokenizer.decode(output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Surfing is a surface water sport in which an individual, a surfer (or two in tandem surfing), and a surfboard (or two in tandem surfboards) are propelled by the force of the waves against the body of the surfer. Surfing is a sport that is popular in the United States, and is also popular in other countries.<|endoftext|>


### 빔 서치 디코딩