### 텍스트 생성

- **자기 회귀 (autoregressive model)**: 과거 데이터를 바탕으로 미래 데이터를 예측
- **인과 언어 모델 (causal language model)**: 이전 단어들로 다음 단어를 예측

- GPT-2의 텍스트 생성 방식
    $$ P(y_1, \dots, y_t | x) = \prod_{t=1}^{N} P(y_t | y_{<t}, x) $$

    $$ \therefore P(y_1, \dots, y_t | x) = P(y_1 | x) \times P(y_2 | y_1, x) \times P(y_3 | y_1, y_2, x) \times \dots \times P(y_t | y_1, \dots, y_{t-1}, x) $$

    - **해석**:
        - $ P(y_1, \dots, y_t) $: 모든 단어들의 확률, 즉 한 문장에서 각각의 단어가 등장할 확률
        - $ \prod $: 각 단어마다의 확률을 계산하여 곱하는 형태
        - $ y_{<t} = y_1, \dots, y_{t-1} $: 이전 단어들
        - $ P(y_t | y_{<t}, x) $: 조건부 확률 — 입력값 $ x $와 이전 단어들 $ y_1, \dots, y_{t-1} $이 있을 때, $ y_t $의 확률

- 조건부 텍스트 생성 (Conditional Text Generation)
    - $ x $가 주어졌을 때의 텍스트 생성
        - $ P(y_t = w_i | y_{<t}, x) = \text{softmax}(z_{t,i}) $ 여기서 $z_{t,i}$ 는 모델의 로짓값
        - $ y_t = w_i $: $ i $-번째 예측에서 단어 $ w_i $를 선택할 확률
        - $ \hat{y} = \argmax\limits_{y} P(y | x) $: 조건부 확률에 대해 최댓값을 갖는 $ y $를 선택

- 예시: 텍스트 예측 과정
    - $ x = $ "오늘 날씨가 어때?" 의 토큰
        - $ P(y_1 = w_1 | x) = y_2 $ → $ \text{argmax}(\text{softmax}(z_1)) = $ "오늘"
        - $ P(y_2 = w_2 | y_1, x) = y_3 $ → $ \text{argmax}(\text{softmax}(z_2)) = $ "날씨는"
        - $ P(y_3 = w_3 | y_1, y_2, x) = y_4 $ → $ \text{argmax}(\text{softmax}(z_3)) = $ "맑습니다."


### 그리디 서치 디코딩 (Greedy Search Decoding)

- $ \hat{y_t} = \argmax\limits_{y_t} P(y_t|y_{<t},x) $
    - 가장 확률이 높은 단어만을 선택

### 그리디 서치 디코딩의 gpt-2 예시

In [31]:
import torch

from transformers import AutoTokenizer, AutoModelForCausalLM

device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

2024-11-29 15:07:00.223465: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1732860420.352671    1008 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1732860420.390983    1008 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-29 15:07:00.711783: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [32]:
import pandas as pd

input_txt = "Surfing is a surface water sport in which an individual, a surfer (or two in tandem surfing),"
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)

In [33]:
iterations = []

In [34]:
n_steps = 8
choice_per_step = 5

with torch.no_grad():
    for _ in range(n_steps):
        iteration = {}
        iteration["Input"] = tokenizer.decode(input_ids[0])
        output = model(input_ids=input_ids)
        next_token_logits = output.logits[0,-1,:]
        next_token_probs = torch.softmax(next_token_logits, dim=-1)
        sorted_ids = torch.argsort(next_token_probs, dim=-1, descending=True)

        # 가장 확률높은 토큰 저장
        for choice_idx in range(choice_per_step):
            token_id = sorted_ids[choice_idx]
            token_prob = next_token_probs[token_id].to("cpu")
            token_choice = f"{tokenizer.decode(token_id)}:({100 * token_prob:.2f})%"
            iteration[f"choice {choice_idx+1}"] = token_choice
        input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=1)
        iterations.append(iteration)
pd.DataFrame(iterations)

Unnamed: 0,Input,choice 1,choice 2,choice 3,choice 4,choice 5
0,Surfing is a surface water sport in which an i...,and:(11.29)%,or:(10.31)%,rides:(6.55)%,a:(4.38)%,is:(3.69)%
1,Surfing is a surface water sport in which an i...,a:(60.53)%,an:(7.63)%,/:(5.86)%,their:(3.80)%,the:(3.64)%
2,Surfing is a surface water sport in which an i...,surf:(35.98)%,board:(21.00)%,boat:(8.92)%,wave:(3.64)%,small:(1.48)%
3,Surfing is a surface water sport in which an i...,board:(94.73)%,board:(4.49)%,boat:(0.27)%,-:(0.06)%,boat:(0.05)%
4,Surfing is a surface water sport in which an i...,(:(19.04)%,are:(16.42)%,or:(9.11)%,ride:(6.42)%,travel:(5.82)%
5,Surfing is a surface water sport in which an i...,or:(70.51)%,a:(2.77)%,usually:(2.56)%,called:(1.75)%,also:(1.55)%
6,Surfing is a surface water sport in which an i...,two:(63.94)%,a:(3.49)%,surf:(2.76)%,more:(2.65)%,three:(2.11)%
7,Surfing is a surface water sport in which an i...,in:(33.78)%,surf:(33.14)%,):(8.48)%,boards:(3.87)%,or:(3.48)%


### generate() 함수

In [35]:
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output = model.generate(input_ids, max_new_tokens = n_steps, do_sample=False) # 새로운 토큰 수 지정

print(tokenizer.decode(output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Surfing is a surface water sport in which an individual, a surfer (or two in tandem surfing), and a surfboard (or two in


In [36]:
output = model.generate(input_ids,max_length = 128, do_sample=False) # 최대 길이 지정

print(tokenizer.decode(output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Surfing is a surface water sport in which an individual, a surfer (or two in tandem surfing), and a surfboard (or two in tandem surfboards) are propelled by the force of the waves against the body of the surfer. Surfing is a sport that is popular in the United States, and is also popular in other countries.<|endoftext|>


### 빔 서치 디코딩(with $\log$ probability and `no_repeat_ngram_size`)
- **patial hypothesis** 또는 **beam** 이라고 불리우는 각 스텝마다 상위 토큰 갯수를 추적
- 최대 길이 및 EOS 도달시 중단
- 각 단어의 확률이 아닌 **log 확률(probability)** 을 이용
    - $ \prod_{t=1}^{N} P(y_t | y_{<t}, x) $ 는 **곱**이기 때문에 수가 매우 작아져, **수치적 불안정**.
        - 각 단어의 확률이 $0.5$, 시퀀스 토큰 수가 $512$ 일때 확률은
            $$0.5 ^ {512} \approx 7.46 × 10^{−155}$$
    - Task 를 Sum 으로 단순화
        $$ \log ( \prod_{t=1}^{N} P(y_t | y_{<t}, x) ) = \sum_{t=1}^{N} \log P(y_t | y_{<t}, x)$$
        $$ \therefore \log (0.5 ^ {512}) = 512 * \log {0.5} \approx -354.89 $$
    - 위의 값을 **상대적 확률 비교방식**으로 사용

In [37]:
import torch.nn.functional as F

# 단일 토큰 로그 확률
def token_log_probs_from_logits(logits, labels):
    logp = F.log_softmax(logits, dim=-1) 
    logp_label = torch.gather(logp, 2, labels.unsqueeze(2)).squeeze(-1)
    return logp_label

# 시퀀스 단위 로그 확률
def sequence_log_probs(model, labels, input_len=0):
    with torch.no_grad():
        output = model(labels)
        log_probs = token_log_probs_from_logits(
            output.logits[:,:-1,:],
            labels[:,1:])
        seq_log_prob = torch.sum(log_probs)
    return seq_log_prob

In [51]:
# 이전 생성한 텍스트의 log 확률
logp = sequence_log_probs(model, output, input_len=len(input_ids[0]))
print(tokenizer.decode(output[0]))
print(f"log_prob:{logp:0.2f}")

Surfing is a surface water sport in which an individual, a surfer (or two in tandem surfing), and a surfboard (or two in tandem surfboards) are propelled by the force of the waves against the body of the surfer. Surfing is a sport that is popular in the United States, and is also popular in other countries.<|endoftext|>
log_prob:-119.17


In [60]:
# 빔서치로 생성된 텍스트의 log 확률 비교

output_beam = model.generate(input_ids, max_length=128, num_beams = 5, do_sample = False)
logp = sequence_log_probs(model, output_beam, input_len=len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
print(f"log_prob:{logp:0.2f}")

# 빔서치로 생성된 결과 시퀀스의 log_prob이 더 높다.

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Surfing is a surface water sport in which an individual, a surfer (or two in tandem surfing), and a surfboard (or two in tandem surfboards) are propelled through the water by the force of the waves. Surfing is a sport in which the surfer is propelled through the water by the force of the waves. Surfing is a sport in which the surfer is propelled through the water by the force of the waves. Surfing is a sport in which the surfer is propelled through the water by the force of the waves. Surfing is a sport in which the surfer is propelled through the water by
log_prob:-109.15


In [62]:
# 빔서치로 생성된 텍스트의 log 확률 + N-gram repeat 제한(중복 단어 제한)

output_beam = model.generate(input_ids,
                             max_length=128,
                             num_beams = 5,
                             no_repeat_ngram_size = 2,
                             do_sample = False)
logp = sequence_log_probs(model, output_beam, input_len=len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
print(f"log_prob:{logp:0.2f}")

# log 확률의 점수는 더 낮아 졌지만 문서의 일관성이 증가된다.

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Surfing is a surface water sport in which an individual, a surfer (or two in tandem surfing), and a surfboard are propelled by the force of the waves against the body. Surfing can be done in a variety of ways, but the most common is to stand on the board and propel it with one's feet.

Surfboards are made of wood, plastic, or fiberglass, and are usually made from a single piece of material. Surfboards come in all shapes and sizes, from small, lightweight boards for beginners, to large, heavy surfboards for experienced surfers. There are many different types of
log_prob:-184.41
