# Text Generation

사전학습된 CLM(Causal Language Model)을 이용하여 자연어 문장을 생성하는 방법에 대해 살펴 보겠습니다.  
GPT2-XL은 GPT2의 1.5B 파라미터 버전으로 트랜스포머 기반의 CLM 입니다.  
Greedy Search Decoding, Beam Search Decoding, Random Sampling, Top-K/Top-P Sampling 방법 등을 실습합니다.  

## 0. Setup

In [1]:
# MLP Suwon 설정 필요
import os

os.environ['REQUESTS_CA_BUNDLE'] = '/etc/ssl/certs/ca-certificates.crt'
os.environ['HTTP_PROXY'] ='http://75.17.107.42:8080'
os.environ['HTTPS_PROXY'] ='http://75.17.107.42:8080'

In [2]:
# MLP Suwon 설정 필요
import ssl

if hasattr(ssl, '_create_unverified_context'):
   ssl._create_default_https_context = ssl._create_unverified_context

In [None]:
# !pip install --user transformers==4.38.2

### **GPT2-XL**: 48-layer, 1600-hidden, 25-heads, 1,558M parameters, English model

In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = "cuda" if torch.cuda.is_available() else "cpu"
model_ckpt = "gpt2-xl"

# [실습] 다음 코드를 완성하세요!!
# 사전학습 모델 'gpt2-xl'에 사용된 Tokenizer 가져옵니다.
# Causl LM 'gpt2-xl'을 device로 가져옵니다.
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForCausalLM.from_pretrained(model_ckpt).to(device)

## 1. Greedy Search Decoding

가장 간단한 디코딩 방식은 각 타임스텝에서 가장 확률이 높은 토큰만을 선택하는 방법입니다.  
Generate() 함수가 있지만, LLM을 이용한 텍스트 생성 과정을 이해하기 위해 직접 구현해 보겠습니다.  
매 타임스텝마다 마지막 토큰에 대한 Logit을 선택하고, SoftMax를 통해 확률값을 얻을 수 있습니다.

In [4]:
import pandas as pd

input_txt = "Transformers are the"
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
iterations = []

n_steps = 8
choices_per_step = 5

with torch.no_grad():
    for _ in range(n_steps):
        iteration = dict()
        iteration["Input"] = tokenizer.decode(input_ids[0])
        output = model(input_ids=input_ids)
        # 첫번째 배치의 마지막 토큰의 로짓을 선택해 소프트맥스를 적용합니다.
        next_token_logits = output.logits[0, -1, :]
        next_token_probs = torch.softmax(next_token_logits, dim=-1)
        sorted_ids = torch.argsort(next_token_probs, dim=-1, descending=True)
        # 가장 높은 확률의 토큰을 저장합니다.
        for choice_idx in range(choices_per_step):
            token_id = sorted_ids[choice_idx]
            token_prob = next_token_probs[token_id].cpu().numpy()
            token_choice = (
                f"{tokenizer.decode(token_id)} ({100 * token_prob: .2f}%)"
            )
            iteration[f"Choice {choice_idx+1}"] = token_choice
        # 예측한 다음 토큰을 입력에 추가합니다.    
        input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=-1)
        iterations.append(iteration)

In [5]:
pd.DataFrame(iterations)

Unnamed: 0,Input,Choice 1,Choice 2,Choice 3,Choice 4,Choice 5
0,Transformers are the,most ( 8.53%),only ( 4.96%),best ( 4.65%),Transformers ( 4.37%),ultimate ( 2.16%)
1,Transformers are the most,popular ( 16.78%),powerful ( 5.37%),common ( 4.96%),famous ( 3.72%),successful ( 3.20%)
2,Transformers are the most popular,toy ( 10.63%),toys ( 7.23%),Transformers ( 6.60%),of ( 5.46%),and ( 3.76%)
3,Transformers are the most popular toy,line ( 34.38%),in ( 18.20%),of ( 11.71%),brand ( 6.10%),line ( 2.69%)
4,Transformers are the most popular toy line,in ( 46.28%),of ( 15.09%),", ( 4.94%)",on ( 4.40%),ever ( 2.72%)
5,Transformers are the most popular toy line in,the ( 65.99%),history ( 12.42%),America ( 6.91%),Japan ( 2.44%),North ( 1.40%)
6,Transformers are the most popular toy line in the,world ( 69.26%),United ( 4.55%),history ( 4.29%),US ( 4.23%),U ( 2.30%)
7,Transformers are the most popular toy line in ...,", ( 39.73%)",. ( 30.64%),and ( 9.87%),with ( 2.32%),today ( 1.74%)


In [6]:
max_length = 12

encoded_input = tokenizer(input_txt, return_tensors="pt").to(device)

# [실습] 다음 코드를 완성하세요!! 
# 입력에 대한 Greedy Search: max_length, do_sample 파라미터를 설정합니다.
output_greedy = model.generate(**encoded_input, pad_token_id=tokenizer.eos_token_id, max_length=max_length, do_sample=False)
print(tokenizer.decode(output_greedy[0]))

Transformers are the most popular toy line in the world,


## 2. Beam Search Decoding

- **log_probs_from_logits()**: 하나의 토큰에 대한 로그 확률을 제공합니다.
- **sequence_logprob()**: 시퀀스에 대한 전체 로그 확률값을 계산합니다.

In [7]:
import torch.nn.functional as F

def log_probs_from_logits(logits, labels):
    logp = F.log_softmax(logits, dim=-1)
    logp_label = torch.gather(logp, 2, labels.unsqueeze(2)).squeeze(-1)
    return logp_label

In [8]:
def sequence_logprob(model, labels, input_len=0):
    with torch.no_grad():
        output = model(labels)
        log_probs = log_probs_from_logits(
            output.logits[:, :-1, :], labels[:, 1:])
        seq_log_prob = torch.sum(log_probs[:, input_len:])
    return seq_log_prob.cpu().numpy()

빔서치 디코딩은 확률이 가장 높은 상위 num_beam 갯수 만큼의 다음 토큰 시퀀스를 추적합니다. 

In [9]:
max_length = 128
input_txt = """In a shocking finding, scientist discovered \
a herd of unicorns living in a remote, previously unexplored \
valley, in the Andes Mountains. Even more surprising to the \
researchers was the fact that the unicorns spoke perfect English.\n\n
"""
encoded_input = tokenizer(input_txt, return_tensors="pt").to(device)

In [10]:
# [실습] 다음 코드를 완성하세요!!
# 입력에 대한 Beam Search: max_length, num_beam, do_sample 파라미터 등을 설정하세요.
output_beam = model.generate(**encoded_input, pad_token_id=tokenizer.eos_token_id, max_length=max_length, num_beams=5, do_sample=False)
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))

print(tokenizer.decode(output_beam[0]))
print(f"\nLog Probability: {logp:.2f}")

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The discovery of the unicorns was made by a team of scientists from the University of California, Santa Cruz, and the National Geographic Society.


The scientists were conducting a study of the Andes Mountains when they discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English

Log Probability: -133.66


- **no_repeat_ngram_size**: 동일한 텍스트가 반복 생성되는 문제를 해결하기 위하여 n-gram penalty를 부과할 수도 있습니다. 

In [11]:
# [실습] 다음 코드를 완성하세요!!
# 이전 Beam Search 결과에 no_repeat_ngram_size 설정 추가
output_beam = model.generate(**encoded_input, pad_token_id=tokenizer.eos_token_id, max_length=max_length, num_beams=5, do_sample=False, no_repeat_ngram_size=2)
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))

print(tokenizer.decode(output_beam[0]))
print(f"\nLog Probability: {logp:.2f}")

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The discovery was made by a team of scientists from the University of California, Santa Cruz, and the National Geographic Society.

According to a press release, the scientists were conducting a survey of the area when they came across the herd. They were surprised to find that they were able to converse with the animals in English, even though they had never seen a unicorn in person before. The researchers were

Log Probability: -171.56


## 3. Random Sampling

각 타임스텝 내 모델이 출력한 전체 어휘사전에 대한 확률분포에서 랜덤하게 샘플링하는 방법입니다.  
- **temperature**: 소프트맥스 함수를 적용하기 전에 로짓의 스케일을 조정하는 Temperature 파라미터를 추가하면 출력의 다양성을 제어할 수 있습니다.  
T << 1 일때 낮은 확률의 토큰들을 억제하며, T >> 1 일때는 분포가 평평해져서 각 토큰의 확률들이 동일해집니다.

In [12]:
# [실습] 다음 코드를 완성하세요!!
# Random Sampleing 방법을 통한 문장 생성: max_length, do_sample, temperature 파라미터를 설정하세요.
output_temp = model.generate(**encoded_input, pad_token_id=tokenizer.eos_token_id, max_length=max_length, do_sample=True, temperature=2.0)

print(tokenizer.decode(output_temp[0]))

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


Scientists discovered that Unicarni-hyrde unicorns are now very comfortable moving from place-to-"place," with no need for their mothers on weekends for their protection of breeding spots. To protect their calves which they milk at, many times the mother Unicorn leads them at an astonishing 24 mph across a valley up from one side at full moon to another at full moon so her baby unicorn could


In [13]:
# [실습] 다음 코드를 완성하세요!! 
# Temperature를 변경하여 문장 생성: max_length, do_sample, temperature 파라미터를 설정하세요.
output_temp = model.generate(**encoded_input, pad_token_id=tokenizer.eos_token_id, max_length=max_length, do_sample=True, temperature=0.5)

print(tokenizer.decode(output_temp[0]))

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The researchers believe that the unicorns have been living in the valley for at least 150 years. They are estimated to number between 100 and 200.


The researchers believe that the unicorns have been living in the valley for at least 150 years. They are estimated to number between 100 and 200.

The researchers believe that the unicorns have been living in the valley for at least 150 years.


## 4. Top-K and Top-P Sampling

두 방법 모두 샘플링에 사용할 토큰의 갯수를 줄인다는 개념에 기초하고 있습니다.  

- Top-K 샘플링: 확률이 가장 높은 K개 토큰에서만 샘플링하고 확률이 낮은 토큰을 제외함으로써,
확률 분포의 롱테일을 잘라내고 확률이 가장 높은 토큰에서만 샘플링하는 방법입니다. 

In [14]:
# [실습] 다음 코드를 완성하세요!! 
# Top-K 샘플링을 적용하여 문장 생성: max_length, do_sample, top_k 파라미터를 설정하세요.
output_topk = model.generate(**encoded_input, pad_token_id=tokenizer.eos_token_id, max_length=max_length, do_sample=True, top_k=50)

print(tokenizer.decode(output_topk[0]))

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The discovery of a herd herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains proves that they do not necessarily speak a language that is closer to human tongues, research showed.The findings were made by a team of scientists from the US National Park Service conducted in 2012 in a study to further understand the Andes as a natural habitat and a biodiversity hotspot, while


- Top-P 샘플링: 고정된 컷오프 값을 사용하지 않고, 어디서 컷오프할 것인지 확률질량(Probability Mass) 조건을 지정합니다.  
모든 토큰을 확률에 따라 내림차순으로 정렬하고, 누적 확률값에 도달할 때까지 토큰들을 하나씩 추가하게 됩니다.

In [15]:
# [실습] 다음 코드를 완성하세요!!
# Top-P 샘플링을 적용하여 문장 생성: max_length, do_sample, top_k 파라미터를 설정하세요.
output_topp = model.generate(**encoded_input, pad_token_id=tokenizer.eos_token_id, max_length=max_length, do_sample=True, top_p=0.90)

print(tokenizer.decode(output_topp[0]))

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


"In the early 1970s, while researching a number of ancient documents from Mexico and Peru, I discovered a few documents relating to the Andean region," said Dr. Walter A. Bock, a biologist and co-director of the Institute for Creation Research. "These documents were written in Mayan, Aztec and pre-Aztec Spanish, and they described ancient Mayan and Aztec cities


- Ref. Natural Language Processing with Transformers