#**5.2 그리디 서치 디코딩**

##**GPT-2 버전 로드하기**

In [1]:
!git clone https://github.com/rickiepark/nlp-with-transformers.git
%cd nlp-with-transformers
from install import *
install_requirements(chapter=5)

fatal: destination path 'nlp-with-transformers' already exists and is not an empty directory.
/content/nlp-with-transformers
⏳ Installing base requirements ...
✅ Base requirements installed!
Using transformers v4.35.2
Using datasets v2.16.1
Using accelerate v0.26.1
Using sentencepiece v0.1.99
No GPU was detected! This notebook can be *very* slow without a GPU 🐢
Go to Runtime > Change runtime type and select a GPU hardware accelerator.


In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# 코랩의 경우 gpt2-xl을 사용하면 메모리 부족 에러가 발생합니다.
# 대신 "gpt2" 또는 "gpt2-large"로 지정하거나 코랩 프로를 사용하세요.
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


##**텍스트 생성하기**

In [3]:
import pandas as pd

# “Transformers are the”를 입력 프롬프트로 사용
input_txt = "Transformers are the"
# 토크나이저 객체를 사용해 input_txt 토큰화 중. 출력은 파이토치 텐서로 반환.
#input_ids는 토큰화된 입력 시퀀스인 것임.
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
iterations = []

# 여덟 번의 타임스텝 동안 디코딩 수행
n_steps = 8

# 대안을 시각적으로 보여주기 위해 타임스텝마다 확률이 가장 높은 토큰 다섯 개 저장
choices_per_step = 5

# 그래디언트 계산 수행하지마 > 메모리 아끼기
with torch.no_grad():
  for _ in range(n_steps):
		# 각 반복마다 정보를 저장할 딕셔너리 생성
    iteration = dict()
		# 현재 반복에서 모델에 입력으로 주어지는 텍스트를 디코딩하여 저장
    iteration["Input"] = tokenizer.decode(input_ids[0])
		# 모델에 입력 텐서를 주고 출력(모델의 예측)을 계산
    output = model(input_ids=input_ids)

    # 첫 번째 배치의 마지막 토큰의 로짓을 선택해 소프트맥스를 적용
		# [0, -1, :]는 출력 텐서의 마지막 시퀀스의 모든 토큰에 대한 로짓을 선택
    next_token_logits = output.logits[0, -1, :]
    next_token_probs = torch.softmax(next_token_logits, dim=-1)
    sorted_ids = torch.argsort(next_token_probs, dim=-1, descending=True)

    # 가장 높은 확률의 토큰을 저장
    for choice_idx in range(choices_per_step):
      token_id = sorted_ids[choice_idx]
      token_prob = next_token_probs[token_id].cpu().numpy()
      token_choice = (
          f"{tokenizer.decode(token_id)} ({100 * token_prob:.2f})"
      )
      iteration[f"Choice {choice_idx+1}"] = token_choice

    # 예측한 다음 토큰을 입력해 추가
    input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=-1)
    iterations.append(iteration)

pd.DataFrame(iterations)

Unnamed: 0,Input,Choice 1,Choice 2,Choice 3,Choice 4,Choice 5
0,Transformers are the,most (9.76),same (2.94),only (2.87),best (2.38),first (1.77)
1,Transformers are the most,common (22.90),powerful (6.88),important (6.32),popular (3.95),commonly (2.14)
2,Transformers are the most common,type (15.06),types (3.31),form (1.91),way (1.89),and (1.49)
3,Transformers are the most common type,of (83.13),in (3.16),. (1.92),", (1.63)",for (0.88)
4,Transformers are the most common type of,particle (1.55),object (1.02),light (0.71),energy (0.67),objects (0.66)
5,Transformers are the most common type of particle,. (14.26),in (11.57),that (10.19),", (9.57)",accelerator (5.81)
6,Transformers are the most common type of parti...,They (17.48),\n (15.19),The (7.06),These (3.09),In (3.07)
7,Transformers are the most common type of parti...,are (38.78),have (8.14),can (7.98),'re (5.04),consist (1.57)


##**generate() 사용**

In [4]:
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output = model.generate(input_ids, max_new_tokens=n_steps, do_sample=False)
print(tokenizer.decode(output[0]))

Transformers are the most common type of particle. They are


##**OpenAI의 유니콘 기사 재현하기**

In [5]:
# 긴 텍스트 시퀀스 생성을 위해 max_length에 큰 값을 지정
max_length = 128

input_txt = """In a shocking finding, scientist discovered \
a herd of unicorns living in a remote, previously unexplored \
valley, in the Andes Mountains. Even more surprising to the \
researchers was the fact that the unicorns spoke perfect English.\n\n
"""

input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
# genrate() 함수 이용하여 모델에게 입력 텐서를 주고 텍스트 생성 요청하기
output_greedy = model.generate(input_ids, max_length=max_length,
                               do_sample=False)
# 생성된 텍스트를 디코딩하여 출력
print(tokenizer.decode(output_greedy[0]))

In a shocking finding, scientist discovered a herd of unicorns living in a
remote, previously unexplored valley, in the Andes Mountains. Even more
surprising to the researchers was the fact that the unicorns spoke perfect
English.


"The unicorns were very intelligent, and they were very intelligent," said Dr.
David S. Siegel, a professor of anthropology at the University of California,
Berkeley. "They were very intelligent, and they were very intelligent, and they
were very intelligent, and they were very intelligent, and they were very
intelligent, and they were very intelligent, and they were very intelligent, and
they were very


#**5.3 빔 서치 디코딩**

In [6]:
#  t = 1024개의 토큰으로 이루어진 시퀀스에서 각 토큰의 확률이 0.5라 가정
0.5 ** 1024

5.562684646268003e-309

##**로그 확률**

In [7]:
# 이전 예를 그대로 적용
import numpy as np

sum([np.log(0.5)] * 1024)

-709.7827128933695

###**로그 확률 비교해보기**

####**하나의 토큰에 대한 로그 확률 계산 함수 만들기**

In [11]:
import torch.nn.functional as F

def log_probs_from_logits(logits, labels):
  # 입력된 로짓에 대해 로그 소프트맥스 계산
  logp = F.log_softmax(logits, dim=-1)

  # 각 데이터 포인트에 대한 정답 라벨의 로그 확률을 추출
  # unsqueeze(2)를 사용하여 labels 텐서의 차원을 하나 확장하여 함수의 입력으로 사용
  # 그 후 squeeze(-1)를 사용하여 결과 텐서의 크기를 줄임
  logp_label = torch.gather(logp, 2, labels.unsqueeze(2)).squeeze(-1)
  return logp_label

####**시퀀스의 전체 로그 확률 얻기**

In [12]:
# input_len=0 : 이 값은 출력에서 무시할 초기 입력 토큰의 수를 의미
def sequence_logprob(model, labels, input_len=0):
    with torch.no_grad():
        output = model(labels)
        log_probs = log_probs_from_logits(
            output.logits[:, :-1, :], labels[:, 1:])

        # input_len 이후의 모든 로그 확률을 합산하여 시퀀스의 전체 로그 확률을 계산
        seq_log_prob = torch.sum(log_probs[:, input_len:])
    return seq_log_prob.cpu().numpy()

####**그리디 서치로 만든 시퀀스의 로그 확률 계산하기**

In [13]:
# 함수 호출하여 output_greedy 시퀀스에 대한 로그 확률 계산
# input_len은 입력 시퀀스의 길이로 설정
logp = sequence_logprob(model, output_greedy, input_len=len(input_ids[0]))

# 시퀀스를 디코딩하여 텍스트로 출력
print(tokenizer.decode(output_greedy[0]))

print(f"\n로그 확률: {logp:.2f}")

In a shocking finding, scientist discovered a herd of unicorns living in a
remote, previously unexplored valley, in the Andes Mountains. Even more
surprising to the researchers was the fact that the unicorns spoke perfect
English.


"The unicorns were very intelligent, and they were very intelligent," said Dr.
David S. Siegel, a professor of anthropology at the University of California,
Berkeley. "They were very intelligent, and they were very intelligent, and they
were very intelligent, and they were very intelligent, and they were very
intelligent, and they were very intelligent, and they were very intelligent, and
they were very

로그 확률: -83.32


####**빔 서치로 생성한 시퀀스 로그 확률**

In [14]:
# 빔 서치 활성화를 위해 generate() 함수에 num_beams 매개변수에 빔 개수 지정
output_beam = model.generate(input_ids, max_length=max_length, num_beams=5,
                             do_sample=False)

# 함수 호출하여 output_beam 시퀀스에 대한 로그 확률 계산
# input_len은 입력 시퀀스의 길이로 설정
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))

# 시퀀스를 디코딩하여 텍스트로 출력
print(tokenizer.decode(output_beam[0]))

print(f"\n로그 확률: {logp:.2f}")

In a shocking finding, scientist discovered a herd of unicorns living in a
remote, previously unexplored valley, in the Andes Mountains. Even more
surprising to the researchers was the fact that the unicorns spoke perfect
English.


The researchers, from the University of California, San Diego, and the
University of California, Santa Cruz, found that the unicorns were able to
communicate with each other in a way that was similar to that of human speech.


"The unicorns were able to communicate with each other in a way that was similar
to that of human speech," said study co-lead author Dr. David J.

로그 확률: -78.34


###**n-그램 페널티**

In [15]:
# 생성된 시퀀스에서 반복되는 n-gram을 허용하지 않는 크기가 2라는 것
output_beam = model.generate(input_ids, max_length=max_length, num_beams=5,
                             do_sample=False, no_repeat_ngram_size=2)
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
print(f"\n로그 확률: {logp:.2f}")

In a shocking finding, scientist discovered a herd of unicorns living in a
remote, previously unexplored valley, in the Andes Mountains. Even more
surprising to the researchers was the fact that the unicorns spoke perfect
English.


The researchers, from the University of California, San Diego, and the National
Science Foundation (NSF) in Boulder, Colorado, were able to translate the words
of the unicorn into English, which they then translated into Spanish.

"This is the first time that we have translated a language into an English
language," said study co-author and NSF professor of linguistics and
evolutionary biology Dr.

로그 확률: -101.87


#**5.4 샘플링 방법**

##**온도가 생성되는 텍스트에 미치는 영향 알아보기**

###**온도가 높을 때**

In [17]:
# top_k = N : 모델이 다음 단어를 예측할 때 예측한 확률 값 중 상위 N개의 단어만 고려하도록 하는 것
output_temp = model.generate(input_ids, max_length=max_length, do_sample=True,
                             temperature=2.0, top_k=0)
print(tokenizer.decode(output_temp[0]))

In a shocking finding, scientist discovered a herd of unicorns living in a
remote, previously unexplored valley, in the Andes Mountains. Even more
surprising to the researchers was the fact that the unicorns spoke perfect
English.


Leave significant community history: actors assault MH buried plight felt gamb
Light SalSab homeworld Background info Lou forcibly taught its upon leaving
Sahulates Wally Green Chevroletasks god Tal Defensive Mobrequently
handsetlaneober ease DodCh Prayer button during GreekingSolution Hindu
occupational Oman contracted throwing Barnett likes friendly Rabbitg texts on
trending3 Creedater Conversion twelve Bluebirds particlesrhcz Thor Dale Dayton
Dennis agency threat encounters understands Kuro licences


###**온도가 낮을 때**

In [21]:
output_temp = model.generate(input_ids, max_length=max_length, do_sample=True,
                          temperature=0.5, top_k=0)
print(tokenizer.decode(output_temp[0]))

In a shocking finding, scientist discovered a herd of unicorns living in a
remote, previously unexplored valley, in the Andes Mountains. Even more
surprising to the researchers was the fact that the unicorns spoke perfect
English.


The scientist, who was not involved in the research, told the newspaper that the
unicorns spoke perfect English and that they were among the first to speak it.


"They were so clever, so clever that it was difficult to believe that they were
speaking English," he said.


The team found that the unicorns were not only able to communicate perfectly
with humans, but also with their human


#**5.5 탑-k 및 뉴클리어스(탑-p) 샘플링**

##**탑-k 샘플링**

In [24]:
output_topk = model.generate(input_ids, max_length=max_length, do_sample=True,
                             top_k=50)
print(tokenizer.decode(output_topk[0]))

In a shocking finding, scientist discovered a herd of unicorns living in a
remote, previously unexplored valley, in the Andes Mountains. Even more
surprising to the researchers was the fact that the unicorns spoke perfect
English.


The researchers found that the only known language used by the animals was
Bengali. Researchers are unsure whether that language can be interpreted as
Chinese, but it would explain their strange speech patterns.


"The unicorns have the 'perfect' English, and we are not sure that that's what
they all say. The way they speak, we don't know. They don't say 'hello', we


##**탑-p 샘플링**

In [26]:
output_topp = model.generate(input_ids, max_length=max_length, do_sample=True,
                             top_p=0.90)
print(tokenizer.decode(output_topp[0]))

In a shocking finding, scientist discovered a herd of unicorns living in a
remote, previously unexplored valley, in the Andes Mountains. Even more
surprising to the researchers was the fact that the unicorns spoke perfect
English.


"Our study is the first documented instance of a large-scale population-based
study of the diversity of languages that inhabit a region," the researchers
wrote in a release.

One of the most striking findings was the fact that the unicorns were able to
communicate with humans. The researchers discovered that the unicorns learned to
identify specific messages with their tongues.

"When the unicorns shared these
