# Transformers?

Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017

![screensh](https://blog.kakaocdn.net/dn/blla7d/btqBPXAzWdA/1yMKSf4SYWRT9t0yDt2lM1/img.jpg)

# huggingface library를 이용한 모델 불러오기

In [46]:
!pip install transformers[sentencepiece]
#sentencepiece 라이브러리를 추가로 받는 코드
!pip install datasets
# datasets 라이브러리 (huggingface)



In [47]:
import torch
import torch.nn as nn
import torchtext

In [48]:
# 26663개 모델 -> 한줄 바꿔서 해결

# BERT
from transformers import BertTokenizer, BertForTokenClassification
# BERT인데, token classification
# TokenClassification : 각 token마다 classification (0 or 1)

# models에 가면 bert-base-uncased를 그대로 가져와 쓰는 것
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# tokenizer는 왜 pretrained?
# 1. 더 일반적인 토큰
# -사전 학습 데이터는 매우 큼. 일반적인 토큰들이 포함되어 있음
# 2. model과 세트임.
# -word embedding 할때. 
# <pad> ->0번, it -> 32번
# <pad> -> 0번, it ->25번
# 25번째 embedding vector/ 32번째 embedding vector는 다르다.
# -> 새로운 단어를 추가하고 싶으면?
# -원래 30000ro vocab 이면 기존꺼는 남겨두고 30001번부터 사용하면 된다.
model = BertForTokenClassification.from_pretrained('bert-base-uncased')

inputs = tokenizer.encode("Hello, my dog is cute", return_tensors="pt")
# pt : tensor PyTorch
# tf : tensorflow
# 안넣으면 : list
outputs = model(inputs)[0]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-u

In [49]:
outputs

tensor([[[ 0.4450, -0.0710],
         [-0.1201, -0.2184],
         [ 0.0071, -0.1612],
         [-0.4120, -0.3709],
         [-0.2742,  0.4108],
         [-0.3344,  0.2184],
         [-0.2298,  0.3595],
         [-0.2713,  0.5162]]], grad_fn=<AddBackward0>)

In [50]:
# Electra
# GAN 형태(discriminator)
from transformers import ElectraTokenizer, ElectraModel
# logits : softmax 취하기 전 값
# for token classification -> classification 모델
# BertModel + Linear layer ( 어떤 형태의 output을 쓸 것이냐)

#BertForSentenceClassification
# Bertmodel + Linear layer
# 한문장에 
# liklihood : softmax 취한 후 값

tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')
model = ElectraModel.from_pretrained('google/electra-small-discriminator')

inputs = tokenizer.encode("The capital of France is [MASK].", return_tensors="pt")

outputs = model(inputs)

Some weights of the model checkpoint at google/electra-small-discriminator were not used when initializing ElectraModel: ['discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.weight']
- This IS expected if you are initializing ElectraModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [51]:
outputs

BaseModelOutputWithPastAndCrossAttentions([('last_hidden_state',
                                            tensor([[[ 0.6132,  1.0569, -1.1983,  ...,  1.1837, -1.5054, -0.1459],
                                                     [ 0.1217,  0.0154, -0.5511,  ...,  0.2008, -0.2318, -0.1800],
                                                     [-0.2863,  0.4928, -0.6949,  ...,  0.4721, -0.3949,  0.2883],
                                                     ...,
                                                     [ 1.4331,  1.0827, -0.6423,  ...,  0.3444, -0.2377, -0.3932],
                                                     [-0.4996, -0.8609, -0.5961,  ...,  0.1058, -0.2810, -0.4912],
                                                     [ 0.6112,  1.0582, -1.1976,  ...,  1.1838, -1.5038, -0.1461]]],
                                                   grad_fn=<NativeLayerNormBackward0>))])

In [52]:
outputs.last_hidden_state

tensor([[[ 0.6132,  1.0569, -1.1983,  ...,  1.1837, -1.5054, -0.1459],
         [ 0.1217,  0.0154, -0.5511,  ...,  0.2008, -0.2318, -0.1800],
         [-0.2863,  0.4928, -0.6949,  ...,  0.4721, -0.3949,  0.2883],
         ...,
         [ 1.4331,  1.0827, -0.6423,  ...,  0.3444, -0.2377, -0.3932],
         [-0.4996, -0.8609, -0.5961,  ...,  0.1058, -0.2810, -0.4912],
         [ 0.6112,  1.0582, -1.1976,  ...,  1.1838, -1.5038, -0.1461]]],
       grad_fn=<NativeLayerNormBackward0>)

In [53]:
outputs.last_hidden_state.shape

torch.Size([1, 9, 256])

# AutoModel 사용해보기

In [54]:
from transformers import AutoModelForTokenClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForTokenClassification.from_pretrained('bert-base-uncased')

inputs = tokenizer.encode("Hello, my dog is cute", return_tensors="pt")
outputs = model(inputs)[0]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-u

# 사전학습된 모델 사용해보기

In [55]:
# fine-tuning되어있는 모델 사용하기
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-squadv2")
model = AutoModelForSeq2SeqLM.from_pretrained("mrm8488/t5-base-finetuned-squadv2")


The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


In [56]:

def get_answer(question, context):
  input_text = "question: %s  context: %s" % (question, context)
  features = tokenizer([input_text], return_tensors='pt')

  output = model.generate(input_ids=features['input_ids'], 
               attention_mask=features['attention_mask'])
  
  return tokenizer.decode(output[0])



# wikipeida text
context = 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy'
question = 'When did Beyonce start becoming popular?'

get_answer(question,context)

'<pad> late 1990s</s>'

In [57]:
from transformers import AutoModelWithLMHead, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-question-generation-ap")
model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-base-finetuned-question-generation-ap")

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


In [58]:
def get_question(answer, context, max_length=64):
  input_text = "answer: %s  context: %s </s>" % (answer, context)
  features = tokenizer([input_text], return_tensors='pt')

  output = model.generate(input_ids=features['input_ids'], 
               attention_mask=features['attention_mask'],
               max_length=max_length)

  return tokenizer.decode(output[0])

context = 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy'
answer = '1981'

get_question(answer, context)

'<pad> question: What year was Beyonce born?</s>'

# huggingfcae 라이브러리를 이용한 데이터 처리하기

In [59]:
from datasets import load_dataset
datasets = load_dataset('imdb')

Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)


  0%|          | 0/3 [00:00<?, ?it/s]

In [60]:
datasets.keys()

dict_keys(['train', 'test', 'unsupervised'])

In [61]:
datasets['train'][0] # 0 or 1 negative/positive

{'label': 0,
 'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are f

In [62]:
from torch.utils.data import Dataset, DataLoader
class CustomDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        x = item['text']
        y = item['label']
        return x, torch.tensor(y).long()






In [63]:
train_dataset = CustomDataset(datasets['train'])
test_dataset = CustomDataset(datasets['test'])

train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=4, shuffle=False)

In [64]:
batch = next(iter(train_dataloader))

In [65]:
features = tokenizer(list(batch[0]))

In [66]:
features.keys()

dict_keys(['input_ids', 'attention_mask'])

# 짧은 코드만으로 학습을 시켜봅시다.

In [67]:
from datasets import load_dataset
datasets = load_dataset('imdb')

from torch.utils.data import Dataset, DataLoader
class CustomDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        x = item['text']
        y = item['label']
        return x, torch.tensor(y).long()


train_dataset = CustomDataset(datasets['train'])
test_dataset = CustomDataset(datasets['test'])

train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=4, shuffle=False)

Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)


  0%|          | 0/3 [00:00<?, ?it/s]

In [68]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased")

# cuda
DEVICE = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model = model.to(DEVICE)

# optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

#criterion = nn.CrossEntropyLoss() #필요없음. 왜? :모델 내장되어있기 때문

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [69]:
def train_epoch(model, dataloader, tokenizer, optimizer):
    model.train()
    train_loss = 0
    for i, (x,y) in enumerate(dataloader):
        x = tokenizer(list(x), padding='max_length', return_tensors='pt',max_length=512, truncation=True)['input_ids'].to(DEVICE)
        y = y.to(DEVICE)
        loss = model(x, labels=y)['loss']
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
        if i % 50 == 0:
            print('Iter [{}/{}] Loss {:.6f}'.format(i+1, len(dataloader), train_loss / (i+1)))
    
    return train_loss / len(dataloader)

def test_epoch(model, dataloader, tokenizer):
    model.eval()
    preds = []
    labels = []
    with torch.no_grad():
      for x,y in dataloader:
          features = tokenizer(list(x), padding='max_length', return_tensors='pt',max_length=512)['input_ids'].to(DEVICE)
          out = model(x)['logits']
          pred = out.argmax(-1)
          preds.append(pred.cpu())
          labels.append(y)
    preds = torch.cat(preds)
    labels = torch.cat(labels)
    acc = (preds == labels).float().mean()
    print('ACC : {:.3f}'.format(acc))
    return preds, labels

def predict(model, tokenizer, sentence):
    model.eval()
    x = tokenizer.encode(sentence, return_tensors='pt').to(DEVICE)
    out = model(x)['logits']
    pred = out.argmax(-1)
    return pred.cpu()

In [70]:
EPOCHS=1

for i in range(EPOCHS):
    train_epoch(model, train_dataloader, tokenizer, optimizer)
    test_epoch(model, test_dataloader, tokenizer)

Iter [1/6250] Loss 0.827498


KeyboardInterrupt: ignored

In [91]:
# 1. data 받아오기
dataset_name = "xsum"
datasets = load_dataset(dataset_name)

#2. model 골라서 tokenizer, model 불러오기, optimizer 만들기
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model = model.cuda() #gpu

optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

# 3. customdataset class
from torch.utils.data import Dataset, DataLoader
class CustomDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        x = item['document'] # text
        y = item['summary'] # text
        return x, y

train_dataset = CustomDataset(datasets['train'])
valid_dataset = CustomDataset(datasets['validation'])

train_dataloader = DataLoader(train_dataset, batch_size = 4, shuffle=True)
valid_dataloader = DataLoader(valid_dataset, batch_size = 4, shuffle=False) # validation

# shuffle -> 랜덤하게 섞는거
# data 1, 2, 3, ...

# 4. 학습 코드
batch = next(iter(valid_dataloader))
def tokenizing(batch):
    document = batch[0]
    summary = batch[1]
    document_features = tokenizer(list(document), return_tensors='pt', 
                                  padding='max_length', max_length=512, truncation=True)
    summary_features = tokenizer(list(summary), return_tensors='pt', 
                                padding='max_length', max_length=512, truncation=True) 
    return document_features, summary_features # gpu


for epoch in range(5):
  model.train()
  train_loss = 0
  for batch in train_dataloader:
    #tokenizing + tensor + gpu upload
    document_features, summary_features = tokenizing(batch)
    loss = model(document_features['input_ids'].to(DEVICE),
            attention_mask = document_features['attention_mask'].to(DEVICE),
            labels = summary_features['input_ids'].to(DEVICE))['loss']
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    train_loss += loss.item() # tensor.item() [1] -> 1
  print('train loss:{:.5f}'.format(train_loss/len(train_dataloader)))

# truncation : 길이가 너무 길면 512로 잘라줌
# padding : 길이가 짧으면 512로 늘려줌
# 5. 학습 진행
  model.eval()
  valid_loss = 0
  for batch in valid_dataloader:
    #tokenizing + tensor + gpu upload
    document_features, summary_features = tokenizing(batch)
    loss = model(document_features['input_ids'].to(DEVICE),
            attention_mask = document_features['attention_mask'].to(DEVICE),
            labels = summary_features['input_ids'].to(DEVICE))['loss']

    valid_loss += loss.item() # tensor.item() [1] -> 1
  print('valid loss:{:.5f}'.format(valid_loss/len(valid_dataloader)))

# truncation : 길이가 너무 길면 512로 잘라줌
# padding : 길이가 짧으면 512로 늘려줌

Using custom data configuration default
Reusing dataset xsum (/root/.cache/huggingface/datasets/xsum/default/1.2.0/32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934)


  0%|          | 0/3 [00:00<?, ?it/s]

KeyboardInterrupt: ignored

# 과제

- Text summary 에 fine-tuned 되어있는 모델을 불러와 아래의 글들을 요약해봅시다.
- 정상적으로 보이는 글이 완성되면 과제 통과입니다. 
- 완벽하게 요약하지 않아도 됩니다. 완전 이상한 글만 아니면 통과입니다!
    - 이상한 글 예시: 이 글은 이 이 글은, . , , pad 이 것  (학습이 제대로 안 된 결과)
    - 정상적인 글 예시: 이건 과제에 관한 글이다

In [92]:
from transformers import PreTrainedTokenizerFast
from transformers import BartForConditionalGeneration

tokenizer = PreTrainedTokenizerFast.from_pretrained('gogamza/kobart-summarization')
model = BartForConditionalGeneration.from_pretrained('gogamza/kobart-summarization')

Downloading:   0%|          | 0.00/4.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/111 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/666k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BartTokenizer'. 
The class this function is called from is 'PreTrainedTokenizerFast'.


Downloading:   0%|          | 0.00/473M [00:00<?, ?B/s]

In [93]:
text = "과거를 떠올려보자. 방송을 보던 우리의 모습을. 독보적인 매체는 TV였다. 온 가족이 둘러앉아 TV를 봤다. 간혹 가족들끼리 뉴스와 드라마, 예능 프로그램을 둘러싸고 리모컨 쟁탈전이 벌어지기도  했다. 각자 선호하는 프로그램을 ‘본방’으로 보기 위한 싸움이었다. TV가 한 대인지 두 대인지 여부도 그래서 중요했다. 지금은 어떤가. ‘안방극장’이라는 말은 옛말이 됐다. TV가 없는 집도 많다. 미디어의 혜 택을 누릴 수 있는 방법은 늘어났다. 각자의 방에서 각자의 휴대폰으로, 노트북으로, 태블릿으로 콘텐츠 를 즐긴다."

# 1. tokenizer를 이용해 토크나이즈를 진행합니다. (huggingface library에 있는 예제를 참고해보세요.)
raw_input_ids = tokenizer.encode(text)
input_ids = [tokenizer.bos_token_id] + raw_input_ids + [tokenizer.eos_token_id]

# 2. model.generate 함수를 이용해 생성해봅시다.
summary_ids = model.generate(torch.tensor([input_ids]))

# 3. tokenizer 를 이용해 decode하여 읽을 수 있는 글로 바꿔줍니다.
tokenizer.decode(summary_ids.squeeze().tolist(), skip_special_tokens=True)

'TV가 없는 집도 많아지고 미디어의 혜 택을 누릴 수 있는 방법은 늘어났다.'

In [94]:
text = '수학에서 순환소수인 0.999…는 실수 1의 또 다른 십진법 소수 표현이다. 즉 "0.999…"와 "1"은 같은 수이다. 이러한 증명은 실수론의 전개, 배경이 있는 가정, 역사적 맥락, 대상이 되는 청자(듣는 사람) 등에 맞는 수준에 따른 것으로서 여러 단계의 수학적 엄밀함을 적절하게 고려한 다양한 정식화가 있다.'

# 위의 코드를 가져와 반복해보세요.
raw_input_ids = tokenizer.encode(text)
input_ids = [tokenizer.bos_token_id] + raw_input_ids + [tokenizer.eos_token_id]

# 2. model.generate 함수를 이용해 생성해봅시다.
summary_ids = model.generate(torch.tensor([input_ids]))

# 3. tokenizer 를 이용해 decode하여 읽을 수 있는 글로 바꿔줍니다.
tokenizer.decode(summary_ids.squeeze().tolist(), skip_special_tokens=True)

'수학에서 순환소수인 0.999는 실수 1의 또 다른 십진법'

In [96]:
text = '암모니아(영어: ammonia)는 질소와 수소로 이루어진 화합물이다. 분자식은 NH3이다. 상온에서는 특유의 자극적인 냄새가 나는 무색의 기체 상태로 존재하고있다. 대기 중에도 소량의 양이 포함되어 있으며, 천연수에 미량 함유되어 있기도 하다. 토양 중에도 세균의 질소 유기물의 분해 과정에서 생겨난 암모니아가 존재할 수 있다. 대표적인 반자성체 중 하나이다.'

# 위의 코드를 가져와 반복해보세요.
raw_input_ids = tokenizer.encode(text)
input_ids = [tokenizer.bos_token_id] + raw_input_ids + [tokenizer.eos_token_id]

# 2. model.generate 함수를 이용해 생성해봅시다.
summary_ids = model.generate(torch.tensor([input_ids]))

# 3. tokenizer 를 이용해 decode하여 읽을 수 있는 글로 바꿔줍니다.
tokenizer.decode(summary_ids.squeeze().tolist(), skip_special_tokens=True)

'토양 중에도 질소와 수소로 이루어진 화합물인 암모니아가 존재할'