# 학습된 NarrativeKoGPT2을 이용한 Text Generation

## 1.Google Drive 연동
- 모델 파일과 학습 데이터가 저장 되어있는 구글 드라이브의 디렉토리와 Colab을 연동. 

### 1.1 Google Drive 연동
아래 코드를 실행후 나오는 URL을 클릭하여 나오는 인증 코드 입력

**Colab 디렉토리 아래 NarrativeKoGPT2 경로 확인**

**필요 패키지들 설치**

**시스템 경로 추가**

## 2.KoGPT2 Text Generation

### 2.1.Import Package

In [1]:
import os
import sys
import random
import torch
import time
from torch.utils.data import DataLoader # 데이터로더
from gluonnlp.data import SentencepieceTokenizer 
from kogpt2.utils import get_tokenizer
from kogpt2.utils import download, tokenizer
from kogpt2.pytorch_kogpt2 import get_pytorch_kogpt2_model
from model.torch_gpt2 import GPT2Config, GPT2LMHeadModel
from util.data import NovelDataset
import gluonnlp
import sampling
import kss

### 2.2. koGPT-2 Config

In [2]:
ctx= 'cpu'#'cuda' #'cpu' #학습 Device CPU or GPU. colab의 경우 GPU 사용
cachedir='~/kogpt2/' # KoGPT-2 모델 다운로드 경로
#epoch = 500  # 학습 epoch
save_path = 'checkpoints/'
load_path = 'checkpoints/narrativeKoGPT2_checkpoint_tokenized_ver3_bat1_epoch100.tar'
use_cuda = True # Colab내 GPU 사용을 위한 값

pytorch_kogpt2 = {
    'url': 'https://kobert.blob.core.windows.net/models/kogpt2/pytorch/pytorch_kogpt2_676e9bcfa7.params',
    'fname': 'pytorch_kogpt2_676e9bcfa7.params',
    'chksum': '676e9bcfa7'
}
kogpt2_config = {
    "initializer_range": 0.02,
    "layer_norm_epsilon": 1e-05,
    "n_ctx": 1024,
    "n_embd": 768,
    "n_head": 12,
    "n_layer": 12,
    "n_positions": 1024,
    "vocab_size": 50000,
    "activation_function": "gelu"
}

### 2.3 Model and Vocab Download

In [3]:
# download model
model_info = pytorch_kogpt2
model_path = download(model_info['url'],
                       model_info['fname'],
                       model_info['chksum'],
                       cachedir=cachedir)
# download vocab
vocab_info = tokenizer
vocab_path = download(vocab_info['url'],
                       vocab_info['fname'],
                       vocab_info['chksum'],
                       cachedir=cachedir)

using cached model
using cached model


### 2.4.KoGPT-2 Model Vocab

**추론 및 학습 재개를 위한 모델 불러오기**
**저장하기**
```python
torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'loss': loss,
            ...
            }, PATH)

```
  
**불러오기**
``` python
model = TheModelClass(*args, **kwargs)
optimizer = TheOptimizerClass(*args, **kwargs)

checkpoint = torch.load(PATH)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']

model.eval()
# - or -
model.train()
```

In [4]:
# Device 설정
device = torch.device(ctx)
# 저장한 Checkpoint 불러오기
checkpoint = torch.load(load_path, map_location=device)

# KoGPT-2 언어 모델 학습을 위한 GPT2LMHeadModel 선언
kogpt2model = GPT2LMHeadModel(config=GPT2Config.from_dict(kogpt2_config))\
## 자기 학습한 모델
kogpt2model.load_state_dict(checkpoint['model_state_dict'])
# skt-ai의 모델
#kogpt2model.load_state_dict(torch.load(model_path))
kogpt2model.to(device)
    
kogpt2model.eval()
vocab_b_obj = gluonnlp.vocab.BERTVocab.from_sentencepiece(vocab_path,
                                                     mask_token=None,
                                                     sep_token=None,
                                                     cls_token=None,
                                                     unknown_token='<unk>',
                                                     padding_token='<pad>',
                                                     bos_token='<s>',
                                                     eos_token='</s>')


### 2.5. Tokenizer

In [5]:
tok_path = get_tokenizer()
model, vocab = kogpt2model, vocab_b_obj
tok = SentencepieceTokenizer(tok_path, num_best = 0, alpha=0)

using cached model


### 2.6. NarrativeKoGPT-2 Text Generation

In [7]:
#tok_path = get_tokenizer()
#model, vocab = get_pytorch_kogpt2_model()
tok = SentencepieceTokenizer(tok_path,  num_best=0, alpha=0)

sent = input('문장 입력: ')

toked = tok(sent)
print(toked)
count = 0
output_size = 200 # 출력하고자 하는 토큰 갯수
start = time.time()
'''
cycle = 100
while cycle :
  input_ids = torch.tensor([vocab[vocab.bos_token],]  + vocab[toked]).unsqueeze(0)
  pred = model(input_ids)[0]
  gen = vocab.to_tokens(torch.argmax(pred, axis=-1).squeeze().tolist())[-1]
  print(gen)
  if gen == '</s>':
      break
  sent += gen.replace('▁', ' ')
  toked = tok(sent)
  cycle -= 1

print(sent)
'''
while 1:
  input_ids = torch.tensor([vocab[vocab.bos_token],]  + vocab[toked]).unsqueeze(0)
  predicts = model(input_ids)
  pred = predicts[0]
    
  last_pred = pred.squeeze()[-1]
  # top_p 샘플링 방법
  # sampling.py를 통해 random, top-k, top-p 선택 가능.
  gen = sampling.top_p(last_pred, vocab, 0.85)
  #gen = sampling.top_k(last_pred, vocab, 5)

  if count>output_size:
    sent += gen.replace('▁', ' ')
    toked = tok(sent)
    count =0
    break
  sent += gen.replace('▁', ' ')
  toked = tok(sent)
  count += 1


for s in kss.split_sentences(sent):
    print(s)
    
print("time is ", time.time() -  start)


문장 입력:  서핑 보드를 타는 한 남자


['▁서', '핑', '▁보', '드를', '▁타는', '▁한', '▁남자']
selected token: 만이 softmax value:tensor(0.0398, grad_fn=<SelectBackward>)
selected token: ▁조용히 softmax value:tensor(0.0013, grad_fn=<SelectBackward>)
selected token: ▁요 softmax value:tensor(0.0061, grad_fn=<SelectBackward>)
selected token: 람 softmax value:tensor(0.7639, grad_fn=<SelectBackward>)
selected token: 에 softmax value:tensor(0.0300, grad_fn=<SelectBackward>)
selected token: ▁폴 softmax value:tensor(0.0586, grad_fn=<SelectBackward>)
selected token: 짝 softmax value:tensor(0.9999, grad_fn=<SelectBackward>)
selected token: 폴 softmax value:tensor(0.9543, grad_fn=<SelectBackward>)
selected token: 짝 softmax value:tensor(0.9999, grad_fn=<SelectBackward>)
selected token: ▁모여 softmax value:tensor(0.0102, grad_fn=<SelectBackward>)
selected token: 들어 softmax value:tensor(0.0301, grad_fn=<SelectBackward>)
selected token: ▁자신의 softmax value:tensor(0.0089, grad_fn=<SelectBackward>)
selected token: ▁몸 softmax value:tensor(0.0038, grad_fn=<SelectBackw