# 설치
- 오픈소스: `keras_bert`(https://github.com/CyberZHG/keras-bert).
- BERT: BERT tiny 다운로드.
    - *오류 주의* : 드라이브 내 폴더에다가 풀면 `.ckpt` 파일 인식하지 못한다.
    - path 지정해서 폴더 안에다가 풀 것.

In [1]:
!pip install keras-bert

Collecting keras-bert
  Downloading https://files.pythonhosted.org/packages/e2/7f/95fabd29f4502924fa3f09ff6538c5a7d290dfef2c2fe076d3d1a16e08f0/keras-bert-0.86.0.tar.gz
Collecting keras-transformer>=0.38.0
  Downloading https://files.pythonhosted.org/packages/89/6c/d6f0c164f4cc16fbc0d0fea85f5526e87a7d2df7b077809e422a7e626150/keras-transformer-0.38.0.tar.gz
Collecting keras-pos-embd>=0.11.0
  Downloading https://files.pythonhosted.org/packages/09/70/b63ed8fc660da2bb6ae29b9895401c628da5740c048c190b5d7107cadd02/keras-pos-embd-0.11.0.tar.gz
Collecting keras-multi-head>=0.27.0
  Downloading https://files.pythonhosted.org/packages/e6/32/45adf2549450aca7867deccfa04af80a0ab1ca139af44b16bc669e0e09cd/keras-multi-head-0.27.0.tar.gz
Collecting keras-layer-normalization>=0.14.0
  Downloading https://files.pythonhosted.org/packages/a4/0e/d1078df0494bac9ce1a67954e5380b6e7569668f0f3b50a9531c62c1fc4a/keras-layer-normalization-0.14.0.tar.gz
Collecting keras-position-wise-feed-forward>=0.6.0
  Downloading

In [9]:
!apt install unzip
!wget -q https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-2_H-128_A-2.zip
!unzip -o uncased_L-2_H-128_A-2.zip -d pretrained_bert

Reading package lists... Done
Building dependency tree       
Reading state information... Done
unzip is already the newest version (6.0-21ubuntu1).
The following package was automatically installed and is no longer required:
  libnvidia-common-440
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
Archive:  uncased_L-2_H-128_A-2.zip
  inflating: pretrained_bert/bert_model.ckpt.data-00000-of-00001  
  inflating: pretrained_bert/bert_config.json  
  inflating: pretrained_bert/vocab.txt  
  inflating: pretrained_bert/bert_model.ckpt.index  


In [10]:
import codecs
import tensorflow as tf
from keras_bert import load_trained_model_from_checkpoint
from keras_bert import Tokenizer
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Model
import numpy as np
from tqdm import tqdm
import pickle
import os

os.environ['TF_KERAS'] = '1'

# 기본 설정
- 경로 설정
- 파라미터 설정
    - 강사님 코드 변경: `LOAD_DATA` boolean 변수 삭제.
    - 이후 `try ~ except ...` 구문으로 실행.

In [14]:
# 경로 설정

pretrained_path = "/content/pretrained_bert"
config_path = os.path.join(pretrained_path, 'bert_config.json')
checkpoint_path = os.path.join(pretrained_path, 'bert_model.ckpt')
vocab_path = os.path.join(pretrained_path, 'vocab.txt')

data_path = "/content/drive/My Drive/멀티캠퍼스/[혁신성장] 인공지능 자연어처리 기반/[강의]/조성현 강사님/dataset"

In [6]:
# 모델 파라미터 설정
SEQ_LEN = int(input('최대 문장 길이 설정: '))
BATCH_SIZE = int(input('배치 사이즈 설정: '))
EPOCHS = int(input('학습 횟수 설정: '))
LR = 0.001
# LOAD_DATA = True # 강사님 코드 변경

최대 문장 길이 설정: 128
배치 사이즈 설정: 128
학습 횟수 설정: 1


# Pre-trained BERT 모델

In [15]:
# Vocabulary
word2idx = {}
with codecs.open(vocab_path, 'r', 'utf8') as reader:
    for line in reader:
        token = line.strip()
        word2idx[token] = len(word2idx)

idx2word = {v:k for v, k in enumerate(word2idx)}

In [16]:
# Pre-trained BERT 모델 구조 확인
model = load_trained_model_from_checkpoint(
        config_path,
        checkpoint_path,
        training=True,
        trainable=True,
        seq_len=SEQ_LEN,
)

print()
model.summary()


Model: "functional_5"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Input-Token (InputLayer)        [(None, 128)]        0                                            
__________________________________________________________________________________________________
Input-Segment (InputLayer)      [(None, 128)]        0                                            
__________________________________________________________________________________________________
Embedding-Token (TokenEmbedding [(None, 128, 128), ( 3906816     Input-Token[0][0]                
__________________________________________________________________________________________________
Embedding-Segment (Embedding)   (None, 128, 128)     256         Input-Segment[0][0]              
______________________________________________________________________________________

In [17]:
# 토크나이저
tokenizer = Tokenizer(word2idx)

# 데이터 준비
- IMDB 데이터 로드.
- 학습용, 시험용 데이터 생성.
- 데이터 확인

In [20]:
# IMDB 데이터를 읽어온다.
dataset = tf.keras.utils.get_file(
    fname="aclImdb.tar.gz", 
    origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz", 
    extract=True,
)

Downloading data from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


In [18]:
# BERT Fine-tuning용 학습 데이터와 시험 데이터를 생성한다.
def load_data(path):
    global tokenizer
    indices, sentiments = [], []
    for folder, sentiment in (('neg', 0), ('pos', 1)):
        folder = os.path.join(path, folder)
        for name in tqdm(os.listdir(folder)):
            with open(os.path.join(folder, name), 'r', encoding='UTF8') as reader:
                  text = reader.read()
            ids, segments = tokenizer.encode(text, max_len=SEQ_LEN)
            indices.append(ids)
            sentiments.append(sentiment)
    items = list(zip(indices, sentiments))
    np.random.shuffle(items)
    indices, sentiments = zip(*items)
    indices = np.array(indices)
    mod = indices.shape[0] % BATCH_SIZE
    if mod > 0:
        indices, sentiments = indices[:-mod], sentiments[:-mod]
    return [indices, np.zeros_like(indices)], np.array(sentiments)

In [19]:
# if LOAD_DATA:
try:
    # 학습, 시험 데이터를 읽어온다.
    with open(f'{data_path}/train_test.pickle', 'rb') as f:
        train_x, train_y, test_x, test_y = pickle.load(f)
except:
    train_path = os.path.join(os.path.dirname(dataset), 'aclImdb', 'train')
    test_path = os.path.join(os.path.dirname(dataset), 'aclImdb', 'test')
    
    train_x, train_y = load_data(train_path)
    test_x, test_y = load_data(test_path)
    
    # 결과를 저장한다.
    with open(f'{data_path}/train_test.pickle', 'wb') as f:
        pickle.dump([train_x, train_y, test_x, test_y], f, pickle.HIGHEST_PROTOCOL)

## 데이터 확인
- 첫 번째 문장 확인.
- decode해서 원래 문장 확인.

In [20]:
# 학습 데이터의 첫 번째 문장을 decode해 본다. 결과는 맨 뒤에 있다.
print([idx2word[k] for k in train_x[0][0]])

# 아래 명령으로 decode해도 된다. 맨 앞의 [CLS]와 맨 뒤의 [SEP]은 제거된다.
decoded = tokenizer.decode(list(train_x[0][0]))
print(decoded)

['[CLS]', 'a', 'drama', 'at', 'its', 'very', 'core', ',', '"', 'anna', '"', 'displays', 'that', 'genuine', 'truth', 'that', 'all', 'actors', 'age', ',', 'and', 'sometimes', ',', 'fade', 'away', '.', 'anna', 'is', 'a', 'character', 'that', 'believes', 'america', 'is', 'her', 'safety', 'net', ',', 'her', 'home', ',', 'and', 'it', 'can', 'do', 'her', 'no', 'wrong', 'but', 'she', 'refuses', 'to', 'bel', '##itt', '##le', 'herself', 'to', 'do', 'work', 'she', 'doesn', "'", 't', 'believe', 'in', '.', 'she', 'is', 'hard', '-', 'nosed', ',', 'optimistic', ',', 'stubborn', ',', 'and', 'arrogant', 'when', 'it', 'comes', 'to', 'her', 'life', ',', 'yet', 'not', 'afraid', 'to', 'let', 'others', 'in', ',', 'yet', 'drop', 'them', 'at', 'a', 'moments', 'notice', '.', 'anna', 'flip', '-', 'flop', '##s', 'between', 'personalities', ',', 'which', 'makes', 'this', 'film', 'ideal', 'of', 'an', 'aging', 'star', ',', 'but', 'not', 'idea', 'of', 'the', 'viewing', 'audience', '.', '[SEP]']
['a', 'drama', 'at', 

In [21]:
# delimiter를 제거하고 원본 문장을 복원해 본다.
text = []
for i, t in enumerate(decoded):
    if i != 0 and t[0] != '#':
        text.append('_' + t)
    else:
        text.append(t)
''.join([t.replace('##', '') for t in text]).replace('_', ' ')

'a drama at its very core , " anna " displays that genuine truth that all actors age , and sometimes , fade away . anna is a character that believes america is her safety net , her home , and it can do her no wrong but she refuses to belittle herself to do work she doesn \' t believe in . she is hard - nosed , optimistic , stubborn , and arrogant when it comes to her life , yet not afraid to let others in , yet drop them at a moments notice . anna flip - flops between personalities , which makes this film ideal of an aging star , but not idea of the viewing audience .'

# 학습

- 원래 우리가 참조한 오픈소스에서는 `RAdam` 쓰도록 했는데, 그거 쓰면 에러난다.

```
from keras_radam import RAdam
optimizer=RAdam(lr=1e-4)
# TypeError: __init__() missing 1 required positional argument: 'name' 
```

- 추가 학습: 에폭 1만 잡아서 돌려 본다.

In [22]:
# Fine-tuning 모델 생성 후 학습
inputs = model.inputs[:2]
dense = model.get_layer('NSP-Dense').output
outputs = Dense(units=1, activation='sigmoid')(dense)
model = Model(inputs, outputs)

# 모델 학습 환경 설정
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
#    metrics=['binary_crossentropy'],
)

In [23]:
# 추가 학습한다. Fine-tuning
model.fit(
    train_x,
    train_y,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
)



<tensorflow.python.keras.callbacks.History at 0x7fc0f1008940>

# 예측 정확도

In [24]:
# 시험 데이터로 정확도를 평가한다.
predicts = model.predict(test_x, verbose=True)
pred_y = np.where(predicts > 0.5, 1, 0).reshape(-1,)
print('Accuracy = %.4f' % np.mean(test_y == pred_y))

Accuracy = 0.8079


# ============ 테스트 ============

In [25]:
print([idx2word[k] for k in train_x[0][0]])

['[CLS]', 'a', 'drama', 'at', 'its', 'very', 'core', ',', '"', 'anna', '"', 'displays', 'that', 'genuine', 'truth', 'that', 'all', 'actors', 'age', ',', 'and', 'sometimes', ',', 'fade', 'away', '.', 'anna', 'is', 'a', 'character', 'that', 'believes', 'america', 'is', 'her', 'safety', 'net', ',', 'her', 'home', ',', 'and', 'it', 'can', 'do', 'her', 'no', 'wrong', 'but', 'she', 'refuses', 'to', 'bel', '##itt', '##le', 'herself', 'to', 'do', 'work', 'she', 'doesn', "'", 't', 'believe', 'in', '.', 'she', 'is', 'hard', '-', 'nosed', ',', 'optimistic', ',', 'stubborn', ',', 'and', 'arrogant', 'when', 'it', 'comes', 'to', 'her', 'life', ',', 'yet', 'not', 'afraid', 'to', 'let', 'others', 'in', ',', 'yet', 'drop', 'them', 'at', 'a', 'moments', 'notice', '.', 'anna', 'flip', '-', 'flop', '##s', 'between', 'personalities', ',', 'which', 'makes', 'this', 'film', 'ideal', 'of', 'an', 'aging', 'star', ',', 'but', 'not', 'idea', 'of', 'the', 'viewing', 'audience', '.', '[SEP]']
