* 지금까지는 단어 벡터에 대한 RNN이지만, 입출력 단위를 문자 레벨로 변경 가능
* 일반적으로 다대일(Many-to-One) 구조와 다대다(Many-to-Many) 구조로 구현할 수 있음

### 1. 문자 단위 RNN 언어 모델(Char RNNLM)
* 이전 시점의 예측문자를 다음 시점의 입력으로 사용하는 문자 단위 RNN 언어 모델
* 문자단위를 입/출력으로 사용하므로 임베딩층(embedding layer)을 사용하지 않음
* '이상한 나라의 앨리스(Alice’s Adventures in Wonderland) 다운로드 : http://www.gutenberg.org/files/11/11-0.txt

#### 1) 데이터 이해 및 전처리

In [1]:
import urllib.request
from tensorflow.keras.utils import to_categorical

In [2]:
### 데이터 로드
urllib.request.urlretrieve('https://www.gutenberg.org/files/11/11-0.txt', filename="11-0.txt")

('11-0.txt', <http.client.HTTPMessage at 0x1ec79048df0>)

In [3]:
f = open('11-0.txt', 'rb')
sentences = []
for sentence in f:
    sentence = sentence.strip()  # \r. \n 제거
    sentence = sentence.lower() # 소문자화
    sentence = sentence.decode('ascii', 'ignore') # \xe2\x80\x99 등 바이트 열 제거
    
    if len(sentence) > 0:
        sentences.append(sentence)
f.close()

In [4]:
sentences[:5]

['the project gutenberg ebook of alices adventures in wonderland, by lewis carroll',
 'this ebook is for the use of anyone anywhere in the united states and',
 'most other parts of the world at no cost and with almost no restrictions',
 'whatsoever. you may copy it, give it away or re-use it under the terms',
 'of the project gutenberg license included with this ebook or online at']

In [6]:
total_data = ' '.join(sentences)
print('문자열 길이 또는 총 문자의 갯수: %d' % len(total_data))

문자열 길이 또는 총 문자의 갯수: 159484


In [7]:
print(total_data[:200])

the project gutenberg ebook of alices adventures in wonderland, by lewis carroll this ebook is for the use of anyone anywhere in the united states and most other parts of the world at no cost and with


In [8]:
## 문자 토큰 만들기
char_vocab = sorted(list(set(total_data)))
vocab_size = len(char_vocab)
print('문자 집합의 크기 : {}'.format(vocab_size))

문자 집합의 크기 : 56


In [10]:
print(char_vocab)

[' ', '!', '"', '#', '$', '%', "'", '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [11]:
char_to_index = dict((char, index) for index, char in enumerate(char_vocab))
print('문자집합 : ', char_to_index)

문자집합 :  {' ': 0, '!': 1, '"': 2, '#': 3, '$': 4, '%': 5, "'": 6, '(': 7, ')': 8, '*': 9, ',': 10, '-': 11, '.': 12, '/': 13, '0': 14, '1': 15, '2': 16, '3': 17, '4': 18, '5': 19, '6': 20, '7': 21, '8': 22, '9': 23, ':': 24, ';': 25, '?': 26, '[': 27, ']': 28, '_': 29, 'a': 30, 'b': 31, 'c': 32, 'd': 33, 'e': 34, 'f': 35, 'g': 36, 'h': 37, 'i': 38, 'j': 39, 'k': 40, 'l': 41, 'm': 42, 'n': 43, 'o': 44, 'p': 45, 'q': 46, 'r': 47, 's': 48, 't': 49, 'u': 50, 'v': 51, 'w': 52, 'x': 53, 'y': 54, 'z': 55}


* 28까지는 공백을 포함한 각종 구두점, 특수문자 등이 존재.. 반대로 정수로부터 문자를 리턴하는 index_to_char 만듦

In [12]:
index_to_char = {}

for key, value in char_to_index.items():
    index_to_char[value] = key

In [13]:
print(index_to_char)

{0: ' ', 1: '!', 2: '"', 3: '#', 4: '$', 5: '%', 6: "'", 7: '(', 8: ')', 9: '*', 10: ',', 11: '-', 12: '.', 13: '/', 14: '0', 15: '1', 16: '2', 17: '3', 18: '4', 19: '5', 20: '6', 21: '7', 22: '8', 23: '9', 24: ':', 25: ';', 26: '?', 27: '[', 28: ']', 29: '_', 30: 'a', 31: 'b', 32: 'c', 33: 'd', 34: 'e', 35: 'f', 36: 'g', 37: 'h', 38: 'i', 39: 'j', 40: 'k', 41: 'l', 42: 'm', 43: 'n', 44: 'o', 45: 'p', 46: 'q', 47: 'r', 48: 's', 49: 't', 50: 'u', 51: 'v', 52: 'w', 53: 'x', 54: 'y', 55: 'z'}


In [14]:
### 입력길이 4, 출력 시퀀스 길이 4인 모델 
# 총 네번의 시점(time step)을 가질 수 있다는 의미
# appl (입력 시퀀스) -> pple (예측해야하는 시퀀스)

train_X = 'appl'
train_y = 'pple'


In [15]:
### 앞 데이터를 이용하여 다수의 샘플 만들기
## 1) 문장 샘플 길이 정하고, 2) 해당 길이만큼 문자열 전체를 등분, 
### 여기에서는 문장의 길이를 60으로 정하고, 샘플 수 는 15900 / 60

seq_length = 60
n_samples = int(np.floor((len(total_data) - 1) / seq_length))
n_samples

2658

In [26]:
### 전처리 진행
train_X = []
train_y = []

for i in range(n_samples):
    # 0:60 -> 60:120 -> 120:180로 loop를 돌면서 문장 샘플을 1개씩 pick.
    X_sample = total_data[i * seq_length: (i+1) * seq_length]
    
    ## 정수 인코딩
    X_encoded = [char_to_index[c] for c in X_sample]
    train_X.append(X_encoded)
    
    ## 오른쪽으로 1칸 쉬프트
    y_sample = total_data[i*seq_length + 1: (i+1) * seq_length + 1]
    y_encoded = [char_to_index[c] for c in y_sample]
    train_y.append(y_encoded)
  

In [20]:
total_data[60:120]

'nd, by lewis carroll this ebook is for the use of anyone any'

In [29]:
print('X 데이터의 첫번째 샘플 :',train_X[0])
print('y 데이터의 첫번째 샘플 :',train_y[0])
print('-'*50)
print('X 데이터의 첫번째 샘플 디코딩 :',[index_to_char[i] for i in train_X[1]])
print('y 데이터의 첫번째 샘플 디코딩 :',[index_to_char[i] for i in train_y[1]])

X 데이터의 첫번째 샘플 : [49, 37, 34, 0, 45, 47, 44, 39, 34, 32, 49, 0, 36, 50, 49, 34, 43, 31, 34, 47, 36, 0, 34, 31, 44, 44, 40, 0, 44, 35, 0, 30, 41, 38, 32, 34, 48, 0, 30, 33, 51, 34, 43, 49, 50, 47, 34, 48, 0, 38, 43, 0, 52, 44, 43, 33, 34, 47, 41, 30]
y 데이터의 첫번째 샘플 : [37, 34, 0, 45, 47, 44, 39, 34, 32, 49, 0, 36, 50, 49, 34, 43, 31, 34, 47, 36, 0, 34, 31, 44, 44, 40, 0, 44, 35, 0, 30, 41, 38, 32, 34, 48, 0, 30, 33, 51, 34, 43, 49, 50, 47, 34, 48, 0, 38, 43, 0, 52, 44, 43, 33, 34, 47, 41, 30, 43]
--------------------------------------------------
X 데이터의 첫번째 샘플 디코딩 : ['n', 'd', ',', ' ', 'b', 'y', ' ', 'l', 'e', 'w', 'i', 's', ' ', 'c', 'a', 'r', 'r', 'o', 'l', 'l', ' ', 't', 'h', 'i', 's', ' ', 'e', 'b', 'o', 'o', 'k', ' ', 'i', 's', ' ', 'f', 'o', 'r', ' ', 't', 'h', 'e', ' ', 'u', 's', 'e', ' ', 'o', 'f', ' ', 'a', 'n', 'y', 'o', 'n', 'e', ' ', 'a', 'n', 'y']
y 데이터의 첫번째 샘플 디코딩 : ['d', ',', ' ', 'b', 'y', ' ', 'l', 'e', 'w', 'i', 's', ' ', 'c', 'a', 'r', 'r', 'o', 'l', 'l', ' ', 't', 'h',

In [30]:
## train_X와 train_y에 대해서 원-핫 인코딩
train_X = to_categorical(train_X)
train_y = to_categorical(train_y)

print('train_X의 크기(shape) : {}'.format(train_X.shape))  # batch_size * input_length * input_dim(56)
print('train_y의 크기(shape) : {}'.format(train_y.shape))

train_X의 크기(shape) : (2658, 60, 56)
train_y의 크기(shape) : (2658, 60, 56)


In [32]:
train_X[0].shape

(60, 56)

#### 2) 모델 설계
* 은닉상태는 256, 모델은 다대다 구조의 LSTM을 사용, LSTM 은닉층은 두개 사용, 전결합층

In [33]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, TimeDistributed

hidden_units = 256

In [35]:
model = Sequential()
model.add(LSTM(hidden_units, input_shape=(None, train_X.shape[2]), return_sequences=True))
model.add(LSTM(hidden_units, return_sequences=True))
model.add(TimeDistributed(Dense(vocab_size, activation='softmax')))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [36]:
model.fit(train_X, train_y, epochs=80, verbose=2)

Epoch 1/80
84/84 - 36s - loss: 3.0689 - accuracy: 0.1817 - 36s/epoch - 424ms/step
Epoch 2/80
84/84 - 28s - loss: 2.7873 - accuracy: 0.2377 - 28s/epoch - 338ms/step
Epoch 3/80
84/84 - 28s - loss: 2.4230 - accuracy: 0.3268 - 28s/epoch - 338ms/step
Epoch 4/80
84/84 - 28s - loss: 2.2626 - accuracy: 0.3608 - 28s/epoch - 334ms/step
Epoch 5/80
84/84 - 28s - loss: 2.1518 - accuracy: 0.3867 - 28s/epoch - 335ms/step
Epoch 6/80
84/84 - 28s - loss: 2.0634 - accuracy: 0.4077 - 28s/epoch - 334ms/step
Epoch 7/80
84/84 - 28s - loss: 1.9885 - accuracy: 0.4270 - 28s/epoch - 336ms/step
Epoch 8/80
84/84 - 28s - loss: 1.9298 - accuracy: 0.4420 - 28s/epoch - 335ms/step
Epoch 9/80
84/84 - 28s - loss: 1.8735 - accuracy: 0.4565 - 28s/epoch - 330ms/step
Epoch 10/80
84/84 - 28s - loss: 1.8234 - accuracy: 0.4715 - 28s/epoch - 332ms/step
Epoch 11/80
84/84 - 28s - loss: 1.7806 - accuracy: 0.4836 - 28s/epoch - 335ms/step
Epoch 12/80
84/84 - 28s - loss: 1.7374 - accuracy: 0.4949 - 28s/epoch - 338ms/step
Epoch 13/80
8

<keras.callbacks.History at 0x1ec00047520>

In [37]:
def sentence_generation(model, length):
    ## 문자에 대한 랜덤한 정수 생성
    ix = [np.random.randint(vocab_size)]
    
    ## 생성한 정수 맵핑한 문자 생성
    y_char = [index_to_char[ix[-1]]]
    print(ix[-1], '번 문자', y_char[-1], '로 예측을 시작!')
    
    ## (1, length, 55) 크기의 x 생성. 즉, LSTM의 입력 시퀀스 생성
    X = np.zeros((1, length, vocab_size))
    
    for i in range(length):
        X[0][i][ix[-1]] = 1
        print(index_to_char[ix[-1]], end="")
        ix = np.argmax(model.predict(X[:, :i+1, :])[0], 1)
        y_char.append(index_to_char[ix[-1]])
    return ('').join(y_char)

In [38]:
result = sentence_generation(model, 100)
print(result)

37 번 문자 h 로 예측을 시작!
he mock turtle, suddenly dropping his voice; and the two creatures would mently remarked. the mock the mock turtle, suddenly dropping his voice; and the two creatures would mently remarked. the mock tu


### 2. 문자 단위 RNN(Char RNN)으로 텍스트 생성하기

* 다대일구조 RNN

In [39]:
from tensorflow.keras.utils import to_categorical

In [40]:
raw_text = '''
I get on with life as a programmer,
I like to contemplate beer.
But when I start to daydream,
My mind turns straight to wine.

Do I love wine more than beer?

I like to use words about beer.
But when I stop my talking,
My mind turns straight to wine.

I hate bugs and errors.
But I just think back to wine,
And I'm happy once again.

I like to hang out with programming and deep learning.
But when left alone,
My mind turns straight to wine.
'''

In [42]:
## 단락 구분 없애고 하나의 문자열로 재저장
tokens = raw_text.split()
raw_text = ' '.join(tokens)
raw_text

"I get on with life as a programmer, I like to contemplate beer. But when I start to daydream, My mind turns straight to wine. Do I love wine more than beer? I like to use words about beer. But when I stop my talking, My mind turns straight to wine. I hate bugs and errors. But I just think back to wine, And I'm happy once again. I like to hang out with programming and deep learning. But when left alone, My mind turns straight to wine."

In [43]:
## retoken
char_vocab = sorted(list(set(raw_text)))
vocab_size = len(char_vocab)
print('문자 집합:', char_vocab)
print('문자 집합의 크기: {}'.format(vocab_size))

문자 집합: [' ', "'", ',', '.', '?', 'A', 'B', 'D', 'I', 'M', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y']
문자 집합의 크기: 33


In [44]:
char_to_index = dict((char, index) for index, char in enumerate(char_vocab))
print(char_to_index)

{' ': 0, "'": 1, ',': 2, '.': 3, '?': 4, 'A': 5, 'B': 6, 'D': 7, 'I': 8, 'M': 9, 'a': 10, 'b': 11, 'c': 12, 'd': 13, 'e': 14, 'f': 15, 'g': 16, 'h': 17, 'i': 18, 'j': 19, 'k': 20, 'l': 21, 'm': 22, 'n': 23, 'o': 24, 'p': 25, 'r': 26, 's': 27, 't': 28, 'u': 29, 'v': 30, 'w': 31, 'y': 32}


In [45]:
### 이번에는 대소문자 구분, 구두점, 공백 포함
## 5개의 입력 문자 시퀀스로부터 다음 문자 시퀀스를 예측. 즉, RNN의 시점(timesteps)은 5번.
## stude -> n
## tuden -> t

length = 11
sequences = []

for i in range(length, len(raw_text)):
    seq = raw_text[i-length: i]
    sequences.append(seq)
    
print('총 훈련 샘플의 수: %d' % len(sequences))

총 훈련 샘플의 수: 426


In [46]:
sequences[:10]

['I get on wi',
 ' get on wit',
 'get on with',
 'et on with ',
 't on with l',
 ' on with li',
 'on with lif',
 'n with life',
 ' with life ',
 'with life a']

In [47]:
## 정수 인코딩
encoded_sequences = []
for sequence in sequences:
    encoded_sequence = [char_to_index[char] for char in sequence]
    encoded_sequences.append(encoded_sequence)

In [48]:
encoded_sequences[:5]

[[8, 0, 16, 14, 28, 0, 24, 23, 0, 31, 18],
 [0, 16, 14, 28, 0, 24, 23, 0, 31, 18, 28],
 [16, 14, 28, 0, 24, 23, 0, 31, 18, 28, 17],
 [14, 28, 0, 24, 23, 0, 31, 18, 28, 17, 0],
 [28, 0, 24, 23, 0, 31, 18, 28, 17, 0, 21]]

In [49]:
## 샘플 문장에 대해서 마지막 문자를 분리하여 마지막 문자가 분리된 샘플은 X_data 저장, 마지막문자는 y_data 저장
encoded_sequences = np.array(encoded_sequences)

X_data = encoded_sequences[:, :-1]
y_data = encoded_sequences[:, -1]

In [50]:
print(X_data[:5])
print(y_data[:5])

[[ 8  0 16 14 28  0 24 23  0 31]
 [ 0 16 14 28  0 24 23  0 31 18]
 [16 14 28  0 24 23  0 31 18 28]
 [14 28  0 24 23  0 31 18 28 17]
 [28  0 24 23  0 31 18 28 17  0]]
[18 28 17  0 21]


In [54]:
## 원핫 인코딩
X_data_one_hot = [to_categorical(encoded, num_classes=vocab_size) for encoded in X_data]
X_data_one_hot = np.array(X_data_one_hot)
y_data_one_hot = to_categorical(y_data, num_classes=vocab_size)
X_data_one_hot.shape, y_data_one_hot.shape

((426, 10, 33), (426, 33))

X_data_one_hot : batch_size(샘플 수) * input_length(시퀀스 길이) * input_dim

#### 2) 모델 설계
* 은닉상태 64로 설정하고 다대일 구조의 LSTM 사용

In [55]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [56]:
hidden_units = 64

model = Sequential()
model.add(LSTM(hidden_units, input_shape=(X_data_one_hot.shape[1], X_data_one_hot.shape[2])))
model.add(Dense(vocab_size, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_data_one_hot, y_data_one_hot, epochs=100, verbose=2)

Epoch 1/100
14/14 - 3s - loss: 3.4770 - accuracy: 0.0540 - 3s/epoch - 237ms/step
Epoch 2/100
14/14 - 0s - loss: 3.3813 - accuracy: 0.1854 - 114ms/epoch - 8ms/step
Epoch 3/100
14/14 - 0s - loss: 3.1421 - accuracy: 0.1972 - 125ms/epoch - 9ms/step
Epoch 4/100
14/14 - 0s - loss: 3.0316 - accuracy: 0.1972 - 127ms/epoch - 9ms/step
Epoch 5/100
14/14 - 0s - loss: 2.9828 - accuracy: 0.1972 - 127ms/epoch - 9ms/step
Epoch 6/100
14/14 - 0s - loss: 2.9474 - accuracy: 0.1972 - 122ms/epoch - 9ms/step
Epoch 7/100
14/14 - 0s - loss: 2.9260 - accuracy: 0.1972 - 123ms/epoch - 9ms/step
Epoch 8/100
14/14 - 0s - loss: 2.9088 - accuracy: 0.1972 - 125ms/epoch - 9ms/step
Epoch 9/100
14/14 - 0s - loss: 2.8910 - accuracy: 0.1972 - 124ms/epoch - 9ms/step
Epoch 10/100
14/14 - 0s - loss: 2.8665 - accuracy: 0.1972 - 122ms/epoch - 9ms/step
Epoch 11/100
14/14 - 0s - loss: 2.8415 - accuracy: 0.1972 - 125ms/epoch - 9ms/step
Epoch 12/100
14/14 - 0s - loss: 2.8137 - accuracy: 0.1972 - 125ms/epoch - 9ms/step
Epoch 13/100
1

Epoch 100/100
14/14 - 0s - loss: 0.2755 - accuracy: 0.9765 - 129ms/epoch - 9ms/step


<keras.callbacks.History at 0x1ec0bfbae80>

In [63]:
def sentence_generation(model, char_to_index, seq_length, seed_text, n):
    ## 초기 시퀀스
    init_text = seed_text
    sentence = ''
    
    # 다음 문자 예측은 총 n번만 반복
    for _ in range(n):
        encoded = [char_to_index[char] for char in seed_text]
        encoded = pad_sequences([encoded], maxlen=seq_length, padding='pre')
        encoded = to_categorical(encoded, num_classes=len(char_to_index))
        
        result = model.predict(encoded, verbose=0)
        result = np.argmax(result, axis=1)
        
        for char, index in char_to_index.items():
            if index == result:
                break
                
        # 현재 시퀀스 + 예측 문자를 현재 시퀀스로 변경
        seed_text = seed_text + char
        # 예측 문자를 문장에 저장
        sentence = sentence + char
        
    # n번의 다음 문자 예측이 끝나면 최종 완성된 문장
    sentence = init_text + sentence
    return sentence

In [64]:
print(sentence_generation(model, char_to_index, 10, 'I get on w', 80))

I get on with life as a programmer, I like to use words about beer. But when I stort mhttb


In [62]:
seq_length

60