# 1. Char RNNLM
입출력을 글자 단위로 구현하며 RNNLM과는 달리 임베딩층 미사용
![Char RNN](https://wikidocs.net/images/page/48649/char_rnn1.PNG "Char RNN")
> **1) 데이터 전처리**

In [1]:
import numpy as np
import urllib.request
from tensorflow.keras.utils import to_categorical

In [2]:
urllib.request.urlretrieve("http://www.gutenberg.org/files/11/11-0.txt", filename="11-0.txt")
f = open('11-0.txt', 'rb')
lines = []

for line in f:
    line = line.strip() # \r, \n 제거
    line = line.lower()
    line = line.decode('ascii', 'ignore') # \xe2\x80\x99 등 바이트 열 제거
    if len(line) > 0:
        lines.append(line)    
f.close()

lines[:5]

['the project gutenberg ebook of alices adventures in wonderland, by lewis carroll',
 'this ebook is for the use of anyone anywhere in the united states and',
 'most other parts of the world at no cost and with almost no restrictions',
 'whatsoever. you may copy it, give it away or re-use it under the terms',
 'of the project gutenberg license included with this ebook or online at']

In [3]:
# 문자열 전체 통합
text = ' '.join(lines)
print('문자열 길이 또는 총 글자 개수: %d' % len(text))

print(text[:200])

문자열 길이 또는 총 글자 개수: 159484
the project gutenberg ebook of alices adventures in wonderland, by lewis carroll this ebook is for the use of anyone anywhere in the united states and most other parts of the world at no cost and with


In [4]:
# 글자 집합 생성
char_vocab = sorted(list(set(text)))
vocab_size = len(char_vocab)
print('글자 집합 크기: {}'.format(vocab_size))

글자 집합 크기: 56


In [5]:
# 글자에 정수 인덱스 부여
char_to_index = dict((c,i) for i, c in enumerate(char_vocab))
print(char_to_index)

{' ': 0, '!': 1, '"': 2, '#': 3, '$': 4, '%': 5, "'": 6, '(': 7, ')': 8, '*': 9, ',': 10, '-': 11, '.': 12, '/': 13, '0': 14, '1': 15, '2': 16, '3': 17, '4': 18, '5': 19, '6': 20, '7': 21, '8': 22, '9': 23, ':': 24, ';': 25, '?': 26, '[': 27, ']': 28, '_': 29, 'a': 30, 'b': 31, 'c': 32, 'd': 33, 'e': 34, 'f': 35, 'g': 36, 'h': 37, 'i': 38, 'j': 39, 'k': 40, 'l': 41, 'm': 42, 'n': 43, 'o': 44, 'p': 45, 'q': 46, 'r': 47, 's': 48, 't': 49, 'u': 50, 'v': 51, 'w': 52, 'x': 53, 'y': 54, 'z': 55}


In [6]:
# 인덱스로부터 글자 반환
index_to_char = {}
for key, value in char_to_index.items():
    index_to_char[value] = key

In [7]:
# 문자열로부터 문장 샘플 분리
seq_length = 60
n_samples = int(np.floor((len(text)-1) / seq_length))
print('문장 샘플 수: {}'.format(n_samples))

train_X = []
train_y = []

for i in range(n_samples):
    X_sample = text[i * seq_length: (i + 1) * seq_length]
    X_encoded = [char_to_index[c] for c in X_sample]
    train_X.append(X_encoded)
    
    y_sample = text[i * seq_length + 1: (i + 1) * seq_length + 1]
    y_encoded = [char_to_index[c] for c in y_sample]
    train_y.append(y_encoded)
    
print(train_X[0])
print(train_y[0]) # train_X[0]에서 오른쪽으로 한 칸 쉬프트된 문장

문장 샘플 수: 2658
[49, 37, 34, 0, 45, 47, 44, 39, 34, 32, 49, 0, 36, 50, 49, 34, 43, 31, 34, 47, 36, 0, 34, 31, 44, 44, 40, 0, 44, 35, 0, 30, 41, 38, 32, 34, 48, 0, 30, 33, 51, 34, 43, 49, 50, 47, 34, 48, 0, 38, 43, 0, 52, 44, 43, 33, 34, 47, 41, 30]
[37, 34, 0, 45, 47, 44, 39, 34, 32, 49, 0, 36, 50, 49, 34, 43, 31, 34, 47, 36, 0, 34, 31, 44, 44, 40, 0, 44, 35, 0, 30, 41, 38, 32, 34, 48, 0, 30, 33, 51, 34, 43, 49, 50, 47, 34, 48, 0, 38, 43, 0, 52, 44, 43, 33, 34, 47, 41, 30, 43]


In [8]:
# 훈련 데이터 원-핫 인코딩
train_X = to_categorical(train_X)
train_y = to_categorical(train_y)

print('train_X의 크기: {}'.format(train_X.shape))
print('train_y의 크기: {}'.format(train_y.shape))

train_X의 크기: (2658, 60, 56)
train_y의 크기: (2658, 60, 56)


> **2) 모델 설계**

In [9]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, TimeDistributed

In [10]:
model = Sequential()
model.add(LSTM(256, input_shape=(None, train_X.shape[2]), return_sequences=True))
model.add(LSTM(256, return_sequences=True))
model.add(TimeDistributed(Dense(vocab_size, activation='softmax')))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(train_X, train_y, epochs=80, verbose=2)

Epoch 1/80
84/84 - 36s - loss: 3.0727 - accuracy: 0.1816
Epoch 2/80
84/84 - 36s - loss: 2.7336 - accuracy: 0.2460
Epoch 3/80
84/84 - 32s - loss: 2.3995 - accuracy: 0.3267
Epoch 4/80
84/84 - 32s - loss: 2.2587 - accuracy: 0.3593
Epoch 5/80
84/84 - 33s - loss: 2.1589 - accuracy: 0.3827
Epoch 6/80
84/84 - 32s - loss: 2.0778 - accuracy: 0.4046
Epoch 7/80
84/84 - 33s - loss: 2.0117 - accuracy: 0.4206
Epoch 8/80
84/84 - 32s - loss: 1.9518 - accuracy: 0.4359
Epoch 9/80
84/84 - 32s - loss: 1.9004 - accuracy: 0.4485
Epoch 10/80
84/84 - 32s - loss: 1.8562 - accuracy: 0.4607
Epoch 11/80
84/84 - 32s - loss: 1.8131 - accuracy: 0.4742
Epoch 12/80
84/84 - 32s - loss: 1.7728 - accuracy: 0.4851
Epoch 13/80
84/84 - 32s - loss: 1.7390 - accuracy: 0.4939
Epoch 14/80
84/84 - 31s - loss: 1.6998 - accuracy: 0.5043
Epoch 15/80
84/84 - 31s - loss: 1.6667 - accuracy: 0.5117
Epoch 16/80
84/84 - 31s - loss: 1.6327 - accuracy: 0.5198
Epoch 17/80
84/84 - 32s - loss: 1.6044 - accuracy: 0.5284
Epoch 18/80
84/84 - 32s

<tensorflow.python.keras.callbacks.History at 0x163d2d44910>

In [11]:
def sentence_generation(model, length):
    ix = [np.random.randint(vocab_size)]
    y_char = [index_to_char[ix[-1]]]
    print(ix[-1], '번 글자', y_char[-1], '로 예측 시작!')
    
    X = np.zeros((1, length, vocab_size))
    
    for i in range(length):
        X[0][i][ix[-1]] = 1
        print(index_to_char[ix[-1]], end="")
        ix = np.argmax(model.predict(X[:, :i+1, :])[0], 1)
        y_char.append(index_to_char[ix[-1]])
        
    return ('').join(y_char)

In [12]:
sentence_generation(model, 100)

12 번 글자 . 로 예측 시작!
. alice was that? there was a large please! as she could, beang to crump, said the mock turtle. and 

'. alice was that? there was a large please! as she could, beang to crump, said the mock turtle. and h'

# 2. Char RNN 텍스트 생성
> **1) 데이터 전처리**

In [13]:
import numpy as np
from tensorflow.keras.utils import to_categorical

In [14]:
text = '''
I get on with life as a programmer,
I like to comtemplate beer.
But when I start to daydream,
My mind turns straight to wine.

Do I love wine more than beer?

I like to use words about beer.
But when I stop my talking,
My mind turns straight to wine.

I hate bugs and errors.
But I just think back to wine,
And I'm happy once again.

I like to hang out with programming and deep learning.
But when left alone,
My mind turns straight to wine.
'''

In [15]:
# 문자열 전체 통합
tokens = text.split()
text = ' '.join(tokens)
print(text)

I get on with life as a programmer, I like to comtemplate beer. But when I start to daydream, My mind turns straight to wine. Do I love wine more than beer? I like to use words about beer. But when I stop my talking, My mind turns straight to wine. I hate bugs and errors. But I just think back to wine, And I'm happy once again. I like to hang out with programming and deep learning. But when left alone, My mind turns straight to wine.


In [16]:
# 글자 집합 생성
char_vocab = sorted(list(set(text)))
print(char_vocab)

vocab_size = len(char_vocab)
print('글자 집합 크기: {}'.format(vocab_size))

[' ', "'", ',', '.', '?', 'A', 'B', 'D', 'I', 'M', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y']
글자 집합 크기: 33


In [17]:
# 글자에 정수 인덱스 부여
char_to_index = dict((c,i) for i, c in enumerate(char_vocab))
print(char_to_index)

{' ': 0, "'": 1, ',': 2, '.': 3, '?': 4, 'A': 5, 'B': 6, 'D': 7, 'I': 8, 'M': 9, 'a': 10, 'b': 11, 'c': 12, 'd': 13, 'e': 14, 'f': 15, 'g': 16, 'h': 17, 'i': 18, 'j': 19, 'k': 20, 'l': 21, 'm': 22, 'n': 23, 'o': 24, 'p': 25, 'r': 26, 's': 27, 't': 28, 'u': 29, 'v': 30, 'w': 31, 'y': 32}


In [18]:
# 모든 샘플 길이 동일화
length = 11
sequences = []
for i in range(length, len(text)):
    seq = text[i-length:i]
    sequences.append(seq)
print('총 훈련 샘플 수: %d' % len(sequences))

sequences[:10]

총 훈련 샘플 수: 426


['I get on wi',
 ' get on wit',
 'get on with',
 'et on with ',
 't on with l',
 ' on with li',
 'on with lif',
 'n with life',
 ' with life ',
 'with life a']

In [19]:
# 전체 데이터에 정수 인코딩 
X = []
for line in sequences:
    temp_X = [char_to_index[char] for char in line]
    X.append(temp_X)
    
for line in X[:5]:
    print(line)

[8, 0, 16, 14, 28, 0, 24, 23, 0, 31, 18]
[0, 16, 14, 28, 0, 24, 23, 0, 31, 18, 28]
[16, 14, 28, 0, 24, 23, 0, 31, 18, 28, 17]
[14, 28, 0, 24, 23, 0, 31, 18, 28, 17, 0]
[28, 0, 24, 23, 0, 31, 18, 28, 17, 0, 21]


In [20]:
# 마지막 글자의 레이블 분리
sequences = np.array(X)
X = sequences[:,:-1]
y = sequences[:,-1]

for line in X[:5]:
    print(line)
print(y[:5])

[ 8  0 16 14 28  0 24 23  0 31]
[ 0 16 14 28  0 24 23  0 31 18]
[16 14 28  0 24 23  0 31 18 28]
[14 28  0 24 23  0 31 18 28 17]
[28  0 24 23  0 31 18 28 17  0]
[18 28 17  0 21]


In [21]:
# X와 y 원-핫 인코딩
sequences = [to_categorical(x, num_classes=vocab_size) for x in X]
X = np.array(sequences)
y = to_categorical(y, num_classes=vocab_size)

print(X.shape)

(426, 10, 33)


> **2) 모델 설계**

In [22]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [23]:
model = Sequential()
model.add(LSTM(80, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(vocab_size, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=100, verbose=2)

Epoch 1/100
14/14 - 3s - loss: 3.4687 - accuracy: 0.0892
Epoch 2/100
14/14 - 0s - loss: 3.3225 - accuracy: 0.2042
Epoch 3/100
14/14 - 0s - loss: 3.0565 - accuracy: 0.1972
Epoch 4/100
14/14 - 0s - loss: 2.9859 - accuracy: 0.1972
Epoch 5/100
14/14 - 0s - loss: 2.9584 - accuracy: 0.1972
Epoch 6/100
14/14 - 0s - loss: 2.9285 - accuracy: 0.1972
Epoch 7/100
14/14 - 0s - loss: 2.9122 - accuracy: 0.1972
Epoch 8/100
14/14 - 0s - loss: 2.8936 - accuracy: 0.1972
Epoch 9/100
14/14 - 0s - loss: 2.8694 - accuracy: 0.1972
Epoch 10/100
14/14 - 0s - loss: 2.8424 - accuracy: 0.1972
Epoch 11/100
14/14 - 0s - loss: 2.8037 - accuracy: 0.1972
Epoch 12/100
14/14 - 0s - loss: 2.7628 - accuracy: 0.2019
Epoch 13/100
14/14 - 0s - loss: 2.7119 - accuracy: 0.2113
Epoch 14/100
14/14 - 0s - loss: 2.6620 - accuracy: 0.2488
Epoch 15/100
14/14 - 0s - loss: 2.5938 - accuracy: 0.2394
Epoch 16/100
14/14 - 0s - loss: 2.5392 - accuracy: 0.2653
Epoch 17/100
14/14 - 0s - loss: 2.4880 - accuracy: 0.2958
Epoch 18/100
14/14 - 0s

<tensorflow.python.keras.callbacks.History at 0x163deb56070>

In [24]:
def sentence_generation(model, char_to_index, seq_length, seed_text, n):
    init_text = seed_text
    sentence = ''
    
    for _ in range(n):
        encoded = [char_to_index[char] for char in seed_text]
        encoded = pad_sequences([encoded], maxlen=seq_length, padding='pre')
        encoded = to_categorical(encoded, num_classes=len(char_to_index))
        result = model.predict_classes(encoded, verbose=0)
        
        for char, index in char_to_index.items():
            if index == result:
                break
        seed_text = seed_text + char
        sentence = sentence + char
        
    sentence = init_text + sentence
    return sentence

In [25]:
print(sentence_generation(model, char_to_index, 10, 'I get on w', 80))



I get on with life as a programmer, I like to comtemplate beer. But when I start to daydre
