## 1. 데이터 불러오기
- glob 모듈을 사용하면 파일 읽어오는 작업이 아주 용이하다. 이를 이용해서 모든 텍스트 파일을 읽어온 후, raw_corpus 리스트에 문장단위로 저장하자.

In [1]:
import re
import numpy as np
import tensorflow as tf

In [2]:
import glob

txt_file_path = './lyricist/data/lyrics/*'

txt_list = glob.glob(txt_file_path)

raw_corpus = []

#여러개의 txt파일을 모두 읽어 raw_corpus에 담는다.
for txt_file in txt_list:
    with open(txt_file, "r", encoding='UTF-8') as f:
        raw = f.read().splitlines()
        raw_corpus.extend(raw)
        
print("데이터 크기: ", len(raw_corpus))
print("Examples: \n", raw_corpus[:3])

데이터 크기:  187088
Examples: 
 ["Now I've heard there was a secret chord", 'That David played, and it pleased the Lord', "But you don't really care for music, do you?"]


## 2. 데이터 정제

In [3]:
#정규화 함수 생성
def preprocess_sentence(sentence):
    sentence = sentence.lower().strip() #1
    sentence = re.sub(r"([?.!,¿])", r"\1", sentence) #2
    sentence = re.sub(r'[" "]+', " ", sentence) #3
    sentence = re.sub(r"[^a-zA-Z?.!,¿]+", " ", sentence) #4
    sentence = sentence.strip() #5
    sentence = '<start> ' + sentence + ' <end>' #6
    return sentence

In [4]:
#정제된 문장 모으기
corpus = []

for sentence in raw_corpus:
    if len(sentence) == 0: continue
    if sentence[-1] == ":": continue
        
    preprocessed_sentence = preprocess_sentence(sentence)
    corpus.append(preprocessed_sentence)
    
corpus[:5]

['<start> now i ve heard there was a secret chord <end>',
 '<start> that david played, and it pleased the lord <end>',
 '<start> but you don t really care for music, do you? <end>',
 '<start> it goes like this <end>',
 '<start> the fourth, the fifth <end>']

In [5]:
# tokenizer 생성
def tokenize(corpus):
    tokenizer = tf.keras.preprocessing.text.Tokenizer(
    num_words = 12000,
    filters= ' ',
    oov_token="<unk>"
    )
    #corpus로 tokenizer내부 단어장 완성
    tokenizer.fit_on_texts(corpus)
    
    #준비한 tokenizer로 corpus를 tensor로 변환
    tensor = tokenizer.texts_to_sequences(corpus)
    
    #입력데이터의 시퀀스 길이를 일정하게 맞춰준다_padding
    tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor, maxlen=15, padding='post')
    
    print(tensor, tokenizer)
    return tensor, tokenizer

tensor, tokenizer = tokenize(corpus)

[[   2   50    4 ...    0    0    0]
 [   2   16 3681 ...    0    0    0]
 [   2   31    6 ...    0    0    0]
 ...
 [ 148    4   20 ...    8 1070    3]
 [   4   32   14 ...  877  663    3]
 [   2    6  367 ...    0    0    0]] <keras_preprocessing.text.Tokenizer object at 0x7f73e98defd0>


## 3. 평가 데이터셋 분리

In [6]:
from sklearn.model_selection import train_test_split

enc_train, enc_val, dec_train, dec_val = train_test_split(tensor[:, :-1],
                                                         tensor[:, 1:],
                                                         test_size=0.2,
                                                         random_state=7)

In [7]:
print('Source Train:', enc_train.shape)
print('Target Train:', dec_train.shape)

Source Train: (140599, 14)
Target Train: (140599, 14)


In [8]:
#데이터셋 객체 만들기
BUFFER_SIZE = len(enc_train)
BATCH_SIZE = 256
steps_per_epoch = len(enc_train) // BATCH_SIZE

# tokenizer가 구축한 단어사전 내 12000개와, 여기 포함되지 않은 0:<pad>를 포함하여 7001개
VOCAB_SIZE = tokenizer.num_words + 1

#준비한 데이터 소스로부터 데이터셋을 만든다
dataset = tf.data.Dataset.from_tensor_slices((enc_train, dec_train))
dataset = dataset.shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)
dataset

<BatchDataset shapes: ((256, 14), (256, 14)), types: (tf.int32, tf.int32)>

## 4. 인공지능 만들기

In [9]:
class TextGenerator(tf.keras.Model):
    def __init__(self, vocab_size, embedding_size, hidden_size):
        super().__init__()
        
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_size)
        self.rnn_1 = tf.keras.layers.LSTM(hidden_size, return_sequences=True)
        self.rnn_2 = tf.keras.layers.LSTM(hidden_size, return_sequences=True)
        self.linear = tf.keras.layers.Dense(vocab_size)
        
    def call(self, x):
        out = self.embedding(x)
        out = self.rnn_1(out)
        out = self.rnn_2(out)
        out = self.linear(out)
        
        return out
    
embedding_size = 512
hidden_size = 1024
model = TextGenerator(tokenizer.num_words+1, embedding_size, hidden_size)

In [10]:
#데이터에서 데이터 한 배치만 불러오는 방법이다
for src_sample, tgt_sample in dataset.take(1): break
    
#한 배치만 불러온 데이터를 모델에 넣어보자
model(src_sample)

<tf.Tensor: shape=(256, 14, 12001), dtype=float32, numpy=
array([[[-9.81576886e-05,  1.28531057e-04, -3.21665226e-04, ...,
          3.24293389e-04, -8.50656215e-05,  4.02875639e-06],
        [-1.71133928e-04, -1.51142827e-04, -7.06765917e-04, ...,
          3.72104289e-04, -5.35100524e-04,  9.05680790e-05],
        [-5.36639360e-04, -4.40671109e-04, -7.57427537e-04, ...,
          5.66263683e-04, -5.72343764e-04,  4.31723951e-04],
        ...,
        [-1.22268114e-03, -3.15817131e-04, -6.00858002e-05, ...,
          2.37907050e-03,  1.46130205e-03,  3.33369622e-04],
        [-4.77311230e-04, -7.62315351e-04,  1.13095921e-04, ...,
          2.50789314e-03,  2.06094817e-03,  3.19480750e-04],
        [ 2.46615673e-04, -1.19260291e-03,  1.77659036e-04, ...,
          2.63716723e-03,  2.53272313e-03,  2.98270752e-04]],

       [[-9.81576886e-05,  1.28531057e-04, -3.21665226e-04, ...,
          3.24293389e-04, -8.50656215e-05,  4.02875639e-06],
        [-1.99923059e-04,  2.31558923e-04, -5

In [11]:
model.summary()

Model: "text_generator"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        multiple                  6144512   
_________________________________________________________________
lstm (LSTM)                  multiple                  6295552   
_________________________________________________________________
lstm_1 (LSTM)                multiple                  8392704   
_________________________________________________________________
dense (Dense)                multiple                  12301025  
Total params: 33,133,793
Trainable params: 33,133,793
Non-trainable params: 0
_________________________________________________________________


In [12]:
optimizer = tf.keras.optimizers.Adam()
loss = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, 
    reduction='none')

model.compile(loss=loss, optimizer=optimizer)
model.fit(dataset, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f73bc77ff10>

## 5. 모델 평가하기

In [15]:
def generate_text(model, tokenizer, init_sentence ="<start>", max_len=20):
    test_input = tokenizer.texts_to_sequences([init_sentence])
    test_tensor = tf.convert_to_tensor(test_input, dtype=tf.int64)
    end_token = tokenizer.word_index["<end>"]
    
    while True:
        predict = model(test_tensor)
        predict_word = tf.argmax(tf.nn.softmax(predict, axis=-1), axis=-1)[:, -1]
        test_tensor = tf.concat([test_tensor, tf.expand_dims(predict_word, axis=0)], axis=-1)
        
        if predict_word.numpy()[0] == end_token: break
        if test_tensor.shape[1] >= max_len: break
            
    generated = ""
    for word_index in test_tensor[0].numpy():
        generated += tokenizer.index_word[word_index] + " "
        
    return generated

In [19]:
generate_text(model, tokenizer, init_sentence="<start> what", max_len=20)

'<start> what you want be what you want <end> '

embedding_size를 256으로 설정했을 때 오류가 2.2 이하로 떨어지지 않아 512로 늘려주었다. 늘려주니 오류가 2.2 이하로 떨어졌다.