# 멋진 작사가 만들기

## key words

순환신경망(RNN), 언어 모델(Language Model)

## 1) 데이터 있는거 확인완료!  

나는 alicia-keys.txt를 활용하기로 하였다.

## 2) 데이터를 읽어오자

In [2]:
import glob
import os
import os, re 
import numpy as np
import tensorflow as tf

In [3]:
txt_file_path = os.getenv('HOME')+'/aiffel/lyricist/data/lyrics/*'

txt_list = glob.glob(txt_file_path)

raw_corpus = []

for txt_file in txt_list:
    with open(txt_file, "r") as f:
        raw = f.read().splitlines()
        raw_corpus.extend(raw)

print("데이터 크기:", len(raw_corpus))
print("Examples:\n", raw_corpus[:3])

데이터 크기: 187088
Examples:
 ["Now I've heard there was a secret chord", 'That David played, and it pleased the Lord', "But you don't really care for music, do you?"]


## Step 3. 데이터 정제
앞서 배운 테크닉들을 활용해 문장 생성에 적합한 모양새로 데이터를 정제하세요!

preprocess_sentence() 함수를 만든 것을 기억하시죠? 이를 활용해 데이터를 정제하도록 하겠습니다.

추가로 지나치게 긴 문장은 다른 데이터들이 과도한 Padding을 갖게 하므로 제거합니다. 
너무 긴 문장은 노래 가사 작사하기에 어울리지 않을 수도 있겠죠.
그래서 이번에는 문장을 토큰화 했을 때 토큰의 개수가 15개를 넘어가는 문장을 학습 데이터에서 제외하기 를 권합니다

### 1) 가사파일을 읽기모드로 열어서 가사를 줄(line)단위로 끊어서 list로 읽음

In [8]:
file_path = os.getenv('HOME') + '/aiffel/lyricist/data/lyrics/alicia-keys.txt'
with open(file_path, "r") as f:
    raw_corpus = f.read().splitlines()

print(raw_corpus[:9])  

['Ooh....... New York x2 Grew up in a town that is famous as a place of movie scenes', 'Noise is always loud, there are sirens all around and the streets are mean', "If I can make it here, I can make it anywhere, that's what they say", "Seeing my face in lights or my name on marquees found down on Broadway Even if it ain't all it seems, I got a pocket full of dreams", "Baby, I'm from New York", 'Concrete jungle where dreams are made of', "There's nothing you can't do", "Now you're in New York", 'These streets will make you feel brand new']


### 2) 1차 필터링 : 를 기준으로 문장을 제외. 공백인 문장은 길이를 검사하여 길이가 0이라면 제외

txt파일 내붕의 문장이 이미 나뉘어져 있음을 확인할 수 있다.

In [11]:
for idx, sentence in enumerate(raw_corpus):
    if len(sentence) == 0: continue   
#    if sentence[-1] == ":": continue    --세익스피어의 극작품과 달리 노래 가사이므로 불필요

    if idx > 20: break  
        
    print(sentence) 

Ooh....... New York x2 Grew up in a town that is famous as a place of movie scenes
Noise is always loud, there are sirens all around and the streets are mean
If I can make it here, I can make it anywhere, that's what they say
Seeing my face in lights or my name on marquees found down on Broadway Even if it ain't all it seems, I got a pocket full of dreams
Baby, I'm from New York
Concrete jungle where dreams are made of
There's nothing you can't do
Now you're in New York
These streets will make you feel brand new
Big lights will inspire you
Hear it from New York, New York, New York! On the avenue, there ain't never a curfew, ladies work so hard
Such a melting pot, on the corner selling rock, preachers pray to God
Hail a gypsy-cab, takes me down from Harlem to the Brooklyn Bridge
Some will sleep tonight with a hunger far more than an empty fridge I'm gonna make it by any means, I got a pocket full of dreams
Baby, I'm from New York
Concrete jungle where dreams are made of
There's nothing 

### 3) Tokenize : 문장을 일정한 기준으로 쪼개기
    
    문장부호 양쪽에 공백추가
    대소문자구분없게 모두 소문자로
    특수문자 모두 제거

#### ① 읽으드린 문장들의 Tokenize를 위한 함수를 만들기  
쓸데없는 문장부호""나 대소문자 및 특수문자를 제거하자!

In [27]:
def preprocess_sentence(sentence):
    sentence = sentence.lower().strip() # 1
    sentence = re.sub(r"([?.!,¿])", r" \1 ", sentence) # 2
    sentence = re.sub(r'[" "]+', " ", sentence) # 3
    sentence = re.sub(r"[^a-zA-Z?.!,¿]+", " ", sentence) # 4
    sentence = sentence.strip() # 5
    sentence = '<start> ' + sentence + ' <end>' # 6
    return sentence

함수가 잘 작동하는지 확인되는것을 확인 할 수 있다.  
함수를 적용시킨 문장들을 모으자  
  --> 문장 전후 "start end" 도 추가했다.

In [28]:
print(preprocess_sentence("This @_is ;;;sample        sentence."))

<start> this is sample sentence . <end>


#### ② 정제된 문장을 모으자

In [49]:
corpus = []

for sentence in raw_corpus:
    if len(sentence) == 0: continue
    if sentence[-1] == ":": continue
    
    tmp = preprocess_sentence(sentence)
       
    preprocessed_sentence = preprocess_sentence(sentence)
    corpus.append(preprocessed_sentence)
        
corpus[:20]

['<start> ooh . . . . . . . new york x grew up in a town that is famous as a place of movie scenes <end>',
 '<start> noise is always loud , there are sirens all around and the streets are mean <end>',
 '<start> if i can make it here , i can make it anywhere , that s what they say <end>',
 '<start> seeing my face in lights or my name on marquees found down on broadway even if it ain t all it seems , i got a pocket full of dreams <end>',
 '<start> baby , i m from new york <end>',
 '<start> concrete jungle where dreams are made of <end>',
 '<start> there s nothing you can t do <end>',
 '<start> now you re in new york <end>',
 '<start> these streets will make you feel brand new <end>',
 '<start> big lights will inspire you <end>',
 '<start> hear it from new york , new york , new york ! on the avenue , there ain t never a curfew , ladies work so hard <end>',
 '<start> such a melting pot , on the corner selling rock , preachers pray to god <end>',
 '<start> hail a gypsy cab , takes me down f

### 내가 읽어드린 글에 함수를 적용하여 tokenize하자  

In [50]:
def tokenize(corpus):
    tokenizer = tf.keras.preprocessing.text.Tokenizer(
    num_words=150000,   #전체데이터의 약 80%
    filters=' ',
    oov_token="<unk>"
    )
    tokenizer.fit_on_texts(corpus)
    tensor = tokenizer.texts_to_sequences(corpus)   
    tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor, padding='post')  
    
    print(tensor,tokenizer)
    return tensor, tokenizer

In [51]:
print(tensor[:3, :10])

[[  2 133  72  72  72  72  72  72  72  39]
 [  2 473  29 127 425   4  58  88 474  32]
 [  2  27   5  20  70  10  93   4   5  20]]


In [52]:
for idx in tokenizer.index_word:
    print(idx, ":", tokenizer.index_word[idx])

    if idx >= 10: break

1 : <unk>
2 : <start>
3 : <end>
4 : ,
5 : i
6 : you
7 : the
8 : t
9 : to
10 : it


#### 소스 문장(Source Sentence) : 모델의 입력이 되는 문장  

#### 타겟 문장(Target Sentence) : 정답 역할을 하게 될 모델의 출력 문장

In [56]:
src_input = tensor[:, :-1]  
tgt_input = tensor[:, 1:]    

print(src_input[0])
print(tgt_input[0])

[  2 133  72  72  72  72  72  72  72  39  50 315 469  63  16  12 424  19
  29 470 256  12 219  26 471 472   3   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0]
[133  72  72  72  72  72  72  72  39  50 315 469  63  16  12 424  19  29
 470 256  12 219  26 471 472   3   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0]


#### 2(start)에서 시작해서 3(end)으로 끝난 후 0(pad)로 채워져 있는것을 확인할 수 있다.

In [57]:
BUFFER_SIZE = len(src_input)
BATCH_SIZE = 256
steps_per_epoch = len(src_input) // BATCH_SIZE

VOCAB_SIZE = tokenizer.num_words + 1   

dataset = tf.data.Dataset.from_tensor_slices((src_input, tgt_input))
dataset = dataset.shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)
dataset

<BatchDataset shapes: ((256, 44), (256, 44)), types: (tf.int32, tf.int32)>

● 정규표현식을 이용한 corpus 생성  
● tf.keras.preprocessing.text.Tokenizer를 이용해 corpus를 텐서로 변환  
● tf.data.Dataset.from_tensor_slices()를 이용해 corpus 텐서를 tf.data.Dataset객체로 변환  

# Step 4. 평가 데이터셋 분리

### sklearn 모듈의 train_test_split() 함수를 사용해 훈련 데이터와 평가 데이터를 분리하도록 하겠습니다.   

enc_train, enc_val, dec_train, dec_val = <코드 작성>

In [59]:
from sklearn.model_selection import train_test_split

In [None]:
enc_train, enc_val, dec_train, dec_val= train_test_split(data, target, test_size=0.2, shuffle=True, stratify=target, random_state=34)

# 학습시키자

In [61]:
class TextGenerator(tf.keras.Model):
    def __init__(self, vocab_size, embedding_size, hidden_size):
        super().__init__()
        
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_size)
        self.rnn_1 = tf.keras.layers.LSTM(hidden_size, return_sequences=True)
        self.rnn_2 = tf.keras.layers.LSTM(hidden_size, return_sequences=True)
        self.linear = tf.keras.layers.Dense(vocab_size)
        
    def call(self, x):
        out = self.embedding(x)
        out = self.rnn_1(out)
        out = self.rnn_2(out)
        out = self.linear(out)
        
        return out
    
embedding_size = 256
hidden_size = 1024
model = TextGenerator(tokenizer.num_words + 1, embedding_size , hidden_size)

In [62]:
for src_sample, tgt_sample in dataset.take(1): break

model(src_sample)

ResourceExhaustedError: OOM when allocating tensor with shape[256,44,150001] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:BiasAdd]

In [24]:
model.summary()

ValueError: This model has not yet been built. Build the model first by calling `build()` or calling `fit()` with some data, or specify an `input_shape` argument in the first layer(s) for automatic build.

In [45]:
optimizer = tf.keras.optimizers.Adam()
loss = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True,
    reduction='none'
)

model.compile(loss=loss, optimizer=optimizer)
model.fit(dataset, epochs=30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7fdced49d160>

In [46]:
def generate_text(model, tokenizer, init_sentence="<start>", max_len=20):
    test_input = tokenizer.texts_to_sequences([init_sentence])
    test_tensor = tf.convert_to_tensor(test_input, dtype=tf.int64)
    end_token = tokenizer.word_index["<end>"]

    while True:
        # 1
        predict = model(test_tensor) 
        # 2
        predict_word = tf.argmax(tf.nn.softmax(predict, axis=-1), axis=-1)[:, -1] 
        # 3 
        test_tensor = tf.concat([test_tensor, tf.expand_dims(predict_word, axis=0)], axis=-1)
        # 4
        if predict_word.numpy()[0] == end_token: break
        if test_tensor.shape[1] >= max_len: break

    generated = ""
    
    for word_index in test_tensor[0].numpy():
        generated += tokenizer.index_word[word_index] + " "

    return generated

In [47]:
generate_text(model, tokenizer, init_sentence="<start> he")

'<start> he , nah nah nah nah nah nah nah nah nah nah nah nah nah nah nah nah nah '

# Step 5. 인공지능 만들기  

모델의 Embedding Size와 Hidden Size를 조절하며 10 Epoch 안에 val_loss 값을 2.2 수준으로 줄일 수 있는 모델을 설계하세요! (Loss는 아래 제시된 Loss 함수를 그대로 사용!)

그리고 멋진 모델이 생성한 가사 한 줄을 제출하시길 바랍니다

In [58]:
loss = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

In [44]:
generate_text(lyricist, tokenizer, init_sentence="<start> i love", max_len=20)

NameError: name 'generate_text' is not defined

# 퇴고

미완~~~