<a href="https://colab.research.google.com/github/seunghyunmoon2/NLP/blob/master/NLP8_Text_Generation/DMN(QnA).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Generation using LSTM

Here, we used LSTM because the sequence affects the learning. Also, instead of giving **one-hot encoding** to the output, **we do multinomial()** so that arguments with lesser probability can also be selected during the generation.

In [None]:
# LSTM을 이용한 세익스피어 저서의 텍스트 자동 생성
# ----------------------------------------------
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Activation
from tensorflow.keras.optimizers import RMSprop
import numpy as np
import random
import sys

path = './dataset/shakespeare_final.txt'
text = open(path).read().lower()

print('corpus length:', len(text))
characters = sorted(list(set(text)))
print('total chars:', len(characters))

char2indices = dict((c, i) for i, c in enumerate(characters))
indices2char = dict((i, c) for i, c in enumerate(characters))

# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

# Converting indices into vectorized format
X = np.zeros((len(sentences), maxlen, len(characters)), dtype=np.bool)
y = np.zeros((len(sentences), len(characters)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char2indices[char]] = 1
    y[i, char2indices[next_chars[i]]] = 1
	
#Model Building
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(characters))))
model.add(Dense(len(characters)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=0.01))
print (model.summary())

# Function to convert prediction into index
def pred_indices(preds, metric=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / metric
    exp_preds = np.exp(preds)
    preds = exp_preds/np.sum(exp_preds)
    probs = np.random.multinomial(1, preds, 1)
    return np.argmax(probs)

# Train & Evaluate the Model
for iteration in range(1, 30):
    print('-' * 40)
    print('Iteration', iteration)
    model.fit(X, y,batch_size=128,epochs=1)

    start_index = random.randint(0, len(text) - maxlen - 1)

    for diversity in [0.2, 0.7,1.2]:
        print('\n----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x = np.zeros((1, maxlen, len(characters)))
            for t, char in enumerate(sentence):
                x[0, t, char2indices[char]] = 1.

            preds = model.predict(x, verbose=0)[0]
            next_index = pred_indices(preds, diversity)
            pred_char = indices2char[next_index]

            generated += pred_char
            sentence = sentence[1:] + pred_char

            sys.stdout.write(pred_char)
            sys.stdout.flush()
        print("\nOne combination completed \n")

output from the final iteration
```
----------------------------------------
Iteration 29
Train on 193798 samples
193798/193798 [==============================] - 105s 544us/sample - loss: 1.2409

----- diversity: 0.2
----- Generating with seed: "of my love,
    and lack not to lose sti"
of my love,
    and lack not to lose still the strong.
    the particess is a soldier's faster?
  cleopatra. who have the seem of the soldiers of him.
    the self to a man and some strong.
    which i will be thee a sun in thee as the strong
    of the rememund's through the care of the fortune,
    and some for my soldiers and the complete with partice
    the messenger. and the particess of the complete works of ephesus. i have fortu
One combination completed 


----- diversity: 0.7
----- Generating with seed: "of my love,
    and lack not to lose sti"
of my love,
    and lack not to lose still person, and i know
  thou my fulvil'd would make the faston and bare bear and emption,
    that can shall dear in this laid to his deed?
    were i be mainter to you, my times some mean's sense,
    all the name, to the world right, and take thee.
    scanth their other.
  cleopatra. not speak thee, amables, the better wrink even of imperier-
    to hear my lov'd to the concitites, and this his
One combination completed 


----- diversity: 1.2
----- Generating with seed: "of my love,
    and lack not to lose sti"
of my love,
    and lack not to lose stight. no, as'd death.
    your much very your involadyes.
  appoaarihier. the a mother gaun.
  cleopatra. well- snoth from my feed, away!
    ix-
    my loves for purse of ewel;  hy', with parry, trims thos
    and i two a shave young
 patiaus that for his worst, wenden thee.
  goiserace.
  cleopatra. my so?' he shall in trength.
  caesar. i am say; if i in heares, and our dis,in as shall
    by my
One combination completed 
```

# DMN (Dynamic Memory Network)

Most tasks in natural language processing can be cast into question answering (QA) problems over language input. We introduce the dynamic memory network (DMN), a neural network architecture which processes input sequences and questions, forms episodic memories, and generates relevant answers. Questions trigger an iterative attention process which allows the model to condition its attention on the inputs and the result of previous iterations. These results are then reasoned over in a hierarchical recurrent sequence model to generate answers. The DMN can be trained end-to-end and obtains state-of-the-art results on several types of tasks and datasets: question answering (Facebook's bAbI dataset), text classification for sentiment analysis (Stanford Sentiment Treebank) and sequence modeling for part-of-speech tagging (WSJ-PTB). The training for these different tasks relies exclusively on trained word vector representations and input-question-answer triplets.

![nlp8-1](https://drive.google.com/uc?export=view&id=13nWfHSHeMbwna8-z5zystgz7DtI5CuiZ)

![nlp8-2](https://drive.google.com/uc?export=view&id=11EQOdvLNK_H7mj74RRpgWEegi1xXVnZl)

## Question and Answer

In [None]:
# Dynamic Memory Network을 이용한 Q&A 데이터 학습
# ----------------------------------------------
import collections
import itertools
import nltk
import numpy as np
import matplotlib.pyplot as plt
import random
from tensorflow.keras.layers import Input, Dense, Activation, Dropout
from tensorflow.keras.layers import LSTM, Permute
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Add, Concatenate, Dot
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

# 문서 내용 예시 :
# 1 Mary moved to the bathroom.\n
# 2 Daniel went to the garden.\n
# 3 Where is Mary?\tbathroom\t1
#
# Return:
# Stories = ['Mary moved to the bathroom.\n', 'John went to the hallway.\n']
# questions = 'Where is Mary? '
# answers = 'bathroom'
#----------------------------------------------------------------------------
def get_data(infile):
    stories, questions, answers = [], [], []
    story_text = []
    fin = open(Train_File, "r") 
    for line in fin:
        lno, text = line.split(" ", 1)
        if "\t" in text:
            question, answer, _ = text.split("\t")
            stories.append(story_text)
            questions.append(question)
            answers.append(answer)
            story_text = []
        else:
            story_text.append(text)
    fin.close()
    return stories, questions, answers

Train_File = "./dataset/qa1_single-supporting-fact_train.txt"
Test_File = "./dataset/qa1_single-supporting-fact_test.txt"

# get the data
data_train = get_data(Train_File)
data_test = get_data(Test_File)
print("\n\nTrain observations:",len(data_train[0]),"Test observations:", len(data_test[0]),"\n\n")

# Building Vocab dictionary from Train & Test data
dictnry = collections.Counter()
for stories, questions, answers in [data_train, data_test]:
    for story in stories:
        for sent in story:
            for word in nltk.word_tokenize(sent):
                dictnry[word.lower()] +=1
    for question in questions:
        for word in nltk.word_tokenize(question):
            dictnry[word.lower()]+=1
    for answer in answers:
        for word in nltk.word_tokenize(answer):
            dictnry[word.lower()]+=1

word2indx = {w:(i+1) for i,(w,_) in enumerate(dictnry.most_common())}
word2indx["PAD"] = 0
indx2word = {v:k for k,v in word2indx.items()}

vocab_size = len(word2indx)
print("vocabulary size:",len(word2indx))
print(word2indx)

# compute max sequence length for each entity
story_maxlen = 0
question_maxlen = 0

for stories, questions, answers in [data_train, data_test]:
    for story in stories:
        story_len = 0
        for sent in story:
            swords = nltk.word_tokenize(sent)
            story_len += len(swords)
        if story_len > story_maxlen:
            story_maxlen = story_len
            
    for question in questions:
        question_len = len(nltk.word_tokenize(question))
        if question_len > question_maxlen:
            question_maxlen = question_len
            
print ("Story maximum length:", story_maxlen, "Question maximum length:", question_maxlen)

# Converting data into Vectorized form
def data_vectorization(data, word2indx, story_maxlen, question_maxlen):
    Xs, Xq, Y = [], [], []
    stories, questions, answers = data
    for story, question, answer in zip(stories, questions, answers):
        xs = [[word2indx[w.lower()] for w in nltk.word_tokenize(s)] for s in story]
        xs = list(itertools.chain.from_iterable(xs))
        xq = [word2indx[w.lower()] for w in nltk.word_tokenize(question)]
        Xs.append(xs)
        Xq.append(xq)
        Y.append(word2indx[answer.lower()])
    return pad_sequences(Xs, maxlen=story_maxlen), pad_sequences(Xq, maxlen=question_maxlen),\
           to_categorical(Y, num_classes=len(word2indx))

           
Xstrain, Xqtrain, Ytrain = data_vectorization(data_train, word2indx, story_maxlen, question_maxlen)
Xstest, Xqtest, Ytest = data_vectorization(data_test, word2indx, story_maxlen, question_maxlen)

print("Train story",Xstrain.shape,"Train question", Xqtrain.shape,"Train answer", Ytrain.shape)
print( "Test story",Xstest.shape, "Test question",Xqtest.shape, "Test answer",Ytest.shape)

# Model Parameters
EMBEDDING_SIZE = 128
LATENT_SIZE = 64
BATCH_SIZE = 64
NUM_EPOCHS = 40

# Inputs
story_input = Input(shape=(story_maxlen,))
question_input = Input(shape=(question_maxlen,))

# Story encoder embedding
story_encoder = Embedding(input_dim=vocab_size,
                          output_dim=EMBEDDING_SIZE, 
                          input_length=story_maxlen)(story_input)
story_encoder = Dropout(0.2)(story_encoder)

# Question encoder embedding
question_encoder = Embedding(input_dim=vocab_size,
                             output_dim=EMBEDDING_SIZE,
                             input_length=question_maxlen)(question_input)
question_encoder = Dropout(0.3)(question_encoder)

# Match between story and question
# story_encoder = [None, 14, 128], question_encoder = [None, 4, 128]
# match = [None, 14, 4]
match = Dot(axes=[2, 2])([story_encoder, question_encoder])

# Encode story into vector space of question
story_encoder_c = Embedding(input_dim=vocab_size,
                            output_dim=question_maxlen,
                            input_length=story_maxlen)(story_input)
story_encoder_c = Dropout(0.3)(story_encoder_c)

# Combine match and story vectors
response = Add()([match, story_encoder_c])
response = Permute((2, 1))(response)

# Combine response and question vectors to answers space
answer = Concatenate()([response, question_encoder])
answer = LSTM(LATENT_SIZE)(answer)
answer = Dropout(0.2)(answer)
answer = Dense(vocab_size)(answer)
output = Activation("softmax")(answer)

model = Model(inputs=[story_input, question_input], outputs=output)
model.compile(optimizer="adam", loss="categorical_crossentropy")
print (model.summary())

# Model Training
history = model.fit([Xstrain, Xqtrain], [Ytrain],
                    batch_size = BATCH_SIZE, 
                    epochs = NUM_EPOCHS,
                    validation_data=([Xstest, Xqtest], [Ytest]))
					
# loss plot
plt.title("Episodic Memory Q & A Loss")
plt.plot(history.history["loss"], color="g", label="train")
plt.plot(history.history["val_loss"], color="r", label="validation")
plt.legend(loc="best")
plt.show()

# get predictions of labels
ytest = np.argmax(Ytest, axis=1)
Ytest_ = model.predict([Xstest, Xqtest])
ytest_ = np.argmax(Ytest_, axis=1)

# Select Random questions and predict answers
NUM_DISPLAY = 10
   
for i in random.sample(range(Xstest.shape[0]),NUM_DISPLAY):
    story = " ".join([indx2word[x] for x in Xstest[i].tolist() if x != 0])
    question = " ".join([indx2word[x] for x in Xqtest[i].tolist()])
    label = indx2word[ytest[i]]
    prediction = indx2word[ytest_[i]]
    print(story, question, label, prediction)

output
```
Epoch 40/40
10000/10000 [==============================] - 1s 148us/sample - loss: 0.4214 - val_loss: 0.3947
mary moved to the bedroom . sandra went back to the garden . where is sandra ? garden garden
mary went to the office . sandra journeyed to the hallway . where is john ? garden office
mary went back to the bathroom . mary went to the office . where is daniel ? kitchen garden
sandra journeyed to the kitchen . mary journeyed to the bathroom . where is john ? hallway office
mary travelled to the kitchen . john journeyed to the bathroom . where is john ? bathroom bathroom
john went back to the bathroom . mary went back to the bedroom . where is sandra ? bedroom bedroom
daniel went to the bathroom . mary journeyed to the bathroom . where is john ? office office
mary went to the garden . daniel went to the garden . where is mary ? garden garden
daniel travelled to the hallway . daniel journeyed to the office . where is sandra ? office office
john went to the bedroom . daniel went back to the bedroom . where is john ? bedroom bedroom
```