# Lab 6: Training Deep Recurrent Neural Network - Part 2

Name1, Student's ID1<br>
Name2, Student's ID2<br>
Name3, Student's ID3<br>

**Note: Please name your file**

## Lab Instruction - Language Modelling and Text Classification

In this lab, you will learn to train a deep recurrent neural network using LSTM with the Keras library using the Tensorflow backend. Your task is to implement the natural language modelling and text generation.

Select your favourite book from https://www.gutenberg.org/browse/scores/top and download it as a text file. Then, you will train your language model using RNN-LSTM. 

- Language model (in Thai): http://bit.ly/language_model_1
- Tutorial on how to create a language model (in English): https://medium.com/@shivambansal36/language-modelling-text-generation-using-lstms-deep-learning-for-nlp-ed36b224b275

To evaluate the model, the perplexity measurement is used: https://stats.stackexchange.com/questions/10302/what-is-perplexity

Last, fine-tune your model. You have to try different hyperparameter or adding more data. Discuss your result.



**The total lab score is 20 which will be evaluated as follows:**</br>
1. Specification (Do as the instruction said. This include the model tuning section where you have to do a proper amount of tuning) - 10 points
2. Design of logic (No weired things in the process) - 5 points
3. Journaling (Communicate your thought process, comment your code, and discuss result & analyse **in every step**) - 5 points



In [27]:
import keras
import numpy as np
import keras.backend as K
from keras import models
from keras import layers
from keras.preprocessing import text
from keras.preprocessing import sequence

Using TensorFlow backend.


#### 1. Load your data 

In [22]:
# Load data
raw_text = open('data/alice_in_wonderland.txt').read()

In [23]:
# Inspect data
raw_text[:200]

'\ufeffCHAPTER I. Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was rea'

In [5]:
chars = sorted(list(set(raw_text)))

In [25]:
print("Total characters: ", len(chars))
print("Total word: ", len(raw_text.split()))

Total characters:  72
Total word:  27264


#### 2. Data Preprocessing 

*Note that only story will be used as a dataset, footnote and creddit are not include.*

The symbol '\n' is indicated the end of the line ``<EOS>``, which is for our model to end the sentence here.

To create a corpus for your model. The following code is can be used:</br>
*Note that other techniques can be used*

```python
# cut the text in semi-redundant sequences of maxlen characters.
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
```

The code loop through the data from first word to the last word. The maxlen define a next n word for a model to predict.


In [24]:
# Adding end of string symbol
raw_text = raw_text.replace('\n\n', " <EOS> ")
raw_text[:200]

'\ufeffCHAPTER I. Down the Rabbit-Hole <EOS> Alice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister wa'

In [36]:
# Preprocessing 
# Create corpus & Vectorization

tokenizer = text.Tokenizer()

# basic cleanup
corpus = raw_text.lower().split("\n")

# tokenization
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1

# create input sequences using list of tokens
input_sequences = []
for line in corpus:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

# pad sequences 
max_sequence_len = max([len(x) for x in input_sequences])

# Pre padding 
input_sequences = np.array(sequence.pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

# create predictors and label
predictors, label = input_sequences[:,:-1],input_sequences[:,-1]

# One-hot label
label = keras.utils.to_categorical(label, num_classes=total_words)

In [47]:
n_gram_sequence[0]

2908

In [38]:
print(input_sequences[10])

[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0 1596   15   37    1  100  632    3   11   13  273
    5  107]


In [40]:
print('Max sequence len: %s' % max_sequence_len)

Max sequence len: 128


#### 3. Language Model

In [41]:
def perplexity(y_true, y_pred):
    cross_entropy = keras.backend.categorical_crossentropy(y_true, y_pred)
    perplexity = keras.backend.pow(2.0, cross_entropy)
    return perplexity

In [43]:
# Define your model
# Used Word Embedding 

model = models.Sequential()
model.add(layers.Embedding(total_words, 512,input_length=max_sequence_len-1,name='Embedding'))
model.add(layers.LSTM(512, kernel_initializer = 'he_normal',
                      dropout=0.3,
                      return_sequences=True,
                     name='LSTM1'))
model.add(layers.LSTM(256, kernel_initializer = 'he_normal',
                     dropout=0.3,
                     name='LSTM2'))
model.add(layers.Dense(total_words, activation='softmax',name='Output'))

model.compile(optimizer='rmsprop',loss='categorical_crossentropy', metrics=[perplexity])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
Embedding (Embedding)        (None, 127, 512)          1489920   
_________________________________________________________________
LSTM1 (LSTM)                 (None, 127, 512)          2099200   
_________________________________________________________________
LSTM2 (LSTM)                 (None, 256)               787456    
_________________________________________________________________
Output (Dense)               (None, 2910)              747870    
Total params: 5,124,446
Trainable params: 5,124,446
Non-trainable params: 0
_________________________________________________________________


In this lab, we will used perplexity as a metrics

```python
def perplexity(y_true, y_pred):
    cross_entropy = keras.backend.categorical_crossentropy(y_true, y_pred)
    perplexity = keras.backend.pow(2.0, cross_entropy)
    return perplexity
```

To used custom metrics function > https://keras.io/metrics/

In [None]:
# Training your model
history = model.fit(predictors, label,batch_size=32, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10

#### 4. Evaluate your model 

In [5]:
# Create a function to evaluate your model using perplexity measurment (You can try adding other measurements as well)
def evaluate_result(features, label, model ):
    model.evaluate(features, label)

#### 5. Text generating

In [3]:
def generate_text(seed_text, max_sequence_len, tolenizer):
    # Loop through the next n words
    for _ in range(200):
        # Preprecess your seed_text and predict the output
        # ======
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = sequence.pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted = model.predict_classes(token_list, verbose=0)
        # ======

        output_word = ""
        for word, index in tokenizer.word_index.items():
            # convert word vector representation to a word string
            if index == predicted:
                output_word = word
                break
        seed_text += " " + output_word
        if 
    return seed_text

In [None]:
# generate your sample text
seed_text = input('Enter your start sentence:')
gen_text = generate_text(seed_text, max_sequence_len, tolenizer)

#### 6. Model Tuning 

In [48]:
# Try out different hyperparameter & model architecture
tokenizer.word_index.items()



#### 7. Help your model to generate a short story 

**Example** https://medium.com/deep-writing/harry-potter-written-by-artificial-intelligence-8a9431803da6

Write your result in a `markdown` cell

In [6]:
# Create your short-story from your model (Add your creativity here)

### More on Natural language Processing and Language model
1. https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e 
2. https://medium.com/phrasee/neural-text-generation-generating-text-using-conditional-language-models-a37b69c7cd4b
3. http://karpathy.github.io/2015/05/21/rnn-effectiveness/

**Music generates by RNN**
https://soundcloud.com/optometrist-prime/recurrence-music-written-by-a-recurrent-neural-network


### References

[1] https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/ </br>
[2] **Pre padding** https://stackoverflow.com/questions/46298793/how-does-choosing-between-pre-and-post-zero-padding-of-sequences-impact-results