# Lab 6: Training Deep Recurrent Neural Network - Part 2


## Lab Instruction - Language Modelling and Text Classification

In this lab, you will learn to train a deep recurrent neural network using LSTM with the Keras library using the Tensorflow backend. Your task is to implement the natural language modelling and text generation.

Select your favourite book from https://www.gutenberg.org/browse/scores/top and download it as a text file. Then, you will train your language model using RNN-LSTM. 

- Language model (in Thai): http://bit.ly/language_model_1
- Tutorial on how to create a language model (in English): https://medium.com/@shivambansal36/language-modelling-text-generation-using-lstms-deep-learning-for-nlp-ed36b224b275

To evaluate the model, the perplexity measurement is used: https://stats.stackexchange.com/questions/10302/what-is-perplexity

Last, fine-tune your model. You have to try different hyperparameter or adding more data. Discuss your result.



**The total lab score is 20 which will be evaluated as follows:**</br>
1. Specification (Do as the instruction said. This include the model tuning section where you have to do a proper amount of tuning) - 10 points
2. Design of logic (No weired things in the process) - 5 points
3. Journaling (Communicate your thought process, comment your code, and discuss result & analyse **in every step**) - 5 points



In [29]:
#importing libraries
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense,Dropout
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku 
import numpy as np

#### 1. Load your data 

In [63]:
# Load data
file = open("219-0.txt", "r",encoding="utf8") 
data = file.read()
print(data)

﻿HEART OF DARKNESS

By Joseph Conrad




I


The Nellie, a cruising yawl, swung to her anchor without a flutter of
the sails, and was at rest. The flood had made, the wind was nearly
calm, and being bound down the river, the only thing for it was to come
to and wait for the turn of the tide.

The sea-reach of the Thames stretched before us like the beginning of
an interminable waterway. In the offing the sea and the sky were welded
together without a joint, and in the luminous space the tanned sails
of the barges drifting up with the tide seemed to stand still in red
clusters of canvas sharply peaked, with gleams of varnished sprits. A
haze rested on the low shores that ran out to sea in vanishing flatness.
The air was dark above Gravesend, and farther back still seemed
condensed into a mournful gloom, brooding motionless over the biggest,
and the greatest, town on earth.

The Director of Companies was our captain and our host. We four
affectionately watched his back as he stood in the

#### 2. Data Preprocessing 

*Note that only story will be used as a dataset, footnote and creddit are not include.*

The symbol '\n' is indicated the end of the line ``<EOS>``, which is for our model to end the sentence here.

To create a corpus for your model. The following code is can be used:</br>
*Note that other techniques can be used*

```python
# cut the text in semi-redundant sequences of maxlen characters.
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
```

The code loop through the data from first word to the last word. The maxlen define a next n word for a model to predict.


In [None]:
# Preprocessing 
# Create corpus & Word vectorization
tokenizer = Tokenizer()

#Change All Data to lowercase and split with EOS -> \n
corpus = data.lower().split("\n")
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1

input_sequences = []
for line in corpus:
    #convert text to sequence[1,2,3,4,...]
    token_list = tokenizer.texts_to_sequences([line])[0] 
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)
        
#finding max sequence len
max_sequence_len = max([len(x) for x in input_sequences])

# pre padding all sequences with 0 amount of max_sequence_len
input_sequences = np.array(pad_sequences(input_sequences,
                          maxlen=max_sequence_len, padding='pre'))

Sentence: "they are learning data science"
<table align="left" >
  <tr>
    <th style = "text-align: center">PREDICTORS</th>
    <th style = "text-align: center">LABEL</th> 
  </tr>
  <tr>
    <td style = "text-align: left">they</td>
    <td>are</td> 
  </tr>
  <tr>
    <td style = "text-align: left">they are</td> 
    <td>learning</td>
  </tr>
  <tr>
    <td style = "text-align: left">they are learning</td>
    <td>data</td> 
  </tr>
  <tr>
    <td style = "text-align: left">they are learning data</td>
    <td>science</td> 
  </tr>
</table>


In [88]:
predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
label = ku.to_categorical(label, num_classes=total_words)

[   0    0    0    0    0    0    0    0    0    0    0    0    0    1
 2672    3 2673]
2674
[[   0    0    0 ...    0    0 2669]
 [   0    0    0 ...    0 2669    2]
 [   0    0    0 ...    0    0   34]
 ...
 [   0    0    0 ...    1  212    2]
 [   0    0    0 ...  212    2   28]
 [   0    0    0 ...    2   28  367]]


#### 3. Language Model

Define RNN model using LSTM and word embedding representation</br>
We will used perplexity as a metrics

```python
def perplexity(y_true, y_pred):
    cross_entropy = keras.backend.categorical_crossentropy(y_true, y_pred)
    perplexity = keras.backend.pow(2.0, cross_entropy)
    return perplexity
```

To used custom metrics function > https://keras.io/metrics/

For a loss function `categorical_crossentropy` is used, any optimzation method can be applied.

In [35]:
import keras
def perplexity(y_true, y_pred):
    cross_entropy = keras.backend.categorical_crossentropy(y_true, y_pred)
    perplexity = keras.backend.pow(2.0, cross_entropy)
    return perplexity

In [36]:
# Define your model
input_len = max_sequence_len - 1

model = Sequential()

model.add(Embedding(total_words, 10, input_length=input_len))
model.add(LSTM(150))
model.add(Dropout(0.1))
model.add(Dense(total_words, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=[perplexity])
model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 17, 10)            56840     
_________________________________________________________________
lstm_6 (LSTM)                (None, 150)               96600     
_________________________________________________________________
dropout_5 (Dropout)          (None, 150)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 5684)              858284    
Total params: 1,011,724
Trainable params: 1,011,724
Non-trainable params: 0
_________________________________________________________________


In [38]:
# Training your model
history = model.fit(predictors, label, epochs=20, verbose=1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x215d405ce48>

#### 4. Evaluate your model 

In [None]:
# Evaluate your model using perplexity measurment (You can try adding other measurements as well)

#### 5. Text generating

In [47]:
def generate_text(seed_text, max_sequence_len, tolenizer):
    # Loop through the next n words
    for _ in range(8):
        # Preprecess your seed_text
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen= 
                             max_sequence_len-1, padding='pre')
        # predict the output
        predicted = model.predict_classes(token_list, verbose=0)

        output_word = ""
        for word, index in tokenizer.word_index.items():
            # convert word vector representation to a word string
            if index == predicted:
                output_word = word
                break
        seed_text += " " + output_word
    return seed_text

In [103]:
# generate your sample text
text = generate_text("the dog", max_sequence_len, model)
print(text)

the dog of the land the reality of tents voice


#### 6. Model Tuning 

Write down why you design this architecture or why you choose this set of parameter</br>
You should have at least 1 different architectures/set of hyperparameters per person in your team</br>
Last, train your best performed model **on 50 epoch** (or you can try 100 epoch but this will take time)</br>
*Note: For the last step, please turn off a verbose during training

In [None]:
# Try out different hyperparameter & model architecture

#### 7. Help your model to generate a short story 

**Example** https://medium.com/deep-writing/harry-potter-written-by-artificial-intelligence-8a9431803da6

Write your result in a `markdown` cell

In [None]:
# Create your short-story from your model (Add your creativity here)

### More on Natural language Processing and Language model
1. https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e 
2. https://medium.com/phrasee/neural-text-generation-generating-text-using-conditional-language-models-a37b69c7cd4b
3. http://karpathy.github.io/2015/05/21/rnn-effectiveness/

**Music generates by RNN**
https://soundcloud.com/optometrist-prime/recurrence-music-written-by-a-recurrent-neural-network
