<a href="https://colab.research.google.com/github/tranvohuy/Markovify_sentence_Truyen_Kieu/blob/master/Vanilla_char_rnn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this note we investigate a LSTM way to produce new characters. There are two codes, 
- https://gist.github.com/vinhkhuc/7ec5bf797308279dc587: pure Tensorflow code
- http://karpathy.github.io/2015/05/21/rnn-effectiveness/: keras layer code

Both are based on this https://github.com/oxford-cs-ml-2015/practical6/blob/master/train.lua 

Other reference:
- https://github.com/AvoncourtPartners/poems
- https://cloud.google.com/blog/products/gcp/cloud-poetry-training-and-hyperparameter-tuning-custom-text-models-on-cloud-ml-engine 

# We first import an original text from an author.

In [0]:
import urllib
from bs4 import BeautifulSoup
from urllib.request import urlopen

def crawling_oneweb(url):
  html = urlopen(url).read()
  soup = BeautifulSoup(html)

  main_body = soup.findAll("p", {"class": "Normal"})
  del main_body[-1]
  text = "".join([p.text for p in main_body])
  return text  

def crawlingwebs(urls):
  '''
  Input:
  urls: a list of urls
  
  Output:
  a combination text extracted from url in urls
  '''
  text = ''
  for url in urls:
    text = text + crawling_oneweb(url)
  return text
  
urls = ["https://vnexpress.net/goc-nhin/nguoi-giau-va-thien-tai-3808335.html",
       'https://vnexpress.net/goc-nhin/thoat-khoi-co-don-3886302.html',
       'https://vnexpress.net/goc-nhin/lam-luat-3900478.html',
       'https://vnexpress.net/goc-nhin/thuat-san-dat-vang-3879380.html',
       'https://vnexpress.net/goc-nhin/nhung-mua-xuan-trong-doi-3877957.html',
       'https://vnexpress.net/goc-nhin/than-phan-ruong-dong-3862785.html',
       'https://vnexpress.net/goc-nhin/long-dan-3824537.html',
       'https://vnexpress.net/goc-nhin/ro-rang-voi-dat-3764906.html',
       'https://vnexpress.net/goc-nhin/thuong-nho-dong-bang-3778190.html']

text = crawlingwebs(urls)

# pure tensorflow

 We now explain the structure of NN in this code.
 - LSTM

 Variables/ constants:
 - 

In [0]:
"""
Vanilla Char-RNN using TensorFlow by Vinh Khuc (@knvinh).
Adapted from Karpathy's min-char-rnn.py
https://gist.github.com/karpathy/d4dee566867f8291f086
Requires tensorflow>=1.0
BSD License
"""
import random
import numpy as np
import tensorflow as tf

seed_value = 42
tf.set_random_seed(seed_value)
random.seed(seed_value)

#np.eye(vocab_size) is the unit matrix of dimension vocab_size x vocab_size

def one_hot(v):
    return np.eye(vocab_size)[v]

data = text
chars = sorted(list(set(text))) # the set of different characters in the text
data_size, vocab_size = len(data), len(chars)
#vocab_size means char_size
#data_size: total characters in the originial text

print('Data has %d characters, %d unique.' % (data_size, vocab_size))
char_to_ix = {ch: i for i, ch in enumerate(chars)}
ix_to_char = {i: ch for i, ch in enumerate(chars)}

# Hyper-parameters
hidden_size   = 100  # hidden layer's size
seq_length    = 25   # number of characters for each sample (input)
learning_rate = 1e-1

inputs     = tf.placeholder(shape=[None, vocab_size], dtype=tf.float32, name="inputs")
targets    = tf.placeholder(shape=[None, vocab_size], dtype=tf.float32, name="targets")
init_state = tf.placeholder(shape=[1, hidden_size], dtype=tf.float32, name="state")

initializer = tf.random_normal_initializer(stddev=0.1)

with tf.variable_scope("RNN") as scope:
    hs_t = init_state
    ys = []
    for t, xs_t in enumerate(tf.split(inputs, seq_length, axis=0)):
        if t > 0: scope.reuse_variables()  # Reuse variables
        Wxh = tf.get_variable("Wxh", [vocab_size, hidden_size], initializer=initializer)
        Whh = tf.get_variable("Whh", [hidden_size, hidden_size], initializer=initializer)
        Why = tf.get_variable("Why", [hidden_size, vocab_size], initializer=initializer)
        bh  = tf.get_variable("bh", [hidden_size], initializer=initializer)
        by  = tf.get_variable("by", [vocab_size], initializer=initializer)

        hs_t = tf.tanh(tf.matmul(xs_t, Wxh) + tf.matmul(hs_t, Whh) + bh)
        ys_t = tf.matmul(hs_t, Why) + by
        ys.append(ys_t)

hprev = hs_t
output_softmax = tf.nn.softmax(ys[-1])  # Get softmax for sampling

outputs = tf.concat(ys, axis=0)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=targets, logits=outputs))

# Minimizer
minimizer = tf.train.AdamOptimizer()
grads_and_vars = minimizer.compute_gradients(loss)

# Gradient clipping
grad_clipping = tf.constant(5.0, name="grad_clipping")
clipped_grads_and_vars = []
for grad, var in grads_and_vars:
    clipped_grad = tf.clip_by_value(grad, -grad_clipping, grad_clipping)
    clipped_grads_and_vars.append((clipped_grad, var))

# Gradient updates
updates = minimizer.apply_gradients(clipped_grads_and_vars)

# Session
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)

# Initial values
n, p = 0, 0
hprev_val = np.zeros([1, hidden_size])

while True:
    # Initialize
    if p + seq_length + 1 >= len(data) or n == 0:
        hprev_val = np.zeros([1, hidden_size])
        p = 0  # reset

    # Prepare inputs
    input_vals  = [char_to_ix[ch] for ch in data[p:p + seq_length]]
    target_vals = [char_to_ix[ch] for ch in data[p + 1:p + seq_length + 1]]
    
    input_vals  = one_hot(input_vals)
    target_vals = one_hot(target_vals)

    hprev_val, loss_val, _ = sess.run([hprev, loss, updates],
                                      feed_dict={inputs: input_vals,
                                                 targets: target_vals,
                                                 init_state: hprev_val})
    if n % 500 == 0:
        # Progress
        print('iter: %d, p: %d, loss: %f' % (n, p, loss_val))

        # Do sampling
        sample_length = 200
        start_ix      = random.randint(0, len(data) - seq_length)
        print('star_ix', start_ix)
        sample_seq_ix = [char_to_ix[ch] for ch in data[start_ix:start_ix + seq_length]]
        ixes          = []
        sample_prev_state_val = np.copy(hprev_val)

        for t in range(sample_length):
            sample_input_vals = one_hot(sample_seq_ix)
            sample_output_softmax_val, sample_prev_state_val = \
                sess.run([output_softmax, hprev],
                         feed_dict={inputs: sample_input_vals, init_state: sample_prev_state_val})
            #np.darray.ravel() return a flat array
            ix = np.random.choice(range(vocab_size), p=sample_output_softmax_val.ravel())
            
            ixes.append(ix)
            sample_seq_ix = sample_seq_ix[1:] + [ix]

        txt = ''.join(ix_to_char[ix] for ix in ixes)
        print('----\n %s \n----\n' % (txt,))

    p += seq_length
    n += 1

In [0]:
# the above code is equivalent to the following keras structure


In [0]:
input_vals  = [char_to_ix[ch] for ch in data[p:p + seq_length]]
target_vals = [char_to_ix[ch] for ch in data[p + 1:p + seq_length + 1]]
print(input_vals)
print(target_vals)

[1, 49, 94, 59, 53, 1, 53, 55, 47, 59, 1, 65, 63, 66, 76, 59, 9, 1, 26, 60, 59, 1, 122, 58, 1]
[49, 94, 59, 53, 1, 53, 55, 47, 59, 1, 65, 63, 66, 76, 59, 9, 1, 26, 60, 59, 1, 122, 58, 1, 92]


# Keras

- https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py

- https://keras.io/layers/recurrent/#lstm

In [34]:
from __future__ import print_function
import keras
from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys
import io

'''
text = 'Tôi còn nhớ một lời thoại trong vở kịch "Hồn Trương Ba, da hàng thịt" của cố tác giả Lưu Quang Vũ. K.....'
'''
print('The number of character in the originial text:', len(text)) #==50088

chars = sorted(list(set(text))) 
print('The number of different characters in the original text:', len(chars)) #==147
char_indices = dict((c, i) for i, c in enumerate(chars))
'''
char_indices = {'\n':0, ' ':1, '!': 2, ...
              'A': 24, 'B': 25, ...}
'''
indices_char = dict((i, c) for i, c in enumerate(chars))
'''
indices_char = {0: '\n', 1: ' ', 2: '!', ...
            24: 'A', 25: 'B', ...}
}
'''

# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences)) #==16683
'''
sentences = ['Tôi còn nhớ một lời thoại trong vở kịch ', 
          ' còn nhớ một lời thoại trong vở kịch "Hồ', 
          'n nhớ một lời thoại trong vở kịch "Hồn T',
          ...]
next_chars = ['"', 'n', 'r', 'n', 'B',...]
'''

print('One hot encoding...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)

# x is a zero tensor of dim 16683 x 40 x 147
# y is a zero tensor of dim 16683 x 147
# 16687 can be thought of the number of samples/training samples

# x[i,t,:] stands for  a 147-d one-hot vector of the word t-th in sentences[i]
# y[i,:] is a 147-d one-hot vector for next_chars[i]
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1


# build the model: a single LSTM
print('Build model...')
model = Sequential()


model.add(LSTM(128, input_shape=(maxlen, len(chars))))
#so input_dim = len(chars), input_length = maxlen
#the output of this layer is a vector of dim 128. This is the last vector
#output of the LSTM. All other vectors, (maxlen-1) of them, are discarded.

# 'last vector output' because return_sequences = False by default in LSTM
# 

model.add(Dense(len(chars), activation='softmax'))
# add another Dense layer with softmax activation
# output is a vector of length len(chars)
# sum of its element is one.

optimizer = RMSprop(lr=0.01)
#optimizer: a strategy to do gradient descend/ finding minimum of loss
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
#we can write 
#model.compile(loss='categorical_crossentropy', optimizer = 'RMSprop')
#but then it won't give the option to choose learning rate of RMSprop. 
#the default lr of RMSprop is lr=0.001

#However we can write
#model.compile(loss='categorical_crossentropy', optimizer = RMSprop(lr=0.01))
#we use categorical_crossentropy since this is a categorical/classification problem
# guessing the next character

def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    #So preds = softmax(log(original preds)/temperature)
    # that means
    # preds = (original preds)^(1/temperature) / (sum)
    # really funny
    
    
    # if temperature is small, then it's very cold, everything is frozen
    # hence if old_preds[i] is big, then preds[i] is very close to 1
    # we can predict what will appear
    # if temperature is extremely small, 
    # then the return is almost argmax(old_preds)
    # which is what we dream of.
    
    # if temperature is big, then it's hot, everything is chaos
    # we can not predict what will appear
    # all preds[i]'s tend to be the same,
    # that is, preds converges to discrete uniform measure
    # hence the probability to choose each character is 1/total diff_character
    # which is 1/147
    
    probas = np.random.multinomial(1, preds, 1)
    # first argument 1-> n, last argument->k
    #An experiment is throwing a dice n times (i.e once in this case). 
    #This dice has len(preds) values. 
    #each value has probability preds[value]
    #Do the experement k times
    #The output of each experiment is a vector of len(preds)
    # the output of multinomial is a vector of m x len(preds)
    # the sum of all element of this vector is n
    
    return np.argmax(probas)
    #return index with highest values

def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    # it is a method/property of keras.callback
    
    
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    #start_index = random.randint(0, len(text) - maxlen - 1)
    start_index=0 # choose a fixed seed sentence, or we can randomize start_index
    for diversity in [0.01, 0.2, 1.0, 10.0]:
        print('----- diversity:', diversity)

        generated = ''
        #choose a sentence in sentences
        # we can randomize start_index
        # sentence = text[start_index: start_index + maxlen]
        # or you can choose your own sentence with maxlen characters
        #sentence = 'Trăm năm trong cõi người ta, chữ tình ch'
        sentence = 'Tự nhiên bảo nói một đoạn thì khó nói đó'
        
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        #print is just a wrapper of sys.stdout.write
        sys.stdout.write(generated)
        
                
        #produce 400  new characters
        for i in range(400):
            #one hot encoding of sentence
            x_pred = np.zeros((1, maxlen, len(chars)))
            
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.
            
            #then put x_pred into model.predict
            # without [0] the shape of preds is (1,147) not (147,)
            preds = model.predict(x_pred, verbose=0)[0]
            
            # preds is a probability density from [0,146]
            # presumably sum(preds) = 1
            next_index = sample(preds, diversity)
            #a new index is chosen. Transform it to a corresonding character
            
            next_char = indices_char[next_index]

            generated += next_char # this code is currently redundant
            
            
            #update sentence to predict the next character
            # we remove the first character and append the new character
            sentence = sentence[1:] + next_char
            
            sys.stdout.write(next_char)
            sys.stdout.flush()
        print() #new line

print_callback = LambdaCallback(on_epoch_end = on_epoch_end)
# https://keras.io/callbacks/#lambdacallback

checkpoint_path = "training_1/kerascp.ckpt"
cp_callback = keras.callbacks.ModelCheckpoint(checkpoint_path, 
                                                 save_weights_only=True,
                                                 verbose=1)



model.fit(x, y,
          batch_size=128,
          epochs=60,
          callbacks=[print_callback])

The number of character in the originial text: 50088
The number of different characters in the original text: 147
nb sequences: 16683
One hot encoding...
Build model...
Epoch 1/60

----- Generating text after Epoch: 0
----- diversity: 0.01
----- Generating with seed: "Tự nhiên bảo nói một đoạn thì khó nói đó"
Tự nhiên bảo nói một đoạn thì khó nói đó n nh nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n nh n
----- diversity: 0.2
----- Generating with seed: "Tự nhiên bảo nói một đoạn thì khó nói đó"
Tự nhiên bảo nói một đoạn thì khó nói đó n n nh n nh n nh n nh n nh t nh n là  nh nh nh nh n nh t nh n nh n là ng nh n li nh n là là  nh n nh nh th n li là 



ầu họi nào đất đai thì hơn và thác thi nhà dan độ Tây lại lại thì thì thế cả thu hồi đất, cho thế lúng giao đất, cho thuy lạ ông thì thể nước tanh vi có tru hồi đất trang được thuy lạ ông thì thế nào đồng có tiển thu hồi đất, cho thế là đổi thuy đất cho thu hồi đất. Thoáng thì cả các chỉ cho mới say với thì cải trong hà
----- diversity: 1.0
----- Generating with seed: "Tự nhiên bảo nói một đoạn thì khó nói đó"
Tự nhiên bảo nói một đoạn thì khó nói đó đất nước tang theo tôi còn là cơn khoa được báng sông Hồ ồi báng phập triển đắng vụ tất cảnh nhằn cấp chí che dế áph tới và hại và quy hoạch phải cản giảo vắc sanh, thế này vM trường Uại hình rả đổi với cũng ch: cải hiện hành cho từ niới, ơn lở các cơi được pháp luật vấn đề sanh iấy mới giau cũng làn c hượng "pháp luật vẫn ngh vệt cho thu hai, người vự xa, nên sông thì nước tanh theo tôi theong t
----- diversity: 1.2
----- Generating with seed: "Tự nhiên bảo nói một đoạn thì khó nói đó"
Tự nhiên bảo nói một đoạn thì khó nói đó lắm lụng hò.

<keras.callbacks.History at 0x7fbda247ad68>

## Questions:
- What happens if we choose the seed_sentence that has no sense
  - as a sentence, e.g., 'I you me him hello' 
  - as words, for example 'I dnoot uredsbadt waht you siad'
  - in different language, 'je'n comprends pas', etc?
- choose to guess the next two characters at the same time, instead of one character
- word-based guessing instead of character-based guessing like above
- 

In [0]:
print('Build model...')
model = Sequential()
model.add(LSTM(2, input_shape=(maxlen, 3), return_sequences = True, name = "LSTM"))
model.add(Dense(len(chars), activation='softmax', name = 'Dense'))


Build model...


In [0]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
LSTM (LSTM)                  (None, 40, 2)             48        
_________________________________________________________________
dense_11 (Dense)             (None, 40, 147)           441       
Total params: 489
Trainable params: 489
Non-trainable params: 0
_________________________________________________________________
