# Simple RNN model for text generation using Glove Embeddings
In this notebook we will learn to generate text using RNN model and glove embeddings. Text based generation models using RNN can be developed in two ways, famously called as character based language RNN and word based language RNN. Each of these have pros and cons. The below table summarizes the differences.

#### Character based Language RNNs.
Pros
- Learns punctuations and rarely used words
- No need for word embeddings, one-hot encodings are just enough.
- Less vocabulary
Cons
- They can produce non-sense words.
- They can generate syntactically and grammatically wrong sentences.

#### Word based language RNNs.
Pros
- They cannot generate words outside the vocabulary
- They can understand and predict complex words

Cons
- Complex and resource demanding.
- Dependency on word embeddings. Training depends on word embeddings, so if you find words in vocabulary not part of the embeddings we need to train our own embeddings.


References
- Embeddings Layer explained: https://medium.com/analytics-vidhya/understanding-embedding-layer-in-keras-bbe3ff1327ce
- https://github.com/WillKoehrsen/recurrent-neural-networks/blob/master/notebooks/Deep%20Dive%20into%20Recurrent%20Neural%20Networks.ipynb

In [267]:
import pandas as pd
import numpy as np

# load data set. 
data = pd.read_csv('../data/neural_network_patent_query.csv')
data.head()


# loading only subset of data
abstracts = data['patent_abstract']
len(abstracts)

# get machine configuration
# from tensorflow.python.client import device_lib
# print(device_lib.list_local_devices())

3522

In [268]:
## Global parameters
import warnings

warnings.filterwarnings('ignore', category=RuntimeWarning)

RANDOM_STATE = 50
EPOCHS = 100
BATCH_SIZE = 256
MAX_WORDS = 10000
MAX_LEN = 100
VERBOSE = 1
SAVE_MODEL = True


In [269]:
import re
sampleText = 'This is a short sentence (1) with one reference to an image. This next sentence, while non-sensical, does not have an image and has two commas.'
def format_text(input):
    """Formats the text to treat punctuations"""
    # Add spaces around punctuation
    input = re.sub(r'(?<=[^\s0-9])(?=[.,;?])', r' ', input)
    # remove references to figures
    input = re.sub(r'\((\d+)\)', r'', input)
    # remove double spaces
    input = re.sub(r'\s\s', ' ', input)
    return input
f = format_text(sampleText)
f

'This is a short sentence with one reference to an image . This next sentence , while non-sensical , does not have an image and has two commas .'

In [270]:
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(filters='!"#$%&()*+/:;.<=>?@[\\]^_`{|}~\t\n', lower=True)
tokenizer.fit_on_texts([f])
s = tokenizer.texts_to_sequences([f])[0]
print(' '.join(tokenizer.index_word[i] for i in s))
print(tokenizer.word_index.keys())

this is a short sentence with one reference to an image this next sentence , while non-sensical , does not have an image and has two commas
dict_keys(['this', 'sentence', 'an', 'image', ',', 'is', 'a', 'short', 'with', 'one', 'reference', 'to', 'next', 'while', 'non-sensical', 'does', 'not', 'have', 'and', 'has', 'two', 'commas'])


In [271]:
formatted = [format_text(s) for s in abstracts]  
len(formatted)

3522

In [272]:
def make_sequences(texts, training_lengths=50, lower=True, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'):
    """Converts text to sequences of integers"""
    
    # create a tokenizer object and fit on texts
    tokenizer = Tokenizer(lower=lower, filters=filters)
    tokenizer.fit_on_texts(texts)
    
    # create lookup dictionaries
    word2idx = tokenizer.word_index
    idx2word = tokenizer.index_word
    num_words = len(word2idx) + 1
    word_counts = tokenizer.word_counts
    
    print(f'There are {num_words} unique words.')
    
    # convert text to sequences of integers
    sequences = tokenizer.texts_to_sequences(texts)
    
    # limit to sequences with more than training length tokens
    seq_lengths = [len(x) for x in sequences]
    over_idx = [i for i, l in enumerate(seq_lengths) if l > (training_lengths+20)]
    
    new_texts = []
    new_sequences = []
    
    for i in over_idx:
        new_texts.append(texts[i])
        new_sequences.append(sequences[i])      
        
    training_sequences = []
    labels = []
    
    for seq in new_sequences:
        for i in range(training_lengths, len(seq)):
            extract = seq[i - training_lengths:i + 1]
            training_sequences.append(extract[:-1])
            labels.append(extract[-1])
    
    print(f'There are {len(training_sequences)} training sequences.')
    return training_sequences, labels, word2idx, idx2word, num_words, word_counts, new_texts, new_sequences

In [273]:
TRAINING_LENGTH = 50
filters = '!#$%&()*+/:;<=>?@[\\]^_`{|}~\t\n'
features, labels, word2idx, idx2word, num_words, word_counts, new_texts, new_sequences = make_sequences(formatted, TRAINING_LENGTH, lower=True, filters=filters)

There are 13751 unique words.
There are 319970 training sequences.


In [274]:
n=2
def find_answers(index):
    print('Features=' + ' '.join(idx2word[i] for i in features[index]))
    print('Label=' + idx2word[labels[index]])
find_answers(n)
print('Original Text' + formatted[0][:400])

Features=""barometer"" neuron enhances stability in a neural network system that , when used as a track-while-scan system , assigns sensor plots to predicted track positions in a plot track association situation . the ""barometer"" neuron functions as a bench-mark or reference system node that equates a superimposed plot and track
Label=to
Original Text" A ""Barometer"" Neuron enhances stability in a Neural Network System that , when used as a track-while-scan system , assigns sensor plots to predicted track positions in a plot/track association situation . The ""Barometer"" Neuron functions as a bench-mark or reference system node that equates a superimposed plot and track to a zero distance as a ""perfect"" pairing of plot and track which has 


In [275]:
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

def create_training_data(features, labels, num_words, train_fraction=0.7):
    """Creates training and validation data"""
    
    features, labels = shuffle(features, labels, random_state=RANDOM_STATE)
    
    # find number of training samples
    num_train = int(len(features) * train_fraction)
    
    print('Number of training samples:', num_train)
    
    # split data
    train_x = features[:num_train]
    train_y = labels[:num_train]
    val_x = features[num_train:]
    val_y = labels[num_train:]
    
    # convert to arrays
    train_x = np.array(train_x)
    valid_x = np.array(val_x)

    y_train = np.zeros((len(train_y), num_words), dtype=np.int8)
    y_valid = np.zeros((len(val_y), num_words), dtype=np.int8)
    
    # one hot encode outputs
    for i, word in enumerate(train_y):
        y_train[i, word] = 1
        
    for i, word in enumerate(val_y):
        y_valid[i, word] = 1
        
    return train_x, y_train, valid_x, y_valid 

In [276]:
train_x, train_y, valid_x, valid_y =  create_training_data(features, labels, num_words, train_fraction=0.7)
len(train_x), len(train_y), len(valid_x), len(valid_y)  

Number of training samples: 223979


(223979, 223979, 95991, 95991)

In [277]:
print(train_x.shape)
print(valid_x.shape)

(223979, 50)
(95991, 50)


In [278]:
import os
from keras.utils import get_file
import numpy as np

# Download word embeddings if they are not present
# !wget --no-check-certificate http://nlp.stanford.edu/data/glove.6B.zip
# unzip glove.6B.zip

# Load in unzipped file
glove_vectors = '../../embeddings/glove.6B.100d.txt'
glove = np.loadtxt(glove_vectors, encoding='utf-8', dtype='str', comments=None)

In [279]:
vectors = glove[:, 1:].astype('float')
words = glove[:, 0]

In [280]:
print(vectors.shape)
print(words.shape)
print(num_words)

(400000, 100)
(400000,)
13751


In [281]:
# create embedding matrix for words that are part of our vocabulary, using GloVe embeddings
word_lookup = {word: vector for word, vector in zip(words, vectors)}
embedding_matrix = np.zeros((num_words, vectors.shape[1]))
not_found = 0
for i, word in enumerate(word2idx.keys()):
    vector = word_lookup.get(word, None)
    if vector is not None:
        embedding_matrix[i + 1, :] = vector    
    else:
        not_found += 1
print(f'{not_found} words not found out of {num_words} total words')

3026 words not found out of 13751 total words


In [282]:
import gc
gc.enable()
del vectors
# del glove
del features
del labels
del glove_vectors
gc.collect()

147

In [283]:
embedding_matrix.shape

(13751, 100)

In [284]:
embedding_matrix = embedding_matrix / np.linalg.norm(embedding_matrix, axis=1).reshape((-1, 1))
embedding_matrix = np.nan_to_num(embedding_matrix)

In [285]:
def find_closest(query, embedding_matrix=embedding_matrix, word2idx=word2idx, idx2word=idx2word, n=10):
    """Finds the closest word to a given word using word embeddings"""
    idx = word2idx.get(query, None)
    if idx is None:
        print(f'{query} not found in vocab.')
        return None
    vector = embedding_matrix[idx]
    if(np.all(vector == 0)):
        print(f'{query} has no pre-trained embedding.')
        return None
    else:
        dist = np.dot(embedding_matrix, vector)
        idxs = np.argsort(dist)[::-1][:n]  
        sorted_dist = dist[idxs]
        closest = [idx2word[i] for i in idxs]

    print(f'Query: {query}\n')
    max_len = max([len(i) for i in closest])
    for word, dist in zip(closest, sorted_dist):
        print(f'{word:{max_len + 2}} Cosine similarity {dist:.4f}')
    
find_closest('the')  
print('-'*100)
find_closest(',') 

Query: the

the     Cosine similarity 1.0000
this    Cosine similarity 0.8573
part    Cosine similarity 0.8508
one     Cosine similarity 0.8503
of      Cosine similarity 0.8329
same    Cosine similarity 0.8325
first   Cosine similarity 0.8210
on      Cosine similarity 0.8200
its     Cosine similarity 0.8169
as      Cosine similarity 0.8128
----------------------------------------------------------------------------------------------------
Query: ,

,       Cosine similarity 1.0000
and     Cosine similarity 0.8782
.       Cosine similarity 0.8756
while   Cosine similarity 0.8525
but     Cosine similarity 0.8338
as      Cosine similarity 0.8284
also    Cosine similarity 0.8066
now     Cosine similarity 0.8026
well    Cosine similarity 0.8003
one     Cosine similarity 0.7802


In [286]:
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dense, Dropout, Embedding, Masking, Bidirectional
from keras.optimizers import Adam

print(num_words)

13751


In [287]:
# callbacks
from keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau
model_dir = '../models/'
def create_callbacks(model_name, save=SAVE_MODEL):
    earlyStopping = EarlyStopping(monitor='val_loss', patience=5)
    callbacks = [earlyStopping]
    if save:
        callbacks.append(ModelCheckpoint(f'{model_dir}{model_name}.h5', save_best_only=True))          
    return callbacks
callbacks = create_callbacks('rnn-glove-embeddings')

### keras embedding layer.
To represent words as a vector of numbers we have two options
- One hot encoded vector where every word is represented as array of numbers. The size of the array will be equal to number of words in the vector. The number 1 is replaced in the place of the word, zeros are used for all the other words. This is not a feasible embedding approach as it demands large storage space for the word vectors and reduces model efficiency.
- Word embeddings are used to represent every word using a fixed length vector. These vectors are dense than one-hot encoding. They helps us identify semantic similarities between any two word vectors. 
Since we are working on Word based language RNN, word embeddings are used here to convert input to word vector using pre-training word embeddings (gLove)

In [288]:
def create_model(lstms=1, lstm_cells=64):
    model = Sequential()
    model.add(Embedding(input_dim=num_words, output_dim=embedding_matrix.shape[1], weights=[embedding_matrix], trainable=False, mask_zero=True))
    model.add(Masking(mask_value=0.0))
    if lstms > 1:
      for i in range(lstms-1):
        model.add(LSTM(lstm_cells, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))
    model.add(LSTM(lstm_cells, return_sequences=False, dropout=0.1, recurrent_dropout=0.1))
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(num_words, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])
    return model

model = create_model(lstms=1, lstm_cells=64)



In [289]:
model.summary()

Model: "sequential_14"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_14 (Embedding)    (None, None, 100)         1375100   
                                                                 
 masking_14 (Masking)        (None, None, 100)         0         
                                                                 
 lstm_16 (LSTM)              (None, 64)                42240     
                                                                 
 dense_28 (Dense)            (None, 128)               8320      
                                                                 
 dropout_14 (Dropout)        (None, 128)               0         
                                                                 
 dense_29 (Dense)            (None, 13751)             1773879   
                                                                 
Total params: 3,199,539
Trainable params: 1,824,439
N

In [290]:
history = model.fit(
    train_x,
    train_y,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    verbose=VERBOSE,
    callbacks=callbacks,
    validation_data=(valid_x, valid_y))

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30


In [295]:
def load_and_evaluate_model(model_name):
    model = load_model(f'{model_dir}{model_name}.h5')
    r = model.evaluate(valid_x, valid_y, batch_size=2048, verbose=1)
    print(f'Cross-entropy: {r[0]:.4f}')
    print(f'Accuracy: {r[1]:.4f}')
    return model
load_and_evaluate_model('rnn-glove-embeddings')

Cross-entropy: 5.0268
Accuracy: 0.2251


<keras.engine.sequential.Sequential at 0x7f340da3aa60>

### Model evaluation.
In this step we assess if our model is performing better than random guess. 
A random guess strategy we consider here is to randomly replace the expected token with most frequently used word.
With all tokens taken from most frequently used words, we calculate the accuracy of the validation set and compare it with the accuracy of the model.
If the accuracy of the model is higher than random fit, we can conclude our model has learned something and it can perform better than random guess.

In [296]:
from collections import Counter

np.random.seed(RANDOM_STATE)
total_words = sum(word_counts.values())
frequencies = [word_counts[word]/total_words for word in word2idx.keys()]
frequencies.insert(0, 0)
print(f'The most common word: ' + idx2word[frequencies.index(max(frequencies))])
print(f'Accuracy of the model if we replace all words with the most common word: {round(100 * np.mean(np.argmax(valid_y, axis = 1) == 1), 4)}%')

# collect random guesses for every item in validation set
# np.random.multinomial(1, frequencies, size=1) returns a one-hot encoded vector of size 1 with a 1 at the index of the randomly chosen word
# frequencies is the probability distribution from which the words are chosen
random_guesses = [np.argmax(np.random.multinomial(1, frequencies, size=1)) for i in valid_y]

# create a counter with the counts of each word
c = Counter(random_guesses)
# for 10 most common words
for i in c.most_common(10):
     word = idx2word[i[0]]
     word_count = word_counts[word]
     print(f'{word:<10} Word Count: {word_count} \t Predicted {i[1]} \t Percentage {round(100*word_count/total_words, 2)}%')
# accuracy of the model which predicts the most common word
accuracy = np.mean(random_guesses == np.argmax(valid_y, axis=1))
print(f'Accuracy: {round(100*accuracy, 2)}%')

The most common word: the
Accuracy of the model if we replace all words with the most common word: 8.7602%
the        Word Count: 36597 	 Predicted 7129 	 Percentage 7.36%
a          Word Count: 24878 	 Predicted 4649 	 Percentage 5.0%
of         Word Count: 20193 	 Predicted 3856 	 Percentage 4.06%
.          Word Count: 16594 	 Predicted 3121 	 Percentage 3.34%
,          Word Count: 15410 	 Predicted 2866 	 Percentage 3.1%
and        Word Count: 12947 	 Predicted 2448 	 Percentage 2.6%
to         Word Count: 12073 	 Predicted 2295 	 Percentage 2.43%
network    Word Count: 7731 	 Predicted 1549 	 Percentage 1.55%
for        Word Count: 6907 	 Predicted 1429 	 Percentage 1.39%
is         Word Count: 7213 	 Predicted 1417 	 Percentage 1.45%
Accuracy: 1.53%


In [293]:
import random

def generate_output(model, sequences, training_length=50, new_words=50, diversity=1, return_output=False):
    """Generates new text given a trained model and a seed sequence"""
    
    # pick a random sequence    
    seq = random.choice(sequences)
    
    # pick a random starting index
    seed_idx = random.randint(0, len(seq)-training_length-10)
    
    # select end index based on training length and seed
    end_idx = seed_idx+training_length
    
    # seed sequence
    seed = seq[seed_idx:end_idx]
    
    # actual entire sequence
    original_sequence_words = [idx2word[i] for i in seed]
    
    # initializing the generated sequence
    generated = seed[:] + ['#']
        
    # actual entire sequence
    actual = generated + seq[end_idx: end_idx+new_words]
      
    for i in range(new_words):
        preds = model.predict(np.array(seed).reshape(1, -1), verbose=0)[0].astype('float64')
        preds = np.log(preds)/diversity
        exp_preds = np.exp(preds)
        
        # reweight distribution => softmax
        preds = exp_preds / np.sum(exp_preds)
        probas = np.random.multinomial(1, preds, 1)[0]
        
        # find the next word index
        next_idx = np.argmax(probas)
        
        # reseed the seed with the new word
        seed = seed[1:] + [next_idx]
        
        # update generated text
        generated.append(next_idx)
        
    gen_list = []
    for i in generated:
      gen_list.append(idx2word.get(i, '<--->'))
    
    a = []
    for i in actual:
      a.append(idx2word.get(i, '<--->'))
    
    return original_sequence_words, gen_list, a

seed, gen_list, actual = generate_output(model, new_sequences)

In [294]:
print('SEED: ' + ' '.join(seed))
print('='*100)
print('ACTUAL:' +' '.join(actual))
print('='*100)
print('GENERATED:' +' '.join(gen_list))

SEED: are determined using occupant feedback provided by individual occupants over at least one of an internet or intranet communications network . according to a first aspect of the invention a setpoint is determined using fuzzy logic . according to a second aspect , historical setpoint data determined using occupant feedback
ACTUAL:are determined using occupant feedback provided by individual occupants over at least one of an internet or intranet communications network . according to a first aspect of the invention a setpoint is determined using fuzzy logic . according to a second aspect , historical setpoint data determined using occupant feedback <---> is used to develop a neural network for predicting setpoint values .
GENERATED:are determined using occupant feedback provided by individual occupants over at least one of an internet or intranet communications network . according to a first aspect of the invention a setpoint is determined using fuzzy logic . according to a second as