### Problem Description

In this notebook we will build an RNN which will learn from a list of patent abstracts. We will utilise this RNN to predict the next word in a sequence of words, and also to create its own patent abstracts. 

### The Approach 

This project consists of the following steps:

1. Reading in the data: thousands of neural network patents
2. Converting patents to integer sequences: `tokenization`
3. Creating training dataset using the next word following in a sequence as label
4. Building a recurrent neural network using word embeddings and LSTM cells
5. Training the network to predict the next word from the sequence
6. Generating new abstracts by feeding a seed sequence to a network
7. Loading pre-trained embeddings
8. Repeating steps 3 - 7 using the pre-trained embeddings
9. Trying out different model architectures to see if the performance improves
10. For fun, creating a simple game where we must guess if the output is human or computer generated

### Loding the data

First we load the date along with some necessary libraries.

In [2]:
import numpy as np
import pandas as pd
import random
import re
import gc
import os
from keras.preprocessing.text import Tokenizer
import warnings

warnings.filterwarnings('ignore', category = RuntimeWarning)

In [3]:
data = pd.read_csv('../neural_network_patent_query.csv',
                  parse_dates = ['patent_date'])

abstracts = list(data['patent_abstract'])

print(f'There are total of {len(abstracts)} patent abstracts.')

data.head()

There are total of 3522 patent abstracts.


Unnamed: 0,patent_abstract,patent_date,patent_number,patent_title
0,""" A """"Barometer"""" Neuron enhances stability in...",1996-07-09,5535303,"""""""Barometer"""" neuron for a neural network"""
1,""" This invention is a novel high-speed neural ...",1993-10-19,5255349,"""Electronic neural network for solving """"trave..."
2,An optical information processor for use as a ...,1995-01-17,5383042,3 layer liquid crystal neural network with out...
3,A method and system for intelligent control of...,2001-01-02,6169981,3-brain architecture for an intelligent decisi...
4,A method and system for intelligent control of...,2003-06-17,6581048,3-brain architecture for an intelligent decisi...


### Text Formatting

Let's print out one of the abstracts.

In [4]:
abstracts[100]

'The present invention provides an apparatus and a method for classifying and recognizing image patterns using a second-order neural network, thereby achieving high-rate parallel processing while lowering the complexity. The second-order neural network, which is made of adders and multipliers, corrects positional translations generated in a complex-log mapping unit to output the same result for the same object irrespective of the scale and/or rotation of the object. The present invention enables high-rate image pattern classification and recognition based on parallel processing, which is the advantage obtained in neural network models, because consistent neural networks and consistent network structure computation models are applied to all steps from the image input step to the pattern classifying and recognizing step.'

Later on we will use the Tokenizer to convert words of each abstract to integers, so we can train the RNN. However, there's a problem with punctuation - the Tokenizer will recognize `network` and `network,` as two different words. That's why we need to format abstracts by separating punctuation marks from the words. We'll also perform some additional formatting, such as removing numbers encircled in bracets `i.e. (3)` which appear quite often in patents' abstracts to reference to a figure.

In [5]:
def format_abstract(abstract):
    
    #separating a word from the punctuation sign
    f_abstract = re.sub(r'(?<=[^\s0-9])(?=[.,;?])', r' ', abstract)
    
    #removing reference numbers
    f_abstract = re.sub(r'\(\d+\)', r'', f_abstract)
    
    #removing double spaces
    f_abstract = re.sub(r'\s\s', ' ', f_abstract)
    
    return f_abstract

In [6]:
def remove_spaces(f_abstract):
    
    """Removes spaces around punctuation"""
    abstract = re.sub(r'\s+([.,?!;])', r'\1', f_abstract)
    
    
    return f_abstract

Now let's see how our formated patent abstract looks like.

In [7]:
format_abstract(abstracts[100])

'The present invention provides an apparatus and a method for classifying and recognizing image patterns using a second-order neural network , thereby achieving high-rate parallel processing while lowering the complexity . The second-order neural network , which is made of adders and multipliers , corrects positional translations generated in a complex-log mapping unit to output the same result for the same object irrespective of the scale and/or rotation of the object . The present invention enables high-rate image pattern classification and recognition based on parallel processing , which is the advantage obtained in neural network models , because consistent neural networks and consistent network structure computation models are applied to all steps from the image input step to the pattern classifying and recognizing step .'

In [8]:
#formatting all the abstracts
formatted_abstracts = [format_abstract(abst) for abst in abstracts]

print(f'There are total of {len(formatted_abstracts)} (formatted) patent abstracts.')

There are total of 3522 (formatted) patent abstracts.


### Converting Texts to Sequences

As said before, using the Tokenizer we need to convert our texts into sequences of integers to train the RNN. In addition, we need to create feature vectors and labels from those sequences. From each patent abstract we'll create a set of feature vectors and corresponding labels. Each feature vector will represent first $i + 50,\ (i = 0,1,...)$ words taken from the texts, while the following word will be used as a label. 

In [9]:
#the following function will convert texts to integer sequences
#and generate feature vectors and corresponding labels
def make_sequences(texts, filters,
                   training_length = 50, lower = True):
    
    
    """Converts texts into integer sequences"""
    
    #creating and fitting the Tokenizer
    tokenizer = Tokenizer(lower = lower, filters = filters)
    tokenizer.fit_on_texts(texts)
    
    
    #creating dictionaries and 'reverse dictionaries'
    #for all the words appearing in patent abstracts
    word_idx = tokenizer.word_index
    idx_word = tokenizer.index_word
    #we add +1 because of the index
    num_of_words = len(word_idx) + 1
    word_count = tokenizer.word_counts
    
    print(f'There are {num_of_words - 1} unique words in the dictionary.')
    
    
    #converting texts to integer sequences
    sequences = tokenizer.texts_to_sequences(texts)
    
    #we'll select only those abstracts that have a bit more words 
    #than the feature vector length
    seq_lengts = [len(seq) for seq in sequences]
    
    long_text_idx = [i for i, seq_l in enumerate(seq_lengts)
                    if seq_l > (training_length + 20)]
    
    new_texts = []
    new_sequences = []
    
    for i in long_text_idx:
        new_texts.append(texts[i])
        new_sequences.append(sequences[i])
        
    #from every abstract we extract sets of feature vectors and 
    #corresponding labels
    
    training_seq = []
    labels = []
    
    for seq in new_sequences:
        for i in range(training_length, len(seq)):
            extract = seq[i - training_length : i+1]
            training_seq.append(extract[:-1])
            labels.append(extract[-1])
            
    print(f'There are {len(training_seq)} training sequences.')
    
    return word_idx, idx_word, num_of_words, word_count, new_texts, new_sequences, training_seq, labels

Now let's see how our function generates data. We'll remove a fair amount of the punctuation and lowercase all letters but leave in periods and commas. Our model will not learn how to capitalize words, but it may learn how to end a sentence and insert commas.

In [10]:
filters = '!"#$%&()*+/:<=>@[\\]^_`{|}~\t\n'

word_idx, idx_word, num_of_words, word_count, f_abstracts, sequences, features, labels = make_sequences(
    formatted_abstracts, filters) 

There are 13676 unique words in the dictionary.
There are 320881 training sequences.


For every training sequence/label pair, we'll define a function which will give us the training sequence text and the word (label) that follows it. 

In [11]:
def the_label(index):
    
    '''Outputs the text of a training sequence and the 
    corresponding label for a given index of a train sequence'''
    
    #training set text
    feat_text = ' '.join(idx_word[i] for i in features[index])
    
    #label
    label = idx_word[labels[index]]
    
    print(f'Feature text: {feat_text}\n')
    print(f'Label: {label}')

In [12]:
the_label(100)

Feature text: it comprises a novel hybrid architecture employing a binary synaptic array whose embodiment incorporates the fixed rules of the problem , such as the number of cities to be visited . the array is prompted by analog voltages representing variables such as distances . the processor incorporates two interconnected feedback

Label: networks


We also want to check what are the most frequent words appearing in our data. Sometimes these words can create bias and weaken the training of our RNN.

In [13]:
sorted(word_count.items(), key = lambda x: x[1], reverse = True)[:15]

[('the', 36597),
 ('a', 24883),
 ('of', 20193),
 ('.', 16604),
 (',', 15417),
 ('and', 12947),
 ('to', 12073),
 ('network', 7733),
 ('neural', 7381),
 ('is', 7213),
 ('in', 6992),
 ('for', 6907),
 ('an', 6061),
 ('data', 4010),
 ('by', 3607)]

We will not remove any of the most common words since they make sense in the context of the patents we are using and the general English language.

### Train and Validation Sets

Now it's time to create train and validation sets. We use 70% of our data for the training set. In addition, we one-hot encode the labels since we are going to use categorical-crossentropy as a loss function.

We shuffle the features since they are made in a sequential order.

In [14]:
from sklearn.utils import shuffle

def make_train_val(features, labels, num_of_words,
                  training_fraction = 0.7):
    
    '''Splits the data into a training and validation set.'''
    
    
    #shuffling the data
    features, labels = shuffle(features, labels, random_state = 50)
    
    #splitting the data
    train_end = int(training_fraction*len(labels))
    
    train_features = features[:train_end]
    val_features = features[train_end:]
    
    train_labels = labels[:train_end]
    val_labels = labels[train_end:]
    
    #converting the data into a numpy arrays
    X_train = np.array(train_features)
    X_val = np.array(val_features)
    
    y_train = np.zeros((len(train_labels), num_of_words), dtype = np.int8)
    y_val = np.zeros((len(val_labels), num_of_words), dtype = np.int8)
    
    for example_index, word_index in enumerate(train_labels):
        y_train[example_index, word_index] = 1
    for example_index, word_index in enumerate(val_labels):
        y_val[example_index, word_index] = 1
        
    #managing the memory
    gc.enable()
    del features, labels, train_features, val_features, train_labels, val_labels
    gc.collect()
    
    return X_train, y_train, X_val, y_val

In [15]:
X_train, y_train, X_val, y_val = make_train_val(features, labels, num_of_words)

In [16]:
print(X_train.shape)
print(y_train.shape)

(224616, 50)
(224616, 13677)


### Pre-Trained Embeddings 

Instead of training our own word embeddings, we'll use word embeddings that were trained on a large corpus of words. The hope is that these embeddings will generalize from the training corpus to our needs.

The following code downloads 100-dimensional word embeddings and loads them into a numpy array. We'll download the pre-trained embeddings from [Stanford online library](https://nlp.stanford.edu/data/).

In [17]:
from keras.utils import get_file
from zipfile37 import ZipFile

keras_home = 'C:\\Users\\alegzander\\.keras\\datasets\\'
glove_vectors_zip = keras_home + 'glove.6B.zip'

#checking if we already have the file
if not os.path.exists(glove_vectors_zip):
    get_file('glove.6B.zip',
             'http://nlp.stanford.edu/data/glove.6B.zip')

#checking if we have already extracted the .zip file
glove_vectors_text = keras_home + 'glove.6B.100d.txt'
if not os.path.exists(glove_vectors_text):
    with ZipFile(glove_vectors_zip, 'r') as gv:
        glove_vectors_text = gv.extract('glove.6B.100d.txt', 
                                        path = keras_home)
        
#loading the embedding into a numpy array
glove = np.loadtxt(glove_vectors_text, dtype = 'str', 
                   encoding = 'utf-8')

glove.shape

(399965, 101)

In [18]:
glove[74][:15]

array(['over', '-0.29574', '0.35345', '0.63326', '0.19576', '-0.030256',
       '0.54244', '-0.21091', '0.32894', '-0.48888', '0.18379', '0.24242',
       '0.40346', '0.11973', '0.013143'], dtype='<U22')

### Now we separate created arrays into words and corresponding vector representations.

In [19]:
words = glove[:,0]
vectors = glove[:,1:].astype('float')

gc.enable()
del glove
gc.collect()

words[74], vectors[74][:15]

('over',
 array([-0.29574 ,  0.35345 ,  0.63326 ,  0.19576 , -0.030256,  0.54244 ,
        -0.21091 ,  0.32894 , -0.48888 ,  0.18379 ,  0.24242 ,  0.40346 ,
         0.11973 ,  0.013143,  0.24154 ]))

Now we want to keep only those words that appear in our vocabulary. For words that are in our vocabulary but don't have an embedding, they will be represented as zero vectors (a shortcoming that we can address by training our own embeddings.)

In the following block we will build the embedding matrix. Each row of this matrix corresponds to the word of our vocabulary and is actually a vector representation of that word.

In [20]:
word_lookup = {word : vec for word, vec in zip(words, vectors)}

In [21]:
def make_embedding_matrix(rows = num_of_words, columns = 100):
    
    "Creates the embedding matrix."
    
    embedding_matrix = np.zeros((rows, columns))
    missing_words = 0

    for i, word in enumerate(word_idx.keys()):

        vec = word_lookup.get(word)

        if vec is not None:
            '''because the key of the first word in our dictionary
            corresponds to 1 (and not to 0), we have i+1'''
            embedding_matrix[i+1] = vec
        else:
            missing_words += 1

    print(f'There are {missing_words} words without pretrained embeddings.')
    
    return embedding_matrix

In [22]:
embedding_matrix = make_embedding_matrix(num_of_words, 100)

There are 2941 words without pretrained embeddings.


In [23]:
gc.enable()
del vectors 
gc.collect()

0

Each word is represented by a 100-dimensional vector. We can find the closest words to a given word in embedding space using the cosine distance. This requires first normalizing the vectors, i.e. the rows of the embedding matrix.

In [24]:
def norm_the_embedding(embedding_matrix):
    
    """Normalizers the embedding matrix vectors"""
    
    norm_vectors = np.linalg.norm(embedding_matrix, axis = 1).reshape((-1,1))
    embedding_matrix = embedding_matrix/norm_vectors
    embedding_matrix = np.nan_to_num(embedding_matrix)
    
    return embedding_matrix

In [25]:
embedding_matrix = norm_the_embedding(embedding_matrix)

In [26]:
def find_closest_words(query, embedding = embedding_matrix,
                      idx_word = idx_word, 
                       word_idx = word_idx, n = 10):
    
    """Finds closest words to a query word in embeddings"""
    
    idx = word_idx.get(query)
    
    if idx is None:
        print(f'\'{query}\' is not present in the vocabulary.')
    else:
        vec = embedding_matrix[idx]
        if np.all(vec == 0):
            print(f'\'{query}\' does not have a pretrained embedding.')
        else:
            distances = np.dot(embedding_matrix,vec)
            dist_idxs = np.argsort(distances)[::-1][:n]
            closest_dists = distances[dist_idxs]
            closest_words = [idx_word[idx] for idx in dist_idxs]
            
            print(f'Closest words to \'{query}\': ')
            for word, dist in zip(closest_words, closest_dists):
                print(f'Word: {word:15} Cosine similarty: {round(dist,4)}')

Let's try out the function defined above. It will list out 10 most similar words to a given query. There might be a litle more entries here, but it was so fun to play around with this:)

In [27]:
find_closest_words('the')

Closest words to 'the': 
Word: the             Cosine similarty: 1.0
Word: this            Cosine similarty: 0.8573
Word: part            Cosine similarty: 0.8508
Word: one             Cosine similarty: 0.8503
Word: of              Cosine similarty: 0.8329
Word: same            Cosine similarty: 0.8325
Word: first           Cosine similarty: 0.821
Word: on              Cosine similarty: 0.82
Word: its             Cosine similarty: 0.8169
Word: as              Cosine similarty: 0.8128


In [28]:
find_closest_words('neural')

Closest words to 'neural': 
Word: neural          Cosine similarty: 1.0
Word: neuronal        Cosine similarty: 0.6841
Word: cortical        Cosine similarty: 0.676
Word: plasticity      Cosine similarty: 0.6625
Word: pathways        Cosine similarty: 0.6534
Word: neurons         Cosine similarty: 0.6485
Word: sensory         Cosine similarty: 0.6391
Word: cognitive       Cosine similarty: 0.6125
Word: brain           Cosine similarty: 0.6082
Word: physiological   Cosine similarty: 0.6022


In [29]:
find_closest_words('.')

Closest words to '.': 
Word: .               Cosine similarty: 1.0
Word: but             Cosine similarty: 0.9049
Word: although        Cosine similarty: 0.8812
Word: however         Cosine similarty: 0.8778
Word: ,               Cosine similarty: 0.8756
Word: when            Cosine similarty: 0.8729
Word: and             Cosine similarty: 0.8717
Word: though          Cosine similarty: 0.8691
Word: it              Cosine similarty: 0.8654
Word: this            Cosine similarty: 0.8653


In [30]:
find_closest_words('differential')

Closest words to 'differential': 
Word: differential    Cosine similarty: 1.0
Word: equation        Cosine similarty: 0.717
Word: geometry        Cosine similarty: 0.7062
Word: equations       Cosine similarty: 0.6953
Word: nonlinear       Cosine similarty: 0.6427
Word: linear          Cosine similarty: 0.6352
Word: non-linear      Cosine similarty: 0.6035
Word: voltage         Cosine similarty: 0.5948
Word: generalization  Cosine similarty: 0.587
Word: algebraic       Cosine similarty: 0.5805


In [31]:
find_closest_words('waves')

Closest words to 'waves': 
Word: waves           Cosine similarty: 1.0
Word: wave            Cosine similarty: 0.7943
Word: tidal           Cosine similarty: 0.7426
Word: wind            Cosine similarty: 0.726
Word: currents        Cosine similarty: 0.7176
Word: winds           Cosine similarty: 0.6818
Word: storm           Cosine similarty: 0.6814
Word: ocean           Cosine similarty: 0.6757
Word: tides           Cosine similarty: 0.6724
Word: tide            Cosine similarty: 0.6461


In [32]:
find_closest_words('rocket')

Closest words to 'rocket': 
Word: rocket          Cosine similarty: 1.0
Word: missile         Cosine similarty: 0.7177
Word: fired           Cosine similarty: 0.6951
Word: launch          Cosine similarty: 0.6864
Word: firing          Cosine similarty: 0.6759
Word: launching       Cosine similarty: 0.6662
Word: fire            Cosine similarty: 0.6538
Word: attack          Cosine similarty: 0.6496
Word: unmanned        Cosine similarty: 0.6344
Word: tank            Cosine similarty: 0.6284


In [33]:
find_closest_words('diffraction')

Closest words to 'diffraction': 
Word: diffraction     Cosine similarty: 1.0
Word: x-ray           Cosine similarty: 0.6791
Word: fluorescence    Cosine similarty: 0.6164
Word: spectroscopy    Cosine similarty: 0.6022
Word: gratings        Cosine similarty: 0.5913
Word: microscopy      Cosine similarty: 0.5835
Word: scattering      Cosine similarty: 0.5797
Word: electron        Cosine similarty: 0.552
Word: electromagnetic Cosine similarty: 0.5509
Word: excitation      Cosine similarty: 0.5422


In [34]:
find_closest_words('majestic')

'majestic' is not present in the vocabulary.


In [35]:
find_closest_words('dnn')

'dnn' does not have a pretrained embedding.


### Building a Model

With data encoded as integers and an embedding matrix of pre-trained word vectors, we're ready to build the recurrent neural network. This model is relatively simple and uses an LSTM cell as the heart of the network. After converting the words into embeddings, we pass them through a single LSTM layer, then into a fully connected layer with `relu` activation before the final output layer with a `softmax` activation. The final layer produces a probability for every word in the vocabulary.

When training, these predictions are compared to the actual label using the `categorical_crossentropy` to calculate a loss. The parameters (weights) in the network are then updated using the Nadam optimizer with gradients calculated through backpropagation.

In [36]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout, Masking

In [37]:
def make_rnn(num_of_words = num_of_words,
            embedding_matrix = embedding_matrix):
    
    "Generates an RNN with an LSTM cell and pretrained embeddings."
    
    model = Sequential()
    
    #implements the pretrained embeddings
    model.add(Embedding(input_dim = num_of_words,
                       output_dim = embedding_matrix.shape[1],
                       weights = [embedding_matrix],
                       trainable = False,
                       mask_zero = True))
    #masking makes sure that the words that do not have 
    #pretrained embeddings are skipped and do not effect
    #the training
    model.add(Masking())
    
    #adding an LSTM layer
    model.add(LSTM(64, dropout = 0.1, recurrent_dropout = 0.1,
                   return_sequences = False))
    
    #adding a dense layer
    model.add(Dense(128, activation = 'relu'))
    model.add(Dropout(0.5))
    
    #adding an output layer
    model.add(Dense(num_of_words, activation = 'softmax'))
    
    #compiling the RNN
    model.compile(optimizer = 'nadam',
                 loss = 'categorical_crossentropy',
                 metrics = ['accuracy'])
    
    return model

First we will just use the words having pretrained embeddings, without training on the words from our own vocabulary. 

In [39]:
model = make_rnn()
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 100)         1367700   
_________________________________________________________________
masking_2 (Masking)          (None, None, 100)         0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 64)                42240     
_________________________________________________________________
dense_3 (Dense)              (None, 128)               8320      
_________________________________________________________________
dropout_2 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 13677)             1764333   
Total params: 3,182,593
Trainable params: 1,814,893
Non-trainable params: 1,367,700
____________________________________

Using pre-trained embeddings means we have about half the parameters to train. However, this also means that the embeddings might not be the best for our data, and there's not an insignificant number of words with no embeddings.

### Training the Model

We can now train the model on our training examples. We'll make sure to use early stopping with a validation set to stop the training when the loss on the validation set is no longer decreasing. Also, we'll save the best model every time the validation loss decreases so we can then load in the best model to generate predictions. Saving the model also very useful in this case, since the training can take several hours. So we can load it within seconds if we need to use it anew.

In [40]:
from keras.models import load_model
from keras.callbacks import EarlyStopping, ModelCheckpoint

In [41]:
def make_callbacks(model_name):
    
    """Makes list of callbacks for training"""
    
    #Early Stopping: Stop training when 
    #validation loss no longer decreases
    callbacks = [EarlyStopping(monitor = 'val_loss',
                          patience = 5)]
    
    #Model Checkpoint: Save the best model on disk
    callbacks.append(ModelCheckpoint(model_name + '.h5',
                                    save_best_only = True,
                                    save_weights_only = False))
    
    return callbacks

In [42]:
model_dir = '../Models/'
model_name = 'pretrained_rnn'
model_name = model_dir + model_name

callbacks = make_callbacks(model_name)

for _ in range(3):

    history = model.fit(X_train, y_train, epochs = 150, 
                        batch_size = 2048, callbacks = callbacks,
                        validation_data = (X_val, y_val))

In [43]:
def load_and_evaluate(model_name):
    
    """Load in a trained model and evaluate with 
    log-loss crossentropy and accuracy"""
    
    model = load_model(model_name + '.h5')
    evaluate = model.evaluate(X_val, y_val, batch_size = 2048)
    
    print(f'Crossentropy loss: {round(evaluate[0],4)}')
    print(f'Accuracy: {round(evaluate[1]*100,2)}%')
    
    
    return model

In [45]:
model = load_and_evaluate(model_name)

Crossentropy loss: 4.937
Accuracy: 22.72%


We'll compare our model with two simple models: one, always guessing the most frequent word (which is `the`), and the other always making a random guess from 10 most common words. We compare with the words from the validation set. 

In [46]:
#a model always guessing the word 'the'

#getting the word index for each entry in the validation set
val_word_idx = np.argmax(y_val, axis = 1)

#using the fact that 'the' has index 1, we compute the accuracy
the_accuracy = np.mean(val_word_idx == 1)

print("'The' accuracy: {}%".format(round(the_accuracy*100,2)))

'The' accuracy: 8.72%


In order to compare with the second model, we need to compute the frequencies of the most frequent words, since we'll base our random predictions using these frequencies. We'll also use the fact that tokenizer sorts words by their frequency.

In [47]:
total_words = sum(word_count.values())

freq_words = list(word_idx.keys())[0:10]

frequencies = [word_count[word]/total_words for word in freq_words]
frequencies = np.round(frequencies, 3)[0:10]

for word, freq in zip(freq_words, frequencies):
    print(f'Word: {word:10} Frequency: {freq}')

Word: the        Frequency: 0.073
Word: a          Frequency: 0.05
Word: of         Frequency: 0.041
Word: .          Frequency: 0.033
Word: ,          Frequency: 0.031
Word: and        Frequency: 0.026
Word: to         Frequency: 0.024
Word: network    Frequency: 0.016
Word: neural     Frequency: 0.015
Word: is         Frequency: 0.014


We will use multinomial distribution to randomly select the outcome word. But first, we need to infer probabilities based on frequencies. 

In [48]:
k = 1/np.sum(frequencies)
probs = np.round(frequencies*k, 2)
for word, prob in zip(freq_words, probs):
    print(f'Word: {word:10} Probability: {prob}')

Word: the        Probability: 0.23
Word: a          Probability: 0.15
Word: of         Probability: 0.13
Word: .          Probability: 0.1
Word: ,          Probability: 0.1
Word: and        Probability: 0.08
Word: to         Probability: 0.07
Word: network    Probability: 0.05
Word: neural     Probability: 0.05
Word: is         Probability: 0.04


In [49]:
#a model always guessing one of the 10 most common words

#getting the indices of a randomly choosen words
random_guess = np.random.choice(range(10), len(y_val), p = probs)

#getting the word index for each entry in the validation set
val_word_idx = np.argmax(y_val, axis = 1)

#comparing the outcomes
rand_accuracy = np.mean(random_guess == val_word_idx)
print(f'Accuracy of random guessing {round(rand_accuracy*100,2)}%')

Accuracy of random guessing 3.31%


We can see that our model outperforms both guessing models by far, with 22.27% accuracy. On the other side, "the" guessing has 8.72% accuracy, and random guessing based on relative word frequencies has 3.18% accuracy.

Let's now see our model in action trying to predict the next word from some abstract excerpt for the validation set. 

In [50]:
def predict_the_label(model):
    
    '''Guesses the output of the abstract excerpt from the
    validaton set'''
    
    num_features = len(X_val)
    index = np.random.randint(num_features)
    
    
    #validation set text
    feat_text = ' '.join(idx_word[i] for i in X_val[index])
    
    
    #predicted_label
    orig_text = X_val[index].reshape(1,-1)
    pred_idx = model.predict_classes(orig_text)[0]
    pred_label = idx_word[pred_idx]
    
    #true_label
    label_idx = np.argmax(y_val[index])
    label = idx_word[label_idx]
    
    
    print(f'Predticted: {feat_text} \033[1m{pred_label}\033[0m\n')
    print(f'True: \033[1m{label}\033[0m\n')
    
    if label == pred_label:
        print(f'\033[1m Good guess! \033[0m')

In [51]:
for _ in range(50):
    predict_the_label(model)
    print('.........................')

Predticted: both binary and continuously modulated energy emissions . in one embodiment , array of parallel processors exhibits behavior of cooperative-competitive neural networks . parallel bus interconnections and digital and analog processing of analog information contained in the exchanged energy emissions are employed with generally local synchronization of the processors . [1mthe[0m

True: [1menergy[0m

.........................
Predticted: the software work-around input as the output vector of the programmable logic circuit . the feedforward lam neural network checking circuit has a weight matrix whose elements are based on a set of known bad input vectors for said faulty hardware block . the feedforward lam neural network checking circuit [1mis[0m

True: [1mmay[0m

.........................
Predticted: sensor for changes in pressure and temperature to indicate a concentration of the sensed species . an apparatus is also provided , for compensating an electrochemical sen

.........................
Predticted: an estimation error criterion . the information process either consists of a measurement process , or if the signal and measurement processes are time-variant , consists of the measurement process as well as a time variance process , that describes the time-variant properties of the signal and measurement processes . [1mthe[0m

True: [1mthe[0m

[1m Good guess! [0m
.........................
Predticted: characteristic features a data matrix , the data matrix is processed with a self organizing network to produce a self organizing feature space mapping . the self organizing feature space mapping is processed to produce a density characterization of the feature space mapping . the self organizing network is preferably [1mused[0m

True: [1mcompletely[0m

.........................
Predticted: provided for distributing and storing sets of temporally ordered information in a systematic and sequential fashion . this method is based on a model of h

### Generating the Output

Now for the fun part: we get to use our model to generate new abstracts. To do this, we feed the network a seed sequence, have it make a prediction, add the predicted word to the sequence, and make another prediction for the next word. We continue this for the number of words that we want. We compare the generated output to the actual abstract to see if we can tell the difference!

In [52]:
#some HTML formatting
from IPython.display import HTML


def header(text, color='black'):
    raw_html = f'<h1 style="color: {color};"><center>' + \
        str(text) + '</center></h1>'
    return raw_html


def box(text):
    raw_html = '<div style="border:1px inset black;padding:1em;font-size: 20px;">' + \
        str(text)+'</div>'
    return raw_html


def addContent(old_html, raw_html):
    old_html += raw_html
    return old_html

In [55]:
import random

def generate_output(model, sequences, train_length = 50,
                   new_words = 50, diversity = 1):
    
    """Generates new apstract excerpt out 
    of output from a trained model and formats it into HTML."""
    
    #choosing a random sequence from all the abstracts
    #and choosing a random seed subsequence from it
    seq = random.choice(sequences)
    start_idx = random.randint(0, len(seq) - train_length - 20)
    end_idx = start_idx + train_length
    seed_seq = seq[start_idx : end_idx]
    seed_text = [idx_word[i] for i in seed_seq]
    
    #generating new sequence of words from the seed sequence
    gen_seq = []
    
    for _ in range(new_words):
        
        input_seq = np.array(seed_seq).reshape(1,-1)
        
        #predicting the probabilities for each word
        probs = model.predict(input_seq)[0]
        
        #implementing the diversity parameter
        probs = np.log(probs)/diversity
        probs = np.exp(probs)
        
        #implementing softmax function
        probs = probs/sum(probs)
        probs = np.round(probs, 4)
        
        #getting the next word
        if sum(probs[:-1]) > 1:
            #sometimes, due to computational errors we might get
            #sum of probabilities to be greater than 1
            next_word = np.argmax(probs)
        else:
            next_word = np.random.multinomial(1,probs)
            next_word = np.argmax(next_word)

        #updating the seed sequence
        seed_seq = seed_seq[1:] + [next_word]
        gen_seq.append(next_word)
        
    
    #getting the actual text
    actual_seq = seq[end_idx : end_idx+new_words]
    actual_text = [idx_word.get(i,'<-->') for i in actual_seq]
    
    #getting the generated text
    gen_seq = gen_seq[:len(actual_seq)]
    gen_text = [idx_word.get(i,'<-->') for i in gen_seq]
    
    
    #HTML formatting
    seed_html = ''
    seed_html = addContent(seed_html, header(
        'Seed Sequence', color='darkblue'))
    seed_html = addContent(seed_html,
                           box(' '.join(seed_text)))
    
    actual_html = ''
    actual_html = addContent(actual_html, header('Actual', color='darkgreen'))
    actual_html = addContent(actual_html, box(' '.join(actual_text)))


    gen_html = ''
    gen_html = addContent(gen_html, header('RNN Generated', color='darkred'))
    gen_html = addContent(gen_html, box(' '.join(gen_text)))

    
    return seed_html, actual_html, gen_html

And now let's see our generator in action.

In [56]:
for _ in range(3):
    seed_html, actual_html, gen_html = generate_output(model, sequences)
    display(HTML(seed_html))
    display(HTML(actual_html))
    display(HTML(gen_html))
    print('..........................')

..........................


..........................


..........................


The `diversity` parameter determines how much randomness is added to the predictions. If we just use the most likely word for each prediction, the output sometimes gets stuck in loops. The diversity means the predicted text has a little more variation. 

In [58]:
for _ in range(3):
    seed_html, actual_html, gen_html = generate_output(model, sequences,
                                                       diversity = 0.75)
    display(HTML(seed_html))
    display(HTML(actual_html))
    display(HTML(gen_html))
    print('..........................')

..........................


..........................


..........................


### Training Own Embeddings

Now we will try training an RNN with our own embedidings. We'll also include punctuated words (e.g. `word.`) and words with capitalized first letter as a part of our vocabulary this time.

In [59]:
#memory management
def clear_memory(variables):
    gc.enable()
    for var in ['X_train', 'X_val', 'y_train', 'y_val',
               'idx_word', 'word_idx', 'frequencies',
               'embedding_matrix', 'model']:
        if var in variables:
            del globals()[var]
    gc.collect()

In [60]:
clear_memory(dir())

In [61]:
filters = '!"%;[\\]^_`{|}~\t\n'
word_idx, idx_word, num_of_words, word_count, f_abstracts, sequences, features, labels = make_sequences(
    abstracts, filters, lower=False)

There are 21382 unique words in the dictionary.
There are 286322 training sequences.


In [62]:
embedding_matrix = make_embedding_matrix(num_of_words, 100)

There are 11980 words without pretrained embeddings.


In [63]:
embedding_matrix.shape

(21383, 100)

In [64]:
X_train, y_train, X_val, y_val = make_train_val(features, labels, num_of_words)

In [65]:
X_train.shape, y_train.shape

((200425, 50), (200425, 21383))

Appart from training our own embeddings, we will now also change the structure of our RNN a bit. We will use bidirectional LSTM cells for our RNN and use adam optimzer this time. 

In [66]:
from keras.layers import Bidirectional

def make_bidirectional_rnn(num_of_words = num_of_words,
            embedding_matrix = embedding_matrix):
    
    "Generates a bidirectional RNN with an LSTM cell."
    
    model = Sequential()
    
    #implements the pretrained embeddings and additionally
    #trains the embeddings for the words from our vocabulary
    model.add(Embedding(input_dim = num_of_words,
                       output_dim = embedding_matrix.shape[1],
                       weights = [embedding_matrix],
                       trainable = True))
    
    #adding a bidirectional LSTM layer
    model.add(Bidirectional(LSTM(64, dropout = 0.1, 
                                 recurrent_dropout = 0.1, 
                                 return_sequences = False)))
    
    #adding a dense layer
    model.add(Dense(128, activation = 'relu'))
    model.add(Dropout(0.5))
    
    #adding an output layer
    model.add(Dense(num_of_words, activation = 'softmax'))
    
    #compiling the RNN
    model.compile(optimizer = 'adam',
                 loss = 'categorical_crossentropy',
                 metrics = ['accuracy'])
    
    return model

In [67]:
model = make_bidirectional_rnn(num_of_words, embedding_matrix)
model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, None, 100)         2138300   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 128)               84480     
_________________________________________________________________
dense_5 (Dense)              (None, 128)               16512     
_________________________________________________________________
dropout_3 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 21383)             2758407   
Total params: 4,997,699
Trainable params: 4,997,699
Non-trainable params: 0
_________________________________________________________________


And now it's time to train the new model. The training will take longer this time, since we're now training our embeddings as well, so we have more parameters to train.

In [68]:
model_dir = '../Models/'
model_name = 'bidir_rnn'
model_name = model_dir + model_name
callbacks = make_callbacks(model_name)

for _ in range(3):

    history = model.fit(X_train, y_train, batch_size = 2048,
                        epochs = 150, callbacks = callbacks,
                        validation_data = (X_val, y_val))

In [69]:
model = load_and_evaluate(model_name)

Crossentropy loss: 5.5816
Accuracy: 25.18%


### Inspecting the Embeddings

We can take a look at our trained embeddings to figure out the closest words in the new embedding space. These embeddings are trained for our task, which means they will differ from the pre-trained versions. But first, we need to normalize the embedding vectors.

In [70]:
def get_embeddings(model):
    
    embedding_layer = model.get_layer(index = 0)
    embedding_matrix = embedding_layer.get_weights()[0]
    
    vec_norms = np.linalg.norm(embedding_matrix, axis = 1).reshape((-1,1))
    embedding_matrix = embedding_matrix/vec_norms
    embedding_matrix = np.nan_to_num(embedding_matrix)
    
    return embedding_matrix

In [71]:
embedding_matrix = get_embeddings(model)

In [72]:
find_closest_words('the')

Closest words to 'the': 
Word: the             Cosine similarty: 1.0
Word: said            Cosine similarty: 0.8162000179290771
Word: processing      Cosine similarty: 0.8093000054359436
Word: if              Cosine similarty: 0.7989000082015991
Word: a               Cosine similarty: 0.7961000204086304
Word: force           Cosine similarty: 0.7741000056266785
Word: via             Cosine similarty: 0.7591000199317932
Word: down            Cosine similarty: 0.734000027179718
Word: for             Cosine similarty: 0.7333999872207642
Word: uses            Cosine similarty: 0.727400004863739


In [73]:
find_closest_words('neural')

Closest words to 'neural': 
Word: neural          Cosine similarty: 1.0
Word: with            Cosine similarty: 0.8216999769210815
Word: including       Cosine similarty: 0.7551000118255615
Word: structure       Cosine similarty: 0.7233999967575073
Word: polynomial      Cosine similarty: 0.7199000120162964
Word: time            Cosine similarty: 0.7196999788284302
Word: estimated       Cosine similarty: 0.7184000015258789
Word: of              Cosine similarty: 0.7120000123977661
Word: image           Cosine similarty: 0.7116000056266785
Word: that            Cosine similarty: 0.7088000178337097


Let's see the new model in action.

In [74]:
for _ in range(50):
    predict_the_label(model)
    print('.........................')

Predticted: Such preprocessing phase applies a genetic algorithm to populations of prediction algorithms which vary as to number and content of input variables, where the prediction algorithms representing the selections of input variables which have the best testing performances and the minimum input variables are promoted for the processing of the [1mneural[0m

True: [1mnew[0m

.........................
Predticted: hands and internal hand control system from the muscular level through the intrafusal fiber system of the neural network is considered in creating the robot and method of operation of the present invention. Therefore, the surgery is not slowed down as in the art, because the surgeon is in conscious [1mand[0m

True: [1mand[0m

[1m Good guess! [0m
.........................
Predticted: program product generates self-organizing layouts of process diagrams. Initial weight vectors are distributed uniformly within boundaries of regions in the process diagram. A spatial 


True: [1mpersons[0m

.........................
Predticted: a window for introduction of electromagnetic energy into the sample tank for analyzing a formation fluid sample down hole or at the surface without disturbing the sample. Near infrared, mid infrared and visible light analysis is performed on the sample to provide a downhole in situ or surface on site [1min[0m

True: [1manalysis[0m

.........................
Predticted: controller, a sigmoid activation ROM look-up-table, a plurality of neuron state registers, and a synaptic weight RAM. The neuroprocessor reduces the number of neurons required to perform the task by time multiplexing groups of neurons from a fixed pool of neurons to achieve the successive hidden layers of a recurrent [1mneural[0m

True: [1mnetwork[0m

.........................
Predticted: hidden layer output signals may be augmented by weight factors, and the augmented hidden layer output signals may be linearly combined to produce real and imaginary c

In [75]:
for _ in range(3):
    seed_html, actual_html, gen_html = generate_output(model, sequences)
    display(HTML(seed_html))
    display(HTML(actual_html))
    display(HTML(gen_html))
    print('..........................')

..........................


..........................


..........................


In [76]:
#changing the diversity
for _ in range(3):
    seed_html, actual_html, gen_html = generate_output(model, sequences,
                                                       diversity = 0.75)
    display(HTML(seed_html))
    display(HTML(actual_html))
    display(HTML(gen_html))
    print('..........................')

..........................


..........................


..........................


### Reducing the Training Sequence

We can try and see how our bidirectional model responds if we reduce the training sequence size. This actually means that we are going to have larger training sets for our model to learn from.

In [77]:
clear_memory(dir())

filters = '!"%;[\\]^_`{|}~\t\n'
word_idx, idx_word, num_of_words, word_count, f_abstracts, sequences, features, labels = make_sequences(
    abstracts, filters, training_length = 20, lower=False)

There are 21382 unique words in the dictionary.
There are 393004 training sequences.


In [78]:
embedding_matrix = make_embedding_matrix(num_of_words, 100)

There are 11980 words without pretrained embeddings.


In [79]:
X_train, y_train, X_val, y_val = make_train_val(features, labels, num_of_words)

X_train.shape, y_train.shape

((275102, 20), (275102, 21383))

In [80]:
model_dir = '../Models/'
model_name = 'short_bidir_rnn'
model_name = model_dir + model_name
callbacks = make_callbacks(model_name)

for _ in range(3):

    history = model.fit(X_train, y_train, batch_size = 2048,
                    epochs = 150, callbacks = callbacks,
                    validation_data = (X_val, y_val))

In [81]:
model = load_and_evaluate(model_name)

Crossentropy loss: 5.3962
Accuracy: 22.85%


Accuracy-wise, our previous model with logner training sequences performed better. But let's see how this model behaves when it come to generating new texts.

In [82]:
for _ in range(50):
    predict_the_label(model)
    print('.........................')

Predticted: to each of the object locations and training the neural network on the training image using the optimal set of [1mthe[0m

True: [1massignments.[0m

.........................
Predticted: least one of the vectors of the second collection with a corresponding vector of a fifth collection. Using a forecasting [1msystem[0m

True: [1mconfiguration,[0m

.........................
Predticted: of the optical emission trace. The back-propagation method is used to train the network. More generally, a neural network can [1mbe[0m

True: [1mbe[0m

[1m Good guess! [0m
.........................
Predticted: mixed together in changing interference environments with very minimal assumption on the original signals. The system of this invention has [1ma[0m

True: [1mpractical[0m

.........................
Predticted: speech post classifier such as a neural network. In one embodiment, a fuzzy Viterbi algorithm is used with the hidden [1mlayer[0m

True: [1mMarkov[0m

........

Predticted: apparatus of designing a set of wavelet basis trained to fit a particular problem. The method and apparatus include constructing [1ma[0m

True: [1ma[0m

[1m Good guess! [0m
.........................
Predticted: into a plurality of voice parameters including frequency components, a neural network for transforming at least some of the separated [1mof[0m

True: [1mfrequency[0m

.........................
Predticted: representative of many process periods of one or more decarburization operations for providing an oxygen count for a preselected gas [1msignal[0m

True: [1mratio[0m

.........................
Predticted: on recent post-synaptic activity of neighboring neurons. Apparatus and methods for simplifying training of the devices are also disclosed, including [1ma[0m

True: [1ma[0m

[1m Good guess! [0m
.........................


In [84]:
for _ in range(3):
    seed_html, actual_html, gen_html = generate_output(model, sequences)
    display(HTML(seed_html))
    display(HTML(actual_html))
    display(HTML(gen_html))
    print('..........................')

..........................


..........................


..........................


In [85]:
for _ in range(3):
    seed_html, actual_html, gen_html = generate_output(model, sequences,
                                                      diversity = 0.75)
    display(HTML(seed_html))
    display(HTML(actual_html))
    display(HTML(gen_html))
    print('..........................')

..........................


..........................


..........................


So, we really see that there's no significant difference over the previous model. It's then definitively better to use the model with longer training lengths, since our training set will be smaller, and take up less memory. 

### Human or Machine?

Now we're going to play a little game of guessing. Each time a user will be provided with three outputs: two of them will be machine generated, and the remaining one written by a human. The goal is to guess the non-artifical abstract. We'll use the second model to generate abstracts.

In [87]:
model_dir = '../Models/'
model_name = 'bidir_rnn'
model_name = model_dir + model_name
model = load_and_evaluate(model_name)

Crossentropy loss: 5.0167
Accuracy: 25.49%


In [136]:
def generate_game_output(model, sequences, train_length = 50,
                   new_words = 50, diversity = 1, output_words = 20):
    
    seq = random.choice(sequences)
    start_idx = random.randint(0, len(seq) - train_length - 20)
    end_idx = start_idx + train_length
    seed_seq = seq[start_idx : end_idx]
    seed_text = [idx_word[i] for i in seed_seq]
    
    
    #getting the actual text
    actual_seq = seq[end_idx : end_idx+new_words]
    actual_text = [idx_word.get(i,'<-->') for i in actual_seq]
    
    game_texts = []
    
    for _ in range(2):
    
        gen_seq = []

        for _ in range(new_words):

            input_seq = np.array(seed_seq).reshape(1,-1)

            #predicting the probabilities for each word
            probs = model.predict(input_seq)[0]

            #implementing the diversity parameter
            probs = np.log(probs)/diversity
            probs = np.exp(probs)

            #implementing softmax function
            probs = probs/sum(probs)
            probs = np.round(probs, 4)
            
            
            #getting the next word
            if sum(probs[:-1]) > 1:
                #sometimes, due to computational errors we might get
                #sum of probabilities to be greater than 1
                next_word = np.argmax(probs)
            else:
                next_word = np.random.multinomial(1,probs)
                next_word = np.argmax(next_word)

            #updating the seed sequence
            seed_seq = seed_seq[1:] + [next_word]
            gen_seq.append(next_word)
            
            #getting the generated text
            gen_seq = gen_seq[:len(actual_seq)]
            gen_text = [idx_word.get(i,'<-->') for i in gen_seq]
        

        gen_text = ' '.join(gen_text[:output_words])
        game_texts.append(gen_text)
        
    
    act_text = ' '.join(actual_text[:output_words])
    game_texts.append(act_text)
    
    return game_texts

In [137]:
def guessing_game(model, sequences, training_length = 50,
                 new_words = 50, output_words = 10):
    
    
    diversity = np.random.uniform(0.5, 1)
    
    #generating outputs using an RNN
    game_texts = generate_game_output(model, sequences, 
                             diversity = diversity,
                             train_length = training_length,
                             new_words = new_words,
                             output_words = output_words)
    
    
    outputs = {'wrong1' : game_texts[0], 'wrong2' : game_texts[1],
              'right' : game_texts[2]}
    
    
    choices = ['wrong1', 'wrong2', 'right']
    choices = shuffle(choices)
    
    for i in range(3):
        #printing out all the choices
        print(f'Output #{i+1}:')
        print(f'{outputs[choices[i]]}\n')
       
    #making the guess   
    guess = int(input("Enter the option you think it's human (1-3):"))
    print('\n')
    
    #checking the answer
    if choices[guess-1] == 'right':
        print(f'\033[1mGood guess!\033[0m\n')
    else:
        print(f'\033[1mWrong! The machine tricked you.\033[0m')
        print(f'The right answer is {choices.index("right") + 1}\n')

In [140]:
for _ in range(10):
    guessing_game(model, sequences, output_words = 10)

Output #1:
by the first derivative to prior a neural network model

Output #2:
reduce randomness associated with other methods of similar purposes, and

Output #3:
be easily configured to discriminate the trajectories of Lyapunov resolution

Enter the option you think it's human (1-3):3


[1mWrong! The machine tricked you.[0m
The right answer is 2

Output #1:
includes a neural network as a second neural network for

Output #2:
further includes modulating the mean neural group and the variance

Output #3:
network for generating a neural network and the internal neural

Enter the option you think it's human (1-3):2


[1mGood guess![0m

Output #1:
input layer within a plurality of substrings and data cells

Output #2:
A second neural network using the data set of data

Output #3:
video content can be determined based at least in part

Enter the option you think it's human (1-3):3


[1mGood guess![0m

Output #1:
for the feedback input pattern adjusted through the input vector

Output

### Conclusions

In this notebook, we built a recurrent neural network and used it to generate patent abstracts. Although the output is not always believable, in some instances it was very hard to distinguish human created from the machine generated parts of text (as in the guessing game). 

We'll list the results from the models we built:

| Model Name | Pretrained Embeddings | Training Sequence Size | Optimizer | Validation Accuracy |
|-------------|---------------|----------------|-------------|-------|
| One-directional | Yes | 50 words | Nadam | 22.72% |
| Bi-directional | No | 50 words  | Adam | 25.18% |
| Short bi-directional | No | 20 words | Adam | 22.85% |

The best performing model in terms of the validation data appears to be the bi-directional model with training sequence of 50 words. Overall, all of the models produced quite similar results suggesting their may be an upper limit in terms of the accuracy that can be achieved on this problem. 

Another possible ways to improve the preformace is trying longer sequences, different optimizers and using more LSTM layers. 