

### how to build and use a recurrent neural network in Keras to write patent abstracts.

we input a sequence of words and train the model to predict the very next word. 

When we go to write a new patent, we pass in a starting sequence of words, make a prediction for the next word, update the input sequence, make another prediction, add the word to the sequence and continue for however many words we want to generate.

The steps of the approach are outlined below:
1. Convert abstracts from list of strings into list of lists of integers (sequences)
2. Create feature and labels from sequences
3. Build LSTM model with Embedding, LSTM, and Dense layers
4. Load in pre-trained embeddings
5. Train model to predict next work in sequence
6. Make predictions by passing in starting sequence


In [1]:
import keras as k

Using TensorFlow backend.


In [2]:
k.__version__

'2.3.0'

In [3]:
%load_ext autoreload
%autoreload 2

In [4]:
from IPython.core.interactiveshell import InteractiveShell
from IPython.display import HTML

InteractiveShell.ast_node_interactivity = 'all'

import warnings
warnings.filterwarnings('ignore', category = RuntimeWarning)
warnings.filterwarnings('ignore', category = UserWarning)

import pandas as pd
import numpy as np
#from utils import get_data, generate_output, guess_human, seed_sequence, get_embeddings, find_closest

In [5]:
from keras.models import load_model
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dense, Dropout, Embedding, Masking
from keras.optimizers import Adam
from keras.utils import Sequence
from keras.preprocessing.text import Tokenizer

from sklearn.utils import shuffle

from IPython.display import HTML

from itertools import chain
from keras.utils import plot_model
import numpy as np
import pandas as pd
import random
import json
import re

RANDOM_STATE = 50
TRAIN_FRACTION = 0.7


#### Utility Functions

In [6]:
def get_model(model_name):
    """Retrieve a Keras model and embeddings"""
    model = load_model(f'{model_name}.h5')
    embeddings = model.get_layer(index = 0)
    embeddings = embeddings.get_weights()[0]
    embeddings = embeddings / np.linalg.norm(embeddings, axis = 1).reshape((-1, 1))
    embeddings = np.nan_to_num(embeddings)
    word_idx = []
    with open(f'training-rnn.json', 'rb') as f:
        for l in f:
            word_idx.append(json.loads(l))
        
    word_idx = word_idx[0]
    word_idx['UNK'] = 0
    idx_word = {index: word for word, index in word_idx.items()}
    return model, embeddings, word_idx, idx_word


In [7]:
def get_embeddings(model):
    """Retrieve the embeddings in a model"""
    embeddings = model.get_layer(index = 0)
    embeddings = embeddings.get_weights()[0]
    embeddings = embeddings / np.linalg.norm(embeddings, axis = 1).reshape((-1, 1))
    embeddings = np.nan_to_num(embeddings)
    return embeddings


In [8]:
def find_closest(query, embedding_matrix, word_idx, idx_word, n = 10):
    """Find closest words to a query word in embeddings"""
    
    idx = word_idx.get(query, None)
    # Handle case where query is not in vocab
    if idx is None:
        print(f'{query} not found in vocab.')
        return
    else:
        vec = embedding_matrix[idx]
        # Handle case where word doesn't have an embedding
        if np.all(vec == 0):
            print(f'{query} has no pre-trained embedding.')
            return
        else:
            # Calculate distance between vector and all others
            dists = np.dot(embedding_matrix, vec)
            
            # Sort indexes in reverse order
            idxs = np.argsort(dists)[::-1][:n]
            sorted_dists = dists[idxs]
            closest = [idx_word[i] for i in idxs]
            
    print(f'Query: {query}\n')
    # Print out the word and cosine distances
    for word, dist in zip(closest, sorted_dists):
        print(f'Word: {word:15} Cosine Similarity: {round(dist, 4)}')
        

In [9]:
def format_sequence(s):
    """Add spaces around punctuation and remove references to images/citations."""
    
    # Add spaces around punctuation
    s =  re.sub(r'(?<=[^\s0-9])(?=[.,;?])', r' ', s)
    
    # Remove references to figures
    s = re.sub(r'\((\d+)\)', r'', s)
    
    # Remove double spaces
    s = re.sub(r'\s\s', ' ', s)
    return s


In [10]:
def remove_spaces(s):
    """Remove spaces around punctuation"""
    s = re.sub(r'\s+([.,;?])', r'\1', s)
    
    return s


In [11]:
def get_data(file, filters='!"%;[\\]^_`{|}~\t\n', training_len=50,
             lower=False):
    """Retrieve formatted training and validation data from a file"""
    
    data = pd.read_csv(file, parse_dates=['patent_date']).dropna(subset = ['patent_abstract'])
    abstracts = [format_sequence(a) for a in list(data['patent_abstract'])]
    word_idx, idx_word, num_words, word_counts, texts, sequences, features, labels = make_sequences(
        abstracts, training_len, lower, filters)
    X_train, X_valid, y_train, y_valid = create_train_valid(features, labels, num_words)
    training_dict = {'X_train': X_train, 'X_valid': X_valid, 
                     'y_train': y_train, 'y_valid': y_valid}
    return training_dict, word_idx, idx_word, sequences



In [12]:
def create_train_valid(features,
                       labels,
                       num_words,
                       train_fraction=0.7):
    """Create training and validation features and labels."""
    
    # Randomly shuffle features and labels
    features, labels = shuffle(features, labels, random_state=RANDOM_STATE)

    # Decide on number of samples for training
    train_end = int(train_fraction * len(labels))

    train_features = np.array(features[:train_end])
    valid_features = np.array(features[train_end:])

    train_labels = labels[:train_end]
    valid_labels = labels[train_end:]

    # Convert to arrays
    X_train, X_valid = np.array(train_features), np.array(valid_features)

    # Using int8 for memory savings
    y_train = np.zeros((len(train_labels), num_words), dtype=np.int8)
    y_valid = np.zeros((len(valid_labels), num_words), dtype=np.int8)

    # One hot encoding of labels
    for example_index, word_index in enumerate(train_labels):
        y_train[example_index, word_index] = 1

    for example_index, word_index in enumerate(valid_labels):
        y_valid[example_index, word_index] = 1

    # Memory management
    import gc
    gc.enable()
    del features, labels, train_features, valid_features, train_labels, valid_labels
    gc.collect()

    return X_train, X_valid, y_train, y_valid



In [13]:
def make_sequences(texts, training_length = 50,
                   lower = True, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'):
    """Turn a set of texts into sequences of integers"""
    
    # Create the tokenizer object and train on texts
    tokenizer = Tokenizer(lower=lower, filters=filters)
    tokenizer.fit_on_texts(texts)
    
    # Create look-up dictionaries and reverse look-ups
    word_idx = tokenizer.word_index
    idx_word = tokenizer.index_word
    num_words = len(word_idx) + 1
    word_counts = tokenizer.word_counts
    
    print(f'There are {num_words} unique words.')
    
    # Convert text to sequences of integers
    sequences = tokenizer.texts_to_sequences(texts)
    
    # Limit to sequences with more than training length tokens
    seq_lengths = [len(x) for x in sequences]
    over_idx = [i for i, l in enumerate(seq_lengths) if l > (training_length + 20)]
    
    new_texts = []
    new_sequences = []
    
    # Only keep sequences with more than training length tokens
    for i in over_idx:
        new_texts.append(texts[i])
        new_sequences.append(sequences[i])
        
    features = []
    labels = []
    
    # Iterate through the sequences of tokens
    for seq in new_sequences:
        
        # Create multiple training examples from each sequence
        for i in range(training_length, len(seq)):
            # Extract the features and label
            extract = seq[i - training_length: i + 1]
            
            # Set the features and label
            features.append(extract[:-1])
            labels.append(extract[-1])
    
    print(f'There are {len(features)} sequences.')
    
    # Return everything needed for setting up the model
    return word_idx, idx_word, num_words, word_counts, new_texts, new_sequences, features, labels



In [14]:
def generate_output(model,
                    sequences,
                    idx_word,
                    seed_length=50,
                    new_words=50,
                    diversity=1,
                    return_output=False,
                    n_gen=1):
    """Generate `new_words` words of output from a trained model and format into HTML."""

    # Choose a random sequence
    seq = random.choice(sequences)

    # Choose a random starting point
    seed_idx = random.randint(0, len(seq) - seed_length - 10)
    # Ending index for seed
    end_idx = seed_idx + seed_length

    gen_list = []

    for n in range(n_gen):
        # Extract the seed sequence
        seed = seq[seed_idx:end_idx]
        original_sequence = [idx_word[i] for i in seed]
        generated = seed[:] + ['#']

        # Find the actual entire sequence
        actual = generated[:] + seq[end_idx:end_idx + new_words]

        # Keep adding new words
        for i in range(new_words):

            # Make a prediction from the seed
            preds = model.predict(np.array(seed).reshape(1, -1))[0].astype(
                np.float64)

            # Diversify
            preds = np.log(preds) / diversity
            exp_preds = np.exp(preds)

            # Softmax
            preds = exp_preds / sum(exp_preds)

            # Choose the next word
            probas = np.random.multinomial(1, preds, 1)[0]

            next_idx = np.argmax(probas)

            # New seed adds on old word
            #             seed = seed[1:] + [next_idx]
            seed += [next_idx]
            generated.append(next_idx)

        # Showing generated and actual abstract
        n = []

        for i in generated:
            n.append(idx_word.get(i, '< --- >'))

        gen_list.append(n)

    a = []

    for i in actual:
        a.append(idx_word.get(i, '< --- >'))

    a = a[seed_length:]

    gen_list = [gen[seed_length:seed_length + len(a)] for gen in gen_list]

    if return_output:
        return original_sequence, gen_list, a

    # HTML formatting
    seed_html = ''
    seed_html = addContent(seed_html, header(
        'Seed Sequence', color='darkblue'))
    seed_html = addContent(seed_html,
                           box(remove_spaces(' '.join(original_sequence))))

    gen_html = ''
    gen_html = addContent(gen_html, header('RNN Generated', color='darkred'))
    gen_html = addContent(gen_html, box(remove_spaces(' '.join(gen_list[0]))))

    a_html = ''
    a_html = addContent(a_html, header('Actual', color='darkgreen'))
    a_html = addContent(a_html, box(remove_spaces(' '.join(a))))

    return seed_html, gen_html, a_html



In [15]:
def header(text, color = 'black', gen_text = None):
    if gen_text:
        raw_html = f'<h1 style="color: {color};"><p><center>' + str(
        text) + '<span style="color: red">' + str(gen_text) + '</center></p></h1>'
    else:
        raw_html = f'<h1 style="color: {color};"><center>' + str(
            text) + '</center></h1>'
    return raw_html


def box(text, gen_text=None):
    if gen_text:
        raw_html = '<div style="border:1px inset black;padding:1em;font-size: 20px;"> <p>' + str(
            text) +'<span style="color: red">' + str(gen_text) + '</p></div>'

    else:
        raw_html = '<div style="border:1px inset black;padding:1em;font-size: 20px;">' + str(
            text) + '</div>'
    return raw_html


def addContent(old_html, raw_html):
    old_html += raw_html
    return old_html

def seed_sequence(model, s, word_idx, idx_word, 
                  diversity = 0.75, num_words = 50):
    """Generate output starting from a seed sequence."""
    # Original formated text
    start = format_sequence(s).split()
    gen = []
    s = start[:]
    # Generate output
    for _ in range(num_words):
        # Conver to arry
        x = np.array([word_idx.get(word, 0) for word in s]).reshape((1, -1))

        # Make predictions
        preds = model.predict(x)[0].astype(float)

        # Diversify
        preds = np.log(preds) / diversity
        exp_preds = np.exp(preds)
        # Softmax
        preds = exp_preds / np.sum(exp_preds)
        # Pick next index
        next_idx = np.argmax(np.random.multinomial(1, preds, size = 1))
        s.append(idx_word[next_idx])
        gen.append(idx_word[next_idx])
    
    # Formatting in html
    start = remove_spaces(' '.join(start)) + ' '
    gen = remove_spaces(' '.join(gen)) 
    html = ''
    html = addContent(html, header('Input Seed ', color = 'black', gen_text = 'Network Output'))
    html = addContent(html, box(start, gen))
    return html

def guess_human(model, sequences, idx_word, seed_length=50):
    """Produce 2 RNN sequences and play game to compare to actaul.
       Diversity is randomly set between 0.5 and 1.25"""
    
    new_words = np.random.randint(10, 50)
    diversity = np.random.uniform(0.5, 1.25)
    sequence, gen_list, actual = generate_output(model, sequences, idx_word, seed_length, new_words,
                                                 diversity=diversity, return_output=True, n_gen = 2)
    gen_0, gen_1 = gen_list
    
    output = {'sequence': remove_spaces(' '.join(sequence)),
              'computer0': remove_spaces(' '.join(gen_0)),
              'computer1': remove_spaces(' '.join(gen_1)),
              'human': remove_spaces(' '.join(actual))}
    
    print(f"Seed Sequence: {output['sequence']}\n")
    
    choices = ['human', 'computer0', 'computer1']
          
    selected = []
    i = 0
    while len(selected) < 3:
        choice = random.choice(choices)
        selected.append(choice)
        print(f'\nOption {i + 1} {output[choice]}')
        choices.remove(selected[-1])
        i += 1
    
    print('\n')
    guess = int(input('Enter option you think is human (1-3): ')) - 1
    print('\n')
    
    if guess == np.where(np.array(selected) == 'human')[0][0]:
        print('*' * 3 + 'Correct' + '*' * 3 + '\n')
        print('-' * 60)
        print('Ordering: ', selected)
    else:
        print('*' * 3 + 'Incorrect' + '*' * 3 + '\n')
        print('-' * 60)
        print('Correct Ordering: ', selected)
          
    print('Diversity', round(diversity, 2))
    
def make_sequences_new(texts,
                   training_length=50,
                   lower=True,
                   filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'):
    """Turn a set of texts into sequences of integers"""

    # Create the tokenizer object and train on texts
    tokenizer = Tokenizer(lower=lower, filters=filters)
    tokenizer.fit_on_texts(texts)

    # Convert text to sequences of integers
    sequences = tokenizer.texts_to_sequences(texts)

    # Limit to sequences with more than (training length + 20) tokens
    seq_lengths = [len(x) for x in sequences]
    over_idx = [
        i for i, l in enumerate(seq_lengths) if l > (training_length + 20)
    ]

    new_texts = []

    # Only keep sequences with more than training length tokens
    for i in over_idx:
        new_texts.append(texts[i])
    
    tokenizer = Tokenizer(lower=lower, filters=filters)
    # Refit on long texts
    tokenizer.fit_on_texts(new_texts)
    new_sequences = tokenizer.texts_to_sequences(new_texts)
    
    # Create look-up dictionaries and reverse look-ups
    word_idx = tokenizer.word_index
    idx_word = tokenizer.index_word
    num_words = len(word_idx) + 1
    word_counts = tokenizer.word_counts

    print(f'There are {num_words} unique words.')

    features = []
    labels = []

    # Iterate through the sequences of tokens
    for seq in new_sequences:

        # Create multiple training examples from each sequence
        for i in range(training_length, len(seq)):
            # Extract the features and label
            extract = seq[i - training_length:i + 1]

            # Set the features and label
            features.append(extract[:-1])
            labels.append(extract[-1])

    print(f'There are {len(features)} training sequences.')

    # Return everything needed for setting up the model
    return word_idx, idx_word, num_words, word_counts, new_texts, new_sequences, features, labels

# Fetch Training Data

The raw data for this project comes from USPTO PatentsView, where you can search for information on any patent applied for in the United States.

https://www.patentsview.org/querydev/

* Using patent abstracts from patent search for neural network
* 3000+ patents total


In [19]:
data = pd.read_csv('neural_network_patent_query.txt')
data.head()

Unnamed: 0,patent_abstract,patent_date,patent_number,patent_title
0,""" A """"Barometer"""" Neuron enhances stability in...",1996-07-09,5535303,"""""""Barometer"""" neuron for a neural network"""
1,""" This invention is a novel high-speed neural ...",1993-10-19,5255349,"""Electronic neural network for solving """"trave..."
2,An optical information processor for use as a ...,1995-01-17,5383042,3 layer liquid crystal neural network with out...
3,A method and system for intelligent control of...,2001-01-02,6169981,3-brain architecture for an intelligent decisi...
4,A method and system for intelligent control of...,2003-06-17,6581048,3-brain architecture for an intelligent decisi...


In [20]:
training_dict, word_idx, idx_word, sequences = get_data('neural_network_patent_query.txt', training_len = 50)

There are 16192 unique words.
There are 318563 sequences.


In [28]:
len(sequences)

3255

* Sequences of text are represented as integers
    * `word_idx` maps words to integers
    * `idx_word` maps integers to words
* Features are integer sequences of length 50
* Label is next word in sequence
* Labels are one-hot encoded

In [21]:
training_dict['X_train'][:2]
training_dict['y_train'][:2]

array([[  117,     7,   141,   277,     4,    18,    81,   110,    10,
          219,    29,     1,   952,  2453,    19,     5,     6,     1,
          117,    10,   182,  2166,    21,     1,    81,   178,     4,
           13,   117,   894,    14,  6163,     7,   302,     1,     9,
            8,    29,    33,    23,    74,   428,     7,   692,     1,
           81,   183,     4,    13,   117],
       [    6,    41,     2,    87,     3,  1340,    79,     7,     1,
          409,   543,    22,   484,     6,     2,  2113,   728,    24,
            1,   178,     3,     1,  1820,    55,    14, 13942,  7240,
          244,     5,    14, 13943,  7240,   244,     5,     2,  2113,
         7240,   244,     5,     2,    38,  9292,   244,     2,    49,
         9292,   244,    14,    22, 13944]])

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int8)

The previous step converts all the abstracts to sequences of integers. The next step is to create a supervised machine learning problem with which to train the network.

Give the network a sequence of words and train it to predict the next word.

The number of words is left as a parameter; we’ll use 50 for the examples shown here which means we give our network 50 words and train it to predict the 51st.

We use the first 50 words as features with the 51st as the label, then use words 2–51 as features and predict the 52nd and so on. This gives us significantly more training data which is beneficial because the performance of the network is proportional to the amount of data that it sees during training.

In [29]:
for i, sequence in enumerate(training_dict['X_train'][:2]):
    text = []
    for idx in sequence:
        text.append(idx_word[idx])
        
    print('Features: ' + ' '.join(text) + '\n')
    print('Label: ' + idx_word[np.argmax(training_dict['y_train'][i])] + '\n')
    

Features: user to provide samples . A recognition operation is performed on the user's handwritten input , and the user is not satisfied with the recognition result . The user selects an option to train the neural network on one or more characters to improve the recognition results . The user

Label: is

Features: and includes a number of amplifiers corresponding to the N bit output sum and a carry generation from the result of the adding process an augend input-synapse group , an addend input-synapse group , a carry input-synapse group , a first bias-synapse group a second bias-synapse group an output feedback-synapse

Label: group



In [30]:
training_dict['X_train'].shape

(222994, 50)

The features end up with shape (296866, 50) which means we have almost 300,000 sequences each with 50 tokens. In the language of recurrent neural networks, each sequence has 50 timesteps each with 1 feature.

# Make Recurrent Neural Network

* Embedding dimension = 100
* 64 LSTM cells in one layer
    * Dropout and recurrent dropout for regularization
* Fully connected layer with 64 units on top of LSTM
     * 'relu' activation
* Drop out for regularization
* Output layer produces prediction for each word
    * 'softmax' activation
* Adam optimizer with defaults
* Categorical cross entropy loss
* Monitor accuracy

We are using the Keras Sequential API which means we build the network up one layer at a time. The layers are as follows:
1. An Embedding which maps each input word to a 100-dimensional vector. The embedding can use pre-trained weights which we supply in the weights parameter. trainable can be set False if we don’t want to update the embeddings.
2. A Masking layer to mask any words that do not have a pre-trained embedding which will be represented as all zeros. This layer should not be used when training the embeddings.
3. The heart of the network: a layer of LSTM cells with dropout to prevent overfitting. Since we are only using one LSTM layer, it does not return the sequences, for using two or more layers, make sure to return sequences.
4. A fully-connected Dense layer with relu activation. This adds additional representational capacity to the network.
5. A Dropout layer to prevent overfitting to the training data.
6. A Dense fully-connected output layer. This produces a probability for every word in the vocab using softmax activation.

In [37]:
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dense, Dropout, Embedding, Masking, Bidirectional
from keras.optimizers import Adam

from keras.utils import plot_model

In [None]:
model = Sequential()

# Embedding layer
model.add(
    Embedding(
        input_dim=len(word_idx) + 1,
        output_dim=100,
        weights=None,
        trainable=True))

# Recurrent layer
model.add(LSTM(64, return_sequences=False, dropout=0.1))

# Fully connected layer
model.add(Dense(64, activation='relu'))

# Dropout for regularization
model.add(Dropout(0.5))

# Output layer
model.add(Dense(len(word_idx) + 1, activation='softmax'))

# Compile the model
model.compile(
    optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()

## Load in Pre-Trained Model

Rather than waiting several hours to train the model, we can load in a model trained for 150 epochs. We'll demonstrate how to train this model for another 5 epochs which shouldn't take too long depending on your hardware.

In [10]:
from keras.models import load_model

# Load in model and demonstrate training
model = load_model('train-embeddings-rnn.h5')
h = model.fit(training_dict['X_train'], training_dict['y_train'], epochs = 5, batch_size = 2048, 
          validation_data = (training_dict['X_valid'], training_dict['y_valid']), 
          verbose = 1)

Train on 222994 samples, validate on 95569 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [11]:
model = load_model('../models/train-embeddings-rnn.h5')
print('Model Performance: Log Loss and Accuracy on training data')
model.evaluate(training_dict['X_train'], training_dict['y_train'], batch_size = 2048)

print('\nModel Performance: Log Loss and Accuracy on validation data')
model.evaluate(training_dict['X_valid'], training_dict['y_valid'], batch_size = 2048)

Model Performance: Log Loss and Accuracy on training data


[3.282083551313851, 0.33844408377189383]


Model Performance: Log Loss and Accuracy on validation data


[4.737925765920241, 0.2671891513580688]

There is a minor amount of overfitting on the training data but it's not major. Using regularization in both the LSTM layer and after the fully dense layer can help to combat the prevalent issue of overfitting.

# Generate Output

We can use the fully trained model to generate output by starting it off with a seed sequence. 

In [13]:
for i in generate_output(model, sequences, idx_word, seed_length = 50, new_words = 30, diversity = 0.75):
    HTML(i)

In [15]:
for i in generate_output(model, sequences, idx_word, seed_length = 30, new_words = 30, diversity = 1.5):
    HTML(i)

Too high of a diversity and the output will be nearly random. Too low of a diversity and the model can get stuck outputting loops of text.

## Start the network with own input

Here you can input your own starting sequence for the network. The network will produce `num_words` of text.

In [16]:
s = 'This patent provides a basis for using a recurrent neural network to '
HTML(seed_sequence(model, s, word_idx, idx_word, diversity = 0.75, num_words = 20))

In [17]:
s = 'The cell state is passed along from one time step to another allowing the '
HTML(seed_sequence(model, s, word_idx, idx_word, diversity = 0.75, num_words = 20))

# Inspect Embeddings

As a final piece of model inspection, we can look at the embeddings and find the words closest to a query word in the embedding space. This gives us an idea of what the network has learned.

In [21]:
embeddings = get_embeddings(model)
embeddings.shape

(16192, 100)

Each word in the vocabulary is now represented as a 100-dimensional vector. This could be reduced to 2 or 3 dimensions for visualization. It can also be used to find the closest word to a query word.

In [22]:
find_closest('network', embeddings, word_idx, idx_word)

Query: network

Word: network         Cosine Similarity: 1.0
Word: channel         Cosine Similarity: 0.7754999995231628
Word: networks        Cosine Similarity: 0.7745000123977661
Word: system          Cosine Similarity: 0.7559999823570251
Word: program         Cosine Similarity: 0.7541999816894531
Word: cable           Cosine Similarity: 0.7419999837875366
Word: now             Cosine Similarity: 0.7297999858856201
Word: programming     Cosine Similarity: 0.7179999947547913
Word: web             Cosine Similarity: 0.7138000130653381
Word: line            Cosine Similarity: 0.6915000081062317


A word should have a cosine similarity of 1.0 with itself! The embeddings are learned for a task, so the nearest words may only make sense in the context of the patents on which we trained the network.

In [23]:
find_closest('data', embeddings, word_idx, idx_word)

Query: data

Word: data            Cosine Similarity: 1.0
Word: information     Cosine Similarity: 0.8185999989509583
Word: numbers         Cosine Similarity: 0.683899998664856
Word: database        Cosine Similarity: 0.6776000261306763
Word: account         Cosine Similarity: 0.6575999855995178
Word: report          Cosine Similarity: 0.6575999855995178
Word: signals         Cosine Similarity: 0.6399999856948853
Word: system          Cosine Similarity: 0.6377000212669373
Word: statistics      Cosine Similarity: 0.6371999979019165
Word: web             Cosine Similarity: 0.6359000205993652


It seems the network has learned some basic relationships between words! 