# nnlm-en-dim128 + Nearest Neighbour Search using SimpleNeighbors[annoy]

Tensorflow 2.0

### Data
Data used in this tutorial is taken from the training set of [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data) on Kaggle.

### Pretrained model
Download and unzip the model [nnlm-en-dim128](https://tfhub.dev/google/nnlm-en-dim128/2) at a convenient location.

Note that I have saved the model at /Users/abhay.shukla/nnlm/2 and therefore module_url is set to that directory.

### Import libraries to use

In [1]:
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
import os

In [2]:
import bokeh
import bokeh.models
import bokeh.plotting
from tensorflow_text import SentencepieceTokenizer
import sklearn.metrics.pairwise
from simpleneighbors import SimpleNeighbors
from tqdm import tqdm
from tqdm import trange

### Load pretrained model and define function which returns the embedding

In [3]:
module_url = '/Users/abhay.shukla/nnlm/2'
model = hub.load(module_url)

def embed_text(input):
    return model(input)

### Test model and visualize embedding similarity on a few manually selected sentences

In [4]:
# Multilingual example
multilingual_example = ["Willkommen zu einfachen, aber", "verrassend krachtige", "multilingüe", "compréhension du langage naturel", "модели.", "大家是什么意思" , "보다 중요한", ".اللغة التي يتحدثونها"]
multilingual_example_in_en =  ["Welcome to simple yet", "surprisingly powerful", "multilingual", "natural language understanding", "models.", "What people mean", "matters more than", "the language they speak."]


In [5]:
multilingual_result = embed_text(multilingual_example)
multilingual_in_en_result = embed_text(multilingual_example_in_en)

In [6]:
def visualize_similarity(embeddings_1, embeddings_2, labels_1, labels_2, plot_title, plot_width=1200, plot_height=600,
                         xaxis_font_size='12pt', yaxis_font_size='12pt'):

    assert len(embeddings_1) == len(labels_1)
    assert len(embeddings_2) == len(labels_2)

    # arccos based text similarity (Yang et al. 2019; Cer et al. 2019)
    sim = 1 - np.arccos(sklearn.metrics.pairwise.cosine_similarity(embeddings_1,embeddings_2))/np.pi

    embeddings_1_col, embeddings_2_col, sim_col = [], [], []
    for i in range(len(embeddings_1)):
        for j in range(len(embeddings_2)):
            embeddings_1_col.append(labels_1[i])
            embeddings_2_col.append(labels_2[j])
            sim_col.append(sim[i][j])
    df = pd.DataFrame(zip(embeddings_1_col, embeddings_2_col, sim_col), columns=['embeddings_1', 'embeddings_2', 'sim'])

    mapper = bokeh.models.LinearColorMapper( palette=[*reversed(bokeh.palettes.YlOrRd[9])], low=df.sim.min(),
      high=df.sim.max())

    p = bokeh.plotting.figure(title=plot_title, x_range=labels_1,
                            x_axis_location="above",
                            y_range=[*reversed(labels_2)],
                            plot_width=plot_width, plot_height=plot_height,
                            tools="save",toolbar_location='below', 
                              tooltips=[('pair', '@embeddings_1 ||| @embeddings_2'), ('sim', '@sim')])
    p.rect(x="embeddings_1", y="embeddings_2", width=1, height=1, source=df, 
           fill_color={'field': 'sim', 'transform': mapper}, line_color=None)

    p.title.text_font_size = '12pt'
    p.axis.axis_line_color = None
    p.axis.major_tick_line_color = None
    p.axis.major_label_standoff = 16
    p.xaxis.major_label_text_font_size = xaxis_font_size
    p.xaxis.major_label_orientation = 0.25 * np.pi
    p.yaxis.major_label_text_font_size = yaxis_font_size
    p.min_border_right = 300

    bokeh.io.output_notebook()
    bokeh.io.show(p)


In [7]:
visualize_similarity(multilingual_in_en_result, multilingual_result,
                     multilingual_example_in_en, multilingual_example,  "Multilingual Universal Sentence Encoder for Semantic Retrieval (Yang et al., 2019)")


### Load sample of your data
Note that when data file is huge 
- you can load data by chunks and 
- you can also load only those columns which you are going to use. 
Both of these features of pandas are shown below.

In [8]:
train = pd.read_csv('../word2vec/data/train.csv.zip', header=0, chunksize=10, usecols=['comment_text'])

In [9]:
for batch in train:
    print(batch)
    b = batch
    break

                                        comment_text
0  Explanation\nWhy the edits made under my usern...
1  D'aww! He matches this background colour I'm s...
2  Hey man, I'm really not trying to edit war. It...
3  "\nMore\nI can't make any real suggestions on ...
4  You, sir, are my hero. Any chance you remember...
5  "\n\nCongratulations from me as well, use the ...
6       COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK
7  Your vandalism to the Matt Shirvington article...
8  Sorry if the word 'nonsense' was offensive to ...
9  alignment on this subject and which are contra...


In [10]:
[x[0] for x in b[['comment_text']].values]

["Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",
 "D'aww! He matches this background colour I'm seemingly stuck with. Thanks.  (talk) 21:51, January 11, 2016 (UTC)",
 "Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",
 '"\nMore\nI can\'t make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of ""types of accidents""  -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on r

### Embed predetermined number of data points using pretrained NNLM EN Dim128
Note that the objective here is to load 1280 examples (a randomly chosen number for sake of demonstration). <br>
This will be loaded in 80 chunks of 16 batches each (80 * 16 = 1280) to avoid out of memory error.

We are using tqdm to show progress bar. (niceties of the world!)

In [11]:
comment_to_embeddings = []
comment_text = []
chunks = 80
batch_size = 16
with tqdm(total=chunks*batch_size) as pbar:
    for idx, batch in enumerate(pd.read_csv('../word2vec/data/train.csv.zip',header=0, chunksize=batch_size, usecols=['comment_text'])):
        comment_text.extend([x[0] for x in batch[['comment_text']].values])
        comment_to_embeddings.extend(embed_text(comment_text[-batch_size:]))
        pbar.update(len(batch))
        if idx + 1 == chunks:
            break

100%|██████████| 1280/1280 [00:01<00:00, 738.10it/s]


In [12]:
assert len(comment_to_embeddings) == len(comment_text)

### Building Nearest Neighbour Search Model
- First we need to add (sentence text, sentence embedding) data to index of the model. 
- Next we train the model by selecting a predetermined number of trees (n)

In [13]:
embedding_dimensions = comment_to_embeddings[0].shape[0]
params = dict(n = 40, metric='dot', dims=embedding_dimensions)
comment_lookup = SimpleNeighbors(dims=params['dims'], metric=params['metric'])

for i in trange(embedding_dimensions):
    comment_lookup.add_one(comment_text[i], comment_to_embeddings[i])

print('Building comment index with {} trees...'.format(params['n']))
comment_lookup.build(n=params['n'])

100%|██████████| 128/128 [00:13<00:00,  9.66it/s]

Building comment index with 40 trees...





### Looking at nearest neighbors

There are 2 options available:
- Search by vector
- Search by text

Both are demonstrated below, pay attention to arguments and the functions used to distinguish between the two options.

In [14]:
pick_comment = 18
print(comment_text[pick_comment])

The Mitsurugi point made no sense - why not argue to include Hindi on Ryo Sakazaki's page to include more information?


In [15]:
comment_lookup.nearest(vec=comment_to_embeddings[pick_comment], n=3)

['"\nFair use rationale for Image:Wonju.jpg\n\nThanks for uploading Image:Wonju.jpg. I notice the image page specifies that the image is being used under fair use but there is no explanation or rationale as to why its use in Wikipedia articles constitutes fair use. In addition to the boilerplate fair use template, you must also write out on the image description page a specific explanation or rationale for why using this image in each article is consistent with fair use.\n\nPlease go to the image description page and edit it to include a fair use rationale.\n\nIf you have uploaded other fair use media, consider checking that you have specified the fair use rationale on those pages too. You can find a list of \'image\' pages you have edited by clicking on the ""my contributions"" link (it is located at the very top of any Wikipedia page when you are logged in), and then selecting ""Image"" from the dropdown box. Note that any fair use images uploaded after 4 May, 2006, and lacking such 

In [16]:
picked_comment_text = comment_text[pick_comment]
print(picked_comment_text)
comment_lookup.neighbors(item=picked_comment_text, n=3)

The Mitsurugi point made no sense - why not argue to include Hindi on Ryo Sakazaki's page to include more information?


['"\nFair use rationale for Image:Wonju.jpg\n\nThanks for uploading Image:Wonju.jpg. I notice the image page specifies that the image is being used under fair use but there is no explanation or rationale as to why its use in Wikipedia articles constitutes fair use. In addition to the boilerplate fair use template, you must also write out on the image description page a specific explanation or rationale for why using this image in each article is consistent with fair use.\n\nPlease go to the image description page and edit it to include a fair use rationale.\n\nIf you have uploaded other fair use media, consider checking that you have specified the fair use rationale on those pages too. You can find a list of \'image\' pages you have edited by clicking on the ""my contributions"" link (it is located at the very top of any Wikipedia page when you are logged in), and then selecting ""Image"" from the dropdown box. Note that any fair use images uploaded after 4 May, 2006, and lacking such 

### Save the nearest neighbour search model

In [17]:
model_write_dir = 'lookup_annoy_model'
if not os.path.exists(model_write_dir):
    os.makedirs(model_write_dir)
    
model_name = 'lookup_annoy_toxic_word_model_{}.annoy'.format("_".join(sorted([p[0]+str(p[1]) for p in params.items()])))
model_file_path = os.path.join(model_write_dir, model_name)
print(model_file_path)

lookup_annoy_model/lookup_annoy_toxic_word_model_dims128_metricdot_n40.annoy


In [18]:
comment_lookup.save(model_file_path)

### References: 
* [Cross-Lingual Similarity and Semantic Search Engine with TF-Hub Multilingual Universal Encode](https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/cross_lingual_similarity_with_tf_hub_multilingual_universal_encoder.ipynb#scrollTo=yRoRT5qCEIYy)
* Pretrained model: [nnlm-en-dim128/2](https://tfhub.dev/google/nnlm-en-dim128/2)
* [Simple Neighbors API Reference](https://simpleneighbors.readthedocs.io/en/latest/simpleneighbors.html)