# Test

Windows adaptaion of semantic search with ANN and text embeddings 

Source:
    https://www.tensorflow.org/hub/tutorials/tf2_semantic_approximate_nearest_neighbors

## Setup

Import the required libraries

In [1]:
import os
import sys
import pickle
from collections import namedtuple
from datetime import datetime
import numpy as np
import apache_beam as beam
from apache_beam.transforms import util
import tensorflow as tf
import tensorflow_hub as hub
import annoy
from sklearn.random_projection import gaussian_random_matrix

## 1. Download Sample Data

[A Million News Headlines](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SYBGZL#) dataset contains news headlines published over a period of 15 years sourced from the reputable Australian Broadcasting Corp. (ABC). This news dataset has a summarised historical record of noteworthy events in the globe from early-2003 to end-2017 with a more granular focus on Australia. 

**Format**: Tab-separated two-column data: 1) publication date and 2) headline text. We are only interested in the headline text.


In [2]:
with open('corpus/text.txt', 'w') as out_file:
  with open('raw.tsv', 'r') as in_file:
    for line in in_file:
      headline = line.split('\t')[1].strip().strip('"')
      out_file.write(headline+"\n")

## 2. Generate Embeddings for the Data.

In this tutorial, we use the [Neural Network Language Model (NNLM)](https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1) to generate embeddings for the headline data. The sentence embeddings can then be easily used to compute sentence level meaning similarity. We run the embedding generation process using Apache Beam.

### Embedding extraction method

In [3]:
embed_fn = None

def generate_embeddings(text, module_url, random_projection_matrix=None):
  # Beam will run this function in different processes that need to
  # import hub and load embed_fn (if not previously loaded)
  global embed_fn
  if embed_fn is None:
    embed_fn = hub.load(module_url)
  embedding = embed_fn(text).numpy()
  if random_projection_matrix is not None:
    embedding = embedding.dot(random_projection_matrix)
  return text, embedding


### Convert to tf.Example method

In [4]:
def to_tf_example(entries):
  examples = []

  text_list, embedding_list = entries
  for i in range(len(text_list)):
    text = text_list[i]
    embedding = embedding_list[i]

    features = {
        'text': tf.train.Feature(
            bytes_list=tf.train.BytesList(value=[text.encode('utf-8')])),
        'embedding': tf.train.Feature(
            float_list=tf.train.FloatList(value=embedding.tolist()))
    }
  
    example = tf.train.Example(
        features=tf.train.Features(
            feature=features)).SerializeToString(deterministic=True)
  
    examples.append(example)
  
  return examples

### Beam pipeline

In [5]:
def run_hub2emb(args):
  '''Runs the embedding generation pipeline'''

  options = beam.options.pipeline_options.PipelineOptions(**args)
  args = namedtuple("options", args.keys())(*args.values())

  with beam.Pipeline(args.runner, options=options) as pipeline:
    (
        pipeline
        | 'Read sentences from files' >> beam.io.ReadFromText(
            file_pattern=args.data_dir)
        | 'Batch elements' >> util.BatchElements(
            min_batch_size=args.batch_size, max_batch_size=args.batch_size)
        | 'Generate embeddings' >> beam.Map(
            generate_embeddings, args.module_url, args.random_projection_matrix)
        | 'Encode to tf example' >> beam.FlatMap(to_tf_example)
        | 'Write to TFRecords files' >> beam.io.WriteToTFRecord(
            file_path_prefix='{}/emb'.format(args.output_dir),
            file_name_suffix='.tfrecords')
    )

### Generaring Random Projection Weight Matrix

[Random projection](https://en.wikipedia.org/wiki/Random_projection) is a simple, yet powerfull technique used to reduce the dimensionality of a set of points which lie in Euclidean space. For a theoretical background, see the [Johnson-Lindenstrauss lemma](https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma).

Reducing the dimensionality of the embeddings with random projection means less time needed to build and query the ANN index.

In this tutorial we use [Gaussian Random Projection](https://en.wikipedia.org/wiki/Random_projection#Gaussian_random_projection) from the [Scikit-learn](https://scikit-learn.org/stable/modules/random_projection.html#gaussian-random-projection) library.

In [6]:
def generate_random_projection_weights(original_dim, projected_dim):
  random_projection_matrix = None
  random_projection_matrix = gaussian_random_matrix(
      n_components=projected_dim, n_features=original_dim).T
  print("A Gaussian random weight matrix was creates with shape of {}".format(random_projection_matrix.shape))
  print('Storing random projection matrix to disk...')
  with open('random_projection_matrix', 'wb') as handle:
    pickle.dump(random_projection_matrix, 
                handle, protocol=pickle.HIGHEST_PROTOCOL)
        
  return random_projection_matrix

### Set parameters
If you want to build an index using the original embedding space without random projection, set the `projected_dim` parameter to `None`. Note that this will slow down the indexing step for high-dimensional embeddings.

In [7]:
module_url = 'https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1' #@param {type:"string"}
projected_dim = 64  #@param {type:"number"}

### Run pipeline

In [8]:
import tempfile

output_dir = tempfile.mkdtemp()
original_dim = hub.load(module_url)(['']).shape[1]
random_projection_matrix = None

if projected_dim:
  random_projection_matrix = generate_random_projection_weights(
      original_dim, projected_dim)

args = {
    'job_name': 'hub2emb-{}'.format(datetime.utcnow().strftime('%y%m%d-%H%M%S')),
    'runner': 'DirectRunner',
    'batch_size': 1024,
    'data_dir': 'corpus\*.txt',
    'output_dir': output_dir,
    'module_url': module_url,
    'random_projection_matrix': random_projection_matrix,
}

print("Pipeline args are set.")
args

A Gaussian random weight matrix was creates with shape of (128, 64)
Storing random projection matrix to disk...
Pipeline args are set.




{'job_name': 'hub2emb-200910-031802',
 'runner': 'DirectRunner',
 'batch_size': 1024,
 'data_dir': 'corpus\\*.txt',
 'output_dir': 'C:\\Users\\SANDER~1\\AppData\\Local\\Temp\\tmp2b7g77zp',
 'module_url': 'https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1',
 'random_projection_matrix': array([[ 0.14556211,  0.16797908,  0.19327695, ..., -0.10546049,
          0.04348957, -0.20288548],
        [ 0.08556691,  0.04585976, -0.06682875, ...,  0.08884578,
          0.07781133,  0.16367863],
        [-0.26843042, -0.20389443,  0.09223419, ...,  0.11595095,
         -0.06493047, -0.00851031],
        ...,
        [-0.03193794, -0.20488071, -0.11304449, ...,  0.45732803,
         -0.0279283 , -0.00914209],
        [ 0.07162892,  0.00403485, -0.07360446, ...,  0.04260638,
         -0.11348477,  0.11642548],
        [ 0.15406356,  0.18406368, -0.09435445, ..., -0.21130506,
         -0.02833571, -0.03901133]])}

In [9]:
print("Running pipeline...")
%time run_hub2emb(args)
print("Pipeline is done.")

Running pipeline...














Wall time: 17min 20s
Pipeline is done.


In [10]:
%ls {output_dir}

 Volume in drive C is Blade
 Volume Serial Number is BEC2-DD53

 Directory of C:\Users\SANDER~1\AppData\Local\Temp\tmp2b7g77zp

09/09/2020  08:35 PM    <DIR>          .
09/09/2020  08:35 PM    <DIR>          ..
09/09/2020  08:35 PM       388,699,429 emb-00000-of-00001.tfrecords
               1 File(s)    388,699,429 bytes
               2 Dir(s)  16,754,184,192 bytes free


The system cannot find the path specified.


Read some of the generated embeddings...

In [11]:
embed_file = os.path.join(output_dir, 'emb-00000-of-00001.tfrecords')
sample = 5

# Create a description of the features.
feature_description = {
    'text': tf.io.FixedLenFeature([], tf.string),
    'embedding': tf.io.FixedLenFeature([projected_dim], tf.float32)
}

def _parse_example(example):
  # Parse the input `tf.Example` proto using the dictionary above.
  return tf.io.parse_single_example(example, feature_description)

dataset = tf.data.TFRecordDataset(embed_file)
for record in dataset.take(sample).map(_parse_example):
  print("{}: {}".format(record['text'].numpy().decode('utf-8'), record['embedding'].numpy()[:10]))

headline_text: [-0.03489685 -0.09948387 -0.05628211 -0.02369605 -0.10834836  0.10760727
 -0.18177474  0.03484242 -0.20400605 -0.12302925]
aba decides against community broadcasting licence: [-0.03458044  0.13679439 -0.1517344   0.08366418  0.00353543 -0.01145421
  0.22175474  0.06245604 -0.0542197  -0.25999856]
act fire witnesses must be aware of defamation: [ 0.15315107  0.3754461  -0.18539304  0.2975436   0.01786217 -0.07028925
  0.21732916  0.2267158  -0.09780349 -0.16126879]
a g calls for infrastructure protection summit: [ 0.21369052  0.18222912 -0.07011453 -0.10433221  0.20290565  0.24954703
  0.05606724 -0.06699353 -0.1903006  -0.00890797]
air nz staff in aust strike for pay rise: [ 0.18976349  0.04245168 -0.1473818  -0.23294246  0.10474072  0.23743637
  0.29843155 -0.0130834  -0.36882558  0.08112953]


## 3. Build the ANN Index for the Embeddings

[ANNOY](https://github.com/spotify/annoy) (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory. It is built and used by [Spotify](https://www.spotify.com) for music recommendations.

In [12]:
def build_index(embedding_files_pattern, index_filename, vector_length, 
    metric='angular', num_trees=100):
  '''Builds an ANNOY index'''

  annoy_index = annoy.AnnoyIndex(vector_length, metric=metric)
  # Mapping between the item and its identifier in the index
  mapping = {}

  embed_files = tf.io.gfile.glob(embedding_files_pattern)
  num_files = len(embed_files)
  print('Found {} embedding file(s).'.format(num_files))

  item_counter = 0
  for i, embed_file in enumerate(embed_files):
    print('Loading embeddings in file {} of {}...'.format(i+1, num_files))
    dataset = tf.data.TFRecordDataset(embed_file)
    for record in dataset.map(_parse_example):
      text = record['text'].numpy().decode("utf-8")
      embedding = record['embedding'].numpy()
      mapping[item_counter] = text
      annoy_index.add_item(item_counter, embedding)
      item_counter += 1
      if item_counter % 100000 == 0:
        print('{} items loaded to the index'.format(item_counter))

  print('A total of {} items added to the index'.format(item_counter))

  print('Building the index with {} trees...'.format(num_trees))
  annoy_index.build(n_trees=num_trees)
  print('Index is successfully built.')
  
  print('Saving index to disk...')
  annoy_index.save(index_filename)
  print('Index is saved to disk.')
  print("Index file size: {} GB".format(
    round(os.path.getsize(index_filename) / float(1024 ** 3), 2)))
  annoy_index.unload()

  print('Saving mapping to disk...')
  with open(index_filename + '.mapping', 'wb') as handle:
    pickle.dump(mapping, handle, protocol=pickle.HIGHEST_PROTOCOL)
  print('Mapping is saved to disk.')
  print("Mapping file size: {} MB".format(
    round(os.path.getsize(index_filename + '.mapping') / float(1024 ** 2), 2)))

In [13]:
embedding_files = "{}/emb-*.tfrecords".format(output_dir)
embedding_dimension = projected_dim
index_filename = "index"

!rm {index_filename}
!rm {index_filename}.mapping

%time build_index(embedding_files, index_filename, embedding_dimension)

The system cannot find the path specified.
'rm' is not recognized as an internal or external command,
operable program or batch file.
The system cannot find the path specified.
'rm' is not recognized as an internal or external command,
operable program or batch file.


Found 1 embedding file(s).
Loading embeddings in file 1 of 1...
100000 items loaded to the index
200000 items loaded to the index
300000 items loaded to the index
400000 items loaded to the index
500000 items loaded to the index
600000 items loaded to the index
700000 items loaded to the index
800000 items loaded to the index
900000 items loaded to the index
1000000 items loaded to the index
1100000 items loaded to the index
A total of 1103664 items added to the index
Building the index with 100 trees...
Index is successfully built.
Saving index to disk...
Index is saved to disk.
Index file size: 1.6 GB
Saving mapping to disk...
Mapping is saved to disk.
Mapping file size: 50.61 MB
Wall time: 14min 17s


In [14]:
%ls

 Volume in drive C is Blade

The system cannot find the path specified.



 Volume Serial Number is BEC2-DD53

 Directory of C:\Users\SandersLi\Camino

09/09/2020  08:49 PM    <DIR>          .
09/09/2020  08:49 PM    <DIR>          ..
09/08/2020  02:32 AM               192 .gitignore
09/09/2020  08:12 PM    <DIR>          .ipynb_checkpoints
07/21/2020  08:12 PM    <DIR>          .vscode
09/09/2020  08:15 PM                 0 asdf
09/08/2020  06:51 PM    <DIR>          backend
09/08/2020  10:29 PM    <DIR>          corpus
07/27/2020  09:54 PM    <DIR>          env
09/02/2020  03:04 PM    <DIR>          frontend
09/09/2020  08:49 PM     1,722,824,064 index
09/09/2020  08:49 PM        53,063,748 index.mapping
07/27/2020  01:16 PM                24 Procfile
09/09/2020  08:18 PM            65,673 random_projection_matrix
09/08/2020  10:18 PM        57,600,231 raw.tsv
07/24/2020  04:16 PM    <DIR>          static
09/02/2020  03:04 PM    <DIR>          templates
09/07/2020  02:53 AM           537,205 test.jpg
09/09/2020  08:40 PM            33,999 tf2_semantic_appr

## 4. Use the Index for Similarity Matching
Now we can use the ANN index to find news headlines that are semantically close to an input query.

### Load the index and the mapping files

In [15]:
index = annoy.AnnoyIndex(embedding_dimension)
index.load(index_filename, prefault=True)
print('Annoy index is loaded.')
with open(index_filename + '.mapping', 'rb') as handle:
  mapping = pickle.load(handle)
print('Mapping file is loaded.')


Annoy index is loaded.


  index = annoy.AnnoyIndex(embedding_dimension)


Mapping file is loaded.


### Similarity matching method

In [16]:
def find_similar_items(embedding, num_matches=5):
  '''Finds similar items to a given embedding in the ANN index'''
  ids = index.get_nns_by_vector(
  embedding, num_matches, search_k=-1, include_distances=False)
  items = [mapping[i] for i in ids]
  return items

### Extract embedding from a given query

In [17]:
# Load the TF-Hub module
print("Loading the TF-Hub module...")
%time embed_fn = hub.load(module_url)
print("TF-Hub module is loaded.")

random_projection_matrix = None
if os.path.exists('random_projection_matrix'):
  print("Loading random projection matrix...")
  with open('random_projection_matrix', 'rb') as handle:
    random_projection_matrix = pickle.load(handle)
  print('random projection matrix is loaded.')

def extract_embeddings(query):
  '''Generates the embedding for the query'''
  query_embedding =  embed_fn([query])[0].numpy()
  if random_projection_matrix is not None:
    query_embedding = query_embedding.dot(random_projection_matrix)
  return query_embedding


Loading the TF-Hub module...
Wall time: 2.95 s
TF-Hub module is loaded.
Loading random projection matrix...
random projection matrix is loaded.


In [18]:
extract_embeddings("Hello Machine Learning!")[:10]





array([-0.02737208, -0.09724229, -0.03260606,  0.0102555 , -0.20648217,
        0.15898153,  0.06123003,  0.0845266 ,  0.05975626,  0.04614805])

### Enter a query to find the most similar items

In [19]:
#@title { run: "auto" }
query = "confronting global challenges" #@param {type:"string"}

print("Generating embedding for the query...")
%time query_embedding = extract_embeddings(query)

print("")
print("Finding relevant items in the index...")
%time items = find_similar_items(query_embedding, 10)

print("")
print("Results:")
print("=========")
for item in items:
  print(item)

Generating embedding for the query...
Wall time: 238 ms

Finding relevant items in the index...
Wall time: 4.2 ms

Results:
confronting global challenges
the domestic challenges facing duterte
chronology tsunamis of the past century
how urgent is the rational scrutiny of religion
lynch human rights movement needs better and bolder leaders
emerging nations to help struggling global economy
modern world encouraging rural and regional to seek help
an small scale farming urged as solution to global hunger
abbott says labor faces an existential crisis
what are the biggest challenges facing new wa labor govt


## Want to learn more?

You can learn more about TensorFlow at [tensorflow.org](https://www.tensorflow.org/) and see the TF-Hub API documentation at [tensorflow.org/hub](https://www.tensorflow.org/hub/). Find available TensorFlow Hub modules at [tfhub.dev](https://tfhub.dev/) including more text embedding modules and image feature vector modules.

Also check out the [Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/) which is Google's fast-paced, practical introduction to machine learning.