**Copyright 2019 The TensorFlow Hub Authors.**

Licensed under the Apache License, Version 2.0 (the "License");

In [0]:
# Copyright 2019 The TensorFlow Hub Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

# Cross-Lingual Similarity and Semantic Search Engine with Multilingual Universal Sentence Encoder


<table align="left"><td>
  <a target="_blank"  href="https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/cross_lingual_similarity_with_tf_hub_multilingual_universal_encoder.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab
  </a>
</td><td>
  <a target="_blank"  href="https://github.com/tensorflow/hub/blob/master/examples/colab/cross_lingual_similarity_with_tf_hub_multilingual_universal_encoder.ipynb">
    <img width=32px src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
</td></table>


This notebook illustrates how to access the Multilingual Universal Sentence Encoder module and use it for sentence similarity across multiple languages. This module is an extension of the [original Universal Encoder module](https://tfhub.dev/google/universal-sentence-encoder/2).

The notebook is divided as follows:

*   The first section shows a visualization of sentences between pair of languages. This is a more academic exercise.
*   In the second section, we show how to build a semantic search engine from a sample of a Wikipedia corpus in multiple languages.

# Getting Started

This section sets up the environment for access to the Multilingual Universal Sentence Encoder Module and also prepares a set of English sentences and their translations. In the following sections, the multilingual module will be used to compute similarity *across languages*.

**IMPORTANT**Note: Pleaseelect "**Python 3**" _and_ "**GPU**" in the ***Runtime->Change Runtime type*** dropdown menu above _before_ running this notebook for faster execution.

In [0]:
#@title Setup Environment
#latest Tensorflow that supports sentencepiece is 1.13.1
!pip uninstall --quiet --yes tensorflow
!pip install --quiet tensorflow-gpu==1.13.1
!pip install --quiet tensorflow-hub
!pip install --quiet seaborn
!pip install --quiet tf-sentencepiece
!pip install --quiet simpleneighbors
!pip install --quiet tqdm

In [0]:
#@title Setup common imports and functions
import numpy as np
import os
import pandas as pd
import seaborn as sns
import tensorflow as tf
import tensorflow_hub as hub
import tf_sentencepiece  # Not used directly but needed to import TF ops.

from simpleneighbors import SimpleNeighbors
from tqdm import tqdm
from tqdm import trange

def visualize_similarity(embeddings_1, embeddings_2, labels_1, labels_2, plot_title):
  corr = np.inner(embeddings_1, embeddings_2)
  chart = sns.heatmap(corr,
                      xticklabels=labels_1,
                      yticklabels=labels_2,
                      vmin=0,
                      vmax=1,
                      cmap='YlOrRd')
  chart.set_yticklabels(chart.get_yticklabels(), rotation=0)
  chart.set_title(plot_title)

This is an additional boilerplate code where where we import the pre-trained ML model we will use to encode text throughout this notebook.

In [0]:
# The 16-language multilingual module is the default but feel free
# to pick others from the list and compare the results.
module_url = 'https://tfhub.dev/google/universal-sentence-encoder-multilingual/1'  #@param ['https://tfhub.dev/google/universal-sentence-encoder-multilingual/1', 'https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/1', 'https://tfhub.dev/google/universal-sentence-encoder-xling-many/1']

# Set up graph.
g = tf.Graph()
with g.as_default():
  text_input = tf.placeholder(dtype=tf.string, shape=[None])
  multiling_embed = hub.Module(module_url)
  embedded_text = multiling_embed(text_input)
  init_op = tf.group([tf.global_variables_initializer(), tf.tables_initializer()])
g.finalize()

# Initialize session.
session = tf.Session(graph=g)
session.run(init_op)

# Visualize Text Similarity Between Languages
With the sentence embeddings now in hand, we can visualize semantic similarity across different languages.

## Computing Text Embeddings

We first define a set of sentences translated to various languages in parallel. Then, we precompute the embeddings for all of our sentences.

In [0]:
# Some texts of different lengths in different languages.
english_sentences = ['dog', 'Puppies are nice.', 'I enjoy taking long walks along the beach with my dog.']
spanish_sentences = ['perro', 'Los cachorros son agradables.', 'Disfruto de dar largos paseos por la playa con mi perro.']
german_sentences = ['Hund', 'Welpen sind nett.', 'Ich genieße lange Spaziergänge am Strand entlang mit meinem Hund.']
french_sentences = ['chien', 'Les chiots sont gentils.', 'J\'aime faire de longues promenades sur la plage avec mon chien.']
italian_sentences = ['cane', 'I cuccioli sono carini.', 'Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane.']
chinese_sentences = ['狗', '小狗很好。', '我喜欢和我的狗一起沿着海滩散步。']
korean_sentences = ['개', '강아지가 좋다.', '나는 나의 산책을 해변을 따라 길게 산책하는 것을 즐긴다.']
japanese_sentences = ['犬', '子犬はいいです', '私は犬と一緒にビーチを散歩するのが好きです']

In [0]:
# Compute embeddings.
en_result = session.run(embedded_text, feed_dict={text_input: english_sentences})
es_result = session.run(embedded_text, feed_dict={text_input: spanish_sentences})
de_result = session.run(embedded_text, feed_dict={text_input: german_sentences})
fr_result = session.run(embedded_text, feed_dict={text_input: french_sentences})
it_result = session.run(embedded_text, feed_dict={text_input: italian_sentences})
zh_result = session.run(embedded_text, feed_dict={text_input: chinese_sentences})
ko_result = session.run(embedded_text, feed_dict={text_input: korean_sentences})
ja_result = session.run(embedded_text, feed_dict={text_input: japanese_sentences})

## Visualizing Similarity

With text embeddings in hand, we can take their dot-product to visualize how similar sentences are between languages. A darker color indicates the embeddings are semantically similar.

### English-Spanish Similarity

In [0]:
visualize_similarity(en_result, es_result, english_sentences, spanish_sentences, 'English-Spanish Similarity')

### English-Italian Similarity

In [0]:
visualize_similarity(en_result, it_result, english_sentences, italian_sentences, 'English-Italian Similarity')

### Italian-Spanish Similarity

In [0]:
visualize_similarity(it_result, es_result, italian_sentences, spanish_sentences, 'Italian-Spanish Similarity')

### And more...

The above examples can be extended to any language pair from **English, Arabic, Chinese, Dutch, French, German, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Thai and Turkish**. Happy coding!

# Creating a Multilingual Semantic-Similarity Search Engine

Whereas in the previous example we visualized a handful of sentences, in this section we will build a semantic-search index of about 200,000 sentences from a Wikipedia Corpus. About half will be in English and the other half in Spanish to demonstrate the multilingual capabilities of the Universal Sentence Encoder.

## Download Data to Index
First, we will download news sentences in multiples languages from the [News Commentary Corpus](http://opus.nlpl.eu/News-Commentary-v11.php) [1].  Without loss of generality, this approach should also work for indexing the rest of the supported languages.

In [0]:
corpus_metadata = [
    ('ar', 'ar-en.txt.zip', 'News-Commentary.ar-en.ar', 'Arabic'),
    ('zh', 'en-zh.txt.zip', 'News-Commentary.en-zh.zh', 'Chinese'),
    ('en', 'en-es.txt.zip', 'News-Commentary.en-es.en', 'English'),
    ('ru', 'en-ru.txt.zip', 'News-Commentary.en-ru.ru', 'Russian'),
    ('es', 'en-es.txt.zip', 'News-Commentary.en-es.es', 'Spanish'),
]

language_to_sentences = {}
language_to_news_path = {}
for language_code, zip_file, news_file, language_name in corpus_metadata:
  zip_path = tf.keras.utils.get_file(
      fname=zip_file,
      origin='http://opus.nlpl.eu/download.php?f=News-Commentary/v11/moses/' + zip_file,
      extract=True)
  news_path = os.path.join(os.path.dirname(zip_path), news_file)
  language_to_sentences[language_code] = pd.read_csv(news_path, sep='\t', header=None)[0]
  language_to_news_path[language_code] = news_path

  print('{:,} {} sentences'.format(len(language_to_sentences[language_code]), language_name))

## Using a pre-trained model to transform sentences into vectors

We comute embeddings in _batches_ so that they fit in the GPU's RAM.

In [0]:
# Takes about 3 minutes

batch_size = 2048
language_to_embeddings = {}
for language_code, zip_file, news_file, language_name in corpus_metadata:
  print('\nComputing {} embeddings'.format(language_name))
  with tqdm(total=len(language_to_sentences[language_code])) as pbar:
    for batch in pd.read_csv(language_to_news_path[language_code], sep='\t',header=None, chunksize=batch_size):
      language_to_embeddings.setdefault(language_code, []).extend(session.run(embedded_text, feed_dict={text_input: batch[0]}))
      pbar.update(len(batch))

## Building an index of semantic vectors

We use the [SimpleNeighbors](https://pypi.org/project/simpleneighbors/) library---which is a wrapper for [Annoy](https://github.com/spotify/annoy) library---to efficiently look up results from the corpus.

In [0]:
%%time

# Takes about 8 minutes

num_index_trees = 40
language_name_to_index = {}
embedding_dimensions = len(list(language_to_embeddings.values())[0][0])
for language_code, zip_file, news_file, language_name in corpus_metadata:
  print('\nAdding {} embeddings to index'.format(language_name))
  index = SimpleNeighbors(embedding_dimensions, metric='dot')

  for i in trange(len(language_to_sentences[language_code])):
    index.add_one(language_to_sentences[language_code][i], language_to_embeddings[language_code][i])

  print('Building {} index with {} trees...'.format(language_name, num_index_trees))
  index.build(n=num_index_trees)
  language_name_to_index[language_name] = index

In [0]:
%%time

# Takes about 13 minutes

num_index_trees = 60
print('Computing mixed-language index')
combined_index = SimpleNeighbors(embedding_dimensions, metric='dot')
for language_code, zip_file, news_file, language_name in corpus_metadata:
  print('Adding {} embeddings to mixed-language index'.format(language_name))
  for i in trange(len(language_to_sentences[language_code])):
    annotated_sentence = '({}) {}'.format(language_name, language_to_sentences[language_code][i])
    combined_index.add_one(annotated_sentence, language_to_embeddings[language_code][i])

print('Building mixed-language index with {} trees...'.format(num_index_trees))
combined_index.build(n=num_index_trees)

## Verify that the semantic-similarity search engine works

In this section will demonstrate both:

1.   Semantic-search capabilities: retrieving sentences from the corpus that are semantically similar to the given query.
2.   Multilingual capabilities: doing so in multiple languages when they query language and index language match
3.   Cross-lingual capabilities: issuing queries in a distinct language than the indexed corpus
4.   Mixed-language corpus: all of the above on a single index containing entries from all languages



### Semantic-search crosss-lingual capabilities

In this section we show how to retrieve sentences related to a set of sample English sentences. Things to try:

*   Try a few different sample sentences
*   Try changing the number of returned results (they are returned in order of similarity)
*   Try cross-lingual capabilities by returning results in different languages (might want to use [Google Translate](http://translate.google.com) on some results to your native language for sanity check)



In [0]:
sample_query = 'The stock market fell four points.'  #@param ["Global warming", "Researchers made a surprising new discovery last week.", "The stock market fell four points.", "Lawmakers will vote on the proposal tomorrow."] {allow-input: true}
index_language = 'English'  #@param ["Arabic", "Chinese", "English", "French", "German", "Russian", "Spanish"]
num_results = 10  #@param {type:"slider", min:0, max:100, step:10}

query_embedding = session.run(embedded_text, feed_dict={text_input: [sample_query]})[0]

search_results = language_name_to_index[index_language].nearest(query_embedding, n=num_results)

print('{} sentences similar to: "{}"\n'.format(index_language, sample_query))
search_results

### Mixed-corpus capabilities

We will now issue a query in English, but the results will come from the any of the indexed languages.

In [0]:
sample_query = 'The stock market fell four points.'  #@param ["Global warming", "Researchers made a surprising new discovery last week.", "The stock market fell four points.", "Lawmakers will vote on the proposal tomorrow."] {allow-input: true}
num_results = 40  #@param {type:"slider", min:0, max:100, step:10}
query_embedding = session.run(embedded_text, feed_dict={text_input: [sample_query]})[0]

search_results = combined_index.nearest(query_embedding, n=num_results)

print('{} sentences similar to: "{}"\n'.format(index_language, sample_query))
search_results

Try your own queries:

In [0]:
query = 'The stock market fell four points.'  #@param {type:"string"}
num_results = 30  #@param {type:"slider", min:0, max:100, step:10}
query_embedding = session.run(embedded_text, feed_dict={text_input: [query]})[0]

search_results = combined_index.nearest(query_embedding, n=num_results)

print('{} sentences similar to: "{}"\n'.format(index_language, query))
search_results

# Further topics

## Multilingual

Finally, we encourage you to tried queries in any of the supported languages: **English, Arabic, Chinese, Dutch, French, German, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Thai and Turkish**.

Also, even though we only indexed in a subset of the languages, you can also index content in any of the supported languages.


## Model variations

We offer variations of the Universal Encoder models optimized for various things like memory, latency and/or quality. Please feel free to experiment with them to find a suitable one.

## Nearest neighbor libraries

We used a Annoy to efficiently look up nearest neighbors. See the [tradeoffs section](https://github.com/spotify/annoy/blob/master/README.rst#tradeoffs) to read about the number of trees (memory-dependent) and number of items to search (latency-dependent)---SimpleNeighbors only allows to control the number of trees, but refactoring the code to use Annoy directly should be simple, we just wanted to keep this code as simple as possible for the general user.

If Annoy does not scale for your application, please also check out [FAISS](https://github.com/facebookresearch/faiss).

*All the best building your multilingual semantic applications!*

[1] J. Tiedemann, 2012, [Parallel Data, Tools and Interfaces in OPUS](http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf). In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)