# A Gentle Introduction in Google's Universal Sentence Encoder

 Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.

The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector. We apply this model to the STS benchmark for semantic similarity, and the results can be seen in the example notebook made available. The universal-sentence-encoder model is trained with a deep averaging network (DAN) encoder.

In [1]:
## Simple Explanation of DAN [PLACEHOLDER]

## The stupid proxy..

In [2]:
import os
os.environ['http_proxy'] = "http://proxy.mms-dresden.de:8080"
os.environ['https_proxy'] = "http://proxy.mms-dresden.de:8080"

## Loading the model from TF-hub

In [3]:
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity

  from ._conv import register_converters as _register_converters


In [4]:
model_url = "https://tfhub.dev/google/universal-sentence-encoder/2" 

In [5]:
import hashlib
# The path where tf-hub will cache the model (use an absolute path..) 
os.environ["TFHUB_CACHE_DIR"] = 'C:/Users/dnho/Desktop/universal_sentence_encoder/model'

#TF-hub will store the name as hex
hashlib.sha1(model_url.encode("utf8")).hexdigest()

'1fb57c3ffe1a38479233ee9853ddd7a8ac8a8c47'

In [6]:
# Reduce logging output.
tf.logging.set_verbosity(tf.logging.ERROR)

In [22]:
%%time
# Initial download takes a while till the model is downloaded from tf-hub (~1GB)
model = hub.Module(model_url)

Wall time: 1.8 s


### Computing different representation for messages
1. Universal Sentence Encoder support Words
2. Sentences as well
3. As longer a paragraph as more diluted is the resulting embedding. Doesn't include the semantic so good

In [8]:
word = "clustering"
sentence = "i love clustering!"
paragraph = "Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters)."


messages = [word, sentence, paragraph]

In [9]:
%%time
with tf.Session() as session: 
    # Initializing global variables in the graph 
    session.run([tf.global_variables_initializer(), tf.tables_initializer()])
    
    message_embeddings = session.run(model(messages))
    print("Return result as np.ndarray: ")
    print(message_embeddings)
    
    print("\n Pretty print: ")
    
    for i, message_embedding in enumerate(np.array(message_embeddings).tolist()):
        print("Message: {}".format(messages[i]))
        print("Embedding size: {}".format(len(message_embedding)))
        message_embedding_snippet = ", ".join((str(x) for x in message_embedding[:3]))
        print("Embedding: [{}, ...]\n".format(message_embedding_snippet))

Return result as np.ndarray: 
[[-0.03222987 -0.05628269 -0.02492116 ... -0.02649049 -0.04364398
  -0.07042835]
 [ 0.01576823 -0.06964916 -0.02094725 ...  0.06391787 -0.05830861
  -0.020761  ]
 [-0.01098906 -0.04785546 -0.00761339 ... -0.06857845 -0.0258504
  -0.06867265]]

 Pretty print: 
Message: clustering
Embedding size: 512
Embedding: [-0.032229870557785034, -0.056282687932252884, -0.024921156466007233, ...]

Message: i love clustering!
Embedding size: 512
Embedding: [0.015768231824040413, -0.06964915990829468, -0.020947245880961418, ...]

Message: Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).
Embedding size: 512
Embedding: [-0.010989056900143623, -0.04785546287894249, -0.007613391149789095, ...]

Wall time: 27.3 s


### Playing around a little bit with similarity (quick & dirty)

In [10]:
paragraph_one = "I like to categorize different machine types"
paragraph_two = "I like to know the stock price of my shares"
paragraph_three = "Cats are just the cutest animals ever!"

definition_1 = "Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters)."
definition_2 = "Regression is a data mining technique used to predict a range of numeric values (also called continuous values), given a particular dataset."

paragraphes = [paragraph_one, paragraph_two, paragraph_three, definition_1, definition_2]

In [11]:
%%time
with tf.Session() as session: 
    # Initializing global variables in the graph 
    session.run([tf.global_variables_initializer(), tf.tables_initializer()])
    
    paragraph_embeddings = session.run(model(paragraphes))

Wall time: 31.4 s


In [12]:
x = paragraph_embeddings

In [13]:
x.shape

(5, 512)

In [14]:
x_0 = x[0].reshape(1,-1)
x_1 = x[1].reshape(1,-1)
x_2 = x[2].reshape(1,-1)
x_3 = x[3].reshape(1,-1)
x_4 = x[4].reshape(1,-1)

#### Clustering Paragrpah
Result: the clustering paragraph is more similar to the clustering as to the regression definition. 

In [15]:
print("Cosine Similiartiy between paragraph_one and definition_1: {}".format(cosine_similarity(x_0, x_3)))

Cosine Similiartiy between paragraph_one and definition_1: [[0.5987135]]


In [16]:
print("Cosine Similiartiy between paragraph_one and definition_2: {}".format(cosine_similarity(x_0, x_4)))

Cosine Similiartiy between paragraph_one and definition_2: [[0.40741786]]


#### Regression Paragraph
Result: the regression paragraph is more similar to the regression as to the clustering definition.

In [17]:
print("Cosine Similiartiy between paragraph_two and definition_1: {}".format(cosine_similarity(x_1, x_3)))

Cosine Similiartiy between paragraph_two and definition_1: [[0.1916528]]


In [18]:
print("Cosine Similiartiy between paragraph_two and definition_2: {}".format(cosine_similarity(x_1, x_4)))

Cosine Similiartiy between paragraph_two and definition_2: [[0.3606335]]


#### Random paragraph
Result: In comparsion to the other results the similarity looks quite small. 

In [19]:
cosine_similarity(x_2, x_3)

array([[0.1499501]], dtype=float32)

In [20]:
cosine_similarity(x_2, x_4)

array([[0.00603354]], dtype=float32)