Universal Sentence Encoder Lite model giving same embeddings for 2 different (long)string #572

anurag89sharma · 2020-04-29T06:59:21Z

Hello,

I am currently working on using the Universal Sentence Encoder Lite model for getting text embeddings for our corpus to find similar items. I am facing an issue that if I have a very long paragraph and remove a sentence from its end then the 2 embeddings before and after removing the sentence are exactly same. Here is what I have tried

sent1 = """No one is ever ready for an emergency but you can be prepared When you know where to get information have the right supplies and have a plan for you your loved ones and your pets you can protect yourself and your family before a crisis and for at least 72hours afterwards The Homeland Security and Emergency Management Agency HSEMA iPhone iPad application contains important information you can use before during and after an emergency or disaster such as Emergency evacuation routes that lead you out of the District Alert DC emergency text alerts Current weather outlooks from the National Weather Service Disaster safety tips Help lines that provide telephone numbers to essential emergency resources and information A calendars informing the public about emergency preparedness training HSEMA Community Outreach events as well as special events such as marathons and street festivals A direct link to the local transit authority s METRO main website and twitter page List of shelters that are opened after a disaster occurs A direct link to FEMA s website Maps of where District Police and Fire stations are located Regional preparedness links Steps to take to make a family emergency plan a go kit and much more The tools in this app help ensure that no matter where you are or what you are doing you ll be prepared The app is free to download through your iPhone and iPad provider s app store"""

sent2 = """No one is ever ready for an emergency but you can be prepared When you know where to get information have the right supplies and have a plan for you your loved ones and your pets you can protect yourself and your family before a crisis and for at least 72hours afterwards The Homeland Security and Emergency Management Agency HSEMA iPhone iPad application contains important information you can use before during and after an emergency or disaster such as Emergency evacuation routes that lead you out of the District Alert DC emergency text alerts Current weather outlooks from the National Weather Service Disaster safety tips Help lines that provide telephone numbers to essential emergency resources and information A calendars informing the public about emergency preparedness training HSEMA Community Outreach events as well as special events such as marathons and street festivals"""

messages = [sent1, sent2]
# Get the embeddings for sent1 & sent2
values, indices, dense_shape = process_to_IDs_in_sparse_format(sp, messages)
with tf.Session() as session:
  session.run([tf.global_variables_initializer(), tf.tables_initializer()])
  message_embeddings = session.run(
      encodings,
      feed_dict={input_placeholder.values: values,
                input_placeholder.indices: indices,
                input_placeholder.dense_shape: dense_shape})

# Find cosine similarity between sent1 & sent2
from scipy import spatial
1 - spatial.distance.cosine(message_embeddings[2], message_embeddings[3])
>>> 1.0

Is there a word length limit for the universal sentence encoder lite model. I also tried a universal sentence encoder model 4 & 5, but I didn't find this issue using these models.

Also when will be the new version of the lite model will be coming that will work on tensorflow 2.0

The text was updated successfully, but these errors were encountered:

gowthamkpr · 2020-05-15T02:25:47Z

@anurag89sharma This is an expected behavior. Universal Sentence Encoder Lite does clip input to a maximum number of tokens of 128.

amahendrakar self-assigned this Apr 29, 2020

amahendrakar added subtype:text-embedding type:bug labels Apr 29, 2020

amahendrakar assigned gowthamkpr and unassigned amahendrakar Apr 29, 2020

gowthamkpr added the stat:awaiting tensorflower label May 14, 2020

gowthamkpr added stat:awaiting response type:docs and removed type:bug labels May 15, 2020

gowthamkpr assigned arnoegw and unassigned gowthamkpr May 15, 2020

arnoegw closed this as completed May 15, 2020

arnoegw mentioned this issue May 15, 2020

Max number of tokens considered by Universal Sentence Encoder Large 3 #244

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Universal Sentence Encoder Lite model giving same embeddings for 2 different (long)string #572

Universal Sentence Encoder Lite model giving same embeddings for 2 different (long)string #572

anurag89sharma commented Apr 29, 2020

gowthamkpr commented May 15, 2020

Universal Sentence Encoder Lite model giving same embeddings for 2 different (long)string #572

Universal Sentence Encoder Lite model giving same embeddings for 2 different (long)string #572

Comments

anurag89sharma commented Apr 29, 2020

gowthamkpr commented May 15, 2020