Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Universal Sentence Encoder Lite model giving same embeddings for 2 different (long)string #572

Closed
anurag89sharma opened this issue Apr 29, 2020 · 1 comment

Comments

@anurag89sharma
Copy link

Hello,

I am currently working on using the Universal Sentence Encoder Lite model for getting text embeddings for our corpus to find similar items. I am facing an issue that if I have a very long paragraph and remove a sentence from its end then the 2 embeddings before and after removing the sentence are exactly same. Here is what I have tried

sent1 = """No one is ever ready for an emergency but you can be prepared When you know where to get information have the right supplies and have a plan for you your loved ones and your pets you can protect yourself and your family before a crisis and for at least 72hours afterwards The Homeland Security and Emergency Management Agency HSEMA iPhone iPad application contains important information you can use before during and after an emergency or disaster such as Emergency evacuation routes that lead you out of the District Alert DC emergency text alerts Current weather outlooks from the National Weather Service Disaster safety tips Help lines that provide telephone numbers to essential emergency resources and information A calendars informing the public about emergency preparedness training HSEMA Community Outreach events as well as special events such as marathons and street festivals A direct link to the local transit authority s METRO main website and twitter page List of shelters that are opened after a disaster occurs A direct link to FEMA s website Maps of where District Police and Fire stations are located Regional preparedness links Steps to take to make a family emergency plan a go kit and much more The tools in this app help ensure that no matter where you are or what you are doing you ll be prepared The app is free to download through your iPhone and iPad provider s app store"""

sent2 = """No one is ever ready for an emergency but you can be prepared When you know where to get information have the right supplies and have a plan for you your loved ones and your pets you can protect yourself and your family before a crisis and for at least 72hours afterwards The Homeland Security and Emergency Management Agency HSEMA iPhone iPad application contains important information you can use before during and after an emergency or disaster such as Emergency evacuation routes that lead you out of the District Alert DC emergency text alerts Current weather outlooks from the National Weather Service Disaster safety tips Help lines that provide telephone numbers to essential emergency resources and information A calendars informing the public about emergency preparedness training HSEMA Community Outreach events as well as special events such as marathons and street festivals"""

messages = [sent1, sent2]
# Get the embeddings for sent1 & sent2
values, indices, dense_shape = process_to_IDs_in_sparse_format(sp, messages)
with tf.Session() as session:
  session.run([tf.global_variables_initializer(), tf.tables_initializer()])
  message_embeddings = session.run(
      encodings,
      feed_dict={input_placeholder.values: values,
                input_placeholder.indices: indices,
                input_placeholder.dense_shape: dense_shape})

# Find cosine similarity between sent1 & sent2
from scipy import spatial
1 - spatial.distance.cosine(message_embeddings[2], message_embeddings[3])
>>> 1.0

Is there a word length limit for the universal sentence encoder lite model. I also tried a universal sentence encoder model 4 & 5, but I didn't find this issue using these models.

Also when will be the new version of the lite model will be coming that will work on tensorflow 2.0

@gowthamkpr
Copy link

@anurag89sharma This is an expected behavior. Universal Sentence Encoder Lite does clip input to a maximum number of tokens of 128.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants