Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Universal Sentence Encoder - extract token level embeddings #875

Closed
giulia95 opened this issue Feb 2, 2023 · 2 comments
Closed

Universal Sentence Encoder - extract token level embeddings #875

giulia95 opened this issue Feb 2, 2023 · 2 comments

Comments

@giulia95
Copy link

giulia95 commented Feb 2, 2023

What happened?

Hi,
I'm using the Universal Sentence Encoder to compute the embeddings representations of text sentences stored into a dataframe, using the code below (use_preprocessing).
Now, I would like to also extract the internal representation of the words that compose the sentence to encode.
I've seen the issue #344 and, for what I understood, the model I'm using compute the final embedding through Σ_w Embed(w) / √sentence length .
The result that I would like to achieve is the following:
Given a sentence "Example text" I would like to extract the final embedding for "Example text" and the internal embedding for each token (i.e the. ebedding of "Example" and the embedding of "text").

I've also seen the issue #403 but it's still not clear to me how to access the token level embeddings.
Thank you,
Giulia

Relevant code

def use_preprocessing(df, column):
    """Compute the embedding via the Universal Sentence Encode algorithm
    for every sentence in the given column
    Args:
        df: Dataframe
        column: column name to identify data to process
    """
    # Universal Sentence Encoder
    tf.compat.v1.disable_eager_execution()

    config = tf.compat.v1.ConfigProto()
    config.gpu_options.allow_growth = True

    module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3?tf-hub-format=compressed"
    embed = hub.Module(module_url)

    with tf.compat.v1.Session(config=config) as session:
        session.run([tf.compat.v1.global_variables_initializer(), tf.compat.v1.tables_initializer()])

        text_embedding = session.run(embed(list(x[column])))
        session.close()

    return text_embedding

Relevant log output

----

tensorflow_hub Version

0.13.0.dev (unstable development build)

TensorFlow Version

2.8 (latest stable release)

Other libraries

No response

Python Version

3.x

OS

Linux

@singhniraj08 singhniraj08 self-assigned this Feb 3, 2023
@singhniraj08
Copy link

@giulia95,

universal-sentence-encoder-large/3 has a transformer encoder and it uses the average pooling over all token embeddings at the last transformer layer as the output embedding.
Unfortunately depending on the model's architecture and the ops used it may be hard or impossible to get the underlying gradients at the token level.

Universal Sentence Encoder provides sentence encoding and cannot provide word level encodings. For word level encodings you can use Word2Vec or GloVe as shown here or follow this tutorial to create custom word embedding layer.

Hope that helps. Thank you!

@giulia95
Copy link
Author

giulia95 commented Feb 3, 2023

Thank you for the answer.
I'll look at the approaches you suggested.

@giulia95 giulia95 closed this as completed Feb 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants