Universal Sentence Encoder - extract token level embeddings #875

giulia95 · 2023-02-02T09:45:47Z

What happened?

Hi,
I'm using the Universal Sentence Encoder to compute the embeddings representations of text sentences stored into a dataframe, using the code below (use_preprocessing).
Now, I would like to also extract the internal representation of the words that compose the sentence to encode.
I've seen the issue #344 and, for what I understood, the model I'm using compute the final embedding through Σ_w Embed(w) / √sentence length .
The result that I would like to achieve is the following:
Given a sentence "Example text" I would like to extract the final embedding for "Example text" and the internal embedding for each token (i.e the. ebedding of "Example" and the embedding of "text").

I've also seen the issue #403 but it's still not clear to me how to access the token level embeddings.
Thank you,
Giulia

Relevant code

def use_preprocessing(df, column):
    """Compute the embedding via the Universal Sentence Encode algorithm
    for every sentence in the given column
    Args:
        df: Dataframe
        column: column name to identify data to process
    """
    # Universal Sentence Encoder
    tf.compat.v1.disable_eager_execution()

    config = tf.compat.v1.ConfigProto()
    config.gpu_options.allow_growth = True

    module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3?tf-hub-format=compressed"
    embed = hub.Module(module_url)

    with tf.compat.v1.Session(config=config) as session:
        session.run([tf.compat.v1.global_variables_initializer(), tf.compat.v1.tables_initializer()])

        text_embedding = session.run(embed(list(x[column])))
        session.close()

    return text_embedding

Relevant log output

----

tensorflow_hub Version

0.13.0.dev (unstable development build)

TensorFlow Version

2.8 (latest stable release)

Other libraries

No response

Python Version

3.x

OS

Linux

The text was updated successfully, but these errors were encountered:

singhniraj08 · 2023-02-03T05:11:06Z

@giulia95,

universal-sentence-encoder-large/3 has a transformer encoder and it uses the average pooling over all token embeddings at the last transformer layer as the output embedding.
Unfortunately depending on the model's architecture and the ops used it may be hard or impossible to get the underlying gradients at the token level.

Universal Sentence Encoder provides sentence encoding and cannot provide word level encodings. For word level encodings you can use Word2Vec or GloVe as shown here or follow this tutorial to create custom word embedding layer.

Hope that helps. Thank you!

giulia95 · 2023-02-03T11:57:08Z

Thank you for the answer.
I'll look at the approaches you suggested.

giulia95 added the type:bug label Feb 2, 2023

singhniraj08 self-assigned this Feb 3, 2023

singhniraj08 added the stat:awaiting response label Feb 3, 2023

giulia95 closed this as completed Feb 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Universal Sentence Encoder - extract token level embeddings #875

Universal Sentence Encoder - extract token level embeddings #875

giulia95 commented Feb 2, 2023 •

edited

Loading

singhniraj08 commented Feb 3, 2023

giulia95 commented Feb 3, 2023

Universal Sentence Encoder - extract token level embeddings #875

Universal Sentence Encoder - extract token level embeddings #875

Comments

giulia95 commented Feb 2, 2023 • edited Loading

What happened?

Relevant code

Relevant log output

tensorflow_hub Version

TensorFlow Version

Other libraries

Python Version

OS

singhniraj08 commented Feb 3, 2023

giulia95 commented Feb 3, 2023

giulia95 commented Feb 2, 2023 •

edited

Loading