#Getting started with Text Embeddings

In [1]:
!pip install google-cloud-aiplatform



In [2]:
import os
import google.auth

# Get the path to the JSON key file.
key_file_path = "/content/tensile-will-381916-c5a7cceb1b3c.json"

# Get the Google Cloud Platform credentials.
credentials, project_id = google.auth.load_credentials_from_file(key_file_path)

# Print the credentials and project ID.
print("Credentials:", credentials)
print("Project ID:", project_id)


Credentials: <google.oauth2.service_account.Credentials object at 0x79d7842d0a30>
Project ID: tensile-will-381916


In [3]:
REGION = 'us-central1'

In [4]:
# Import and initialize the Vertex AI Python SDK

import vertexai
vertexai.init(project = project_id,
              location = REGION,
              credentials = credentials)

In [5]:
from vertexai.language_models import TextEmbeddingModel

Here is a list of pre-trained text embedding models that are available on Google Cloud Platform:

textembedding-gecko@001: A stable model that is designed to be used in a variety of applications, including natural language processing, machine learning, and artificial intelligence.\
textembedding-gecko-multilingual@latest: A model that is designed to be used in multilingual applications.\
textembedding-resnet@latest: A model that is designed to be used in applications that require high accuracy, such as sentiment analysis and question answering.\
textembedding-albert@latest: A model that is designed to be used in applications that require low latency, such as search and recommendation.\
textembedding-roberta@latest: A model that is designed to be used in applications that require both high accuracy and low latency, such as machine translation and summarization.

In [6]:
embedding_model = TextEmbeddingModel.from_pretrained(
    "textembedding-gecko@001")

In [7]:
embedding = embedding_model.get_embeddings(
    ["life"])

In [8]:
vector = embedding[0].values
print(f"Length = {len(vector)}")
print(vector[:10])

Length = 768
[-0.006005102302879095, 0.015532972291111946, -0.030447669327259064, 0.05322219058871269, 0.014444807544350624, -0.0542873740196228, 0.045140113681554794, 0.02127358317375183, -0.06537645310163498, 0.019103270024061203]


In [9]:
embedding = embedding_model.get_embeddings(["what is a life?"])

In [10]:
vector = embedding[0].values
print("Length = ",len(vector))
print(vector[:10])

Length =  768
[0.02044595032930374, 0.04214783012866974, -0.013931035064160824, 0.011491508223116398, -0.017616981640458107, -0.0008076780359260738, 0.05849781632423401, 0.016730481758713722, -0.047019511461257935, 8.595037797931582e-05]



The length of 768 for each sentence or word is because the text embedding model that you are using is using a 768-dimensional embedding space. This means that each text embedding is a vector of 768 numbers.

There are a number of reasons why text embedding models use high-dimensional embedding spaces. One reason is that it allows the model to capture more information about the text. Another reason is that it allows the model to be more robust to noise in the text.

768 is not the only possible embedding dimension. Some text embedding models use lower dimensions, such as 128 or 256. Others use higher dimensions, such as 1024 or 2048. The choice of embedding dimension depends on the specific task that the model is being used for.


In [11]:
emb_1 = embedding_model.get_embeddings(["How is your day?"])
emb_2 = embedding_model.get_embeddings(["What is the temperature in room?"])
emb_3 = embedding_model.get_embeddings(["Have you ever watched jurassic world?"])

In [12]:
vec_1 = [emb_1[0].values]
vec_2 = [emb_2[0].values]
vec_3 = [emb_3[0].values]

In [13]:
print(vec_1)
print(vec_2)
print(vec_3)

[[0.017992012202739716, 0.03759821876883507, 0.03552056476473808, 0.011872878298163414, -0.01051289215683937, 0.001496658893302083, -0.0019833100959658623, 0.008424529805779457, -0.019770337268710136, 0.06383293122053146, -0.014201260171830654, -0.010678608901798725, 0.05269564315676689, -0.0447530671954155, -0.022143255919218063, -0.021399609744548798, -0.05021602660417557, -0.01311811525374651, 0.005040123127400875, 0.013418120332062244, -0.08280237764120102, 0.024037359282374382, -0.02780505083501339, 0.004317035432904959, -0.04585236683487892, -0.0898003950715065, 0.047838810831308365, -0.023063112050294876, -0.04059545323252678, -0.058300457894802094, 0.03932586684823036, -0.005100395530462265, -0.022614212706685066, -0.061345845460891724, 0.010922761633992195, 0.04418500140309334, 0.004976274445652962, -0.018539493903517723, -0.0035041384398937225, 0.025289662182331085, 0.03906964883208275, 0.005395546089857817, 0.008701755665242672, -0.02095331810414791, -0.024266386404633522, 0

In [14]:
from sklearn.metrics.pairwise import cosine_similarity

In [15]:
print(cosine_similarity(vec_1,vec_2))
print(cosine_similarity(vec_2,vec_3))
print(cosine_similarity(vec_1,vec_3))

[[0.66062075]]
[[0.54585515]]
[[0.53823373]]


In [16]:
in_1 = "The kids play in the park."
in_2 = "The play was for kids in the park."

In [17]:
lt_1 = in_1.lower()
lt_2 = in_2.lower()
print(lt_1)
print(lt_2)

the kids play in the park.
the play was for kids in the park.


In [18]:
!pip install nltk



In [19]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   U

True

In [20]:
from nltk.tokenize import word_tokenize

In [21]:
t_1 = nltk.word_tokenize(lt_1)
t_2 = nltk.word_tokenize(lt_2)

In [22]:
import string

In [23]:
tok_no_punct_1 = [i for i in t_1 if i not in string.punctuation]
tok_no_punct_2 = [i for i in t_2 if i not in string.punctuation]
print(tok_no_punct_1)

['the', 'kids', 'play', 'in', 'the', 'park']


In [24]:
from nltk.corpus import stopwords
sw = stopwords.words('english')
print(sw)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [25]:
rem_sw_in_1 = [i for i in tok_no_punct_1 if i not in sw]
rem_sw_in_2 = [i for i in tok_no_punct_2 if i not in sw]
print(rem_sw_in_1)
print(rem_sw_in_2)

['kids', 'play', 'park']
['play', 'kids', 'park']


In [27]:
embeddings_1 = [emb.values for emb in embedding_model.get_embeddings(rem_sw_in_1)]
embeddings_2 = [emb.values for emb in embedding_model.get_embeddings(rem_sw_in_2)]

In [28]:
print(embeddings_1)
print(embeddings_2)

[[-0.03156903386116028, 0.008489725179970264, 0.017588036134839058, 0.032134201377630234, 0.03936800733208656, -0.09866096824407578, 0.021243518218398094, -0.01753906160593033, -0.02377880923449993, -0.009624628350138664, 0.04115459695458412, -0.032025296241045, 0.04308250546455383, 0.006345911417156458, -0.014998256228864193, -0.007242536637932062, -0.03459545597434044, -0.01790512725710869, 0.015554779209196568, 0.02111588604748249, -0.07564287632703781, 0.029853027313947678, -0.04521135985851288, 0.000996007234789431, -0.0464370995759964, -0.13004381954669952, 0.09672889113426208, 0.01813829503953457, -0.019238680601119995, -0.04046168550848961, 0.024772346019744873, 0.0021819681860506535, -0.013148401863873005, -0.014166107401251793, 0.016094300895929337, 0.06669357419013977, 0.01237595733255148, -0.020037582144141197, 0.043719708919525146, 0.03883647173643112, 0.006853623781353235, 0.024056946858763695, -0.04831813648343086, -0.035331811755895615, -0.020588163286447525, 0.04058500

Use numpy to convert this list of lists into a 2D array of 3 rows and 768 columns.

In [36]:
import numpy as np
emb_array_1 = np.stack(embeddings_1)
emb_array_2 = np.stack(embeddings_2)
print(emb_array_1.shape)
print(emb_array_2.shape)

(3, 768)
(3, 768)


Take the average embedding across the 3 word embeddings
You'll get a single embedding of length 768.

In [38]:
emb_1_mean = emb_array_1.mean(axis=0)
emb_2_mean = emb_array_2.mean(axis=0)
print(emb_1_mean.shape)
print(emb_2_mean.shape)

(768,)
(768,)


Check to see that taking an average of word embeddings results in two sentence embeddings that are identical.

In [39]:
print(emb_1_mean[:4])
print(emb_2_mean[:4])

[-0.00385805 -0.00522636  0.00574341  0.03331106]
[-0.00385805 -0.00522636  0.00574341  0.03331106]


Get sentence embeddings from the model.\
These sentence embeddings account for word order and context.\
Verify that the sentence embeddings are not the same.

In [40]:
print(in_1)
print(in_2)

The kids play in the park.
The play was for kids in the park.


In [41]:
embedding_1 = embedding_model.get_embeddings([in_1])
embedding_2 = embedding_model.get_embeddings([in_2])

In [42]:
vector_1 = embedding_1[0].values
print(vector_1[:4])
vector_2 = embedding_2[0].values
print(vector_2[:4])

[0.0039385221898555756, -0.020830577239394188, -0.002994248876348138, -0.007580515928566456]
[-0.01565515622496605, -0.012884826399385929, 0.01229254249483347, -0.0005865463172085583]
