# Lesson 4: Sentence Embeddings

- In the classroom, the libraries are already installed for you.
- If you would like to run this code on your own machine, you can install the following:
```
    !pip install sentence-transformers
```

- Here is some code that suppresses warning messages.

In [1]:
from transformers.utils import logging
logging.set_verbosity_error()

In [4]:
# !pip install sentence_transformers

### Build the `sentence embedding` pipeline using 🤗 Transformers Library

In [5]:
from sentence_transformers import SentenceTransformer

In [6]:
model = SentenceTransformer("all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

More info on [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).

In [7]:
sentences1 = ['The cat sits outside',
              'A man is playing guitar',
              'The movies are awesome']

In [8]:
embeddings1 = model.encode(sentences1, convert_to_tensor=True)

In [9]:
embeddings1

tensor([[ 0.1392,  0.0030,  0.0470,  ...,  0.0641, -0.0163,  0.0636],
        [ 0.0227, -0.0014, -0.0056,  ..., -0.0225,  0.0846, -0.0283],
        [-0.1043, -0.0628,  0.0093,  ...,  0.0020,  0.0653, -0.0150]],
       device='cuda:0')

In [10]:
sentences2 = ['The dog plays in the garden',
              'A woman watches TV',
              'The new movie is so great']

In [11]:
embeddings2 = model.encode(sentences2,
                           convert_to_tensor=True)

In [12]:
print(embeddings2)

tensor([[ 0.0163, -0.0700,  0.0384,  ...,  0.0447,  0.0254, -0.0023],
        [ 0.0054, -0.0920,  0.0140,  ...,  0.0167, -0.0086, -0.0424],
        [-0.0842, -0.0592, -0.0010,  ..., -0.0157,  0.0764,  0.0389]],
       device='cuda:0')


* Calculate the cosine similarity between two sentences as a measure of how similar they are to each other.

In [13]:
from sentence_transformers import util

In [14]:
cosine_scores = util.cos_sim(embeddings1,embeddings2)

In [15]:
print(cosine_scores)

tensor([[ 0.2838,  0.1310, -0.0029],
        [ 0.2277, -0.0327, -0.0136],
        [-0.0124, -0.0465,  0.6571]], device='cuda:0')


In [16]:
for i in range(len(sentences1)):
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i],
                                                 sentences2[i],
                                                 cosine_scores[i][i]))

The cat sits outside 		 The dog plays in the garden 		 Score: 0.2838
A man is playing guitar 		 A woman watches TV 		 Score: -0.0327
The movies are awesome 		 The new movie is so great 		 Score: 0.6571


### Try it yourself!
- Try this model with your own sentences!

In [18]:
# put them all together
from sentence_transformers import util
sentences1 = ['Naya is better than Chipotle',
              'I wish there were more Costco in NYC',
              'K-town is great']
sentences2 = ['Starbucks taests better than Pret',
              'H-mart is overpriced',
              'H-mart is overpriced']
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
print(embeddings1)
embeddings2 = model.encode(sentences2,
                           convert_to_tensor=True)
print(embeddings2)

# sentence embeddings
cosine_scores = util.cos_sim(embeddings1,embeddings2)
print("sentence embedding scores: ", cosine_scores)

for i in range(len(sentences1)):
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i],
                                                 sentences2[i],
                                                 cosine_scores[i][i]))

tensor([[-0.0003, -0.0198, -0.0940,  ..., -0.0773,  0.0919, -0.0069],
        [ 0.0325, -0.0237,  0.0138,  ..., -0.1187, -0.0296,  0.0395],
        [ 0.0966, -0.0168, -0.0293,  ..., -0.0060, -0.0449,  0.0248]],
       device='cuda:0')
tensor([[-0.0186, -0.0360,  0.0594,  ..., -0.1232,  0.1186,  0.0500],
        [ 0.0102,  0.0543,  0.0160,  ..., -0.1601,  0.0190,  0.1033],
        [ 0.0102,  0.0543,  0.0160,  ..., -0.1601,  0.0190,  0.1033]],
       device='cuda:0')
sentence embedding scores:  tensor([[0.1965, 0.0670, 0.0670],
        [0.2911, 0.4091, 0.4091],
        [0.1988, 0.2046, 0.2046]], device='cuda:0')
Naya is better than Chipotle 		 Starbucks taests better than Pret 		 Score: 0.1965
I wish there were more Costco in NYC 		 H-mart is overpriced 		 Score: 0.4091
K-town is great 		 H-mart is overpriced 		 Score: 0.2046


In [21]:
# put them all together
from sentence_transformers import util
sentences1 = ['Naya is better than Chipotle',
              'I wish there were more Costco in NYC',
              'K-town is great',
              'Korea town is great']
sentences2 = ['Naya is healthier',
              'Costco trip without a car is impossible',
              'Korean sunscreen is great',
              'Korean sunscreen is great']

              # spell out K-town to Korean town makes the embedding understand the meaning of the words.
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
print(embeddings1)
embeddings2 = model.encode(sentences2,
                           convert_to_tensor=True)
print(embeddings2)

# sentence embeddings
cosine_scores = util.cos_sim(embeddings1,embeddings2)
print("sentence embedding scores: ", cosine_scores)

for i in range(len(sentences1)):
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i],
                                                 sentences2[i],
                                                 cosine_scores[i][i]))

tensor([[-0.0003, -0.0198, -0.0940,  ..., -0.0773,  0.0919, -0.0069],
        [ 0.0325, -0.0237,  0.0138,  ..., -0.1187, -0.0296,  0.0395],
        [ 0.0966, -0.0168, -0.0293,  ..., -0.0060, -0.0449,  0.0248],
        [ 0.0724,  0.0434,  0.0141,  ..., -0.0094, -0.0881, -0.0014]],
       device='cuda:0')
tensor([[ 0.0309,  0.0627, -0.0260,  ..., -0.1225,  0.0599, -0.0099],
        [ 0.0256,  0.0751, -0.0213,  ..., -0.0193, -0.0408, -0.0247],
        [-0.0183,  0.0822,  0.0673,  ..., -0.0484, -0.0403,  0.0854],
        [-0.0183,  0.0822,  0.0673,  ..., -0.0484, -0.0403,  0.0854]],
       device='cuda:0')
sentence embedding scores:  tensor([[0.6487, 0.0207, 0.0760, 0.0760],
        [0.1200, 0.4873, 0.0754, 0.0754],
        [0.0998, 0.1216, 0.2929, 0.2929],
        [0.0952, 0.1368, 0.4173, 0.4173]], device='cuda:0')
Naya is better than Chipotle 		 Naya is healthier 		 Score: 0.6487
I wish there were more Costco in NYC 		 Costco trip without a car is impossible 		 Score: 0.4873
K-town is gr

In [23]:
# put them all together
from sentence_transformers import util
sentences1 = ['Naya is better than Chipotle',
              'I wish there were more Costco in NYC',
              'K-town is great',
              'Korea town is great']
sentences2 = ['Naya is healthier',
              'Costco trip without a car is impossible',
              'Korean sunscreen is great',
              'Korean sunscreen is great']

              # spell out K-town to Korean town makes the embedding understand the meaning of the words.
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
print(embeddings1)
embeddings2 = model.encode(sentences2,
                           convert_to_tensor=True)
print(embeddings2)

# sentence embeddings
cosine_scores = util.cos_sim(embeddings1,embeddings2)
print("sentence embedding scores: ", cosine_scores)

for i in range(len(sentences1)):
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i],
                                                 sentences2[i],
                                                 cosine_scores[i][i]))

tensor([[-0.0003, -0.0198, -0.0940,  ..., -0.0773,  0.0919, -0.0069],
        [ 0.0325, -0.0237,  0.0138,  ..., -0.1187, -0.0296,  0.0395],
        [ 0.0966, -0.0168, -0.0293,  ..., -0.0060, -0.0449,  0.0248],
        [ 0.0724,  0.0434,  0.0141,  ..., -0.0094, -0.0881, -0.0014]],
       device='cuda:0')
tensor([[ 0.0309,  0.0627, -0.0260,  ..., -0.1225,  0.0599, -0.0099],
        [ 0.0256,  0.0751, -0.0213,  ..., -0.0193, -0.0408, -0.0247],
        [-0.0183,  0.0822,  0.0673,  ..., -0.0484, -0.0403,  0.0854],
        [-0.0183,  0.0822,  0.0673,  ..., -0.0484, -0.0403,  0.0854]],
       device='cuda:0')
sentence embedding scores:  tensor([[0.6487, 0.0207, 0.0760, 0.0760],
        [0.1200, 0.4873, 0.0754, 0.0754],
        [0.0998, 0.1216, 0.2929, 0.2929],
        [0.0952, 0.1368, 0.4173, 0.4173]], device='cuda:0')
Naya is better than Chipotle 		 Naya is healthier 		 Score: 0.6487
I wish there were more Costco in NYC 		 Costco trip without a car is impossible 		 Score: 0.4873
K-town is gr

In [26]:
# put them all together
from sentence_transformers import util
sentences1 = ['Naya, a food chain restaurant, is better than Chipotle',
              'I wish there were more Costco in NYC',
              'K-town is great',
              'Korea town is great']
sentences2 = ['Naya is healthier',  # relationships worsens.
              'Costco trip without a car is impossible',
              'Korean sunscreen is great',
              'Korean sunscreen is great']

              # spell out K-town to Korean town makes the embedding understand the meaning of the words.
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
print(embeddings1)
embeddings2 = model.encode(sentences2,
                           convert_to_tensor=True)
print(embeddings2)

# sentence embeddings
cosine_scores = util.cos_sim(embeddings1,embeddings2)
print("sentence embedding scores: ", cosine_scores)

for i in range(len(sentences1)):
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i],
                                                 sentences2[i],
                                                 cosine_scores[i][i]))

tensor([[ 0.0113, -0.0253, -0.0405,  ..., -0.0835,  0.0565, -0.0231],
        [ 0.0325, -0.0237,  0.0138,  ..., -0.1187, -0.0296,  0.0395],
        [ 0.0966, -0.0168, -0.0293,  ..., -0.0060, -0.0449,  0.0248],
        [ 0.0724,  0.0434,  0.0141,  ..., -0.0094, -0.0881, -0.0014]],
       device='cuda:0')
tensor([[ 0.0309,  0.0627, -0.0260,  ..., -0.1225,  0.0599, -0.0099],
        [ 0.0256,  0.0751, -0.0213,  ..., -0.0193, -0.0408, -0.0247],
        [-0.0183,  0.0822,  0.0673,  ..., -0.0484, -0.0403,  0.0854],
        [-0.0183,  0.0822,  0.0673,  ..., -0.0484, -0.0403,  0.0854]],
       device='cuda:0')
sentence embedding scores:  tensor([[0.5724, 0.0998, 0.0829, 0.0829],
        [0.1200, 0.4873, 0.0754, 0.0754],
        [0.0998, 0.1216, 0.2929, 0.2929],
        [0.0952, 0.1368, 0.4173, 0.4173]], device='cuda:0')
Naya, a food chain restaurant, is better than Chipotle 		 Naya is healthier 		 Score: 0.5724
I wish there were more Costco in NYC 		 Costco trip without a car is impossible 		 