# Lesson 4: Sentence Embeddings

- In the classroom, the libraries are already installed for you.
- If you would like to run this code on your own machine, you can install the following:
``` 
    !pip install sentence-transformers
```

- Here is some code that suppresses warning messages.

In [1]:
from transformers.utils import logging
logging.set_verbosity_error()

### Build the `sentence embedding` pipeline using 🤗 Transformers Library

In [2]:
from sentence_transformers import SentenceTransformer

In [3]:
model = SentenceTransformer("all-MiniLM-L6-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

More info on [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).

In [4]:
sentences1 = ['The cat sits outside',
              'A man is playing guitar',
              'The movies are awesome']

In [5]:
embeddings1 = model.encode(sentences1, convert_to_tensor=True)

In [6]:
embeddings1

tensor([[ 0.1392,  0.0030,  0.0470,  ...,  0.0641, -0.0163,  0.0636],
        [ 0.0227, -0.0014, -0.0056,  ..., -0.0225,  0.0846, -0.0283],
        [-0.1043, -0.0628,  0.0093,  ...,  0.0020,  0.0653, -0.0150]])

In [7]:
embeddings1.shape

torch.Size([3, 384])

In [8]:
sentences2 = ['The dog plays in the garden',
              'A woman watches TV',
              'The new movie is so great']

In [9]:
embeddings2 = model.encode(sentences2, 
                           convert_to_tensor=True)

In [10]:
print(embeddings2)

tensor([[ 0.0163, -0.0700,  0.0384,  ...,  0.0447,  0.0254, -0.0023],
        [ 0.0054, -0.0920,  0.0140,  ...,  0.0167, -0.0086, -0.0424],
        [-0.0842, -0.0592, -0.0010,  ..., -0.0157,  0.0764,  0.0389]])


* Calculate the cosine similarity between two sentences as a measure of how similar they are to each other.

In [11]:
from sentence_transformers import util

In [12]:
cosine_scores = util.cos_sim(embeddings1,embeddings2)

In [13]:
print(cosine_scores)

tensor([[ 0.2838,  0.1310, -0.0029],
        [ 0.2277, -0.0327, -0.0136],
        [-0.0124, -0.0465,  0.6571]])


In [14]:
for i in range(len(sentences1)):
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i],
                                                 sentences2[i],
                                                 cosine_scores[i][i]))

The cat sits outside 		 The dog plays in the garden 		 Score: 0.2838
A man is playing guitar 		 A woman watches TV 		 Score: -0.0327
The movies are awesome 		 The new movie is so great 		 Score: 0.6571


### Try it yourself! 
- Try this model with your own sentences!

#### Sentences with similar words vs similar meanings

In [15]:
sent1 = "Hugging Face is awesome!"
sent2 = "She is hugging him. It's awesome!"

In [16]:
embeddings1 = model.encode(sent1, convert_to_tensor=True)
embeddings2 = model.encode(sent2, convert_to_tensor=True)
embeddings2.shape

torch.Size([384])

In [17]:
cosine_scores = util.cos_sim(embeddings1,embeddings2)
print(cosine_scores)

tensor([[0.6298]])


In [21]:
embeddings1 = model.encode("How are you?", convert_to_tensor=False)
embeddings2 = model.encode("How is it going?", convert_to_tensor=False)
embeddings1.shape

(384,)

In [22]:
type(embeddings1)

numpy.ndarray

In [23]:
cosine_scores = util.cos_sim(embeddings1,embeddings2)
print(cosine_scores)

tensor([[0.5947]])


#### Try a different model

In [24]:
model = SentenceTransformer("all-mpnet-base-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [25]:
embeddings1 = model.encode("How are you?", convert_to_tensor=False)
embeddings2 = model.encode("How is it going?", convert_to_tensor=False)
cosine_scores = util.cos_sim(embeddings1,embeddings2)
print(cosine_scores)

tensor([[0.5442]])


In [27]:
embeddings1 = model.encode("How are you?", convert_to_tensor=True)
embeddings2 = model.encode("How you doing?", convert_to_tensor=True)
cosine_scores = util.cos_sim(embeddings1,embeddings2)
print(cosine_scores)

tensor([[0.8115]])


## Other Tasks

### Text Classification

In [28]:
from transformers import pipeline, Conversation

In [29]:
classifier = pipeline(
    "text-classification",
    model="distilbert-base-uncased-finetuned-sst-2-english")

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [30]:
classifier("I like pandas")

[{'label': 'POSITIVE', 'score': 0.9990052580833435}]

In [31]:
classifier("I hate it")

[{'label': 'NEGATIVE', 'score': 0.9996398687362671}]

In [32]:
classifier("I don't know")

[{'label': 'NEGATIVE', 'score': 0.9974709749221802}]

In [39]:
classifier("Not sure")

[{'label': 'NEGATIVE', 'score': 0.9996809959411621}]

### Zero shot classification

In [34]:
zero_shot_classifier = pipeline(
    "zero-shot-classification",
    model="facebook/bart-large-mnli")

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

In [35]:
sequence = "one day I will see the world"
candidate_labels = ["travel", "cooking", "dancing"]

In [36]:
classification = zero_shot_classifier(
    sequence,
    candidate_labels,
    multi_label=True)
print(classification)

{'sequence': 'one day I will see the world', 'labels': ['travel', 'dancing', 'cooking'], 'scores': [0.9945111274719238, 0.005706233438104391, 0.0018193295691162348]}


In [37]:
print(classification["labels"])
print(classification["scores"])

['travel', 'dancing', 'cooking']
[0.9945111274719238, 0.005706233438104391, 0.0018193295691162348]


In [38]:
sum(classification["scores"])

1.0020366904791445

### Q/A

In [40]:
question_answerer = pipeline(
    "question-answering",
    model="distilbert-base-cased-distilled-squad")

config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [43]:
context = """
Extractive QA is when a model identifies and extracts the exact answer span directly from the given text.
A good example is SQuAD (Stanford Question Answering Dataset), which contains over 100,000 questions based on Wikipedia.
"""
question = """
What is a good example of a question answering dataset?
"""

In [44]:
result = question_answerer(question=question,
                          context=context)
print(result)

{'score': 0.9811776876449585, 'start': 125, 'end': 130, 'answer': 'SQuAD'}
