## Sentence Transformers

Sentence transformers require Python 3.6 or higher, PyTorch 1.6.0 or higher
Install PyTorch and TorchVision in conda (see https://pytorch.org/) and sentence transformers from the Anaconda Command Prompt:

conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch

python -m pip install transformers

python -m pip install sentence-transformers

In [1]:
# Check for GPU (using PyTorch)
# If available. tell PyTorch to use the GPU, otherwise use the CPU

import torch

if torch.cuda.is_available():    
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

No GPU available, using the CPU instead.


In [2]:
import numpy as np
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('stsb-roberta-large')

In [3]:
# Calculate semantic similarity between two sentences
sentence1 = "I like Python because I can build AI applications"
sentence2 = "I like Python because I can do data analytics"

# Encode sentences to get their embeddings
embedding1 = model.encode(sentence1, convert_to_tensor=True)
embedding2 = model.encode(sentence2, convert_to_tensor=True)

# Compute similarity scores of two embeddings
cosine_score = util.pytorch_cos_sim(embedding1, embedding2)
print("Sentence 1:", sentence1)
print("Sentence 2:", sentence2)
print("Similarity score:", cosine_score.item())

Sentence 1: I like Python because I can build AI applications
Sentence 2: I like Python because I can do data analytics
Similarity score: 0.8015284538269043


### Semantic Text Similarity (STS)

In [4]:
# Print sentence embeddings
sentences = ["I like Python because I can build AI applications",
             "I like Python because I can do data analytics", 
             "The cat sits on the ground"]

sentence_embeddings = model.encode(sentences)

for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Sentence: I like Python because I can build AI applications
Embedding: [-0.46271408  0.74068356 -0.26615497 ...  1.6758347  -2.6872828
 -0.21768881]

Sentence: I like Python because I can do data analytics
Embedding: [-0.3860071   0.6501613  -0.30140767 ...  1.5000768  -2.2584777
  0.7605829 ]

Sentence: The cat sits on the ground
Embedding: [-0.23815349  0.52042085 -0.2830657  ...  0.09840204 -0.5524504
  0.40428722]



In [5]:
# Calculate semantic similarity between two lists of sentences
sentences1 = ["I like Python because I can build AI applications", 
              "The cat sits on the ground"]   
sentences2 = ["I like Python because I can do data analytics", 
              "The cat walks on the sidewalk"]

embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)

cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings2)

for i in range(len(sentences1)):
    for j in range(len(sentences2)):
        print("Sentence 1:", sentences1[i])
        print("Sentence 2:", sentences2[j])
        print("Similarity Score:", cosine_scores[i][j].item())
        print()

Sentence 1: I like Python because I can build AI applications
Sentence 2: I like Python because I can do data analytics
Similarity Score: 0.8015283942222595

Sentence 1: I like Python because I can build AI applications
Sentence 2: The cat walks on the sidewalk
Similarity Score: -0.03110990673303604

Sentence 1: The cat sits on the ground
Sentence 2: I like Python because I can do data analytics
Similarity Score: 0.11328627914190292

Sentence 1: The cat sits on the ground
Sentence 2: The cat walks on the sidewalk
Similarity Score: 0.4038146734237671



In [6]:
# Retrieve Top K most similar sentences from a corpus given a sentence
corpus = ["I like Python because I can build AI applications",
          "I like Python because I can do data analytics",
          "The cat sits on the ground",
          "The cat walks on the sidewalk"]

sentence = "I like Javascript because I can build web applications"

corpus_embeddings = model.encode(corpus, convert_to_tensor=True)
sentence_embedding = model.encode(sentence, convert_to_tensor=True)

# top_k results to return
top_k=2

# Compute similarity scores of the sentence with the corpus
cos_scores = util.pytorch_cos_sim(sentence_embedding, corpus_embeddings)[0]

# Sort and print results in decreasing order and get the first top_k
top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]
print("Sentence:", sentence, "\n")
print("Top", top_k, "most similar sentences in corpus:")

for idx in top_results[0:top_k]:
    print(corpus[idx], "(Score: %.4f)" % (cos_scores[idx]))

Sentence: I like Javascript because I can build web applications 

Top 2 most similar sentences in corpus:
I like Python because I can build AI applications (Score: 0.6696)
I like Python because I can do data analytics (Score: 0.5455)


In [7]:
# Retreive top k most similar sentences from a corpus of sentences
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'Someone in a gorilla costume is playing a set of drums.']

embeddings = model.encode(sentences)

# Compute cosine similarity between all pairs
cos_sim = util.pytorch_cos_sim(embeddings, embeddings)

# Add all pairs to a list with their cosine similarity score
all_sentence_combinations = []
for i in range(len(cos_sim)-1):
    for j in range(i+1, len(cos_sim)):
        all_sentence_combinations.append([cos_sim[i][j], i, j])

# Sort and print list by the highest cosine similarity score
all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True)

print("Top-5 most similar pairs:")
for score, i, j in all_sentence_combinations[0:5]:
    print("{} \t {} \t {:.4f}".format(sentences[i], sentences[j], cos_sim[i][j]))

Top-5 most similar pairs:
I like Python because I can build AI applications 	 I like Python because I can do data analytics 	 0.8015
I like Python because I can do data analytics 	 The cat sits on the ground 	 0.1133
I like Python because I can build AI applications 	 The cat sits on the ground 	 0.1112


##### Pretrained sentence embedding models

The following models are optimized for Semantic Textual Similarity (STS).

stsb-roberta-large - STSb performance: 86.39
stsb-roberta-base - STSb performance: 85.44
stsb-bert-large - STSb performance: 85.29
stsb-distilbert-base - STSb performance: 85.16

The following models are recommended for various applications, including various similarity and retrieval tasks, as they were trained on millions of paraphrase examples. They are currently under development, but they outperform NLI/STSb models for many tasks.

paraphrase-distilroberta-base-v1 - Trained on large scale paraphrase data.
paraphrase-xlm-r-multilingual-v1 - Multilingual version of paraphrase-distilroberta-base-v1, trained on parallel data for 50+ languages. 

##### Embedding models for search queries (information extraction)

The following models are optimized for question-answer retrieval in search queries. They were trained on MSMARCO Passage Ranking, a dataset with 500k real queries from Bing search.

msmarco-distilbert-base-v3: MRR@10: 33.13 on MS MARCO dev set
msmarco-roberta-base-ance-fristp: MRR@10: 33.03 on MS MARCO dev set

The following model is trained on Google’s Natural Questions dataset, a dataset with 100k real queries from Google search together with the relevant passages from Wikipedia.

nq-distilbert-base-v1: MRR10: 72.36 on NQ dev set (small)

In Dense Passage Retrieval (DPR) for Open-Domain Question Answering, Karpukhin et al. trained models based on Google’s Natural Questions dataset:

facebook-dpr-ctx_encoder-single-nq-base
facebook-dpr-question_encoder-single-nq-base

Karpukhin et al. also trained models on the combination of Natural Questions, TriviaQA, WebQuestions, and CuratedTREC.

facebook-dpr-ctx_encoder-multiset-base
facebook-dpr-question_encoder-multiset-base

In [8]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('msmarco-distilbert-base-v3')

query_embedding = model.encode('How big is London')
passage_embedding = model.encode('London has 9,787,426 inhabitants at the 2011 census')

print("Similarity:", util.pytorch_cos_sim(query_embedding, passage_embedding))

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=244721538.0), HTML(value='')))


Similarity: tensor([[0.6082]])


In [9]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('nq-distilbert-base-v1')

query_embedding = model.encode('How many people live in London?')

# The passages are encoded as [ [title1, text1], [title2, text2], ...]
passage_embedding = model.encode([['London', 'London has 9,787,426 inhabitants at the 2011 census.']])

print("Similarity:", util.pytorch_cos_sim(query_embedding, passage_embedding))

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=244717164.0), HTML(value='')))


Similarity: tensor([[0.6503]])


##### Multi-lingual models

The following models generate aligned vector spaces, i.e., similar inputs in different languages are mapped close in vector space. You do not need to specify the input language. 

Models for Semantic Similarity: Generate semantically similar sentences within one language or across languages:

distiluse-base-multilingual-cased-v1: Multilingual knowledge distilled version of multilingual Universal Sentence Encoder. Supports 15 languages: Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish.

distiluse-base-multilingual-cased-v2: Multilingual knowledge distilled version of multilingual Universal Sentence Encoder. Supports 50+ languages. However, performance on the 15 languages mentioned above are a bit lower.

paraphrase-xlm-r-multilingual-v1 - Multilingual version of paraphrase-distilroberta-base-v1, trained on parallel data for 50+ languages.

stsb-xlm-r-multilingual: Produces similar embeddings as the stsb-bert-base model. Trained on parallel data for 50+ languages.

quora-distilbert-multilingual - Multilingual version of quora-distilbert-base. Fine-tuned with parallel data for 50+ languages.

Bitext mining: Describes the process of finding translated sentence pairs in two languages. The best model for this use-case is:

LaBSE: Finds translation pairs across 109 languages. Works less well for assessing the similarity of sentence pairs that are not translations of each other.

For detail, see https://www.sbert.net/docs/pretrained_models.html

### Machine translation

For translation between any two  languages, say English to German, we need a language model pretrained for this task. T5 is a model for English-German translation that has been trained on the massive c4 dataset.

In [12]:
from transformers import pipeline
translation = pipeline("translation_en_to_de", model="t5-base", tokenizer="t5-base")

HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=1199.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=891691430.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=791656.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=1389353.0), HTML(value='')))




In [13]:
text = "I like to study Data Science and Machine Learning"
translated_text = translation(text, max_length=40)[0]['translation_text']
print(translated_text)

Ich studiere gerne Datenwissenschaft und maschinelles Lernen


The Huggingface community (https://huggingface.co/models?filter=translation) has created a set of pretrained language models for machine translation between different language pairs.

For example, Engligh-to-Chinese translation using HelsinkiNLPs pretrained model on Huggingface (https://huggingface.co/Helsinki-NLP/opus-mt-en-zh):

In [14]:
from transformers import AutoModelWithLMHead, AutoTokenizer

model = AutoModelWithLMHead.from_pretrained("Helsinki-NLP/opus-mt-en-zh")
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-zh")
translation = pipeline("translation_en_to_zh", model=model, tokenizer=tokenizer)

text = "I like to study Data Science and Machine Learning"
translated_text = translation(text, max_length=40)[0]['translation_text']
print(translated_text)



HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=1373.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=312087009.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=806435.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=804600.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=1617791.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=44.0), HTML(value='')))


我喜欢学习数据科学和机器学习
