# Derive Embedding with Sentence-Transformers

In this notebook we explored a few ways to map a sentence to a vector.


## References for Sentence Embedding
- [Document Embedding Techniques - 2019](https://towardsdatascience.com/document-embedding-techniques-fed3e7a6a25d)
    - Classic techniques
        * Bag-of-words
        * Latent Dirichlet Allocation (LDA)
    - Unsupervised document embedding techniques
        * n-gram embeddings
        * Averaging word embeddings
        * Sent2Vec
        * Paragraph vectors (doc2vec)
        * Doc2VecC
        * Skip-thought vectors
        * FastSent
        * Quick-thought vectors
        * Word Mover’s Embedding (WME)
        * Sentence-BERT (SBERT)
    - Supervised document embedding techniques
        * Learning document embeddings from labeled data
        * Task-specific supervised document embeddings
        * — GPT
        * — Deep Semantic Similarity Model (DSSM)
        * Jointly learning sentence representations
        * — Universal Sentence Encoder
        * — GenSen
- [Top 4 Sentence Embedding Techniques using Python! - 2020](https://www.analyticsvidhya.com/blog/2020/08/top-4-sentence-embedding-techniques-using-python/)
    + Doc2Vec
    + SentenceBERT
    + InferSent
    + Universal Sentence Encoder

## Pre-trained Sentence Transformers
- [Github](https://github.com/UKPLab/sentence-transformers)
- [Pretrained sentence-bert](https://www.sbert.net/docs/pretrained_models.html)
    - **distiluse-base-multilingual-cased-v2**: Multilingual knowledge distilled version of multilingual Universal Sentence Encoder. This version supports 50+ languages, but performs a bit weaker than the v1 model.
    - The models using *average word embedding* computation speed is much higher than the transformer based models, but the quality of the embeddings are worse.
   

In [1]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('distiluse-base-multilingual-cased-v2')

sentence = ['朝辭白帝彩雲間','千里江陵一日還','兩岸猿聲啼不住','輕舟已過萬重山']


C:\Users\tsyo\anaconda3\lib\site-packages\numpy\.libs\libopenblas.QVLO2T66WEPI7JZ63PS3HMOHFEY472BC.gfortran-win_amd64.dll
C:\Users\tsyo\anaconda3\lib\site-packages\numpy\.libs\libopenblas.WCDJNK7YVMPZQ2ME2ZZHJJRJ3JIKNDB7.gfortran-win_amd64.dll
  stacklevel=1)


Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.38k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/610 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/341 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/539M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/531 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/114 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.58M [00:00<?, ?B/s]

NameError: name 'sentences' is not defined

In [2]:
#Encode all sentences
embeddings = model.encode(sentence)



In [3]:
embeddings.shape

(4, 512)

In [4]:
#Compute cosine similarity between all pairs
cos_sim = util.cos_sim(embeddings, embeddings)

print(cos_sim)

tensor([[1.0000, 0.4568, 0.3396, 0.2982],
        [0.4568, 1.0000, 0.3701, 0.4409],
        [0.3396, 0.3701, 1.0000, 0.4095],
        [0.2982, 0.4409, 0.4095, 1.0000]])
