# Derive Embedding with Sentence-Transformers

In this notebook we explored a few ways to map a sentence to a vector.


## References for Sentence Embedding
- [Document Embedding Techniques - 2019](https://towardsdatascience.com/document-embedding-techniques-fed3e7a6a25d)
    - Classic techniques
        * Bag-of-words
        * Latent Dirichlet Allocation (LDA)
    - Unsupervised document embedding techniques
        * n-gram embeddings
        * Averaging word embeddings
        * Sent2Vec
        * Paragraph vectors (doc2vec)
        * Doc2VecC
        * Skip-thought vectors
        * FastSent
        * Quick-thought vectors
        * Word Mover’s Embedding (WME)
        * Sentence-BERT (SBERT)
    - Supervised document embedding techniques
        * Learning document embeddings from labeled data
        * Task-specific supervised document embeddings
        * — GPT
        * — Deep Semantic Similarity Model (DSSM)
        * Jointly learning sentence representations
        * — Universal Sentence Encoder
        * — GenSen
- [Top 4 Sentence Embedding Techniques using Python! - 2020](https://www.analyticsvidhya.com/blog/2020/08/top-4-sentence-embedding-techniques-using-python/)
    + Doc2Vec
    + SentenceBERT
    + InferSent
    + Universal Sentence Encoder

## Pre-trained Sentence Transformers
- [Github](https://github.com/UKPLab/sentence-transformers)
- [Pretrained sentence-bert](https://www.sbert.net/docs/pretrained_models.html)
    - **distiluse-base-multilingual-cased-v2**: Multilingual knowledge distilled version of multilingual Universal Sentence Encoder. This version supports 50+ languages, but performs a bit weaker than the v1 model.
    - The models using *average word embedding* computation speed is much higher than the transformer based models, but the quality of the embeddings are worse.
   

In [1]:
#from sentence_transformers import SentenceTransformer, util
#model = SentenceTransformer('distiluse-base-multilingual-cased-v2')
#model.save('D:\workspace\language_models\distiluse-base-multilingual-cased-v2')
#sentence = ['朝辭白帝彩雲間','千里江陵一日還','兩岸猿聲啼不住','輕舟已過萬重山']

C:\Users\tsyo\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\.libs\libopenblas.NOIJJG62EMASZI6NYURL6JBKM4EVBGM7.gfortran-win_amd64.dll
C:\Users\tsyo\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\.libs\libopenblas.QVLO2T66WEPI7JZ63PS3HMOHFEY472BC.gfortran-win_amd64.dll
  stacklevel=1)


Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.38k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/610 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/341 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/539M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/531 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/114 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.58M [00:00<?, ?B/s]

In [1]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('D:\workspace\language_models\distiluse-base-multilingual-cased-v2')

sentence = ['朝辭白帝彩雲間','千里江陵一日還','兩岸猿聲啼不住','輕舟已過萬重山']

C:\Users\tsyo\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\.libs\libopenblas.NOIJJG62EMASZI6NYURL6JBKM4EVBGM7.gfortran-win_amd64.dll
C:\Users\tsyo\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\.libs\libopenblas.QVLO2T66WEPI7JZ63PS3HMOHFEY472BC.gfortran-win_amd64.dll
  stacklevel=1)


In [2]:
#Encode all sentences
embeddings = model.encode(sentence)

In [3]:
embeddings.shape

(4, 512)

In [4]:
#Compute cosine similarity between all pairs
cos_sim = util.cos_sim(embeddings, embeddings)

print(cos_sim)

tensor([[1.0000, 0.4568, 0.3396, 0.2982],
        [0.4568, 1.0000, 0.3701, 0.4409],
        [0.3396, 0.3701, 1.0000, 0.4095],
        [0.2982, 0.4409, 0.4095, 1.0000]])


In [20]:
def decode_generated_ids(generated, tokenizer):
    ''' Decode the ids generated by the language model. '''
    output=[]
    for i in range(10):
        text = tokenizer.decode(generated[i], skip_special_tokens= True)    # Decode the generated text
        text = text.replace(' ','')                                         # Remove spaces between tokens
        text = text.replace(',','，')
        output.append(text)
    return(output)

def generate_new_sentences(input, tokenizer, model, params):
    ''' Generate new sentences with specified model and tokenizer. '''
    # Parse seeding string
    input_ids = tokenizer.encode(input, return_tensors='pt')
    # Generate text
    generated = model.generate(input_ids, 
                            max_length=params['max_length'],  
                            num_return_sequences=params['num_return_sequences'],
                            no_repeat_ngram_size=params['no_repeat_ngram_size'],
                            repetition_penalty=params['repetition_penalty'],
                            length_penalty=params['length_penalty'],
                            top_p=params['top_p'],
                            temperature=params['temperature'],
                            top_k=params['top_k'],
                            do_sample=True,
                            early_stopping=True)
    # Decode
    output = decode_generated_ids(generated, tokenizer)
    # Done
    return(output)

# Default configuration
TOKENIZER_PATH = 'D:\workspace\language_models\ckipft'
MODEL_PATH = 'D:\workspace\language_models\ckipft'
MODEL_TF = True
GEN_PARAMS = {
    "max_length": 30,  
    "num_return_sequences": 10,
    "no_repeat_ngram_size": 2,
    "repetition_penalty": 1.5,
    "length_penalty": 1.0,
    "top_p": 0.92,
    "temperature": 0.85,
    "top_k": 16
}

from transformers import BertTokenizerFast, AutoModelForCausalLM
import pandas as pd
import numpy as np

tokenizer = BertTokenizerFast.from_pretrained(TOKENIZER_PATH)
lm = AutoModelForCausalLM.from_pretrained(MODEL_PATH, from_tf=True)

All TF 2.0 model weights were used when initializing GPT2LMHeadModel.

All the weights of GPT2LMHeadModel were initialized from the TF 2.0 model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.


In [23]:
generated = generate_new_sentences(sentence[0], tokenizer, lm, GEN_PARAMS)

Setting `pad_token_id` to `eos_token_id`:102 for open-end generation.


In [24]:
seed_vec = model.encode(sentence[0])
vecs = []
scores = []
for s in generated:
    vec = model.encode(s)
    vecs.append(vec)
    scores.append(util.cos_sim(seed_vec, vec))

for i in range(len(generated)):
    print(str(i), str(scores[i]), generated[i])

0 tensor([[0.3848]]) 朝辭白帝彩雲間您的名字，是不可能的！您知道嗎？」
1 tensor([[1.0000]]) 朝辭白帝彩雲間
2 tensor([[1.0000]]) 朝辭白帝彩雲間
3 tensor([[0.2824]]) 朝辭白帝彩雲間「我們的女兒，就是她！」老夫人激烈而抗議
4 tensor([[1.0000]]) 朝辭白帝彩雲間
5 tensor([[0.4718]]) 朝辭白帝彩雲間「老天爺，你別這麼說！」樂梅低聲下氣的接
6 tensor([[1.0000]]) 朝辭白帝彩雲間
7 tensor([[0.3868]]) 朝辭白帝彩雲間「一起回去！」他一把抓住了她的手，聲音裡
8 tensor([[1.0000]]) 朝辭白帝彩雲間
9 tensor([[1.0000]]) 朝辭白帝彩雲間


In [43]:
def postprocess_generated_sentences(sentences, seed_sentence, sent_transformer):
    ''' Post-process the generated paragraph. '''
    # Define sentence-break symbols
    bs = ['，','。','；','！','？','「','」']
    # Loop through all generated snetences
    svecs = []
    for s in sentences:
        temp = s.replace(seed_sentence, '')     # Remove the seed sentence
        # Looking for sentence-break symbols
        idxs = [i for i, x in enumerate(temp) if x in bs]
        if len(idxs)>1:                         # Keep tokens before the fisrt break
            tokens = temp[idxs[0]+1:idxs[1]]
            print("Take the segment between the 1st and 2nd punchuations. "+str(len(idxs)))
        #elif len(idxs)>0:
        #    tokens = tokens[:idxs[0]]
        else:                                   # Skip empty sentence
            print('The generated sentence is too short, skip it: '+s)
            continue
        svec = sent_transformer.encode(tokens)   # Calculate the sentence-embedding vectors of the tokens
        svecs.append({'sentence':tokens, 'embedding':svec})
    #
    return(svecs)

candidates = postprocess_generated_sentences(generated, sentence[0], model)

for c in candidates:
    print(c['sentence'])


Take the segment between the 1st and 2nd punchuations. 4
The generated sentence is too short, skip it: 朝辭白帝彩雲間
The generated sentence is too short, skip it: 朝辭白帝彩雲間
Take the segment between the 1st and 2nd punchuations. 4
The generated sentence is too short, skip it: 朝辭白帝彩雲間
Take the segment between the 1st and 2nd punchuations. 4
The generated sentence is too short, skip it: 朝辭白帝彩雲間
Take the segment between the 1st and 2nd punchuations. 4
The generated sentence is too short, skip it: 朝辭白帝彩雲間
The generated sentence is too short, skip it: 朝辭白帝彩雲間
是不可能的
我們的女兒
老天爺
一起回去


In [45]:
def select_next_sentence(candidates, seed_vec):
    ''' Select the best candidate. '''
    scores = []
    for i in range(len(candidates)):
        print(candidates[i]['sentence'])
        score = np.dot(seed_vec, candidates[i]['embedding'])
        print(score)
        scores.append(score)
    return(candidates[scores.index(max(scores))])

selected = select_next_sentence(candidates, seed_vec)
print(selected['sentence'])
print(selected['embedding'].shape)

是不可能的
0.027658649
我們的女兒
0.038706854
老天爺
0.10766287
一起回去
0.05292652
老天爺
(512,)
