<a href="https://colab.research.google.com/github/ryuqae/2022-meta-learning/blob/main/Day_03_Semantic_Textual_Similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# References
## TextDistance
- python library for comparing distance between two or more sequences by many algorithms.
- https://github.com/life4/textdistance

## Universal Sentence Encoder
- https://tfhub.dev/google/universal-sentence-encoder/4
- https://www.tensorflow.org/hub/tutorials/semantic_similarity_with_tf_hub_universal_encoder?hl=ko

## Sentence Transformers(SBert)
- https://www.sbert.net/docs/usage/semantic_textual_similarity.html
- https://www.sbert.net/examples/training/sts/README.html#

## Blog Posts & GitHub Repos
- https://towardsdatascience.com/semantic-textual-similarity-83b3ca4a840e

- https://medium.com/@adriensieg/text-similarities-da019229c894
- https://github.com/adsieg/text_similarity

- https://github.com/nlptown/nlp-notebooks


In [None]:
# Pip install the necessary libraries
!pip install -U datasets plotly scikit-learn tqdm ipywidgets 
!pip install -U numpy spacy textdistance fasttext gensim 
!pip install -U tensorflow tensorflow_hub sentence-transformers openai
!conda install pyemd

# Download the Spacy Model
!python -m spacy download en_core_web_sm

In [None]:
# Imports
from datasets import load_dataset
import pandas as pd
import numpy as np
from tqdm import tqdm
tqdm.pandas()

# Load the English STSB dataset
stsb_dataset = load_dataset('stsb_multi_mt', 'en')
stsb_train = pd.DataFrame(stsb_dataset['train'])
stsb_test = pd.DataFrame(stsb_dataset['test'])

# Check loaded data
print(stsb_train.shape, stsb_test.shape)
stsb_test.head()

Downloading builder script:   0%|          | 0.00/2.10k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.69k [00:00<?, ?B/s]

Downloading and preparing dataset stsb_multi_mt/en (download: 1.02 MiB, generated: 1.06 MiB, post-processed: Unknown size, total: 2.08 MiB) to /root/.cache/huggingface/datasets/stsb_multi_mt/en/1.0.0/a5d260e4b7aa82d1ab7379523a005a366d9b124c76a5a5cf0c4c5365458b0ba9...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/229k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/52.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5749 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1379 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/1500 [00:00<?, ? examples/s]

Dataset stsb_multi_mt downloaded and prepared to /root/.cache/huggingface/datasets/stsb_multi_mt/en/1.0.0/a5d260e4b7aa82d1ab7379523a005a366d9b124c76a5a5cf0c4c5365458b0ba9. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

(5749, 3) (1379, 3)


Unnamed: 0,sentence1,sentence2,similarity_score
0,A girl is styling her hair.,A girl is brushing her hair.,2.5
1,A group of men play soccer on the beach.,A group of boys are playing soccer on the beach.,3.6
2,One woman is measuring another woman's ankle.,A woman measures another woman's ankle.,5.0
3,A man is cutting up a cucumber.,A man is slicing a cucumber.,4.2
4,A man is playing a harp.,A man is playing a keyboard.,1.5


In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import spacy
nlp = spacy.load("en_core_web_sm")

def text_processing(sentence):
    """
    Lemmatize, lowercase, remove numbers and stop words
    
    Args:
      sentence: The sentence we want to process.
    
    Returns:
      A list of processed words
    """
    sentence = [token.lemma_.lower()
                for token in nlp(sentence) 
                if token.is_alpha and not token.is_stop]
    
    return sentence

def cos_sim(sentence1_emb, sentence2_emb):
    """
    Cosine similarity between two columns of sentence embeddings
    
    Args:
      sentence1_emb: sentence1 embedding column
      sentence2_emb: sentence2 embedding column
    
    Returns:
      The row-wise cosine similarity between the two columns.
      For instance is sentence1_emb=[a,b,c] and sentence2_emb=[x,y,z]
      Then the result is [cosine_similarity(a,x), cosine_similarity(b,y), cosine_similarity(c,z)]
    """
    cos_sim = cosine_similarity(sentence1_emb, sentence2_emb)
    return np.diag(cos_sim)

# Semantic Textual Similarity(STS)
- 단어, 구절, 문장 혹은 문서쌍의 의미론적 거리(유사한 정도)를 측정하는 것이 주 목적
- 자연어 텍스트의 유사도를 비교하는 것은 NLP 대부분의 task에서 활용될 정도로 essential
- 예를 들어, "자동차"는 "고양이"보다 "버스"와 의미론적으로 더 유사할 것
- 2가지 접근법: knowledge-based & corpus-based

## Non-contextual algorithms - without CONTEXT(단어의 출현 맥락 무시 - 동형이의어 문제)
- Jaccard Similarity
- Bag of Words(BoW)
    - Count Vectorizer
    - Tf-Idf Vectorizer
- Word Movers Distance(WMD)

## Contextual algorithms - with CONTEXT
- Universal Sentence Encoder(USE)
- BERT Cross Encoder
- SBERT Bi-Encoder
- SimCSE
    - Supervised
    - Unsupervised

In [None]:
from sentence_transformers import SentenceTransformer, SentencesDataset, InputExample, losses
from datasets import load_dataset
import pandas as pd
import numpy as np
from tqdm import tqdm
tqdm.pandas()


model = SentenceTransformer('nli-distilroberta-base-v2')
train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
    InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]
train_dataset = SentencesDataset(train_examples, model)

Downloading:   0%|          | 0.00/736 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.71k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/679 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/329M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

# Non-contextual algorithms

## Jaccard Similarity
- Sentence 1: AI is our friend and it has been friendly
- Sentence 2: AI and humans have always been friendly

![](https://miro.medium.com/max/695/1*NSK8ERXexyIZ_SRaxioFEg.png)

$$ {\#\ Common\ unique\ words \over \#\ All\ unique\ words}={5\over(5+3+2)} = 0.5 $$

In [None]:
import textdistance

def jaccard_sim(row):
    # Text Processing
    sentence1 = text_processing(row['sentence1'])
    sentence2 = text_processing(row['sentence2'])
    
    # Jaccard similarity
    return textdistance.jaccard.normalized_similarity(sentence1, sentence2)


# Jaccard Similarity
stsb_test['Jaccard_score'] = stsb_test.progress_apply(jaccard_sim, axis=1)

100%|██████████| 1379/1379 [00:15<00:00, 87.79it/s]


## Bag of Words(BoW)

### Count Vectorizer 
- 가장 간단한 vectorize 방법
- 모든 단어를 동일한 중요도로 본다는 문제점이 있음

### Tf-Idf Vectorizer
- Count vectorizer의 문제점을 보완한 방법
- Inverse Document Frequency(IDF)로 흔한 단어의 경우 중요도를 Penalize해줌
- "The intuition here is that frequent words in one document which are relatively rare across the entire corpus are the crucial words for that document and have a high TFIDF score." - [출처](https://towardsdatascience.com/semantic-textual-similarity-83b3ca4a840e#4031)

![](https://miro.medium.com/max/1050/1*SW80ThHVY7UndJRCF_DEFw.png)

- Similarity에서 BoW의 성능은 별로 좋지 못함
- Document 수가 많아질수록 unique 단어 수 또한 많아지므로 sparse한 vector가 만들어짐

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
model = TfidfVectorizer(lowercase=True, stop_words='english')

# Train the model
X_train = pd.concat([stsb_train['sentence1'], stsb_train['sentence2']]).unique()
model.fit(X_train)

# Generate Embeddings on Test
sentence1_emb = model.transform(stsb_test['sentence1'])
sentence2_emb = model.transform(stsb_test['sentence2'])

# Cosine Similarity
stsb_test['TFIDF_cosine_score'] = cos_sim(sentence1_emb, sentence2_emb)

## Word Movers Distance(WMD)
- 앞서 소개한 Jaccard Similarity와 TFIDF는 유사한 텍스트는 공통의 단어를 많이 갖고 있을거라는 가정 -> 그렇지 않을 수 있다!

    - document1: “Obama speaks to the media in Illinois” 
    - document2: “The president greets the press in Chicago”


- Word embeddings 통해서 단어를 semantic 정보를 포함한 vector로 변환하여 해결
    - Word2Vec
    - Glove
    - FastText

- text 내의 word vector를 단순히 element-wise average로 sentence embedding을 구할 수도 있음

- Word Movers Distance: Earth Movers Distance의 개념을 차용하여, 하나의 문서의 단어 벡터들이 다른 문서의 단어 벡터들로 도달하는데까지 이동해야하는 거리

- 단어가 아닌 벡터를 기준으로 하기 때문에 동일한 단어가 없는 두 문장도 유사할 수 있음

![](https://miro.medium.com/max/473/1*t2D29DYW1TP6cNqQr84YcQ.png)

In [None]:
import gensim.downloader as api

# Load the pre-trained model
model = api.load('fasttext-wiki-news-subwords-300')

def word_movers_distance(row):
    # Text Processing
    sentence1 = text_processing(row['sentence1'])
    sentence2 = text_processing(row['sentence2'])
    
    # Negative Word Movers Distance
    return -model.wmdistance(sentence1, sentence2)


# Negative Word Movers Distance
stsb_test['NegWMD_score'] = stsb_test.progress_apply(word_movers_distance, axis=1)



100%|██████████| 1379/1379 [00:16<00:00, 81.30it/s]


# Contextual algorithms
- Non-contextual algorithms의 문제는 같은 형태의 단어는 문맥과 상관없이 무조건 같은 벡터를 갖는다는 것
- 문맥을 고려한 접근이 더 나은 성능을 보임

## Universal Sentence Encoder(USE)
- 논문에서는 두 가지 모델 제시
    - 느리지만 더 성능이 좋은 Transformer-based
    - 성능은 조금 부족하지만 빠른 Deep Averaging Network(DAN)-based
- Transformer-based model에서는 transformer 구조의 encoding sub-graph를 활용
- attention을 활용해 단어의 순서와 동질성을 모두 고려한, 문장에서의 context aware representations를 출력
- fixed length sentence encoding!
- Multi-task learning을 통해 가능한 general purpose 하도록

In [None]:
import tensorflow as tf
import tensorflow_hub as hub

# Load the pre-trained model
gpus = tf.config.list_physical_devices('GPU')
for gpu in gpus:
    # Control GPU memory usage
    tf.config.experimental.set_memory_growth(gpu, True)

module_url = 'https://tfhub.dev/google/universal-sentence-encoder/4'
model = hub.load(module_url)

# Generate Embeddings
sentence1_emb = model(stsb_test['sentence1']).numpy()
sentence2_emb = model(stsb_test['sentence2']).numpy()

# Cosine Similarity
stsb_test['USE_cosine_score'] = cos_sim(sentence1_emb, sentence2_emb)

## BERT Cross Encoder
- BERT에 classfier를 얹어서 Cross Encoder로 사용
- 주어진 두 문장이 유사할 "확률"을 구함
- embedding vector를 출력하지 않으므로 수천개 문서 이상으로는 확장할 수 없음
- 반면 성능은 좋음

![](https://miro.medium.com/max/305/1*T6XILGOFvbIvLaNSa9VjSA.png)

In [None]:
from sentence_transformers import CrossEncoder

# Load the pre-trained model
model = CrossEncoder('cross-encoder/stsb-roberta-base')

sentence_pairs = []
for sentence1, sentence2 in zip(stsb_test['sentence1'], stsb_test['sentence2']):
    sentence_pairs.append([sentence1, sentence2])
    
stsb_test['BERT CrossEncoder_score'] = model.predict(sentence_pairs, show_progress_bar=True)

Batches:   0%|          | 0/44 [00:00<?, ?it/s]

## Metric Learning


- [출처](https://untitledtblog.tistory.com/164)
- item 간의 유사도를 측정하기 위해 Euclidean과 같은 거리 함수(distance metric function)를 이용
- 여러 머신러닝, 딥러닝 알고리즘들이 거리 함수 기반으로 움직이므로 적절한 metric function을 이용하는 것이 중요
- 그러나 모든 데이터에 적합한 단 하나의 metric function은 존재하지 않음
- 유사한(혹은 같은 클래스의) item 간의 distance는 더욱 가까워지고, 유사하지 않은(혹은 다른 클래스의) item 간의 distance는 더욱 멀어지게 학습하여, 궁극적으로는 분류모델을 더욱 쉽게 만들어 줌
![](https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fdpuky0%2FbtqIjeVyxZo%2FSnmmbKkMGT6aD1JSWybngk%2Fimg.png)

- 이러한 관점에서 text similarity에 적용해보자면,

    - 우선, BERT와 같은 language model을 활용해 text(sentence)를 embedding
    - 의미적으로 유사한 text끼리는 더욱 가깝게, 그 반대는 더 멀리 떨어뜨리도록 함

![](https://miro.medium.com/max/700/1*ae5MjNxxpw0GX1eHTf95BA.png)


### SBERT Bi-Encoder

- BERT-like model을 사용하여 각 text의 contextual word embedding
- text별 word embedding의 element-wise average - Mean Pooling
- Siamese Network architecture + contrastive loss 이용하여 모델 학습

![](https://miro.medium.com/max/311/1*2GzHnkb4DpQ1jAix1PAC3g.png)


In [None]:
from sentence_transformers import SentenceTransformer

# Load the pre-trained model
model = SentenceTransformer('stsb-mpnet-base-v2')

# Generate Embeddings
sentence1_emb = model.encode(stsb_test['sentence1'], show_progress_bar=True)
sentence2_emb = model.encode(stsb_test['sentence2'], show_progress_bar=True)

# Cosine Similarity
stsb_test['SBERT BiEncoder_cosine_score'] = cos_sim(sentence1_emb, sentence2_emb)

Downloading:   0%|          | 0.00/868 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.67k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/588 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Batches:   0%|          | 0/44 [00:00<?, ?it/s]

Batches:   0%|          | 0/44 [00:00<?, ?it/s]

https://www.sbert.net/examples/applications/cross-encoder/README.html

![](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/Bi_vs_Cross-Encoder.png)

### SimCSE
- Simple Contrastive Learning of Sentence Embeddings

![](https://miro.medium.com/max/700/1*9OWdjRkCXBBXglIZwoGyYQ.png)

In [None]:
########## Supervised ##########
# Load the pre-trained model
model = SentenceTransformer('princeton-nlp/sup-simcse-roberta-large')

# Generate Embeddings
sentence1_emb = model.encode(stsb_test['sentence1'], show_progress_bar=True)
sentence2_emb = model.encode(stsb_test['sentence2'], show_progress_bar=True)

# Cosine Similarity
stsb_test['SimCSE Supervised_cosine_score'] = cos_sim(sentence1_emb, sentence2_emb)


########## Un-Supervised ##########
# Load the pre-trained model
model = SentenceTransformer('princeton-nlp/unsup-simcse-roberta-large')

# Generate Embeddings
sentence1_emb = model.encode(stsb_test['sentence1'], show_progress_bar=True)
sentence2_emb = model.encode(stsb_test['sentence2'], show_progress_bar=True)

# Cosine Similarity
stsb_test['SimCSE Unsupervised_cosine_score'] = cos_sim(sentence1_emb, sentence2_emb)

Downloading:   0%|          | 0.00/736 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/664 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]



Batches:   0%|          | 0/44 [00:00<?, ?it/s]

Batches:   0%|          | 0/44 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/736 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/743 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]



Batches:   0%|          | 0/44 [00:00<?, ?it/s]

Batches:   0%|          | 0/44 [00:00<?, ?it/s]

# Result

In [None]:
stsb_test.head(3)

Unnamed: 0,sentence1,sentence2,similarity_score,Jaccard_score,TFIDF_cosine_score,NegWMD_score,SBERT CrossEncoder_score,USE_cosine_score,BERT CrossEncoder_score,SBERT BiEncoder_cosine_score,SimCSE Supervised_cosine_score,SimCSE Unsupervised_cosine_score
0,A girl is styling her hair.,A girl is brushing her hair.,2.5,0.5,0.49064,-0.366527,0.477538,0.73754,0.477538,0.653581,0.881867,0.78727
1,A group of men play soccer on the beach.,A group of boys are playing soccer on the beach.,3.6,0.666667,0.61308,-0.161301,0.83739,0.889257,0.83739,0.825622,0.833109,0.89243
2,One woman is measuring another woman's ankle.,A woman measures another woman's ankle.,5.0,1.0,0.705323,-0.0,0.996481,0.856532,0.996481,0.970068,0.990725,0.95176


In [None]:
score_cols = [col for col in stsb_test.columns if '_score' in col]

# Spearman Rank Correlation
spearman_rank_corr = stsb_test[score_cols].corr(method='spearman').iloc[1:, 0:1]*100
spearman_rank_corr.head(10)

Unnamed: 0,similarity_score
Jaccard_score,66.026529
TFIDF_cosine_score,61.420989
NegWMD_score,67.032848
SBERT CrossEncoder_score,90.172534
USE_cosine_score,77.085989
BERT CrossEncoder_score,90.172534
SBERT BiEncoder_cosine_score,88.572419
SimCSE Supervised_cosine_score,87.082275
SimCSE Unsupervised_cosine_score,82.784251


In [None]:
spearman_rank_corr.sort_values(by='similarity_score', ascending=False)

Unnamed: 0,similarity_score
SBERT CrossEncoder_score,90.172534
BERT CrossEncoder_score,90.172534
SBERT BiEncoder_cosine_score,88.572419
SimCSE Supervised_cosine_score,87.082275
SimCSE Unsupervised_cosine_score,82.784251
USE_cosine_score,77.085989
NegWMD_score,67.032848
Jaccard_score,66.026529
TFIDF_cosine_score,61.420989


In [None]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

nrows = 3
ncols = 3
plot_array = np.arange(0, nrows*ncols).reshape(nrows, ncols)

subplot_titles = [f'{row.Index.split("_")[0]}: {row.similarity_score:.2f}' for row in spearman_rank_corr.itertuples()]
fig = make_subplots(rows=nrows, cols=ncols, subplot_titles=subplot_titles)

for index, score in enumerate(spearman_rank_corr.index):
    row, col = np.argwhere(plot_array == index)[0]
    
    fig.add_trace(
        go.Scatter(
            x=stsb_test[score_cols[0]], 
            y=stsb_test[score],
            mode='markers',
        ),
        row=row+1, col=col+1
    )


fig.update_layout(height=700, width=1000, title_text='Spearman Rank Correlation (ρ × 100)', showlegend=False)
fig.show()

# Assignment

1. 각각의 방법의 dataset 크기에 따른 속도 비교해보기
2. 10-Q의 sentence similarity를 구하기 위해 가장 적절한 알고리즘은 무엇일지 실험해보기