# Similarity embeddings

In this notebook, I will compare different embedding algorithms to define which is better for calculating content preservation. It is needed to preserve the meaning of the replaced word.
For determing the best one, I will divide dataset in train and validation parts, train the models(except transformers) and calculate the MSE for cosine similarity for the toxic and non-toxic sentences.

In [1]:
import pandas as pd

In [2]:
import random
import numpy as np
import torch
import transformers

random_state = 0

random.seed(random_state)
np.random.seed(random_state)
torch.manual_seed(random_state)
torch.backends.cudnn.deterministic = True
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(random_state)
transformers.set_seed(random_state)

In [23]:
df_train = pd.read_csv('../data/interim/toxicity_levels.csv')
df_val = pd.read_csv('../data/raw/filtered.tsv', sep='\t', index_col=0)

In [24]:
def preprocess_df_val(df):
    df['toxic'] = df.apply(lambda x: x['reference'] if x['ref_tox'] > x['trn_tox'] else x['translation'], axis=1)
    df['detoxified'] = df.apply(lambda x: x["translation"] if x['ref_tox'] > x['trn_tox'] else x['reference'], axis=1)

    return df[['toxic', 'detoxified', 'similarity']]

In [25]:
df_val = preprocess_df_val(df_val.sample(20000, random_state=random_state))

## Word2Vec

In [26]:
from gensim import utils


class MyCorpus:
    def __iter__(self):
        for line in df_train.iterrows():
            yield utils.simple_preprocess(line[1].text)

In [27]:
import gensim.models

sentences = MyCorpus()
model = gensim.models.Word2Vec(sentences=sentences, vector_size=200, workers=8, seed=random_state)

In [28]:
model.save("../models/word2vec.model")

## Fasttext

In [29]:
df_train['text'].to_csv('data.txt', index=False)

In [30]:
import fasttext

model = fasttext.train_unsupervised('data.txt')

Read 8M words
Number of words:  55928
Number of labels: 0
Progress: 100.0% words/sec/thread:  100734 lr:  0.000000 avg.loss:  2.040466 ETA:   0h 0m 0s


In [31]:
model.save_model("../models/fasttext.bin")

## Comparison

In [32]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from numpy.linalg import norm
from transformers import BertTokenizer, BertModel

### Word2Vec

In [33]:
model_word2vec = gensim.models.Word2Vec.load('../models/word2vec.model')
se_score_word2vec = []

for row in df_val.iterrows():
    embedding1 = np.zeros(200)
    for word in row[1].toxic.split():
        embedding1 += model_word2vec.wv[word] if word in model_word2vec.wv else np.zeros(200)
    embedding2 = np.zeros(200)
    for word in row[1].detoxified.split():
        embedding2 += model_word2vec.wv[word] if word in model_word2vec.wv else np.zeros(200)
    score = cosine_similarity(embedding1.reshape(1, -1), embedding2.reshape(1, -1))
    se_score_word2vec.append(norm(score - row[1].similarity))

### Fasttext

In [34]:
model_fasttext = fasttext.load_model('../models/fasttext.bin')
se_score_fasttext = []

for row in df_val.iterrows():
    embedding1 = np.zeros(100)
    for word in row[1].toxic.split():
        embedding1 += model_fasttext[word]
    embedding2 = np.zeros(100)
    for word in row[1].detoxified.split():
        embedding2 += model_fasttext[word]
    score = cosine_similarity(embedding1.reshape(1, -1), embedding2.reshape(1, -1))
    se_score_fasttext.append(norm(score - row[1].similarity))



### Bert

In [17]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

In [18]:
def get_embeddings(sentences):
    inputs = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state
    return embeddings[:, 0]

In [19]:
from tqdm import tqdm
se_score_bert = []

for row in tqdm(df_val.iterrows()):
    embedding1, embedding2 = get_embeddings([row[1].toxic, row[1].detoxified])
    score = cosine_similarity(embedding1.reshape(1, -1), embedding2.reshape(1, -1))
    se_score_bert.append(norm(score - row[1].similarity))

0it [00:00, ?it/s]

20000it [17:21, 19.21it/s]


In [35]:
print('Word2Vec MSE = ', np.array(se_score_word2vec).mean())  # 9.5s
print('Fasttext MSE = ', np.array(se_score_fasttext).mean())  # 12.9s
print('Bert MSE = ', np.array(se_score_bert).mean())  # 17m 21.1s

Word2Vec MSE =  0.21221232225860703
Fasttext MSE =  0.13669340326418633
Bert MSE =  0.18090047


Based on the results, I can conclude that Fasstext has the best performance in term of MSE, but it little bit slower than Word2Vec and uses a lot of disk space. Bert shows very slow performance in compare with Word2Vec and Fasttext.

For cosine similarity I will use fasttext embeddings

## How to compute similarity

In [85]:
def _get_cosine_similarity(preds: list[str], labels: list[str], embedding_size=100):
        """
        Computes cosine similarity between embeddings
        :param preds: list of the predicted sequences
        :param labels: list of the true detoxified sequences
        :param embedding_size: size of the embedding, depends on sim_model
        """
        embeddings1 = np.zeros((len(preds), embedding_size))
        for i, pred in enumerate(preds):
            for word in pred.split():
                embeddings1[i] += model_fasttext[word]
        embeddings2 = np.zeros((len(preds), embedding_size))
        for i, label in enumerate(labels):
            for word in label.split():
                embeddings2[i] += model_fasttext[word]
        cosine_similarities = []
        for vec1, vec2 in zip(embeddings1, embeddings2):
            cosine_sim = cosine_similarity([vec1], [vec2])[0][0]
            cosine_similarities.append(cosine_sim)
        return np.array(cosine_similarities).mean()

In [86]:
_get_cosine_similarity(['hi, man, you are', 'my name is Ivan'], ['Moscow is the capital of', 'Beatles was born in Liverpool'])

0.5526267136478593

: 