# Sim-CSE vs Fast Sentence Embeddings 

Running the STS benchmark for the [SimCSE](https://github.com/princeton-nlp/SimCSE) model and the [Fast Sentence Embeddings ](https://github.com/oborchers/Fast_Sentence_Embeddings)model.

The benchmark is the [STS benchmark](http://ixa2.si.ehu.eus/stswiki/index.php/STSbenchmark) and scripts from the benchmark to run Pearson correlation, have been used here.

Files have been downloaded from the STS benchmark site. We only take the sts-dev file to run the benchmark.

**SimCSE** scores around **65.98** while **FSE** scores around **45.5** for *glove-wiki-gigaword-50* and around **63.135** for *paranmt-300* and using the  *Averages* model.



In [1]:
!pip install simcse transformers datasets accelerate nvidia-ml-py3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting simcse
  Downloading simcse-0.4.tar.gz (18 kB)
Collecting transformers
  Downloading transformers-4.21.2-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 53.4 MB/s 
[?25hCollecting datasets
  Downloading datasets-2.4.0-py3-none-any.whl (365 kB)
[K     |████████████████████████████████| 365 kB 68.0 MB/s 
[?25hCollecting accelerate
  Downloading accelerate-0.12.0-py3-none-any.whl (143 kB)
[K     |████████████████████████████████| 143 kB 71.8 MB/s 
[?25hCollecting nvidia-ml-py3
  Downloading nvidia-ml-py3-7.352.0.tar.gz (19 kB)
Collecting scipy<1.6,>=1.5.4
  Downloading scipy-1.5.4-cp37-cp37m-manylinux1_x86_64.whl (25.9 MB)
[K     |████████████████████████████████| 25.9 MB 78.5 MB/s 
Collecting numpy<1.20,>=1.19.5
  Downloading numpy-1.19.5-cp37-cp37m-manylinux2010_x86_64.whl (14.8 MB)
[K     |████████████████████████████████| 14.8 MB 57.2 MB/s 
C

In [7]:
import numpy as np
import pandas as pd
import torch
import csv



df = pd.read_csv("sts-dev-new.csv")

In [28]:
from pynvml import *


def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")


def print_summary(result):
    print(f"Time: {result.metrics['train_runtime']:.2f}")
    print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
    print_gpu_utilization()

def cosine_similarity(n1):
  with torch.no_grad():
    cos = torch.nn.CosineSimilarity(dim=-1)
    return cos(*n1) * 5


In [9]:
import torch
from transformers import AutoModel, AutoTokenizer


# Import our models. The package will take care of downloading the models automatically
tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")
model = AutoModel.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")


Downloading tokenizer_config.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/418M [00:00<?, ?B/s]

In [10]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch.cuda.empty_cache()
print(device)
model = model.to(device)
print_gpu_utilization()



cuda:0
GPU memory occupied: 1080 MB.


In [80]:
# preprocess dev set

sent = df[["Sent1", "Sent2"]].values
score = df["Score"]
print(len(sent))
nsent = []
scores = []
err = []
for i in range(len(sent)):
  try:
    t = sent[i]
    s1 = t[0].encode("utf-8").decode()
    s2 = t[1].encode("utf-8").decode()
  except Exception as e:
    err.append(i)
    nsent.append(["SS", "SS"])
    scores.append(5.0)
  else:
    nsent.append([s1, s2])
    scores.append(score[i])

print(len(nsent))
print("err: ", err)

1470
1470
err:  [764, 1024]


In [81]:
nsent[:25]

[['A man with a hard hat is dancing.', 'A man wearing a hard hat is dancing.'],
 ['A young child is riding a horse.', 'A child is riding a horse.'],
 ['A man is feeding a mouse to a snake.',
  'The man is feeding a mouse to the snake.'],
 ['A woman is playing the guitar.', 'A man is playing guitar.'],
 ['A woman is playing the flute.', 'A man is playing a flute.'],
 ['A woman is cutting an onion.', 'A man is cutting onions.'],
 ['A man is erasing a chalk board.', 'The man is erasing the chalk board.'],
 ['A woman is carrying a boy.', 'A woman is carrying her baby.'],
 ['Three men are playing guitars.', 'Three men are on stage playing guitars.'],
 ['A woman peels a potato.', 'A woman is peeling a potato.'],
 ['People are playing cricket.', 'Men are playing cricket.'],
 ['A man is playing a guitar.', 'A man is playing a flute.'],
 ['The cougar is chasing the bear.', 'A cougar is chasing a bear.'],
 ['The man cut down a tree with an axe.',
  'A man chops down a tree with an axe.'],
 ['The

In [82]:
sc = []
true_sc = []
for i in range(len(nsent)):
  test = nsent[i]
  inputs = tokenizer(test, padding=True, truncation=True, return_tensors="pt").to(device)
  with torch.no_grad():
      embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output
      embds = embeddings
      del inputs
      torch.cuda.empty_cache()
  # embds = embds.cpu()
  res = cosine_similarity(embds)
  sc.append(res.cpu())
  if (i%250 == 0):
    print(i)


0
20
40
60
80
100
120
140
160
180
200
220
240
260
280
300
320
340
360
380
400
420
440
460
480
500
520
540
560
580
600
620
640
660
680
700
720
740
760
780
800
820
840
860
880
900
920
940
960
980
1000
1020
1040
1060
1080
1100
1120
1140
1160
1180
1200
1220
1240
1260
1280
1300
1320
1340
1360
1380
1400
1420
1440
1460


In [56]:
import gc
inputs = None
model = None 
gc.collect()

with torch.no_grad():
    torch.cuda.empty_cache()

array(['The grass family is one of the most widely distributed and abundant groups of plants on Earth.\tAs noted on the Wiki page, grass seed was imported to the new world to improve pasturage for livestock.\nmain-forums\tanswers-forums\t2015\t0782\t1.20\thttp://en.wikipedia.org/wiki/History_of_Delhi#Early_history shows that the city was called Delhi (or Dilli) at least since the 12/13th century.\tI am not exactly sure but I remember the media (in India) addressing Delhi as HASTHINA".',
       nan], dtype=object)

In [48]:
sc= [i.cpu() for i in sc]


In [89]:
with open("res.txt", "w") as rfile:
  for i in sc:
    rfile.write(str(i.item())+"\n")

In [73]:
sum(score)

3445.731333333332

In [90]:
!perl correlation.pl sts-dev.csv res.txt

Use of uninitialized value in multiplication (*) at correlation.pl line 88.
Use of uninitialized value in multiplication (*) at correlation.pl line 88.
Use of uninitialized value in multiplication (*) at correlation.pl line 88.
Use of uninitialized value in multiplication (*) at correlation.pl line 88.
Use of uninitialized value in multiplication (*) at correlation.pl line 88.
Use of uninitialized value in multiplication (*) at correlation.pl line 88.
Use of uninitialized value in multiplication (*) at correlation.pl line 88.
Use of uninitialized value in multiplication (*) at correlation.pl line 88.
Use of uninitialized value in multiplication (*) at correlation.pl line 88.
Use of uninitialized value in multiplication (*) at correlation.pl line 88.
Use of uninitialized value in multiplication (*) at correlation.pl line 88.
Use of uninitialized value in multiplication (*) at correlation.pl line 88.
Use of uninitialized value in multiplication (*) at correlation.pl line 88.
Use of unini

### FSE

In [100]:
# !pip install -U gensim fse
# !pip install nltk
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [146]:
from fse import Vectors, Average, uSIF, IndexedList, CSplitIndexedList

from nltk import word_tokenize

# vecs = Vectors.from_pretrained("glove-wiki-gigaword-50")
vecs = Vectors.from_pretrained("paranmt-300")
model = uSIF(vecs)


Downloading:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/278 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.84M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/92.7M [00:00<?, ?B/s]

In [106]:
from fse.models.average import FAST_VERSION
FAST_VERSION #1 means available

1

In [147]:
# tokenizer from https://github.com/oborchers/Fast_Sentence_Embeddings/blob/master/notebooks/STS-Benchmarks.ipynb
import re

not_punc = re.compile('.*[A-Za-z0-9].*')

def prep_token(token):
    t = token.lower().strip("';.:()").strip('"')
    t = 'not' if t == "n't" else t
    return re.split(r'[-]', t)

def prep_sentence(sentence):
    tokens = []
    for token in word_tokenize(sentence):
        if not_punc.match(token):
            tokens = tokens + prep_token(token)
    return tokens

sent_a = [i[0] for i in nsent]
sent_b = [i[1] for i in nsent]

In [148]:
sentences = CSplitIndexedList(sent_a, sent_b, custom_split=prep_sentence)

In [149]:
sentence[0]

[['a', 'man', 'with', 'a', 'hard', 'hat', 'is', 'dancing'],
 ['a', 'man', 'wearing', 'a', 'hard', 'hat', 'is', 'dancing']]

In [150]:
model.train(sentences)


(2940, 34468)

In [151]:
def compute_similarities(task_length, model):
    sims = []
    for i, j in zip(range(task_length), range(task_length, 2*task_length)):
        sims.append(model.sv.similarity(i,j) * 5)
    return sims

In [152]:
fts_sc = compute_similarities(len(nsent), model)

In [142]:
fts_sc

[4.720543026924133,
 4.820749461650848,
 4.654754996299744,
 4.795777797698975,
 4.762759506702423,
 4.236574172973633,
 4.066173434257507,
 4.4780027866363525,
 4.752212464809418,
 4.6947914361953735,
 4.6185302734375,
 4.734078049659729,
 4.12793755531311,
 4.489364326000214,
 4.398241937160492,
 4.079786539077759,
 4.249792993068695,
 4.649677574634552,
 4.585241377353668,
 3.5085314512252808,
 4.087885022163391,
 4.803565442562103,
 3.883202075958252,
 4.6032145619392395,
 3.8448941707611084,
 4.721610844135284,
 4.304442405700684,
 3.80185604095459,
 2.2768478095531464,
 3.928616940975189,
 4.421308636665344,
 4.022016525268555,
 4.401788711547852,
 4.765162169933319,
 4.7049203515052795,
 4.581669867038727,
 4.761216044425964,
 4.763786196708679,
 4.263761937618256,
 4.055345952510834,
 4.375719428062439,
 4.067807197570801,
 3.5285937786102295,
 4.1559600830078125,
 4.6062517166137695,
 2.255125790834427,
 3.096795976161957,
 3.005911707878113,
 4.218945205211639,
 4.44007605314

In [153]:
with open("res2.txt", "w") as rfile:
  for i in fts_sc:
    rfile.write(str(i)+"\n")

In [154]:
!perl correlation.pl sts-dev.csv res2.txt

Use of uninitialized value in multiplication (*) at correlation.pl line 88.
Use of uninitialized value in multiplication (*) at correlation.pl line 88.
Use of uninitialized value in multiplication (*) at correlation.pl line 88.
Use of uninitialized value in multiplication (*) at correlation.pl line 88.
Use of uninitialized value in multiplication (*) at correlation.pl line 88.
Use of uninitialized value in multiplication (*) at correlation.pl line 88.
Use of uninitialized value in multiplication (*) at correlation.pl line 88.
Use of uninitialized value in multiplication (*) at correlation.pl line 88.
Use of uninitialized value in multiplication (*) at correlation.pl line 88.
Use of uninitialized value in multiplication (*) at correlation.pl line 88.
Use of uninitialized value in multiplication (*) at correlation.pl line 88.
Use of uninitialized value in multiplication (*) at correlation.pl line 88.
Use of uninitialized value in multiplication (*) at correlation.pl line 88.
Use of unini