# HW07: Document Embeddings 

Remember that these homework work as a completion grade. **In this homework, we present two tasks and you can choose which one you want to solve. You only have to solve <span style="color:red">one task</span> in this homework.**
Task 1 is more guided and we evaluate document embeddings on a standard benchmark. Task 2 is very open-end and might be a starting point for your course project.

**Task 1**
In this task, we evaluate different document embeddings on the English version of the [STS Benchmark](https://arxiv.org/pdf/1708.00055.pdf). The task is to determine how semantically similar two texts are and is a popular dataset to evaluate document embeddings, i.e. we want embeddings of two semantically similar documents to be similar as well. We provide a wordcounts baseline for this task and ask you to compute and evaluate embeddings for a selected sample of document embedding techniques.

To evaluate, we follow [(Reimers and Gurevych, 2019)](https://arxiv.org/pdf/1908.10084.pdf) and compute the Spearman’s rank correlation between the cosine-similarity of thesentence embeddings and the gold labels. **It is ok to skip one of the document embedding methods**

In [1]:
# obtain the data
#!wget http://alt.qcri.org/semeval2017/task1/data/uploads/sts2017.eval.v1.1.zip
#!wget http://alt.qcri.org/semeval2017/task1/data/uploads/sts2017.gs.zip

#!unzip sts2017.eval.v1.1.zip 
#!unzip sts2017.gs.zip 

In [2]:
# load the data

def load_STS_data():
    with open("STS2017.gs/STS.gs.track5.en-en.txt") as f:
        labels = [float(line.strip()) for line in f]
    
    text_a, text_b = [], []
    with open("STS2017.eval.v1.1/STS.input.track5.en-en.txt") as f:
        for line in f:
            line = line.strip().split("\t")
            text_a.append(line[0])
            text_b.append(line[1])
    return text_a, text_b, labels

text_a, text_b, labels = load_STS_data()
text_a[0], text_b[0], labels[0]

('A person is on a baseball team.',
 'A person is playing basketball on a team.',
 2.4)

In [3]:
# some utils
from scipy.stats import spearmanr
def evaluate(predictions, labels):
    print ("spearman's rank correlation", spearmanr(predictions, labels)[0])

import numpy as np
from numpy import dot
from numpy.linalg import norm

def cosine_similarity(a,b):
    return dot(a, b)/(norm(a)*norm(b))


In [4]:
# Wordcounts baseline
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
vec.fit(text_a + text_b)

# encode documents
text_a_encoded = np.array(vec.transform(text_a).todense())
text_b_encoded = np.array(vec.transform(text_b).todense())

# predict cosine similarities
predictions = [cosine_similarity(a,b) for a,b in zip(text_a_encoded, text_b_encoded)]

# evaluate
evaluate(predictions, labels)

spearman's rank correlation 0.6998056665685976


In [5]:
##TODO train Doc2Vec on the texts in the dataset
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
doc_iterator = [TaggedDocument(doc, [i]) for i, doc in enumerate(text_a + text_b)]
d2v = Doc2Vec(doc_iterator,
                min_count=10, # minimum word count
                window=10,    # window size
                vector_size=100, # size of document vector
                sample=1e-4, 
                negative=5, 
                workers=4, # threads
                #dbow_words = 1 # uncomment to get word vectors too
                max_vocab_size=1000) # max vocab size


##TODO derive the word vectors for each text in the dataset
##TODO compute cosine similarity between the text pairs and evaluate spearman's rank correlation
## Don't worry if results are not satisfactory using Doc2Vec (the dataset is too small to train good embeddings)
predictions = []
for sent_a, sent_b in zip(text_a, text_b):
    emb_a = d2v.infer_vector([sent_a])
    emb_b = d2v.infer_vector([sent_b])
    predictions.append(cosine_similarity(emb_a, emb_a))

evaluate(predictions, labels)

spearman's rank correlation 0.052841014224743035


In [6]:
##TODO do the same with embeddings provided by spaCy
import spacy
nlp = spacy.load('en_core_web_sm')

predictions = []
for sent_a, sent_b in zip(text_a, text_b):
    emb_a = nlp(sent_a).vector
    emb_b = nlp(sent_b).vector
    predictions.append(cosine_similarity(emb_a, emb_a))

evaluate(predictions, labels)

  from .autonotebook import tqdm as notebook_tqdm
2022-04-28 14:50:20.893086: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-04-28 14:50:20.893171: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


spearman's rank correlation 0.09087822079358758


In [7]:
##TODO do the same with universal sentence embeddings
#!pip install --upgrade tensorflow-hub

import tensorflow.compat.v1 as tf
tf.disable_eager_execution()

import tensorflow_hub as hub

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
    return model(input)    

2022-04-28 14:50:40.515588: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-04-28 14:50:40.515682: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-04-28 14:50:40.515785: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (Henry): /proc/driver/nvidia/version does not exist


module https://tfhub.dev/google/universal-sentence-encoder/4 loaded


In [8]:
embs_a = embed(text_a)
embs_b = embed(text_b)

with tf.Session() as session:
    session.run([tf.global_variables_initializer(), tf.tables_initializer()])
    embs_a = session.run(embs_a)
    embs_b = session.run(embs_b)

predictions = []
for emb_a, emb_b in zip(embs_a, embs_b):
    predictions.append(cosine_similarity(emb_a,emb_b))

evaluate(predictions, labels)

2022-04-28 14:50:42.459499: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


spearman's rank correlation 0.8493103413219787


In [9]:
##TODO do the same with SBERT embeddings
#!pip install sentence-transformers
from sentence_transformers import SentenceTransformer
model = "bert-base-nli-mean-tokens"
embedder = SentenceTransformer(model)

embs_a = embedder.encode(text_a)
embs_b = embedder.encode(text_b)

predictions = []
for emb_a, emb_b in zip(embs_b, embs_a):
    predictions.append(cosine_similarity(emb_a, emb_a))

evaluate(predictions, labels)

spearman's rank correlation 0.0530998674682112


**Task 2**
Use your favorite document embeddings method to compute embeddings for a dataset you are interested in. Think of a method and provide some data visualization statistics (one method would be the path we have chosen in the notebook, i.e. cluster the embeddings with k-means and visualize low-dimensional representations of the document embeddings obtained by PCA). 

This task is very open and there is no right or wrong; If you want to use document embeddings in your course project, this is a chance to play around with them.

