# Embeddings

In this lab we will use both sparse vectors and dense word2vec embeddings to obtain vector representations of words and documents.

### Outcomes

- Be able to compute term-document matrices from a collection of text documents.
- Be able to implement cosine similarity.
- Know how to use Gensim to train, download and apply word embedding models.
- Understand the word analogy task for word embeddings.

### Overview

First, we will load another set of tweet data. Then, we will obtain a term-document matrix, and compute cosine similarities. Then, we will use the Gensim library to train a word2vec model and download a pretrained model. Finally, we use the Gensim embeddings to perform the analogy task.


# Preparing the Data

Instead of the sentiment classification dataset, we will work with the smaller emotion classification dataset. The dataset labels tweets as one of the following classes:

- 0: anger
- 1: joy
- 2: optimism
- 3: sadness


In [1]:
import os
import sys

path = os.path.abspath(os.path.join(".."))

if path not in sys.path:
    sys.path.append(path)

In [2]:
from dn.datasets import TweetEvalDataset

train = TweetEvalDataset("emotion", "train")
test = TweetEvalDataset("emotion", "test")

train_texts: list[str] = []
train_labels: list[int] = []

for item in train.iter():
    train_texts.append(item["text"])
    train_labels.append(item["label"])

  from .autonotebook import tqdm as notebook_tqdm
Found cached dataset tweet_eval (/Users/qr23940/git/dialogue_and_narrative/src/notebooks/data_cache/tweet_eval/emotion/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)


# 1. Term-Document Matrix

**TO-DO 1.1:** Use the CountVectorizer, as in week 3, to obtain a term-document matrix for the training set.


In [3]:
from dn.embeddings.document_term_matrix import DocumentTermMatrix

document_term_matrix = DocumentTermMatrix(train_texts)

**TO-DO 1.2:** Print out the term vector for the word 'happy'. Use the vocabulary\_ attribute to look up the word's index.
_Hint:_ the CountVectorizer stores a term-document matrix in a sparse format to save memory. You can convert this to a standard numpy array using the method '.toarray()'.
_Hint:_ you can use the method '.flatten()' to convert a 1xN matrix to a vector.

The print-out probably won't be terribly readable, so you will need to convince yourself you have obtained the correct vector.


In [4]:
print(f"Term index = {document_term_matrix.term_index('happy')}")
print(f"Term vector = {document_term_matrix.term_vector('happy')}")
print(f"Term documents = ")
for document in document_term_matrix.term_documents("happy", 10):
    print(f"\t{document}")

Term index = 3295
Term vector = [0 0 0 ... 0 0 0]
Term documents = 
	Happy birthday to Stephen King, a man responsible for some of the best horror of the past 40 years... and a whole bunch of the worst.
	Happy Birthday, LOST! / #lost #dharmainitiative #12years #22september2004 #oceanic815
	#PeopleLikeMeBecause they see the happy exterior, not the hopelessness I sometimes feel inside. #depression  #anxietyprobz
	#PeopleLikeMeBecause they see the happy exterior, not the hopelessness I sometimes feel inside. #depression #anxiety #anxietyprobz
	Happy Birthday @user #cheer #cheerchick #jeep #jeepgirl #IDriveAJeep #jeepjeep #Cheer
	@user people have so much negativity filled inside them but im always happy that in such a gloomy world someone like u exists Namjoon
	Well stock finished &amp; listed, living room moved around, new editing done &amp; fitted in a visit to the in-laws. #productivityatitsfinest #happy
	@user Thank you, happy birthday to you as well!
	Happy Birthday @user  #cheerchic

**TO-DO 1.3:** Print out the document vector for the first tweet in the training set.


In [5]:
print(f"Document vector = {document_term_matrix.document_vector(0)}")
print(f"Document terms = ")

for term in document_term_matrix.document_terms(0):
    print(f"\t{term}")

Document vector = [0 0 0 ... 0 0 0]
Document terms = 
	down
	have
	is
	joyce
	leadership
	may
	meyer
	motivation
	never
	on
	payment
	problem
	worry
	you


# 2. Cosine Similarity

**TO-DO 2.1:** Write a function that computes cosine similarity between two vectors. _Hint:_ you might find numpy's linalg library useful. Refer to the textbook for a definition of cosine similarity.


**TO-DO 2.2:** Use the function to find the five most similar words to 'happy' according to the document-term matrix. _Hint:_ the vocab_inverted dictionary that we compute below lets you look up a word given its index.


In [6]:
for term, similarity in document_term_matrix.most_similar("happy", 5):
    print(f"{term} = {similarity:.3f}")

birthday = 0.400
ampalaya = 0.280
paitpaitanangpeg = 0.280
exterior = 0.243
hopelessness = 0.243


# 3. Word2Vec

For this part, we will need the gensim library. The code below tokenizes the training texts, then runs word2vec (the skipgram model) to learn a set of embeddings.


In [7]:
from dn.embeddings.gensim_word2vec import GensimWord2Vec

gensim_word2vec = GensimWord2Vec(train_texts)

We can look up the embedding for any given word like this:


In [8]:
print(f"Term vector = {gensim_word2vec.term_vector('happy')}")

Term vector = [ 0.093519   -0.08279113  0.3580625  -0.55562764  0.10134792 -0.26659063
  0.31592056  0.52575016 -0.60725075 -0.4336265  -0.13637583 -0.01111981
  0.17245735  0.2303386   0.1331783   0.82918245  0.4227791   0.5091341
 -0.46667728  0.3583106   0.46896124  0.51394933  0.41899118 -0.06921134
  0.7425835 ]


**TO-DO 3.1:** Now, use your cosine similarity method again to find the five most similar words to 'happy' according to your word2vec model.


In [9]:
for term, similarity in gensim_word2vec.most_similar("happy", 5):
    print(f"{term} = {similarity:.3f}")

shocking = 0.997
take = 0.997
taking = 0.997
horrid = 0.997
though = 0.997


**TO-DO 3.2:** Have either of these embeddings been effective at finding similar words? What might improve them?


# 4. Downloading Pretrained Models


Above, we trained our own model using the skipgram method. We can also download a pretrained model that has previously been trained on a large corpus. There is a list of models available [here](https://radimrehurek.com/gensim/models/word2vec.html#pretrained-models). Let's try out GLoVe embeddings (another way of learning embeddings than using the skipgram model) trained on a corpus of tweets:


In [10]:
from dn.embeddings.gensim_glove import GensimGlove

gensim_glove = GensimGlove(train_texts)

print(f"Term vector = {gensim_glove.term_vector('happy')}")

Term vector = [-1.2304   0.48312  0.14102 -0.0295  -0.65253 -0.18554  2.1033   1.7516
 -1.3001  -0.32113 -0.84774  0.41995 -3.8823   0.19638 -0.72865 -0.85273
  0.23174 -1.0763  -0.83023  0.10815 -0.51015  0.27691 -1.1895   0.98094
 -0.13955]


**TO-DO 4.1:** Repeat the exercise above to find the closest relations to 'happy' with the downloaded model. How do the results compare to the embeddings we trained ourselves?


In [11]:
for term, similarity in gensim_glove.most_similar("happy", 5):
    print(f"{term} = {similarity:.3f}")

birthday = 0.958
thank = 0.938
welcome = 0.934
love = 0.918
miss = 0.916


# 5. Analogy Task

An analogy can be formalised as:

A is to B as A* is to B*.

The analogy task is to find B* given A, B and A*.

**TO-DO 5.1:** Define a function that can find the top N closest words B* for any given A, B and A*, using the Gensim embeddings.


In [12]:
from gensim.models import KeyedVectors


def most_similar_analogies(
    term1: str,
    term2: str,
    term3: str,
    term_vectors: KeyedVectors,
    topn: int,
) -> list[tuple[str, float]]:
    term_vector = (
        term_vectors[term3] + term_vectors[term2] - term_vectors[term1]
    )
    return term_vectors.similar_by_vector(term_vector, topn=topn)


for analogy in most_similar_analogies(
    "man", "programmer", "woman", gensim_glove.vectors, 10
):
    print(analogy)

('sourcing', 0.8338578343391418)
('traveller', 0.8320027589797974)
('programmer', 0.8229867815971375)
('marketer', 0.8201889991760254)
('optimization', 0.8200806975364685)
('multilingual', 0.8142974972724915)
('caregiver', 0.8140866160392761)
('xstrology', 0.8124445080757141)
('telesales', 0.8106080889701843)
('columnists', 0.8105406165122986)
