# Embeddings

In this lab we will use both sparse vectors and dense word2vec embeddings to obtain vector representations of words and documents.

### Outcomes
* Be able to compute term-document matrices from a collection of text documents.
* Be able to implement cosine similarity.
* Know how to use Gensim to train, download and apply word embedding models.
* Understand the word analogy task for word embeddings.

### Overview

First, we will load another set of tweet data. Then, we will obtain a term-document matrix, and compute cosine similarities. Then, we will use the Gensim library to train a word2vec model and download a pretrained model. Finally, we use the Gensim embeddings to perform the analogy task.

# Preparing the Data

Instead of the sentiment classification dataset, we will work with the smaller emotion classification dataset. The dataset labels tweets as one of the following classes:
 * 0: anger
 * 1: joy
 * 2: optimism
 * 3: sadness

In [2]:
from datasets import load_dataset
from tqdm import tqdm

cache_dir = "./data_cache"

train_dataset = load_dataset(
    "tweet_eval",
    name="emotion",
    split="train",
    ignore_verifications=True,
    cache_dir=cache_dir,
)

print(f"Training dataset with {len(train_dataset)} instances loaded")

test_dataset = load_dataset(
    "tweet_eval",
    name="emotion",
    split="test",
    ignore_verifications=True,
    cache_dir=cache_dir,
)

print(f"Test dataset with {len(test_dataset)} instances loaded")

# Put the data into lists ready for the next steps...
train_texts = []
train_labels = []
for i in tqdm(range(len(train_dataset))):
    train_texts.append(train_dataset[i]['text'])
    train_labels.append(train_dataset[i]['label'])

    # if i % 1000 == 0:
    #     print(i)


Reusing dataset tweet_eval (./data_cache/tweet_eval/emotion/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)


Training dataset with 3257 instances loaded


Reusing dataset tweet_eval (./data_cache/tweet_eval/emotion/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)


Test dataset with 1421 instances loaded


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3257/3257 [00:00<00:00, 11158.20it/s]


# 1. Term-Document Matrix

**TO-DO 1.1:** Use the CountVectorizer, as in week 3, to obtain a term-document matrix for the training set.

In [3]:
# WRITE YOUR ANSWER HERE

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

vectorizer.fit(train_texts)
X_train = vectorizer.transform(train_texts)
print(X_train)

  (0, 2184)	1
  (0, 3337)	1
  (0, 3837)	1
  (0, 3964)	1
  (0, 4166)	1
  (0, 4514)	1
  (0, 4600)	1
  (0, 4761)	1
  (0, 4926)	1
  (0, 5203)	1
  (0, 5423)	1
  (0, 5758)	1
  (0, 8191)	2
  (0, 8284)	1
  (1, 622)	1
  (1, 784)	1
  (1, 1177)	1
  (1, 2722)	1
  (1, 3337)	1
  (1, 3856)	1
  (1, 4819)	1
  (1, 5187)	1
  (1, 6275)	1
  (1, 6873)	1
  (1, 7324)	1
  :	:
  (3255, 7357)	1
  (3255, 8101)	1
  (3255, 8272)	1
  (3255, 8284)	2
  (3256, 513)	1
  (3256, 843)	1
  (3256, 1132)	1
  (3256, 2160)	1
  (3256, 2680)	1
  (3256, 3236)	1
  (3256, 3766)	1
  (3256, 4060)	1
  (3256, 4247)	1
  (3256, 4269)	1
  (3256, 5160)	1
  (3256, 5570)	1
  (3256, 6502)	1
  (3256, 6800)	1
  (3256, 7362)	1
  (3256, 7365)	2
  (3256, 7396)	1
  (3256, 7413)	1
  (3256, 7815)	2
  (3256, 7993)	1
  (3256, 8116)	1


**TO-DO 1.2:** Print out the term vector for the word 'happy'. Use the vocabulary_ attribute to look up the word's index. 
*Hint:* the CountVectorizer stores a term-document matrix in a sparse format to save memory. You can convert this to a standard numpy array using the method '.toarray()'.
*Hint:* you can use the method '.flatten()' to convert a 1xN matrix to a vector.

The print-out probably won't be terribly readable, so you will need to convince yourself you have obtained the correct vector.

In [4]:
# WRITE YOUR ANSWER HERE
index = vectorizer.vocabulary_['happy']
vec = X_train[:, index].toarray().flatten()
print(vec)
print(vec.shape)
print(X_train.shape)
print(len(vectorizer.vocabulary_))

[0 0 0 ... 0 0 0]
(3257,)
(3257, 8328)
8328


**TO-DO 1.3:** Print out the document vector for the first tweet in the training set. 

In [5]:
# WRITE YOUR ANSWER HERE
index = 0
X_train[0, :].toarray().flatten()

array([0, 0, 0, ..., 0, 0, 0])

# 2. Cosine Similarity

**TO-DO 2.1:** Write a function that computes cosine similarity between two vectors. *Hint:* you might find numpy's linalg library useful. Refer to the textbook for a definition of cosine similarity.

In [6]:
import numpy as np

### WRITE YOUR OWN CODE HERE

def similarity(x, y):
    dot_prod = np.dot(x, y)
    return dot_prod / (np.linalg.norm(x)*np.linalg.norm(y))

**TO-DO 2.2:** Use the function to find the five most similar words to 'happy' according to the document-term matrix. *Hint:* the vocab_inverted dictionary that we compute below lets you look up a word given its index.

In [7]:
# invert the vocabulary dictionary so we can look up word types given an index
keys = vectorizer.vocabulary_.values()
values = vectorizer.vocabulary_.keys()
vocab_inverted = dict(zip(keys, values))

# WRITE YOUR OWN CODE HERE

index = vectorizer.vocabulary_['happy']
happy_vector = X_train[:, index].toarray().flatten()
vocab_size = X_train.shape[1]

similarities = np.zeros(vocab_size)
for i in tqdm(range(vocab_size)):
    if i == index:
        continue  # skip comparison with the input word

    similarities[i] = similarity(happy_vector, X_train[:, i].toarray().flatten())

most_similar_idxs = np.argsort(similarities)[-5:]
for idx in most_similar_idxs:
    print(vocab_inverted[idx])


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8328/8328 [00:02<00:00, 3637.61it/s]

hopelessness
anxietyprobz
ampalaya
paitpaitanangpeg
birthday





# 3. Word2Vec

For this part, we will need the gensim library. The code below tokenizes the training texts, then runs word2vec (the skipgram model) to learn a set of embeddings.

In [21]:
from gensim.models import word2vec
from gensim.utils import tokenize

tokenized_texts = [list(tokenize(text, lowercase=True)) for text in train_texts]
emb_model = word2vec.Word2Vec(tokenized_texts, seed=123, sg=1, min_count=1, window=3, vector_size=25)

We can look up the embedding for any given word like this:

In [22]:
emb_model.wv['happy']

array([-0.20619911,  0.01198039, -0.4783855 , -0.2384516 , -0.33177167,
       -0.21107505, -1.1538926 , -0.4510568 ,  0.21269228,  0.02695287,
       -0.16724521,  0.53105634,  0.38963693,  0.7182249 , -0.235123  ,
       -0.01076802, -0.09377442, -0.23101725, -0.23545082,  0.19366637,
       -0.39208633, -0.3270961 ,  0.15147023,  0.44964668, -0.5845829 ],
      dtype=float32)

You may have noticed above that we used gensim's own tokenizer. This means we have a slightly different vocabulary to the one produced by CountVectorizer. To access the vocab, we can use the following property: 

In [23]:
vocab = emb_model.wv.index_to_key
print(vocab[:10])
vocab_size = len(vocab)

['user', 'i', 'the', 'to', 'a', 'and', 'you', 'is', 'of', 'it']


**TO-DO 3.1:** Now, use your cosine similarity method again to find the five most similar words to 'happy' according to your word2vec model.

In [24]:
# WRITE YOUR OWN CODE HERE

happy_vector = emb_model.wv['happy']

similarities = np.zeros(vocab_size)
for i, wordtype in tqdm(enumerate(vocab)):
    if wordtype == 'happy':
        continue  # skip comparison with the input word

    similarities[i] = similarity(happy_vector, emb_model.wv[wordtype])

most_similar_values = np.sort(similarities)[-5:]
most_similar_idxs = np.argsort(similarities)[-5:]
for i, idx in enumerate(most_similar_idxs):
    print(f'{vocab[idx]}, {most_similar_values[i]}')
    
# OR you can cheat and use the library function...
emb_model.wv.most_similar('happy', topn=5)

8174it [00:00, 88136.11it/s]

worry, 0.9966073036193848
keep, 0.9966935515403748
yet, 0.9972906708717346
mourn, 0.9973416924476624
sorry, 0.9975515007972717





[('sorry', 0.9975515007972717),
 ('mourn', 0.9973416924476624),
 ('yet', 0.9972906708717346),
 ('keep', 0.9966936111450195),
 ('worry', 0.9966073632240295)]

**TO-DO 3.2:** Have either of these embeddings been  effective at finding similar words? What might improve them?

# 4. Downloading Pretrained Models

Above, we trained our own model using the skipgram method. We can also download a pretrained model that has previously been trained on a large corpus. There is a list of models available [here](https://radimrehurek.com/gensim/models/word2vec.html#pretrained-models). Let's try out GLoVe embeddings (another way of learning embeddings than using the skipgram model) trained on a corpus of tweets:

In [11]:
import gensim.downloader

glove_wv = gensim.downloader.load('glove-twitter-25')

# show the vector for Hamlet:
print(glove_wv['happy'])

[-1.2304   0.48312  0.14102 -0.0295  -0.65253 -0.18554  2.1033   1.7516
 -1.3001  -0.32113 -0.84774  0.41995 -3.8823   0.19638 -0.72865 -0.85273
  0.23174 -1.0763  -0.83023  0.10815 -0.51015  0.27691 -1.1895   0.98094
 -0.13955]


**TO-DO 4.1:** Repeat the exercise above to find the closest relations to 'happy' with the downloaded model. How do the results compare to the embeddings we trained ourselves?

In [12]:
# WRITE YOUR CODE HERE
glove_wv.most_similar('happy', topn=5)

[('birthday', 0.9577818512916565),
 ('thank', 0.937666654586792),
 ('welcome', 0.93361496925354),
 ('love', 0.9176183342933655),
 ('miss', 0.9164500832557678)]

# 5. Analogy Task

An analogy can be formalised as:

A is to B as A* is to B*.

The analogy task is to find B* given A, B and A*.

**TO-DO 5.1:** Define a function that can find the top N closest words B* for any given A, B and A*, using the Gensim embeddings.

In [13]:
vocab = glove_wv.index_to_key

def analogy(A, B, Astar, embeddings, topn):
    # WRITE YOUR OWN CODE HERE
    target = embeddings[Astar] + embeddings[B] - embeddings[A]

    similarities = np.zeros(len(vocab))
    closest_word = '<NONE>'
    for i, key in tqdm(enumerate(vocab)):
        similarities[i] = similarity(target, embeddings[key])

    idxs = np.argsort(similarities)[-topn:]
    closest_words = [vocab[i] for i in idxs]
        
    return closest_words

print(analogy('man', 'programmer', 'woman', glove_wv, 10))

1193514it [00:11, 102008.26it/s]


['columnists', 'telesales', 'xstrology', 'caregiver', 'multilingual', 'optimization', 'marketer', 'programmer', 'traveller', 'sourcing']
