# Embeddings

In this lab we will use both sparse vectors and dense word2vec embeddings to obtain vector representations of words and documents.

### Outcomes
* Be able to compute term-document matrices from a collection of text documents.
* Be able to implement cosine similarity.
* Know how to use Gensim to train, download and apply word embedding models.
* Understand the word analogy task for word embeddings.

### Overview

First, we will load another set of tweet data. Then, we will obtain a term-document matrix, and compute cosine similarities. Then, we will use the Gensim library to train a word2vec model and download a pretrained model. Finally, we use the Gensim embeddings to perform the analogy task.

# Preparing the Data

Instead of the sentiment classification dataset, we will work with the smaller emotion classification dataset. The dataset labels tweets as one of the following classes:
 * 0: anger
 * 1: joy
 * 2: optimism
 * 3: sadness

In [None]:
from datasets import load_dataset
from tqdm import tqdm

cache_dir = "./data_cache"

train_dataset = load_dataset(
    "tweet_eval",
    name="emotion",
    split="train",
    ignore_verifications=True,
    cache_dir=cache_dir,
)

print(f"Training dataset with {len(train_dataset)} instances loaded")

test_dataset = load_dataset(
    "tweet_eval",
    name="emotion",
    split="test",
    ignore_verifications=True,
    cache_dir=cache_dir,
)

print(f"Test dataset with {len(test_dataset)} instances loaded")

# Put the data into lists ready for the next steps...
train_texts = []
train_labels = []
for i in tqdm(range(len(train_dataset))):
    train_texts.append(train_dataset[i]['text'])
    train_labels.append(train_dataset[i]['label'])

    # if i % 1000 == 0:
    #     print(i)


# 1. Term-Document Matrix

**TO-DO 1.1:** Use the CountVectorizer, as in week 3, to obtain a term-document matrix for the training set.

In [None]:
# WRITE YOUR ANSWER HERE


**TO-DO 1.2:** Print out the term vector for the word 'happy'. Use the vocabulary_ attribute to look up the word's index. 
*Hint:* the CountVectorizer stores a term-document matrix in a sparse format to save memory. You can convert this to a standard numpy array using the method '.toarray()'.
*Hint:* you can use the method '.flatten()' to convert a 1xN matrix to a vector.

The print-out probably won't be terribly readable, so you will need to convince yourself you have obtained the correct vector.

In [None]:
# WRITE YOUR ANSWER HERE


**TO-DO 1.3:** Print out the document vector for the first tweet in the training set. 

In [None]:
# WRITE YOUR ANSWER HERE


# 2. Cosine Similarity

**TO-DO 2.1:** Write a function that computes cosine similarity between two vectors. *Hint:* you might find numpy's linalg library useful. Refer to the textbook for a definition of cosine similarity.

In [None]:
import numpy as np

### WRITE YOUR OWN CODE HERE


**TO-DO 2.2:** Use the function to find the five most similar words to 'happy' according to the document-term matrix. *Hint:* the vocab_inverted dictionary that we compute below lets you look up a word given its index.

In [None]:
# invert the vocabulary dictionary so we can look up word types given an index
keys = vectorizer.vocabulary_.values()
values = vectorizer.vocabulary_.keys()
vocab_inverted = dict(zip(keys, values))

# WRITE YOUR OWN CODE HERE



# 3. Word2Vec

For this part, we will need the gensim library. The code below tokenizes the training texts, then runs word2vec (the skipgram model) to learn a set of embeddings.

In [None]:
from gensim.models import word2vec
from gensim.utils import tokenize

tokenized_texts = [list(tokenize(text, lowercase=True)) for text in train_texts]
emb_model = word2vec.Word2Vec(tokenized_texts, sg=1, min_count=1, window=3, vector_size=25)

We can look up the embedding for any given word like this:

In [None]:
emb_model.wv['happy']

You may have noticed above that we used gensim's own tokenizer. This means we have a slightly different vocabulary to the one produced by CountVectorizer. To access the vocab, we can use the following property: 

In [None]:
vocab = emb_model.wv.index_to_key
print(vocab[:10])

**TO-DO 3.1:** Now, use your cosine similarity method again to find the five most similar words to 'happy' according to your word2vec model.

In [None]:
# WRITE YOUR OWN CODE HERE


**TO-DO 3.2:** Have either of these embeddings been  effective at finding similar words? What might improve them?

# 4. Downloading Pretrained Models

Above, we trained our own model using the skipgram method. We can also download a pretrained model that has previously been trained on a large corpus. There is a list of models available [here](https://radimrehurek.com/gensim/models/word2vec.html#pretrained-models). Let's try out GLoVe embeddings (another way of learning embeddings than using the skipgram model) trained on a corpus of tweets:

In [None]:
import gensim.downloader

glove_wv = gensim.downloader.load('glove-twitter-25')

# show the vector for Hamlet:
print(glove_wv['happy'])

**TO-DO 4.1:** Repeat the exercise above to find the closest relations to 'happy' with the downloaded model. How do the results compare to the embeddings we trained ourselves?

In [None]:
# WRITE YOUR CODE HERE


# 5. Analogy Task

An analogy can be formalised as:

A is to B as A* is to B*.

The analogy task is to find B* given A, B and A*.

**TO-DO 5.1:** Define a function that can find the top N closest words B* for any given A, B and A*, using the Gensim embeddings.

In [None]:
vocab = glove_wv.index_to_key

def analogy(A, B, Astar, embeddings, topn):
    # WRITE YOUR OWN CODE HERE
    
    ###

print(analogy('man', 'programmer', 'woman', glove_wv, 10))