# Lab 4: Word Vector representations

In this week's lab, we'll explore dense word vector representations.

*   Use word vectors to explore word meaning (similar words, analogies, and what does not belong)
*   Build a word2vec model of Reddit data
*   Use dense vector word representations to model documents with doc2vec
* Evaluate the quality of word embedding models


## Glove embeddings

[GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/) developed at Stanford are another way of creating word embeddings.  These dense vectors are trained using sparse global word co-occurrence vectors. 

The website provides pre-trained word embeddings from a variety of sources. We'll be using the 6B collection, trained on Wikipedia and a large collection of news documents.  We'll be using a vector representation of size 200, but the pretrained embeddings are also available in other sizes.





In [0]:
local_file = "glove.6B.200d_gensim.txt.gz"
!gsutil cp  gs://textasdata/glove.6B.200d_gensim.txt.gz $local_file 

In [0]:
#!pip install 'gensim==3.2.0'
!pip install --upgrade gensim


Note: Loading the glove vectors takes approximately 1 minute in Colab.

In [0]:
from gensim.models.keyedvectors import KeyedVectors
glove_model = KeyedVectors.load_word2vec_format("glove.6B.200d_gensim.txt.gz")

Warmup: Read the Gensim documentation on [KeyedVectors](https://radimrehurek.com/gensim/models/keyedvectors.html) for working with pre-trained word embeedings.  If you want to go deeper on how Gensim works, you can read the [source code](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py).

**Note:** The GLOVE embeddings are unigrams that have been normalized (lowercased). Word lookups on the vectors need to be the same or the lookups will fail. Not all words may be in the vector, these are OOV words that would be ignored.

####Exercise:
* Use the gensim API to find the 10 `most_similar` words by cosine similarity
* Try postive words "university" and "glasgow". 
* Beyond these words, experiment with other words as positive seeds.

In [0]:
# Try your own!

#### Exercise: 

* Create a function `word_anology` that uses cosine to solve problem analogy A is to B =  BLANK  is to C?  (e.g. Paris is to France = ___ is to England)
* It should take a parameter, `k`, the number of possible answers to return.

**Hint**: Formulate this with addition and subtraction with the word vectors. 

In [0]:
def word_analogy(word_a, word_b, word_c, k=5):
    """
    Function that solves problem analogy word_a to word_b = word_c to ?
    @param word_a, word_b, word_c: string
    @param k: top k candidates to return
    """
    # YOUR CODE HERE

In [0]:
word_analogy("london", "england", "scotland")

In [0]:
word_analogy("fish", "water", "soil")

Play with a few of your own word analogies. 

We will now play another game: "Which of these things does not belong". 

### Optional Exercise 
* Implement a function,  `doesnt_match `
* Take a series of word as input, returns the word farthest from the mean (average) embedding by cosine similarity. 

**Hints**:
 - np.mean(vectors, axis=0) performs element-wise averages
 [20, 30]
 [10, 10]
 = [15, 20]
 - See  `cosine_similarities` built in to the  `glove_model ` (KeyedVector) object. 
 
 You can click SHOW CODE to see the solution.


In [0]:
#@title
import numpy as np

def doesnt_match(words):
  filtered_words = [word for word in words if word in glove_model]
  vectors = [glove_model.word_vec(word, use_norm=True) for word in filtered_words]
  mean = np.mean(vectors, axis=0)
  distances = glove_model.cosine_similarities(mean, vectors)
  return sorted(zip(distances, filtered_words))[0][1]

In [0]:
doesnt_match("apple pear orange car".split())


In [0]:
# Try another example of not matching.

Try this out with another series of words and see if it works!

## Train a Reddit Word2Vec Model

Load the reddit data and tokenize it.

In [0]:
local_file = "coarse_discourse_dump_reddit.json"

!gsutil cp gs://textasdata/coarse_discourse_dump_reddit.json $local_file

In [0]:
#!python -m spacy download en

import spacy

# Load the small english model. 
# Disable the advanced NLP features in the pipeline for efficiency.
nlp = spacy.load('en_core_web_sm', disable=['ner'])
nlp.remove_pipe('tagger')
nlp.remove_pipe('parser')

#@Tokenize
def spacy_tokenize(string):
  tokens = list()
  doc = nlp(string)
  for token in doc:
    tokens.append(token)
  return tokens

#@Normalize
def normalize(tokens):
  normalized_tokens = list()
  for token in tokens:
    if (token.is_alpha or token.is_digit):
      normalized = token.text.lower().strip()
      normalized_tokens.append(normalized)
  return normalized_tokens

#@Tokenize and normalize
def tokenize_normalize(string):
  return normalize(spacy_tokenize(string))  

In [0]:
import json
import pandas as pd

posts = list()

# If the dataset is too large, you can load a subset of the posts.
post_limit = 100000000

# Construct a dataframe, by opening the JSON file line-by-line
with open(local_file) as jsonfile:
  for i, line in enumerate(jsonfile):
    thread = json.loads(line)
    if (len(posts) > post_limit):
      break
      
    for post in thread['posts']:
      posts.append((thread['subreddit'], thread['title'], thread['url'],
                        post['id'], post.get('author', ""), post.get('body', "")))
print(len(posts))

labels = ['subreddit', 'title', 'id', 'url', 'author', 'body']
post_frame = pd.DataFrame(posts, columns=labels)


In [0]:
# Use the tokenizer to extract all tokens from the body of the posts.
# Flatten the tokens in the post into a single list of all the tokens.
import itertools
all_tokens = []
all_posts_tokenized = post_frame.body.apply(tokenize_normalize)
all_tokens = list(itertools.chain.from_iterable(all_posts_tokenized))
print("Num tokens: ", len(all_tokens))

### Gensim word2vec model ###

In this section, we'll train a word2vec model on the reddit data using [Gensim](https://radimrehurek.com/gensim/index.html). Gensim is a widely used 'topic modeling' library that is used for various word and document similarities in Python.

You will need to refer to the documentation of Gensim's [Word2Vec Model](https://radimrehurek.com/gensim/models/word2vec.html).

Some of the important parameters:
*   `size`: Number of dimensions for word embedding model
*   `window`: Number of context words to observe in each direction
*   `min_count`: Minimum frequency for words included in model
*   `sg` (Skip-Gram): '0' indicates CBOW model; '1' indicates Skip-Gram
*   `alpha`: Learning rate (initial); prevents model from over-correcting, enables finer tuning
*   `iterations`: Number of passes through dataset
*   `batch_words`: Number of words to sample from data during each pass



Note: Training the model should take less than 1-2 minutes in colab.

**Exercise**: 
- Train a CBOW model on `all_posts_tokenized`
- Add the correct parameters to gensim and train a model
- Paramters: `window = 5, dimensions = 50, min_count=5, alpha=0.025`,  `size = 50`, `batch_words=10000` to have a batch size of 10k. 

We're training a CBOW model because on small collections, like the Reddit data we're using, CBOW is more effective than Skip-gram. Skip-gram models usually perform better on larger datasets. 

We also usually use more than 50 dimensions (typically 100-300 or more).


In [0]:
import gensim
import time

t0 = time.time()
model = gensim.models.Word2Vec(all_posts_tokenized ...
print ("done in %.02f s" % (time.time() - t0))

How well does the model work?  

Let's compute the cosine similarity between several combinations of vectors to see if they make sense. 
Use the gensim [word2vec API](https://radimrehurek.com/gensim/models/word2vec.html).

**Exercise:** Find the similarities between 'man' and 'woman'.
Hint: Look at the `wv` member.


Now, do the same for 'woman' and girl'.


Which one is more similar?  Is this what you expect?

Let's dig deeper and look at months.

Pick a month of the year.  What are the most similar 11 words? 20 words?

- What month is (usually) missing for some months? 
- Why do you think this could be given what we discussed in lecture?

## Word vector document representations

We've looked at how to represent words with dense vectors. Let's now apply this by using dense word vectors to represent documents. 


Recall from Lecture 1: We typically represent each document as a vector, with one dimension for each word in the dictionary (i.e. each document's vector is $|V|$). How can we expand this to deal with word embeddings? We could represent each term occurrence by the $|D|$ dimensions of its word embedding vector - i.e. where each word occurrence is represented with the vector for that word. 


However, this would lead to a very large document representation (a vector of $|D| * |V|$ in our case)! Instead, one approach is to combine the dense vector representations.


In this part of the lab, we'll experiment with different ways of combining word vectors Each document will be represented by a single $|D|$ dimensional vector that is the combination of all of its word vectors.

We could combine the vectors in different ways:
*   Take the average of each dimension
*   Take the min or max of each dimension

### A note on SKLearn: BaseEstimator interface
The root of the API is an `Estimator`, broadly any object that can learn from data. The primary `Estimator` objects implement classifiers, regressors, or clustering algorithms. However, they can also include a wide array of data manipulation, from dimensionality reduction to feature extraction from raw data. The `Estimator` essentially serves as an interface, and classes that implement `Estimator` functionality must have two methods—`fit` and `predict`. 

A `Transformer` is a special type of `Estimator` that creates a new dataset from an old one based on rules that it has learned from the fitting process. it follows  `fit` and `transform`.

As you've seen, we'll use these extensively.  In many cases it's easier to create your own Estimator/Transformer for custom applications.




In [0]:
from sklearn.base import BaseEstimator

class Estimator(BaseEstimator):

    def fit(self, X, y=None):
        """
        Accept input data, X, and optional target data, y. Returns self.
        """
        return self

    def predict(self, X):
        """
        Accept input data, X and return a vector of predictions for each row.
        """
        return yhat
      
from sklearn.base import TransformerMixin

class Transfomer(BaseEstimator, TransformerMixin):

    def fit(self, X, y=None):
        """
        Learn how to transform data based on input data, X.
        """
        return self

    def transform(self, X):
        """
        Transform X into a new dataset, Xprime and return it.
        """
        return Xprime

#### Exercise:
Create a SKlearn wrapper (Transformer) around Gensim's W2V features to create a document representation that averages word vectors to create a document vector.

* Complete the implementation of Transform 
* Input is `X`: a vector of documents -> list of tokens
* Ignore OOV words 
* Take the average (mean) of the embeddings for each word word in the document

**Reminder** Transform takes a document-feature matrix as input.  Assume that X is a vector of documents each containing a vector of tokens (tokenization performed).  

As in the `doesnt_match` above, use numpy for the vector arithmetic. 

**Hints** 
- If all words in the document are UNK, return a $|D|$ dimensional vector of 0s.
- This is particularly elegant with nested list comprehensions (documents, tokens)

If you are stuck, you can show code to see the solution below.

In [0]:
import numpy as np

from sklearn.base import BaseEstimator, TransformerMixin

class AverageEmbeddingVectorizer(BaseEstimator, TransformerMixin):
  
    def __init__(self, embedding_model):
        self.embedding = embedding_model
        self.dimension = embedding_model.vector_size

    def fit(self, X, y):
      # Nothing is required here. No collection properties are needed.
      return self
      
    def transform(self, X):
      # Input: X: an iterable of documents that have been tokenized.
      # Example of input with two documents: 
      # [["the", "cat", "ran"], ["the", "cat", "jumped"]]
      # Output: a numpy array of vectors, one for each document.  Each vector
      # is the mean of the vectors of each word in the document. 
      # Hint: It may require a nested loop / for comprehension.
      # Be sure to skip OOV terms. Return 0 if no words are in the vocabulary.
      return <YOUR CODE HERE>

In [0]:
#@title
#Solution

import numpy as np

from sklearn.base import BaseEstimator, TransformerMixin

class AverageEmbeddingVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, embedding_model):
        self.embedding = embedding_model
        self.dimension = embedding_model.vector_size

    def fit(self, X, y):
        return self
      
    def transform(self, X):  
      # Skip OOV terms. Return 0 if no words are in the vocabulary.
      #print (X)
      return np.array([ 
          np.mean([self.embedding[token] for token in doc if token in self.embedding]
                or [np.zeros(self.dimension)], axis=0)
          for doc in X
      ])

In [0]:
# Use a vectorizer with the Glove vectors and Reddit w2v vectors.  This will 
# allow us to easily compare them.
reddit_vectorizer = AverageEmbeddingVectorizer(model)
glove_vectorizer = AverageEmbeddingVectorizer(glove_model)

Compare the reddit and glove vectorizers for the same vector of tokens. What's different (besides the fact that the numbers are different)?


In [0]:
doc = tokenize_normalize('watch the cat chase the dog')
X = [doc]
print("Glove:")
print(glove_vectorizer.transform(X))
print("Reddit:")
print(reddit_vectorizer.transform(X))

We can see that the number of dimensions is different, as well as producing very different vector values.

Let's process our collection with our embedding vectorizer.

In [0]:
glove_post_vector_matrix = glove_vectorizer.transform(all_posts_tokenized)
reddit_post_vector_matrix = reddit_vectorizer.transform(all_posts_tokenized)

**Word2Vec Questions:** 
*   Do you expect this would this work well for long documents?  Why or why not?
*   What about stopwords or non-informative words? See the optional exercises for ideas on how to improve the vectorizer. 
*   Averaging word vectors has some disavantages, what happens to word order? 

Doc2Vec addresses some of these issues.


### Paragraph embeddings
There is an extension to word2vec to learn fixed-length representations of variable length documents, beyond taking a simple average.  Doc2Vec was shown to be effective as a text classification feature as well as paragraph similarity.

See Gensim's description of [Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html). It was developed at Google Research  by Quoc Le and Tomas Mikolov: “[Distributed Representations of Sentences and Documents](https://arxiv.org/pdf/1405.4053v2.pdf)". 

In [0]:
import gensim
import collections
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

tagged_documents = [TaggedDocument(words=_d, tags=[i]) for i, _d in enumerate(all_posts_tokenized)]

# In the paper they use 400 as the dimensions, may take many more epochs to converge.
# The devil is in the details to make these work effectively.
d2v_model = gensim.models.doc2vec.Doc2Vec(tagged_documents, vector_size=300, alpha=0.025, min_alpha=0.001, min_count=5, window=8, epochs=10)
#vocab = collections.Counter(all_tokens)
#d2v_model.build_vocab_from_freq(vocab)


In [0]:
d2v_model.docvecs[1]

In [0]:
# This should really be called 'transform'
d2v_model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])
#d2v_model.save("d2v.model")

This represents the paragraph just like another word in the collection.

## Evaluation

We'll look at two ways to evaluate our models.

1.   Finding similar documents
2.   Word analogies

These are both extrinsic evaluations.  Other extrinsic evaluations include using them as part of supervised learning, such as text classification.

### Find similar documents revisited ###

One application of this is finding related reddit posts for a given document or string. For example, to create a 'Reddit widget' to embed in a webpage that shows related content from Reddit. 

We'll use our representation of documents to find similar posts and to query the documents.


In [0]:
from sklearn.metrics.pairwise import cosine_similarity
 
# A function that given an input query item returns the top-k most similar items 
# by their cosine similarity.
def find_similar(query_vector, vd_matrix, top_k = 5):
    cosine_similarities = cosine_similarity(query_vector, vd_matrix).flatten()
    related_doc_indices = cosine_similarities.argsort()[::-1]
    return [(index, cosine_similarities[index]) for index in related_doc_indices][0:top_k]

Find the top 10 most similar posts based on cosine similarity in the word vector representation. Try this with both the glove and w2v embeddings.

In [0]:
doc = tokenize_normalize('end of the world as we know it')

# Glove vectorizer
vectorizer = glove_vectorizer
matrix = glove_post_vector_matrix

# Reddit w2v
#vectorizer = reddit_vectorizer
#matrix = reddit_post_vector_matrix

transformed = vectorizer.transform([doc])

query_vector = transformed[0:1]
print("\nSimilar posts:")
for index, score in find_similar(query_vector, matrix, 10):
  post_contents = post_frame.iloc[index]['body'].replace('\n', '')
  post_limited = (post_contents[:75] + '..') if len(post_contents) > 75 else post_contents
  print(score, index, post_limited)

In [0]:
import random as rand
post_index = 1000
post_index = rand.randint(0, len(all_posts_tokenized))
tokens = all_posts_tokenized[post_index]
print(tokens)

# Transform our string using the vocabulary
transformed = vectorizer.transform([tokens])
doc = transformed[0:1]

print("\nSimilar posts:")
for index, score in find_similar(doc, matrix, 10):
  post_contents = post_frame.iloc[index]['body'].replace('\n', '')
  post_limited = (post_contents[:150] + '..') if len(post_contents) > 150 else post_contents
  print(score, index, post_limited)

The code below does the same step random post with doc2vec.

In [0]:
# Compare with doc2vec

inferred_vector = d2v_model.infer_vector(tokens)
print(tokens)
similar = d2v_model.docvecs.most_similar([inferred_vector], topn=10)

print("\nSimilar posts:")
for (label, score) in similar:
  post_contents = post_frame.iloc[label]['body'].replace('\n', '')
  post_limited = (post_contents[:150] + '..') if len(post_contents) > 150 else post_contents
  print(score, index, post_limited)

Compare the similarity of Glove vs. Reddit vectorizers on this task.  Are the results different, which one do you think is more effective? 

In the current case, it looks like the doc2vec vectors are not (yet) optimal.  They may need additional training or different parameters to be very effective.

###Extrinsic evaluation with word analogies ###
We've seen how to combine word vectors and find the most similar words.  

Recall: glove_model.wv.doesnt_match("apple pear orange car".split())

Revisit this problem and try implementing a solution yourself. After you're done, you could look at the source code in gensim to see how your approach compares.


In [0]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
!wget https://raw.githubusercontent.com/RaRe-Technologies/gensim/develop/gensim/test/test_data/questions-words.txt

glove_model.wv.evaluate_word_analogies('questions-words.txt')
model.wv.evaluate_word_analogies('questions-words.txt')


Note: this may be slow for the 300D model.

The reddit model accuracy.

In [0]:
model.wv.evaluate_word_analogies('questions-words.txt')


In [0]:
glove_model.wv.evaluate_word_analogies('questions-words.txt')


We have just scratched the surface of word embeddings. 

See this great post about the rise of [BERT, ELMo, and others](http://jalammar.github.io/illustrated-bert/).  These new embedding models that perform the task of language modeling are state-of-the-art and leading to a new revolution in NLP effectiveness. 


## Wrapup

Take the [Moodle quiz](https://moodle.gla.ac.uk/mod/feedback/view.php?id=1118007) for this lab and let us know what you think.

## (Optional) Extra exercises 

### Modify the document vector representation ###

Experiment with other ways to combining the word vectors.  

*   What about taking max?
*   Add IDF weighting to create a weighted average
*   Remove stopwords or other non-informative words

### Modify embedding hyper-parameters ###
Retrain the reddit model with different parameters. For example try varying some of the following:

*   Dimension of embeddings
*   Window size 
*   Skipgram vs CBOW
*   Iterations or learning rates


### Other embedding models




### Playing with vizualizations ###

We can use a WordCloud to vizualize the similar words. You could use the popular WordCloud python library. (And we'll use a nice font from Google Fonts.)

In [0]:
!wget https://storage.googleapis.com/tad2018/OpenSansCondensed-Bold.ttf
!pip install wordcloud

# Below is optional code to create a 'mask' to put the wordcloud into a cool shape.
!wget https://storage.googleapis.com/tad2018/reddit-mask.png

from PIL import Image
from os import path
import os 

reddit_mask = np.array(Image.open(path.join(os.getcwd(), "reddit-mask.png")))
tad_mask = np.array(Image.open(path.join(os.getcwd(), "tad-mask3.png")))

# !wget https://storage.googleapis.com/tad2018/tad-mask3.png

In [0]:
terms = model.wv.most_similar(positive=['lit', 'thanks'], topn=1000)
print(terms)

In [0]:
from os import path
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# Convert the similarity score into an integer count, N.
# Repeat the word in a space separated string N times.
def terms_to_wordcounts(terms, multiplier=1000):
    <Your Code here>

term_counts = terms_to_wordcounts(terms)

wc = WordCloud(font_path='/content/OpenSansCondensed-Bold.ttf',
                      width=2048,
                      height=2048,
                      max_words=1000,
                      mask=reddit_mask, # optional mask
                      background_color="white").generate(term_counts)

plt.imshow(wc)
plt.axis("off")
plt.show()
plt.savefig("terms1")
plt.close()  

In [0]:
#Solution 

from os import path
import matplotlib.pyplot as plt
from wordcloud import WordCloud

def terms_to_wordcounts(terms, multiplier=1000):
  counts = (i[0]+" " * int(multiplier*i[1]) for i in terms)
  return  " ".join(counts)


term_counts = terms_to_wordcounts(terms)

wc = WordCloud(font_path='/content/OpenSansCondensed-Bold.ttf',
                      width=2048,
                      height=2048,
                      max_words=1000,
                      mask=reddit_mask, # optional mask
                      background_color="white").generate(term_counts)

plt.imshow(wc)
plt.axis("off")
plt.show()
plt.savefig("terms1")
plt.close()

In [0]:
wc.to_file("terms.png")

from google.colab import files
files.download('terms.png')