#Vector Representations of Words

This lab is devoted to vector representations of words. We will empirically use and evaluate word embeddings.

In NLP systems we need to deal with different kinds of discrete data:
* words
* characters
* part-of-speech tags
* named entities
* items in a product catalog, and so on. 

Although words comprise a finite set -- vocabulary -- representing such discrete types as dense vectors is challenging and has a great impact on the quality of the overall NLP system. 

**Representation learning** and **embedding** refer to learning the mapping from one discrete type to a point in the vector space. And therefore, when these discrete types are words, we call these dense vector representation is called a **word embedding**. 

Today we will explore:
1. Count-based embeddings: one-hot vectors, Term-Frequency and Term-Frequency-Inverse-Document-Frequency (TF-IDF). 
2. Word embeddings: GloVe and Word2Vec.
3. How to visualize word embeddings.

##Prerequisites

###Data

First, let's download a dataset which will be used to train our embeddings. It is a pretty large set, so let's do it now, so later we won't have to wait for it to finish downloading.

In [None]:
!wget -P data/glove http://nlp.stanford.edu/data/glove.6B.zip
!unzip -d data/glove data/glove/glove.6B.zip

glove_path='data/glove/'

In [None]:
pip install seaborn

##1. Count-based (frequency-based) Embeddings

Traditional methods for creating vector representations of words, are called count-based. For term frequencies in a corpus the basis of the vector space is the vocabulary of the corpus. For example, in a sentence or a document each word is characterised as the number of times it appears there. For this we can construct **occurance matrix**, where a **document vector** is created for each sentence/document and is of size of the number of unique words. On the other hand, the **word vector** is of the size of the number of sentences/documents. The count-based representations are also called distributional representations because their significant content or meaning is represented by multiple dimensions in the vector.

###One-hot vectors and TF representations



In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns

corpus = ['This cell phone is very small and cheap.', 'This phone is great and works great for a small phone']
one_hot_vectorizer = CountVectorizer(analyzer='word', binary=True)
one_hot = one_hot_vectorizer.fit_transform(corpus)
vocab = one_hot_vectorizer.get_feature_names()

sns.heatmap(one_hot.toarray(), annot=True, cbar=False, xticklabels=vocab, yticklabels=['Sentence 1','Sentence 2'])

The **TF representation** of a phrase, sentence, or document is simply the sum of the one-hot representations of its constituent words. Notice that each entry is a count of the number of times the corresponding word appears in the sentence (corpus).

In [None]:
corpus.append('My cell phone is in a small cell.')
### WRITE YOUR CODE HERE ###
# to-do:
# implement a TF vectorizer that allows to have word frequencies for each sentence. 

### END OF YOUR CODE ###
vector = vectorizer.fit_transform(corpus)
vocab = vectorizer.get_feature_names()

sns.heatmap(vector.toarray(), annot=True, cbar=False, xticklabels=vocab, yticklabels=['Sentence 1','Sentence 2','Sentence 3'])

######SOLUTION

In [None]:
corpus.append('My cell phone is in a small cell.')
### WRITE YOUR CODE HERE ###
# to-do:
# implement a TF vectorizer that allows to have word frequencies for each sentence. 
vectorizer = CountVectorizer(analyzer='word', binary=False)
### END OF YOUR CODE ###
vector = vectorizer.fit_transform(corpus)
vocab = vectorizer.get_feature_names()

sns.heatmap(vector.toarray(), annot=True, cbar=False, xticklabels=vocab, yticklabels=['Sentence 1','Sentence 2','Sentence 3'])

###TF-IDF
The **TF-IDF representation** penalizes common tokens and rewards rare tokens in the vector representation. 
In deep learning, it is rare to see inputs encoded using heuristic representations like TF-IDF because the goal is to learn a representation. Often, we start with a one-hot encoding using integer indices and a special “embedding lookup” layer to construct inputs to the neural network.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer 

tfidf_vectorizer = TfidfVectorizer() 
tfidf = tfidf_vectorizer.fit_transform(corpus).toarray() 

sns.heatmap(tfidf, annot=True, cbar=False, xticklabels=vocab, yticklabels= ['Sentence 1', 'Sentence 2', 'Sentence 3'])

##2. Word Embeddings

Distributed representations enable to represent words as a much lower-dimension dense vector (say d=100, as opposed to the size of the entire vocabulary). Low-dimensional learned dense representations have several benefits over the one-hot and count-based vectors
* reducing the dimensionality is computationally efficient
* count-based representations result in high-dimensional vectors that redundantly encode similar information along many dimensions
* very high dimensions in the input can result in real problems in machine learning and optimization

###Pre-trained embeddings

It is common to use pre-trained embeddings, as they offer good quality embeddings even though they were trained on general (but large!) corpora (e.g. wikipedia). 
* [GloVe](https://nlp.stanford.edu/projects/glove/)
* [fastText](https://fasttext.cc/docs/en/english-vectors.html)

You have already downloaded English GloVe word embeddings - *Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download): [**glove.6B.zip**](http://nlp.stanford.edu/data/glove.6B.zip)*. The downloaded file was unzipped and put into a subfolder named glove to result in the following file paths: `data/glove/glove.6B.100d.txt` and `data/glove/glove.6B.300d.txt`.

In [None]:
!head -n 1 data/glove/glove.6B.100d.txt #print out first line of the file

We will be using [Annoy](https://github.com/spotify/annoy) library for effective nearest neighor search.

In [None]:
pip install annoy

In [None]:
import torch
import torch.nn as nn
from tqdm import tqdm
from annoy import AnnoyIndex
import numpy as np

To efficiently load and process embeddings, we describe a utility class called PreTrainedEmbeddings. The class builds an in-memory index of all the word vectors to facilitate quick lookups and nearest-neighbor queries using an approximate nearest-neighbor package, annoy.
In these examples, we use the GloVe word embeddings. After you download them, you can instantiate with the PretrainedEmbeddings class.

####Loading embeddings

In [None]:
class PreTrainedEmbeddings(object):
    def __init__(self, word_to_index, word_vectors):
        self.word_to_index = word_to_index
        self.word_vectors = word_vectors
        self.index_to_word = {v: k for k, v in self.word_to_index.items()}

        self.index = AnnoyIndex(len(word_vectors[0]), metric='euclidean')
        print("Building Index!")
        for _, i in self.word_to_index.items():
            self.index.add_item(i, self.word_vectors[i])
        self.index.build(50)
        print("Finished!")
        
    @classmethod
    def from_embeddings_file(cls, embedding_file):
        word_to_index = {}
        word_vectors = []

        with open(embedding_file) as fp:
            ### WRITE YOUR CODE HERE ###
            # to-do:
            # write a loop to read the file with vectors line by line
            # store each word with its corresponding vector (as a numpy array)
            # implement word_to_index which is a dict mapping words to integers
            # implement word_vectors which is a list of numpy arrays
            ### END OF YOUR CODE ###
        return cls(word_to_index, word_vectors)
    
    def get_embedding(self, word):
        ### WRITE YOUR CODE HERE ###
        # to-do:
        # return word_vectors

        ### END OF YOUR CODE HERE ###

######SOLUTION

In [None]:
class PreTrainedEmbeddings(object):
    def __init__(self, word_to_index, word_vectors):
        self.word_to_index = word_to_index
        self.word_vectors = word_vectors
        self.index_to_word = {v: k for k, v in self.word_to_index.items()}

        self.index = AnnoyIndex(len(word_vectors[0]), metric='euclidean')
        print("Building Index!")
        for _, i in self.word_to_index.items():
            self.index.add_item(i, self.word_vectors[i])
        self.index.build(50)
        print("Finished!")
        
    @classmethod
    def from_embeddings_file(cls, embedding_file):
        word_to_index = {}
        word_vectors = []

        with open(embedding_file) as fp:
            ### WRITE YOUR CODE HERE ###
            # to-do:
            # write a loop to read the file with vectors line by line
            # store each word with its corresponding vector (as a numpy array)
            # implement word_to_index which is a dict mapping words to integers
            # implement word_vectors which is a list of numpy arrays
            for line in fp.readlines():
                line = line.split(" ")
                word = line[0]
                vec = np.array([float(x) for x in line[1:]])
                
                word_to_index[word] = len(word_to_index)
                word_vectors.append(vec)
            ### END OF YOUR CODE ###
        return cls(word_to_index, word_vectors)
    
    def get_embedding(self, word):
        ### WRITE YOUR CODE HERE ###
        # to-do:
        # return word_vectors
        return self.word_vectors[self.word_to_index[word]]
        ### END OF YOUR CODE HERE ###

####Retrieving vectors of words
Load the pre-trained embeddings and check what is the vector for a given word.

In [None]:
embeddings = PreTrainedEmbeddings.from_embeddings_file(glove_path + 'glove.6B.300d.txt')

word_exp = 'the'
print("word: {} \nvector: \n{}".format(word_exp, embeddings.get_embedding(word_exp)))

Building Index!
Finished!
word: the 
vector: 
[ 4.6560e-02  2.1318e-01 -7.4364e-03 -4.5854e-01 -3.5639e-02  2.3643e-01
 -2.8836e-01  2.1521e-01 -1.3486e-01 -1.6413e+00 -2.6091e-01  3.2434e-02
  5.6621e-02 -4.3296e-02 -2.1672e-02  2.2476e-01 -7.5129e-02 -6.7018e-02
 -1.4247e-01  3.8825e-02 -1.8951e-01  2.9977e-01  3.9305e-01  1.7887e-01
 -1.7343e-01 -2.1178e-01  2.3617e-01 -6.3681e-02 -4.2318e-01 -1.1661e-01
  9.3754e-02  1.7296e-01 -3.3073e-01  4.9112e-01 -6.8995e-01 -9.2462e-02
  2.4742e-01 -1.7991e-01  9.7908e-02  8.3118e-02  1.5299e-01 -2.7276e-01
 -3.8934e-02  5.4453e-01  5.3737e-01  2.9105e-01 -7.3514e-03  4.7880e-02
 -4.0760e-01 -2.6759e-02  1.7919e-01  1.0977e-02 -1.0963e-01 -2.6395e-01
  7.3990e-02  2.6236e-01 -1.5080e-01  3.4623e-01  2.5758e-01  1.1971e-01
 -3.7135e-02 -7.1593e-02  4.3898e-01 -4.0764e-02  1.6425e-02 -4.4640e-01
  1.7197e-01  4.6246e-02  5.8639e-02  4.1499e-02  5.3948e-01  5.2495e-01
  1.1361e-01 -4.8315e-02 -3.6385e-01  1.8704e-01  9.2761e-02 -1.1129e-01
 -4.2

####Word Analogy task

Word embeddings is that the encode syntactic and semantic relationships. We can explore the semantic relationships encoded in word embeddings. One of the most popular methods is a word analogy task. In this task, you are provided with the first three words and need to determine the fourth word. Interestingly, the simple word analogy task can demonstrate that word embeddings capture a variety of semantic and syntactic relationships.

Now, let us implement Word Analogy logic to test our pre-trained embeddings. We employ geometric properties of word embeddings, for example:

**king - man + woman = queen**


In [None]:
def get_closest_to_vector(self, vector, n=1):
    nn_indices = self.index.get_nns_by_vector(vector, n)
    return [self.index_to_word[neighbor] for neighbor in nn_indices]

def word_analogy(self, word1, word2, word3):
    existing_words = set([word1, word2, word3])
    vec1 = self.get_embedding(word1)

    ### WRITE YOUR CODE HERE ###
    # to-do:
    # implement get vectors for vec2, vec3 and vec4
    # get closest_words for vec4 (at most 4) and make sure they are not the same as vec1, vec2 or vec3
    ### END OF YOUR CODE HERE ###

    print_analogy(self, word1, word2, word3, closest_words)

def print_analogy(self, word1, word2, word3, closest_words):
    if len(closest_words) == 0:
        print("Could not find nearest neighbors for the computed vector!")
        return        
    for word4 in closest_words:
        print("{} : {} \t {} : {}".format(word1, word2, word3, word4))

######SOLUTION

In [None]:
def get_closest_to_vector(self, vector, n=1):
    nn_indices = self.index.get_nns_by_vector(vector, n)
    return [self.index_to_word[neighbor] for neighbor in nn_indices]

def word_analogy(self, word1, word2, word3):
    existing_words = set([word1, word2, word3])
    vec1 = self.get_embedding(word1)

    ### WRITE YOUR CODE HERE ###
    # to-do:
    # implement get vectors for vec2, vec3 and vec4
    # get closest_words for vec4 (at most 4) and make sure they are not the same as vec1, vec2 or vec3
    vec2 = self.get_embedding(word2)
    vec3 = self.get_embedding(word3)
    vec4 = vec2 - vec1 + vec3

    closest_words = get_closest_to_vector(self, vec4, n=4)        
    closest_words = [word for word in closest_words if word not in existing_words]
    ### END OF YOUR CODE HERE ###

    print_analogy(self, word1, word2, word3, closest_words)

def print_analogy(self, word1, word2, word3, closest_words):
    if len(closest_words) == 0:
        print("Could not find nearest neighbors for the computed vector!")
        return        
    for word4 in closest_words:
        print("{} : {} \t {} : {}".format(word1, word2, word3, word4))

#####Examples
It is time to perform the word analogy task using our pre-trained embeddings.

In [None]:
word_analogy(embeddings, 'man', 'he', 'woman')
word_analogy(embeddings, 'cat', 'kitten', 'dog')
word_analogy(embeddings, 'car', 'cars', 'bicycle')

man : he 	 woman : she
man : he 	 woman : her
cat : kitten 	 dog : puppy
cat : kitten 	 dog : rottweiler
cat : kitten 	 dog : pooch
car : cars 	 bicycle : bicycles
car : cars 	 bicycle : bikes
car : cars 	 bicycle : bike


###Training Word2Vec embeddings

However, if we are implementing an NLP system that is designed to operate in a domain where general embeddings won't perform well, then we can train word embeddings using our own corpus.

Training the model is fairly straightforward. You just instantiate Word2Vec and pass the data, a list (sentences) of lists (tokens) for a complete corpus. Word2Vec uses all these tokens to internally create a vocabulary

After building the vocabulary, we just need to call `train(...)` to start training the Word2Vec model. Remember you are training a simple neural network with a single hidden layer. But, we are actually not going to use the neural network after training. Instead, the goal is to learn the weights of the hidden layer. These weights are essentially the word vectors that we’re trying to learn.

####Data
The most important thing is data. We'll use five volumes of Game of Thrones from https://github.com/nihitx/game-of-thrones

We will download `got.5books.clean.txt` that is already cleaned.

In [None]:
file_share_link = "https://drive.google.com/open?id=1rNp15OFYZiGWvw1wx_TCa3BZoZ6MjvkQ"
file_download_link = "https://docs.google.com/uc?export=download&id=" + file_share_link[file_share_link.find("=") + 1:]

!wget --no-check-certificate "$file_download_link" -P "data/got" -O "got.5books.clean.txt"

got_path='data/got/'
!mkdir $got_path
!mv "got.5books.clean.txt" $got_path/

--2020-06-04 17:14:06--  https://docs.google.com/uc?export=download&id=1rNp15OFYZiGWvw1wx_TCa3BZoZ6MjvkQ
Resolving docs.google.com (docs.google.com)... 74.125.195.113, 74.125.195.139, 74.125.195.102, ...
Connecting to docs.google.com (docs.google.com)|74.125.195.113|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-08-14-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/971udi2nmj8fagvbinqtlp6039p1257r/1591290825000/10683794948536997497/*/1rNp15OFYZiGWvw1wx_TCa3BZoZ6MjvkQ?e=download [following]
--2020-06-04 17:14:07--  https://doc-08-14-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/971udi2nmj8fagvbinqtlp6039p1257r/1591290825000/10683794948536997497/*/1rNp15OFYZiGWvw1wx_TCa3BZoZ6MjvkQ?e=download
Resolving doc-08-14-docs.googleusercontent.com (doc-08-14-docs.googleusercontent.com)... 74.125.195.132, 2607:f8b0:400e:c09::84
Connecting to doc-08-14-docs.googleusercontent.com (doc-08-14-d

In [None]:
dataFile= got_path + "/got.5books.clean.txt"

with open(dataFile, 'rb') as f:
    for line in f:
        print(line)
        break

b'"We should start back," Gared urged as the woods began to grow dark around them. "The wildlings are dead."\n'


####Pre-processing data

We can read data into a list so that we can pass this on to the Word2Vec model. We'll pre-process data first using `gensim.utils.simple_preprocess(sentence)` - lowercase tokens, ignore tokens that are too short or too long and return a list of tokens (words). Documentation of this pre-processing method can be found in the official [Gensim documentation](https://radimrehurek.com/gensim/utils.html).

In [None]:
import gensim

# Write a function `readInput(inputFile)` that reads a file and applies the `simple_preprocess`
def readInput(inputFile):
    """Method to read the input file"""
    ## CODE HERE
    

# read the tokenized file into a list (sentences) of lists (tokens) named `sentences`
sentences = list(readInput(dataFile))

# print some examples
print(sentences[0])

######SOLUTION

In [None]:
import gensim

# Write a function `readInput(inputFile)` that reads a file and applies the `simple_preprocess`
def readInput(inputFile):
    """Method to read the input file"""
    ## CODE HERE
    lines = []
    with open(inputFile, 'rb') as f:
        for line in f:
            preproc_line = gensim.utils.simple_preprocess(line)
            lines.append(preproc_line)
    return lines

# read the tokenized file into a list (sentences) of lists (tokens) named `sentences`
sentences = list(readInput(dataFile))

# print some examples
print(sentences[0])

['we', 'should', 'start', 'back', 'gared', 'urged', 'as', 'the', 'woods', 'began', 'to', 'grow', 'dark', 'around', 'them', 'the', 'wildlings', 'are', 'dead']


####Training embeddings

Training the model is straightforward, you just create Word2Vec instance and pass the data - a list (sentences) of lists (tokens) for a complete corpus. Word2Vec uses all these tokens to create a vocabulary

Once the vocabulary is built, we just need to call `train(...)` to start training the Word2Vec model. Remember you are training a simple neural network with a single hidden layer. But, we are actually not going to use the neural network after training. Instead, the goal is to learn the weights of the hidden layer. These weights are essentially the word vectors that we’re trying to learn.

In [None]:
# Define a basic Word2Vec model (gensim.models.Word2Vec) with CBOW and train (model.train) it on `sentences`
w2v_model = gensim.models.Word2Vec(sentences, window=5)
w2v_model.train(sentences,total_examples=len(sentences),epochs=10)

(7515102, 10107440)

Several functions allow to explore the results. The `most_similar` function returns the top 10 similar words to a given input word. `similarity` returns the similarity between two words that are present in the vocabulary. `doesnt_match` returns the most dissimilar word with a list of words. Let's play with these functions.

In [None]:
# Chose a word to see the 10 closest words with `most_similar`
w1 = "throne"
w2v_model.wv.most_similar(positive=w1)

What happens if the word is not in the vocabulary? We are using a tiny corpus in a specific domain...

In [None]:
# Chose a word that you think does not belong to the corpus to see the 10 closest words
## CODE HERE
w2 = "google"
w2v_model.wv.most_similar(positive=w2,topn=10)

You can also specify several positive examples to get things that are related in the provided context and provide negative examples to say what should not be considered as related with `most_similar(positive=w1,negative=w2,topn=n)`.

In [None]:
# get everything related to some word
w3 = ["pillow", "sheet", "blanket", "bed"]
w4 = ["dirt"]
w2v_model.wv.most_similar(positive=w3, negative=w4, topn=10)

In [None]:
# try what happens without the negative constraint
w2v_model.wv.most_similar(positive=w3, topn=10)

Calculate some similarities now

In [None]:
# similarity between two different words
w2v_model.wv.similarity("cushion", "pillow")

In [None]:
# similarity between two identical words
## CODE HERE
w2v_model.wv.similarity("pillow", "pillow")

In [None]:
# similarity between two unrelated words
## CODE HERE
w2v_model.wv.similarity("pillow", "dragon")

Under the hood, the above three snippets compute the cosine similarity between the two specified words using word vectors of each. From the scores, it makes sense that `dirty` is highly similar to `smelly` but `dirty` is dissimilar to `clean`. If you do a similarity between two identical words, the score will be 1.0 as the range of the cosine similarity score will always be between [0.0-1.0]. You can read more about cosine similarity scoring [here](https://en.wikipedia.org/wiki/Cosine_similarity).

You can use Word2Vec to find odd items given a list of items with `doesnt_match`.

In [None]:
# Define a list of words and look for the strange word
# Which one is the odd one out in this list?
w5 = ["pillow", "sheet", "queen", "blanket", "bed"]
w2v_model.wv.doesnt_match(w5)

We can use `gensim` to load pre-trained embeddings.

In [None]:
!wc -l data/glove/glove.6B.100d.txt
!(echo 400000$'\t'100 ; cat 'data/glove/glove.6B.100d.txt') > 'data/glove/glove-gensim.6B.100d.txt'

In [None]:
from gensim.models import KeyedVectors
gloveModel='data/glove/glove-gensim.6B.100d.txt'
model_glove = KeyedVectors.load_word2vec_format(gloveModel, binary=False)

In [None]:
# find the similarity between two words. 
# Use the same examples as before an also some examples with out-of-domain vocabulary. I'm sure the word "phone" was
# not in the vocabulary before!
## CODE HERE

# get everything related to some word
## CODE HERE

# The famous (king - man) + woman
## CODE HERE

######SOLUTION

In [None]:
# find the similarity between two words. 
# Use the same examples as before an also some examples with out-of-domain vocabulary. I'm sure the word "phone" was
# not in the vocabulary before!

## CODE HERE
model_glove.wv.similarity("cushion", "pillow")

# get everything related to some word
## CODE HERE
w3 = ["pillow", "sheet", "blanket", "bed"]
w4 = ["dirt"]
model_glove.wv.most_similar(positive=w3, negative=w4, topn=10)

# The famous (king - man) + woman
## CODE HERE
model_glove.wv.most_similar(positive=['woman', 'king'], negative=['man'], topn=5)


##3. Visualization of Word Embeddings

Finally, we will visualise the n-dimensional word embeddings by projecting them down to 2-dimensional x,y coordinate pairs. Several techniques exist (PCA, t-SNE, etc). 

###PCA
We use PCA in the following (PCA class in `sklearn.decomposition`)

In [None]:
# Imports needed for the visualisation
from sklearn.decomposition import PCA
from matplotlib import pyplot
%matplotlib inline

# fit a 2d PCA model to the vectors
X = w2v_model[w2v_model.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)

# create a scatter plot of the projection
pyplot.scatter(result[:, 0], result[:, 1])

# add the labels to the plot
words = list(w2v_model.wv.vocab)
for i, word in enumerate(words):
    pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))

# look at the plot
pyplot.show()

Too much information. Let's select only a subset of words

In [None]:
# Select what we wanna see ('most_similar' words to something for instance)
setToPlot = w2v_model.wv.most_similar(positive='throne', topn=10)

# Look for the vectors for the desired words only, and store them as vectorX and vectorY
vectorX =  []
vectorY =  []
words = []
for word, sim in setToPlot:
    i=w2v_model.wv.vocab[word].index
    words.append(word)
    vectorX.append(result[i,0])
    vectorY.append(result[i,1])

# create the scatter plot for these words
pyplot.scatter(vectorX, vectorY)

# add the labels
for i, word in enumerate(words):
    pyplot.annotate(word, xy=(vectorX[i], vectorY[i]))

# look at the plot
pyplot.show()

#References
* This tutorial is partly inspired by the book: "Natural Language Processing with PyTorch: Build Intelligent Language Applications Using Deep Learning" by Delip Rao and Brian McMahan.