**Homework due data: 04/05/2019 23:59**

# Introduction

In this homework you will be given a chance to explore the properties of word embedding using a pre-trained embedding. Then you will build and train your own embedding using the skip-gram method.

Here are some reading materials:

1. [Learning representations by back-propagating errors](https://www.nature.com/articles/323533a0)

Recent Turing Reward winner Geoffrey Hinton and coworkers first introduced the concept of words embedding in their 1986 Nature paper.

2. [word2vec](https://code.google.com/archive/p/word2vec/)

Google's word2vec project built on skip-gram and google news data.

3. [Efficient Estimation of Word Representations in
Vector Space](https://arxiv.org/pdf/1301.3781.pdf)

   [Distributed Representations of Words and Phrases
and their Compositionality](https://arxiv.org/pdf/1310.4546.pdf)

Tomas Mikolov from Google published these two papers in 2013 proposing the skip-gram approach for word embedding which has become one of the most popular word embedding.

4. [On word embeddings](http://ruder.io/word-embeddings-1/index.html)

An online blog by DeepMind engineer Sebastian Ruder explaining skip-gram. I found it easier to understand than the original papers.

# Required pacakges

In [None]:
%matplotlib inline
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn, keras

# Play with pretrained embedding

Before we start training our own words embedding, let's play with pretrained embeddings, so you know what you can expect from your own models. Here we use a very popular embedding called [GloVe](https://nlp.stanford.edu/projects/glove/) developed by standford university. The method used to produce this embedding is based on the factorization of word-word similarity matrix. Worth to notice, thi method is quite different to the skip-gram method we are going to implement later.

First let's load the embedding as a Pandas DataFrame.

In [None]:
glove = pd.read_csv("glove_6B_100d_top100k.csv"); glove.head()

## Find nearest words
One of the many motivations that people are interested in words embedding is that it reveals similarities between words. Let's first check how this works with GloVe.

In [None]:
from sklearn.metrics.pairwise import euclidean_distances, cosine_distances

def find_nearest(embedding, word=None, n=5, distance=euclidean_distances):
    """
    For given embedding matrix and a given word, find the n nearest words in the embedding space
    
    input:
        embedding: DataFrame, look at `glove` 
        word: string, must be in the index of embedding dataframe
        n: int, number of nearest words
        distance: fucntion, it should at least support the euclidean_distances and cosine_distances
        
    return:
        A series with word as index, distance as value, sorted from lower to high
    """
    """
    Write your code here
    """

In [None]:
print("Using euclidean_distances, the closest words to frog are:")
print(find_nearest(glove, 'lion'))
print("Using cosine_distances, the closest words to frog are:")
print(find_nearest(glove, 'lion', distance=cosine_distances))

What have you observed? Does the result make sense to you? Play with some other words, and see if you can find something interesting. Try countries and numebrs :). 

## Find nearest words with vector
Remember that at the beginning of the course we advertised the ability of word embedding being able to find relative relationship between words, such as king - male + female = queen. Let's test this with the embedding we have. But before that we need a method that's similar to find_nearest, but instead of taking a word, it takes an embedding vector as input.

In [None]:
def find_nearest_with_vector(embedding, vector=None, n=5, distance=euclidean_distances):
    """
    For given embedding matrix and a given vector, find the n nearest words in the embedding space
    
    input:
        embedding: DataFrame, look at `glove` 
        vector: Series, looks like a coloumn vector of the embedding dataframe
        n: int, number of nearest words
        distance: fucntion, it should at least support the euclidean_distances and cosine_distances
        
    return:
        A series with word as index, distance as value, sorted from lower to high
    """
    """
    Write your code here
    """

In [None]:
find_nearest_with_vector(glove, glove['king']-glove['male']+glove['female'])

In [None]:
find_nearest_with_vector(glove, glove['china']+glove['capital'])

What did you see? Can you explore some other interesting relations? Like countries vs cities, etc.

## Word clustering

Another feature of the word embedding is that it can cluster similar word in to the same cluster while keep semantic relationship with other clusters. Try the following dimention reduction code:

In [None]:
from sklearn.decomposition import PCA

def plot_2D(X, labels):
    x_min, x_max = np.min(X, 0), np.max(X, 0)
    X = 0.1 + 0.8 * (X - x_min) / (x_max - x_min)

    plt.figure(figsize=(10, 8))
    for x, lab in zip(X, labels):
        plt.text(x[0], x[1], str(lab), fontdict={'size': 14})
        
def plot_words_embedding(embedding, words):
    X = PCA(n_components=2).fit_transform(embedding[words].transpose())
    plot_2D(X, words)

In [None]:
words = ['china', 'beijing', 'russia', 'moscow', 'poland', 'warsaw', 'japan', 'tokyo',
        'france', 'paris', 'germany', 'berlin', 'italy', 'rome', 'spain', 'madrid']

plot_words_embedding(glove, words)

Have you spot something interesting? Try with some other words set and see what you can find.

# Skip-gram

## Load the training data

In [None]:
from tools import load_data, show_model

text = load_data()

print("Number of summarys: ", len(text))
print("Number of words:", len([w for s in text for w in s]))
print("Vocabulary size:", len({w for s in text for w in s}))

There are about 200K unique words in this corpus. To make it more computational feasible, let's reduce the size of the vocabulary:

## Encode the text

In [None]:
MIN_COUNT = 20
def create_encoder(text, min_count=20):
    """
    - Create a encoder which is a dictionary like {word: index}
    - To reduce the total number of vocabularies, you can remove 
    the words that appear for less than min_count times in the entire
    corpus
    - Enfore {'_unknown_': 0}
    
    input:
        text: list of token list, e.g. [['i', 'am', 'fine'], ['another', 'summary'], ...]
    returns:
        tokenmap:  encoder dictionary
        tokenmap_reverse: reversed tokenmap {index: word} to faciliate inverse lookup
    """
    
    """
    Write your code here
    """

    return tokenmap, tokenmap_reverse

In [None]:
tokenmap, tokenmap_reverse = create_encoder(text, MIN_COUNT)
VOCAB_SIZE = len(tokenmap)
print("the reduced vocabulary size is:", VOCAB_SIZE)

In [None]:
# Encoder the text using the encoder you just created
def encode(text, tokenmap, default=0):
    return [[tokenmap.get(t, default) for t in s] for s in text]

text_encoded = encode(text, tokenmap)

## Construct training context pairs

To generate training data, we need to find word-context pairs from the encoded text, 
we also want to generate some negative sample, so the input and output may look like:

for input corpus: [[2, 3, 1, 2]] 

returns: [[word, context, label]]

[[2, 3, 1], [2, 1, 1], [2, 2, 1], [3, 1, 1], ...., [4, 2, 0], [4, 3, 0], ...]

Notice that in practice the sequence should be shuffled.

In [None]:
from keras.preprocessing.sequence import skipgrams

def training_data_generator(text_encoded, window_size=4, negative_samples=1.0, batch_docs=50):
    """
    For given encoded text, return 3 np.array:
    words, contexts, labels
    Do not pair the w and its context cross different documents.
    
    input: 
        text_encoded: list of list of int, each list of int is the numerical encoding of the doc
        window_size: int, define the context
        negative_samples: float, how much negative sampling you need, normally 1.0
        batch_docs: int, number of docs for which it generates one return
        
    yield:
        words: list of int, the numerical encoding of the central words
        contexts: list of int, the numerical encoding of the context words
        labels: list of int, 1 or 0
        
    hint: 
    1. You can use skipgrams method from keras
    2. For training purpose, words and contexts needs to be 2D array, with shape (N, 1), 
       but labels is 1D array, with shape (N, )
    3. The output can be very big, you SHOULD using generator
    """
    
    """
    Write your code here
    """

## Construct Learning Model

Now we need to create a network that looks like this:
<img src="skip-gram-NN.png" width="480">

In [None]:
"""
Write your code here
"""

## Train your model 

The following is a simple version of training on batch code. You do not need to use
opochs more than 10 since it will soon start shaking around the minimum. If you want 
to further improve your training, consider gradually increase the batch size or reduce 
the learning rate, then you can try for more than 10 epochs.

In [None]:
model.compile(loss="binary_crossentropy", optimizer="rmsprop")

epochs = 10
ntot = 0
for epoch in range(epochs):
    print("Epoch %d ======" % epoch)
    for words, contexts, labels in training_data_generator(text_encoded, batch_docs=50):
        loss = model.train_on_batch(x=[words, contexts], y=labels)
        ntot += len(words)
        print("Total trained pairs (M): %10.2f ; \t loss: %.4f" % (ntot/1e6, loss))

## Transform the embedding to a table

Ready to translate the model you trained into the embedding DataFrame?

In [None]:
def embedding2df(embedding_layer, tokenmap_reverse):
    return (pd.DataFrame(embedding_layer.get_weights()[0], 
                        tokenmap_reverse.values())
              .drop("_unknown_", errors='ignore')
              .transpose())

skip = embedding2df(model.layers[2], tokenmap_reverse)

## Test your trained embedding

Use the embedding you just trained, repeat the exploration you did for Section 3.

# Deliverables:

- pdf version of your final notebook
- Discuss the questions in Section 3
- If you have done any work to improve the model and model training, explain it.

# Final project (Not due this week)

**Work with your teammates and start working on your final project proposal, think about these questions:**
- The problem you try to solve and the value of this problem
- Some current solution to this problem, reference citation if needed
- Outline your approach and the goal you want to achieve
