# (Word Embedding Training, Visualization, Evaluation)
Word Vectors are often used as a fundamental component for downstream NLP tasks, e.g. question answering, text generation, translation, etc., so it is important to build some intuitions as to their strengths and weaknesses.  <br>
**Note on Terminology:**
The terms "word vectors" and "word embeddings" are often used interchangeably. The term "embedding" refers to the fact that we are encoding aspects of a word's meaning in a lower dimensional space. As Wikipedia states, "conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension".

# Collect Data
The dataset  contains  10 sentences. 

In [0]:
import numpy as np

In [0]:
corpus = ['king is a strong man', 
          'queen is a wise woman', 
          'boy is a young man',
          'girl is a young woman',
          'prince is a young king',
          'princess is a young queen',
          'man is strong', 
          'woman is pretty',
          'prince is a boy will be king',
          'princess is a girl will be queen',
          'prince is a strong boy',
          'boy is strong',
          'girl is pretty',
          'girl will be woman',
          'boy will be man']


# Tokenization
For many NLP tasks, the first thing you need to do is to tokenize your raw text into lists of words. Suppose your have *text = "king is a strong man"*  You can just use *text.split("  ")* to break the sentences into a list of words. You will get output as "['king', 'is', 'a', 'strong', 'man']"
Write and run your code in the next cell to tokenize all the sentences. <br>


In [0]:
def Tokenization(corpus):
    '''
    Input:
      corpus: list of sentences(Eg., The list 'corpus' contains 10 sentences as defined above)
    
    Output:
           y: list of lists, each sentence in corpus is broken into a list of words (obtained by tokenizing all the sentences)
    '''
    # YOUR CODE HERE
    y=[]
    for i in range(len(corpus)):
      y+=[corpus[i].split(" ")]
    return y

In [4]:
'''test for Tokenization'''
def test_Tokenization():
  y = Tokenization(corpus)
  assert y[0]==['king', 'is', 'a', 'strong', 'man']
  assert y[9]==['princess', 'is', 'a', 'girl', 'will', 'be', 'queen']
  print('Test passed', '\U0001F44D')
test_Tokenization()

Test passed 👍


In [5]:
print(Tokenization(corpus))

[['king', 'is', 'a', 'strong', 'man'], ['queen', 'is', 'a', 'wise', 'woman'], ['boy', 'is', 'a', 'young', 'man'], ['girl', 'is', 'a', 'young', 'woman'], ['prince', 'is', 'a', 'young', 'king'], ['princess', 'is', 'a', 'young', 'queen'], ['man', 'is', 'strong'], ['woman', 'is', 'pretty'], ['prince', 'is', 'a', 'boy', 'will', 'be', 'king'], ['princess', 'is', 'a', 'girl', 'will', 'be', 'queen'], ['prince', 'is', 'a', 'strong', 'boy'], ['boy', 'is', 'strong'], ['girl', 'is', 'pretty'], ['girl', 'will', 'be', 'woman'], ['boy', 'will', 'be', 'man']]


# Remove stop words
## Stopwords: 
Words such as articles and some verbs are usually considered stop words because they don’t help us to find the context or the true meaning of a sentence. These are words that can be removed without any negative consequences to the final model that you are training.
In order for efficiency of creating word vectors, we will remove commonly used words.<br>
For our case, lets take following list as stopwords. <br>
stop_words = ['is', 'a', 'will', 'be'] <br>
For example, ouput corrosponding to *"king is a strong man"* will be ['king', 'strong', 'man'] and your function should return list which don't have stop-words in it.<br>

In [0]:
def remove_stop_words(corpus):
    '''
    Input:
      corpus: list of sentences(Eg., The list 'corpus' contains 10 sentences as defined above)
    
    Output:
      corpus_wo_stopwords: list of lists, each sentence in corpus is broken into a list of words excluding stop words 
                           (obtained after tokenizing the corpus followed by removing stop words)
    '''
    
    # Get stop-word list
    stop_words = ['is', 'a', 'will', 'be']
    
    # YOUR CODE HERE
    y=Tokenization(corpus)
    corpus_wo_stopwords=[]
    for i in range(len(y)):
      for j in range(len(stop_words)):
        for h in range(len(y[i])):
          try:
            y[i].remove(stop_words[j])
          except:
            continue
    return y
    
PP_corpus = remove_stop_words(corpus)

In [7]:
'''test for remove_stop_words'''
def test_remove_stop_words():
  assert set(PP_corpus[0])==set(['king', 'strong', 'man'])
  assert set(PP_corpus[9])==set(['princess', 'girl', 'queen'])
  print('Test passed', '\U0001F44D')
test_remove_stop_words()

Test passed 👍


# Create vocabulary
Building a vocabulary is nothing more than assigning a unique integer id to each word in the dataset. So, a vocabulary is basically a python dictionary data structure. The dictionary will map the word to a number. <br>
 E.g. dictionary['love'] = 520
Your function should return dictionary for unique words.<br>
For example, if you have three unique words, namely, *'princess', 'queen', 'girl',* then your output should be {'princess': 0, 'queen': 1, 'girl': 2}

### 1. Find out set of unique words in PP_corpus 

In [8]:
def get_unique_words(PP_corpus):
    '''
    Input:
        PP_corpus: list of lists containing the list of words (obtained after tokenizing the corpus followed by removing stop words)
    
    Output:
        unique_words: set of unique words in PP_corpus
    '''
    # YOUR CODE HERE
    y=PP_corpus
    li=[]
    for i in range(len(y)):
      for j in range(len(y[i])):
        li+=[y[i][j]]
    y1=np.array(li)
    unique=[]
    for elem in y1:
      if(elem not in unique):
        unique+=[elem]
    y1=unique
    ind=list(range(len(y1)))
    print(y1)
#     d = {k:v for k,v in zip(y1,ind)}
#     print(d)
    return y1
  
unique_words = get_unique_words(PP_corpus)

['king', 'strong', 'man', 'queen', 'wise', 'woman', 'boy', 'young', 'girl', 'prince', 'princess', 'pretty']


In [0]:
# '''test for remove_stop_words'''
# def test_unique_word():
#   assert unique_words=={'strong', 'pretty', 'wise', 'queen', 'man', 'prince', 'king', 'young', 'princess', 'woman', 'boy', 'girl'}
#   print('Test passed', '\U0001F44D')
# test_unique_word()

### 2. Map the unique words to integers starting from 0

In [0]:
def mapping(unique_words):
    '''
    Input:
        unique_words: set of unique words in PP_corpus  
    Output:
        word2int: dictionary which maps the words in set unique_words to integers starting from 0 (of same length as unique_words)
    '''
    # YOUR CODE HERE
    y1=unique_words
    ind=list(range(len(y1)))
    d = {k:v for k,v in zip(y1,ind)}
#     return word2int
    return d
    
word2int = mapping(unique_words)

In [11]:
'''test for mapping'''
def test_mapping():
  assert len(word2int)==12
  assert np.unique(list(word2int.values())).shape[0]==12
  print('Test passed', '\U0001F44D')
test_mapping()

Test passed 👍



## Prepare data for Skip-Gram Model 
In skip gram architecture of word2vec, the input is the center word and the predictions are the context words. Consider an array of words W, if W(i) is the input (center word), then W(i-2), W(i-1), W(i+1), and W(i+2) are the context words, for a sliding window size of 2.

![Sliding Window](https://cdncontribute.geeksforgeeks.org/wp-content/uploads/word2vec_diagram-1.jpg)

# data generation
The final structure of your data should be a list of tuples $(x, y)$.
$x$ is the id of the target word (the center word in current window) and $y$ is the id of the context word. This is well illustrated in above figure.

In [12]:
import pandas as pd

def data_gen(PP_corpus, window_size):
    '''
    Input:
        PP_corpus: list of lists (obtained after tokenizing the corpus followed by removing stop words) 
        window_size: int, window size as described above
    Output:
        data: list of tuples (x, y). x is the id of the target word (the center word in current window) and y is the id of the context word
    '''
    # YOUR CODE HERE
    w=window_size
    y=PP_corpus
    li=[]
    k=0
    for i in range(len(y)):
      for j in range(len(y[i])):
        for h in range(-w,w+1):
          if(j+h>=0 and j+h<len(y[i])):
            if(not(j==j+h)):
              li+=[(y[i][j],y[i][j+h])]
    data = li   
    print (data)
    return data
    
data = data_gen(PP_corpus, 2)
df = pd.DataFrame(data, columns = ['input', 'label'])

[('king', 'strong'), ('king', 'man'), ('strong', 'king'), ('strong', 'man'), ('man', 'king'), ('man', 'strong'), ('queen', 'wise'), ('queen', 'woman'), ('wise', 'queen'), ('wise', 'woman'), ('woman', 'queen'), ('woman', 'wise'), ('boy', 'young'), ('boy', 'man'), ('young', 'boy'), ('young', 'man'), ('man', 'boy'), ('man', 'young'), ('girl', 'young'), ('girl', 'woman'), ('young', 'girl'), ('young', 'woman'), ('woman', 'girl'), ('woman', 'young'), ('prince', 'young'), ('prince', 'king'), ('young', 'prince'), ('young', 'king'), ('king', 'prince'), ('king', 'young'), ('princess', 'young'), ('princess', 'queen'), ('young', 'princess'), ('young', 'queen'), ('queen', 'princess'), ('queen', 'young'), ('man', 'strong'), ('strong', 'man'), ('woman', 'pretty'), ('pretty', 'woman'), ('prince', 'boy'), ('prince', 'king'), ('boy', 'prince'), ('boy', 'king'), ('king', 'prince'), ('king', 'boy'), ('princess', 'girl'), ('princess', 'queen'), ('girl', 'princess'), ('girl', 'queen'), ('queen', 'princess')

In [13]:
'''test for data_gen'''
def test_data_gen():
  assert df.shape == (66, 2)
  print('Test passed', '\U0001F44D')
test_data_gen()

Test passed 👍


# One-hot Encoding

In [0]:
import numpy as np


def to_one_hot_encoding(data_point_index, one_hot_dim):
    '''
    Input:
        data_point_index: integer value between 0 and one_hot_dim (index at which the one_hot_encoding array value will be 1 and 0 otherwise)
        one_hot_dim : int, vocabulary size
    Output:
        one_hot_encoding: np array of size (vocabulary size, ) containing one hot encoding
    '''
    # YOUR CODE HERE
    return np.eye(one_hot_dim)[data_point_index]

In [15]:
'''test for to_one_hot_encoding'''
def test_to_one_hot_encoding():
  arr = to_one_hot_encoding(4, len(unique_words))
  assert arr.tolist() == [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
  print('Test passed', '\U0001F44D')
test_to_one_hot_encoding()

Test passed 👍


## Change  data into 1-hot encoding for Skip-gram Training

In [16]:
def one_hot_for_skip_gram(word2int, data):
    '''
    Input:
        word2int: dictionary, mapping from vocabulary words to ints
            data: list of tuples (x, y) list of tuples (x, y). x is the id of the target word (the center word in current window) and y is the id of the context word
    Output:
               X: numpy array of shape (samples, vocabulary_size) containing input words
               Y: numpy array of shape (samples, vocabulary_size) containing target words corresponding to input words in X
    '''
    # YOUR CODE HERE
    print(word2int)
    print(data)
    samples=len(data)
    vocabulary_size=len(word2int.keys())
    print(samples,vocabulary_size)
    X=[]
    Y=[]
#     X=np.array((samples,vocabulary_size))
#     Y=np.array((samples,vocabulary_size))
    for i in range(0,samples):
#       print(word2int[data[i][0]])
#       print(to_one_hot_encoding(word2int[data[i][0]],vocabulary_size))
      X+=[to_one_hot_encoding(word2int[data[i][0]],vocabulary_size)]
      Y+=[to_one_hot_encoding(word2int[data[i][1]],vocabulary_size)]
    
    X=np.array(X)
    Y=np.array(Y)
    print(X,Y)
    return X, Y
    
X_train,Y_train = one_hot_for_skip_gram(word2int, data)    

{'king': 0, 'strong': 1, 'man': 2, 'queen': 3, 'wise': 4, 'woman': 5, 'boy': 6, 'young': 7, 'girl': 8, 'prince': 9, 'princess': 10, 'pretty': 11}
[('king', 'strong'), ('king', 'man'), ('strong', 'king'), ('strong', 'man'), ('man', 'king'), ('man', 'strong'), ('queen', 'wise'), ('queen', 'woman'), ('wise', 'queen'), ('wise', 'woman'), ('woman', 'queen'), ('woman', 'wise'), ('boy', 'young'), ('boy', 'man'), ('young', 'boy'), ('young', 'man'), ('man', 'boy'), ('man', 'young'), ('girl', 'young'), ('girl', 'woman'), ('young', 'girl'), ('young', 'woman'), ('woman', 'girl'), ('woman', 'young'), ('prince', 'young'), ('prince', 'king'), ('young', 'prince'), ('young', 'king'), ('king', 'prince'), ('king', 'young'), ('princess', 'young'), ('princess', 'queen'), ('young', 'princess'), ('young', 'queen'), ('queen', 'princess'), ('queen', 'young'), ('man', 'strong'), ('strong', 'man'), ('woman', 'pretty'), ('pretty', 'woman'), ('prince', 'boy'), ('prince', 'king'), ('boy', 'prince'), ('boy', 'king')

In [17]:
'''test for Skip_gram_to_one_hot'''
def test_one_hot_for_skip_gram():
  assert Y_train.shape == (66,12)
  print('Test passed', '\U0001F44D')
test_one_hot_for_skip_gram()

Test passed 👍


# Define Model Architecture
You could now train your model  using whatever optimizer you want.
In order to keep track of your training, you should also print out the loss every 3000 iterations.
Write your code in the cell below. Print out the loss every 3000 steps. Run your model for 20K epochs.

![Skip Gram Model Architecture](https://cdncontribute.geeksforgeeks.org/wp-content/uploads/Skip-gram-architecture-2.jpg)

## Building skip gram network in Keras
Description of the Network
- There is only one hidden layer with 2 neurons and no activation
- Input and output layers have same shape as one-hot encoded vectors
- Output layer has activation softmax
- Loss used is categorical_crossentropy


Build this network using keras and train for at least 1000 epochs.

In [18]:
import keras
from keras.layers import Input, Dense
from keras.models import Model
from keras import optimizers

Using TensorFlow backend.


In [0]:
def create_model():
    """
    Inputs:
        None
    Outputs:
        model: compiled keras model for skipgram architecture
    """
    # YOUR CODE HERE
    input_layer=Input(shape=(12,))
    hidden_layer=Dense(2)(input_layer)
    output_layer=Dense(12, activation="softmax")(hidden_layer)
    model=Model(inputs=[input_layer],outputs=[output_layer])
    model.compile(optimizer="Adam",loss="categorical_crossentropy")
    model.summary()
    return model

In [0]:
def get_weights_and_bias(model, layer_index):
    """
    Inputs:
        model: compiled keras model
        layer_index: index of the layer
    Outputs:
        weights: weights of hidden layer
        biases: biases of the hidden layer
    """
    # YOUR CODE HERE
    weights=model.layers[layer_index].get_weights()[0]
    biases=model.layers[layer_index].get_weights()[1]
    
    return weights, biases

In [21]:
"""Test for get_weights_and_bias"""
model = create_model()
w, b = get_weights_and_bias(model, 1)
assert w.shape[1] == 2
assert b.shape[0] == 2
print('Test passed', '\U0001F44D')


W0622 01:37:11.541846 139923357579136 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0622 01:37:11.584172 139923357579136 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0622 01:37:11.593665 139923357579136 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0622 01:37:11.619910 139923357579136 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0622 01:37:11.649501 139923357579136 deprecation_wrappe

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 12)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 26        
_________________________________________________________________
dense_2 (Dense)              (None, 12)                36        
Total params: 62
Trainable params: 62
Non-trainable params: 0
_________________________________________________________________
Test passed 👍


### Remember the model before training

In [0]:
import copy
model_before = copy.copy(model)

### Training

In [0]:
model.fit(X_train, Y_train, epochs = 12000, batch_size = 20)

W0622 01:37:15.031142 139923357579136 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Epoch 1/12000
Epoch 2/12000
Epoch 3/12000
Epoch 4/12000
Epoch 5/12000
Epoch 6/12000
Epoch 7/12000
Epoch 8/12000
Epoch 9/12000
Epoch 10/12000
Epoch 11/12000
Epoch 12/12000
Epoch 13/12000
Epoch 14/12000
Epoch 15/12000
Epoch 16/12000
Epoch 17/12000
Epoch 18/12000
Epoch 19/12000
Epoch 20/12000
Epoch 21/12000
Epoch 22/12000
Epoch 23/12000
Epoch 24/12000
Epoch 25/12000
Epoch 26/12000
Epoch 27/12000
Epoch 28/12000
Epoch 29/12000
Epoch 30/12000
Epoch 31/12000
Epoch 32/12000
Epoch 33/12000
Epoch 34/12000
Epoch 35/12000
Epoch 36/12000
Epoch 37/12000
Epoch 38/12000
Epoch 39/12000
Epoch 40/12000
Epoch 41/12000
Epoch 42/12000
Epoch 43/12000
Epoch 44/12000
Epoch 45/12000
Epoch 46/12000
Epoch 47/12000
Epoch 48/12000
Epoch 49/12000
Epoch 50/12000
Epoch 51/12000
Epoch 52/12000
Epoch 53/12000
Epoch 54/12000
Epoch 55/12000
Epoch 56/12000
Epoch 57/12000
Epoch 58/12000
Epoch 59/12000
Epoch 60/12000
Epoch 61/12000
Epoch 62/12000
Epoch 63/12000
Epoch 64/12000
Epoch 65/12000
Epoch 66/12000
Epoch 67/12000
Epoc

  % delta_t_median)


Epoch 338/12000
Epoch 339/12000
Epoch 340/12000
Epoch 341/12000
Epoch 342/12000
Epoch 343/12000
Epoch 344/12000
Epoch 345/12000
Epoch 346/12000
Epoch 347/12000
Epoch 348/12000
Epoch 349/12000
Epoch 350/12000
Epoch 351/12000
Epoch 352/12000
Epoch 353/12000
Epoch 354/12000
Epoch 355/12000
Epoch 356/12000
Epoch 357/12000
Epoch 358/12000
Epoch 359/12000
Epoch 360/12000
Epoch 361/12000
Epoch 362/12000
Epoch 363/12000
Epoch 364/12000
Epoch 365/12000
Epoch 366/12000
Epoch 367/12000
Epoch 368/12000
Epoch 369/12000
Epoch 370/12000
Epoch 371/12000
Epoch 372/12000
Epoch 373/12000
Epoch 374/12000
Epoch 375/12000
Epoch 376/12000
Epoch 377/12000
Epoch 378/12000
Epoch 379/12000
Epoch 380/12000
Epoch 381/12000
Epoch 382/12000
Epoch 383/12000
Epoch 384/12000
Epoch 385/12000
Epoch 386/12000
Epoch 387/12000
Epoch 388/12000
Epoch 389/12000
Epoch 390/12000
Epoch 391/12000
Epoch 392/12000
Epoch 393/12000
Epoch 394/12000
Epoch 395/12000
Epoch 396/12000
Epoch 397/12000
Epoch 398/12000
Epoch 399/12000
Epoch 40

 ## (1) Word embedding extraction <br>
Extract your word embedding matrix from the model and print out its shape.
(The size should be [vocabulary_size, embedding_dimension])


In [0]:
def get_embeddings(model, flag):
    """
    Inputs:
        model: compiled keras model
        flag:int, {0 or 1}, if 0 represents input vectors else represents output vectors
    Outputs:
        embeddings: numpy array of shape(vocabulary_size, embedding_dimension), word embeddings of all words
    """
    # YOUR CODE HERE
    embeddings=model.layers[flag+1].get_weights()[0]
    embeddings=embeddings.reshape((12,2))
    return embeddings

In [0]:
get_embeddings(model,0)

In [0]:
embeddings_before = get_embeddings(model_before, 1)
embeddings_after_input = get_embeddings(model, 1).T
embeddings_after_output = get_embeddings(model, 0)

## (2) Visualization
<p>In this step, you need to visualize your word vectors by dimension reduction. (e.g. PCA or t-SNE)</p>
<p>If you are not satisfied with the quality of your word vector from visualization (in most cases), you could try to change some parameters in your model (e.g. vocabulary_size, embedding_dimension) and re-train your word embedding.</p>

Visualize the word vectors before and after training by changing vectors to either embeddings_before or embeddings_after.

Visulaize the word vectors of the learned input vectors and learned output vectors and see the difference.


In [0]:
import matplotlib.pyplot as plt

## Set vectors
vectors = embeddings_after_input


# Build dataframe for vectors corrosponding to unique words where first column will be word.
w2v_df = pd.DataFrame(vectors.T, columns = ['x1', 'x2'])
w2v_df['word'] = unique_words
w2v_df = w2v_df[['word', 'x1', 'x2']]


# Plot the vectors
fig, ax = plt.subplots()

for word, x1, x2 in zip(w2v_df['word'], w2v_df['x1'], w2v_df['x2']):
    ax.annotate(word, (x1,x2), bbox={'facecolor':'red', 'alpha':0.5, 'pad':5})
    
# w2v_df = pd.DataFrame(vec_before, columns = ['x1', 'x2'])
# w2v_df['word'] = unique_words
# w2v_df = w2v_df[['word', 'x1', 'x2']]
# for word, x1, x2 in zip(w2v_df['word'], w2v_df['x1'], w2v_df['x2']):
#     ax.annotate(word, (x1,x2), bbox={'facecolor':'blue', 'alpha':0.5, 'pad':5})

    
PADDING = 1.0
x_axis_min = np.amin(vectors, axis=0)[0] - PADDING
y_axis_min = np.amin(vectors, axis=0)[1] - PADDING
x_axis_max = np.amax(vectors, axis=0)[0] + PADDING
y_axis_max = np.amax(vectors, axis=0)[1] + PADDING
 
plt.xlim(x_axis_min,x_axis_max)
plt.ylim(y_axis_min,y_axis_max)
plt.rcParams["figure.figsize"] = (10, 10)

plt.show()

### Advanced Experiments
- Tune hyperparameters to see if you can get better representations
- Try to add more sentences using the same vocabulary (or expanding the vocabulary only slightly) to see if you can learn better representations