### Word2Vec - Skip Gram Model

Skip Gram model is implemented using tensorflow in this notebook.

For more information on this model, refer to following:
1. <a href = "https://arxiv.org/pdf/1301.3781.pdf">Efficient Estimation of Word Representations in Vector Space</a>
2. <a href = "https://arxiv.org/pdf/1310.4546.pdf">Distributed Representations of Words and Phrases and their Compositionality</a>

#### 1. Input Data

In [1]:
sentences = ['king is a strong man',
             'queen is a wise woman',
             'boy is a young man',
             'girl is a young woman',
             'prince is a young king',
             'princess is a young queen',
             'man is strong',
             'woman is pretty',
             'prince is a boy will be king',
             'princess is a girl will be queen'
            ]

#### 2. Getting rid of stopwords

In [2]:
def remove_stopwords(sentences):
    stopwords = {"is", "a", "will", "be"}
    
    transformed_sentences = []
    for idx, sentence in enumerate(sentences):
        result = list()
        for word in sentence.split(" "):
            if word in stopwords:
                continue
            result.append(word)
        transformed_sentences.append(" ".join(result))
    return transformed_sentences

In [3]:
sentences = remove_stopwords(sentences)

#### 3. Creating Vocabulary

In [4]:
vocabulary = set()
for sentence in sentences:
    for word in sentence.split(" "):
        vocabulary.add(word)

print("Size of the vocabulary: {}".format(len(vocabulary)))
print("Vocabulary: {}".format(vocabulary))

Size of the vocabulary: 12
Vocabulary: {'king', 'boy', 'man', 'queen', 'young', 'girl', 'princess', 'pretty', 'strong', 'prince', 'woman', 'wise'}


#### 4. Data Preparation

In [5]:
wordPos = dict()
for idx, word in enumerate(vocabulary):
    wordPos[word] = idx

sentences = [sentence.split(" ") for sentence in sentences]

m = 2     # Window Size for the context words
data = []
for sentence in sentences:
    for i, word in enumerate(sentence):
        for context in sentence[max(i-m, 0): min(i+m, len(sentence))+1]:
            if context != word:
                data.append([word, context])

In [6]:
import pandas as pd

data = pd.DataFrame(data, columns = ["Word", "Context"])
print(data.head())

     Word Context
0    king  strong
1    king     man
2  strong    king
3  strong     man
4     man    king


In [7]:
print(data.shape)

(52, 2)


In [8]:
wordPos

{'king': 0,
 'boy': 1,
 'man': 2,
 'queen': 3,
 'young': 4,
 'girl': 5,
 'princess': 6,
 'pretty': 7,
 'strong': 8,
 'prince': 9,
 'woman': 10,
 'wise': 11}

#### 5. Defining model using tensorflow

In [11]:
import tensorflow as tf
import numpy as np

In [12]:
one_hot_size = len(vocabulary)

def to_one_hot_vector(pos):
    one_hot_vector = np.zeros(one_hot_size)
    one_hot_vector[pos] = 1
    
    return one_hot_vector

In [23]:
X = []
y = []

for xi, yi in data.values:
    X.append(to_one_hot_vector(wordPos[xi]))
    y.append(to_one_hot_vector(wordPos[yi]))
    
X = np.array(X)
y = np.array(y)

print(X.shape)
print(y.shape)

(52, 12)
(52, 12)


In [26]:
X_input = tf.placeholder(tf.float32, shape = (None, vocab_size))
y_input = tf.placeholder(tf.float32, shape = (None, one_hot_size))

word2vec_dimension = 2

w1 = tf.Variable(tf.random_normal([one_hot_size, word2vec_dimension]))
b1 = tf.Variable(tf.random_normal([1]))
hidden_layer = tf.add(tf.matmul(X_input, w1), b1)

w2 = tf.Variable(tf.random_normal([word2vec_dimension, one_hot_size]))
b2 = tf.Variable(tf.random_normal([1]))
pred = tf.nn.softmax(tf.add(tf.matmul(hidden_layer, w2), b2))

loss = tf.reduce_mean(- tf.reduce_sum(y_input * tf.log(pred), axis = 1))

train = tf.train.GradientDescentOptimizer(0.05).minimize(loss)

In [32]:
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)

max_iter = 20000
for i in range(max_iter):
    sess.run(train, feed_dict = {X_input: X, y_input: y})
    
    if i%5000 == 0 or i == max_iter-1:
        print("Loss at {} iteration is: {}".format(i, sess.run(loss, feed_dict = {X_input: X, y_input: y})))

Loss at 0 iteration is: 3.6373674869537354
Loss at 5000 iteration is: 1.7648868560791016
Loss at 10000 iteration is: 1.700859785079956
Loss at 15000 iteration is: 1.678276538848877
Loss at 19999 iteration is: 1.666369915008545
