## **1. The corpus**

First, we load our folders in our Google Drive into Google Colab.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Then, we load our train data (text8.txt) and parse it into lines. text8 is a clean train data collect from wikipedia.

In [None]:
text = []

# Open the text file 
f = open("/content/drive/MyDrive/word-embedding-creation/input/text8", "r")

# Extract each line of our text file
for line in f:
  text.append(line)


Import libraries for later use.

In [None]:
import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import text_to_word_sequence

Then, based on the lines from text8 above, we split it into separate words.

In [None]:
corpus = []
# For each line, we split it into words.
for i in range(len(text)):
  corpus.append(text_to_word_sequence(text[i])) # Same as split
  # This is because we want to re-use our code in other corpus that contains multiple lines of text

In [None]:
print((corpus[0][:100]))

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution', 'whilst', 'the', 'term', 'is', 'still', 'used', 'in', 'a', 'pejorative', 'way', 'to', 'describe', 'any', 'act', 'that', 'used', 'violent', 'means', 'to', 'destroy', 'the', 'organization', 'of', 'society', 'it', 'has', 'also', 'been', 'taken', 'up', 'as', 'a', 'positive', 'label', 'by', 'self', 'defined', 'anarchists', 'the', 'word', 'anarchism', 'is', 'derived', 'from', 'the', 'greek', 'without', 'archons', 'ruler', 'chief', 'king', 'anarchism', 'as', 'a', 'political', 'philosophy', 'is', 'the', 'belief', 'that', 'rulers', 'are', 'unnecessary', 'and', 'should', 'be', 'abolished', 'although', 'there', 'are', 'differing']


Because the corpus is very **large**, if we take all vocabulary the appear in the corpus into consideration, our vocab dictionary grows up to 250,000 words. This cause our models to be unreasonably large and take a large amount of time to train. On the other hand, there are many words that appear only several times in the corpus (< 10); thus, we will not be able to obtain good dense representations for it. Therefore, it is essential for use to only choose our corpus to be the 30000 most frequent words.

In [None]:
from collections import Counter
c = Counter(corpus[0])
most_30000 = c.most_common(29999)
vocab_dictionary = [[i[0] for i in most_30000]]

In [None]:
print(vocab_dictionary)



Based on the 30000 words, we generate the index for each word. The first index 
is reserved for out-of-vocabulary (OOV) word.

In [None]:
tokenizer = Tokenizer(oov_token='<OOV>')
tokenizer.fit_on_texts(vocab_dictionary)
w2id = tokenizer.word_index
print(w2id)



In [None]:
print(len(w2id))

30000


## **2. Preprocess data for Skip-gram**

The next step is to preprocess the data for training set. First, we define the vocabulary size and the window size for each word.

In [None]:
vocab_size = len(tokenizer.word_index) + 1
window_size = 2

Then, we define a function to generate pairs of words that exist in the corpus based on the window_size and the corpus.

Note:
For example, we are trying to get training samples with window_size = 2 from sentence:
    ***I am a student at Denison.***
If the training word is "student", then the labels would be "am", "a", "at", "Denison".

In [None]:
import numpy as np
def generate_pairs(window_size, corpus):

    X_cat = []
    y_cat = []

    X = []
    y = []

    for sent in corpus:                 # This is because we want to re-use our code in other corpus that contains multiple lines of text
    tar_i = 0

    while tar_i < len(sent):
        if sent[tar_i] in w2id:
        start = max(0, tar_i-2)
        end = min(len(sent)-1, tar_i+2)

        labels = sent[start:tar_i] + sent[tar_i+1:end+1]
        for label in labels:
            if label in w2id:
            X.append(sent[tar_i])
            y.append(label)
            if len(X) >= 100:
                # If we store all extracted training data and label in form of python built-in lists, the RAM will be overflowed
                X_cat.append(tf.convert_to_tensor(tokenizer.texts_to_sequences(X)))
                y_cat.append(tf.convert_to_tensor(tokenizer.texts_to_sequences(y)))
                X = []
                y = []
        tar_i += 1
    return tf.concat(X_cat, axis=0) , tf.concat(y_cat, axis=0)

In [None]:
# Generate train data
X_train, y_train = generate_pairs(window_size, corpus)

In [None]:
# import required libraries for later use
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Lambda
from tensorflow.keras.backend import mean 

In [None]:
print(X_train.shape) 
print(y_train.shape)

(62834600, 1)
(62834600, 1)


In [None]:
vocab_size

30001

## **3. Skip_gram**

Finally, we train out model

In [None]:
from tensorflow.keras import Input
top4_accuracy_metric = tf.metrics.SparseTopKCategoricalAccuracy(k=4, name='top1_acc')

# Choose the neutral embedding size
embedding_size = 50
skip_gram = Sequential()
skip_gram.add(Embedding(input_dim=vocab_size, output_dim=embedding_size, embeddings_initializer="LecunUniform"))
skip_gram.add(Dense(vocab_size, activation='softmax', use_bias=False, kernel_initializer="LecunUniform"))

skip_gram.compile(loss='sparse_categorical_crossentropy', optimizer='Adam', metrics=[top4_accuracy_metric])
skip_gram.fit(X_train, y_train, epochs=1, verbose=1)
skip_gram.save('/content/drive/MyDrive/word-embedding-creation/output/model_full')

  78390/1963582 [>.............................] - ETA: 1:55:40 - loss: 7.2863 - top1_acc: 0.2026

In [None]:
  from tensorflow.keras import Input
top4_accuracy_metric = tf.metrics.SparseTopKCategoricalAccuracy(k=4, name='top4_acc')
mid_size1 = 512
mid_size2 = 256
embedding_size = 300
skip_gram = Sequential()
a = Input(shape=(1,vocab_size))
skip_gram.add(Embedding(input_dim=vocab_size, output_dim=mid_size1))
skip_gram.add(Dense(embedding_size, use_bias=False))
skip_gram.add(Dense(mid_size1, use_bias=False))
skip_gram.add(Dense(vocab_size, use_bias=False, activation='softmax'))

skip_gram.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=[top4_accuracy_metric])
skip_gram.fit(X_train, y_train, epochs=3, verbose=1)
skip_gram.save('/content/drive/MyDrive/word-embedding-creation/output/model_full')

After we train the model, we try to evaluate out model by extracting the vector representation for each word.

In [None]:
# function to convert numbers to one hot vectors
import numpy as np
def to_one_hot(data_point_index, vocab_size):
    temp = np.zeros(vocab_size)
    temp[data_point_index] = 1
    return temp
x_train = [] # input word
for data_word in list(w2id.keys())[1:2]:
    x_train.append(to_one_hot(w2id[ data_word ], vocab_size))
# convert them to numpy arrays
x_train = np.asarray(x_train)

In [None]:
print(x_train.shape)

(1, 30001)


In [None]:
# Load the model
model = tf.keras.models.load_model('/content/drive/MyDrive/word-embedding-creation/output/model_full')


In [None]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 300)         9000300   
                                                                 
 dense (Dense)               (None, None, 30001)       9000300   
                                                                 
Total params: 18,000,600
Trainable params: 18,000,600
Non-trainable params: 0
_________________________________________________________________


In [None]:
# Create a new model using part of the layers in our original model.
from tensorflow.keras import backend as K
from keras.models import Model
import numpy as np
keras_function = Model(model.layers[0].input, model.layers[0].output)

In [None]:
keras_function.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_input (InputLayer  [(None, None)]           0         
 )                                                               
                                                                 
 embedding (Embedding)       (None, None, 300)         9000300   
                                                                 
Total params: 9,000,300
Trainable params: 9,000,300
Non-trainable params: 0
_________________________________________________________________


In [None]:
# Generate the vector representation and write the result into result.txt
words = list(w2id.keys())
result = {}
for i in range(1, len(words)):
  result[words[i]] = keras_function(np.asarray([w2id[words[i]]])).numpy()[0]
f = open("/content/drive/MyDrive/word-embedding-creation/result.txt", "w", encoding='utf-8')
for i in result:
    line = ' '.join([str(i), *[str(j) for j in result[i]]])
    line += "\n"
    f.write(line)
f.close()


In [None]:
print(keras_function(np.asarray([w2id[words[1]]])).numpy()[0].shape)

(300,)


## **5. Skip Gram with negative sampling**

### We will use this later. Currently, there are several functions that are outdated

Negative sampling formula:

$$P(w _{i}\hspace{0.001em}) = \frac{f(w _{i}\hspace{0.001em})^{3/4}}{\sum_0^n (f(w _{j}\hspace{0.001em})^{3/4})}$$

$P(w _{i}\hspace{0.001em})$: probability that $w _{i}\hspace{0.001em}$ would be selected as a negative sample

$f(w _{i}\hspace{0.001em}$: the number of times that  $w _{i}\hspace{0.001em}$ appears in the corpus (text8)

In [None]:
from collections import Counter
int_corpus = []
for i in range(len(text)):
  text_seq = text_to_word_sequence(text[i])
  for word in text_seq:
    int_corpus.append(w2id[word])

appear_counts = Counter(int_corpus)
total_count = len(int_corpus)
freqs_dict = {word: count/total_count for word, count in appear_counts.items()}

freqs_arr = np.array(sorted(freqs_dict.values(), reverse=True))
sampling = tf.convert_to_tensor(freqs_arr**(0.75)/np.sum(freqs_arr**(0.75)))
sampling = tf.expand_dims(sampling, axis=0)

In [None]:
n_samples = 10
negative_samples = tf.compat.v1.multinomial(sampling, y_train.shape[0] * n_samples)
negative_samples = tf.reshape(negative_samples, [y_train.shape[0], n_samples])
print(negative_samples[0])

Instructions for updating:
Use `tf.random.categorical` instead.
tf.Tensor([88  2 82 91 74 91 21 26 87 52], shape=(10,), dtype=int64)


In [None]:
class SkipGramNeg(tf.keras.Model):
  def __init__(self, vocab_size, embedding_size):
    super().__init__()
    
    self.vocab_size = vocab_size
    self.embedding_size = embedding_size
    
    # define embedding layers for input and output words
    self.in_embed = Embedding(vocab_size, embedding_size)
    self.out_embed = Embedding(vocab_size, embedding_size)
  def call(self, inputs, targets, negative_samples):
    input_vectors = self.in_embed(inputs)
    target_vectors = self.out_embed(targets)
    negative_vectors = self.out_embed(negative_samples)

    return input_vectors, target_vectors, negative_vectors

In [None]:
from tensorflow import math

def negativeSamplingLoss(input_vectors, output_vectors, negative_vectors):
    input_vectors =  tf.transpose(input_vectors, perm=(0,2,1))
    out_loss = math.log(math.sigmoid(tf.matmul(output_vectors, input_vectors)))
    out_loss = tf.squeeze(out_loss)
    
    # incorrect log-sigmoid loss

    negative_loss = math.log(math.sigmoid(tf.matmul(math.negative(negative_vectors), input_vectors)))
    negative_loss = math.reduce_sum(tf.squeeze(negative_loss), axis=1)  # sum the losses over the sample of noise vectors

    # negate and sum correct and noisy log-sigmoid losses
    # return average batch loss
    return tf.math.reduce_mean(-(out_loss + negative_loss))

In [None]:
embedding_size = 128
neg_skip_gram = SkipGramNeg(vocab_size, embedding_size)
loss_fn = negativeSamplingLoss
optimizer = tf.keras.optimizers.Adam()

In [None]:
epochs = 300
for epoch in range(epochs):
  print("\nStart of epoch %d" % (epoch,))

  with tf.GradientTape() as tape:

      # Run the forward pass of the layer.
      # The operations that the layer applies
      # to its inputs are going to be recorded
      # on the GradientTape.
      input_vectors, output_vectors, negative_vectors = neg_skip_gram( X_train, y_train, negative_samples)  # Logits for this minibatch
      # Compute the loss value for this minibatch.
      loss_value = loss_fn(input_vectors, output_vectors, negative_vectors)

  # Use the gradient tape to automatically retrieve
  # the gradients of the trainable variables with respect to the loss.
  grads = tape.gradient(loss_value, neg_skip_gram.trainable_weights)

  # Run one step of gradient descent by updating
  # the value of the variables to minimize the loss.
  optimizer.apply_gradients(zip(grads, neg_skip_gram.trainable_weights))

  print(
      "Training loss: %.4f"
      % (float(loss_value))
  )



Start of epoch 0
Training loss: 7.6248

Start of epoch 1
Training loss: 7.6192

Start of epoch 2
Training loss: 7.6137

Start of epoch 3
Training loss: 7.6080

Start of epoch 4
Training loss: 7.6023

Start of epoch 5
Training loss: 7.5964

Start of epoch 6
Training loss: 7.5903

Start of epoch 7
Training loss: 7.5839

Start of epoch 8
Training loss: 7.5773

Start of epoch 9
Training loss: 7.5703

Start of epoch 10
Training loss: 7.5630

Start of epoch 11
Training loss: 7.5552

Start of epoch 12
Training loss: 7.5469

Start of epoch 13
Training loss: 7.5381

Start of epoch 14
Training loss: 7.5287

Start of epoch 15
Training loss: 7.5187

Start of epoch 16
Training loss: 7.5080

Start of epoch 17
Training loss: 7.4965

Start of epoch 18
Training loss: 7.4842

Start of epoch 19
Training loss: 7.4711

Start of epoch 20
Training loss: 7.4571

Start of epoch 21
Training loss: 7.4421

Start of epoch 22
Training loss: 7.4261

Start of epoch 23
Training loss: 7.4090

Start of epoch 24
Trainin