Implementation Process

Data Preparation — Define corpus, clean, normalise and tokenise words

Hyperparameters — Learning rate, epochs, window size, embedding size

Generate Training Data — Build vocabulary, one-hot encoding for words, build dictionaries that map id to word and vice versa

Model Training — Pass encoded words through forward pass, calculate error rate, adjust weights using backpropagation and compute loss

Inference — Get word vector and find similar words
Further improvements — Speeding up training time with Skip-gram Negative Sampling (SGNS) and Hierarchical Softmax

In [199]:
text = ["natural language processing and machine learning is fun and exciting","I am also a good in machine learning"]
import numpy as np
import collections

In [200]:
corpus=[]
for i in range(len(text)):
    corpus.append([word.lower() for word in text[i].split()])

In [201]:
corpus

[['natural',
  'language',
  'processing',
  'and',
  'machine',
  'learning',
  'is',
  'fun',
  'and',
  'exciting'],
 ['i', 'am', 'also', 'a', 'good', 'in', 'machine', 'learning']]

In [202]:
settings = {
'window_size': 2, # context window +- center word
'n': 10,# dimensions of word embeddings, also refer to size of hidden layer
'epochs': 50,# number of training epochs
'learning_rate': 0.01# learning rate
}

In [209]:
class word2vec():
    def __init__(self):
        self.n=settings['n']
        self.lr=settings['learning_rate']
        self.epochs=settings['epochs']
        self.window=settings['window_size']
        print(self.n,self.lr,self.epochs,self.window)
        
    def generate_training_data(self,settings,corpus):
        word_counts=collections.defaultdict(int)
        for row in corpus:
            for word in row:
                word_counts[word]+=1
        
        
        self.v_count=len(word_counts.keys()) #Unique words of vocabulary
        self.words_list=list(word_counts.keys())
        
        self.word_index=dict((word,i) for i,word in enumerate(self.words_list))
        self.index_word=dict((i,word) for i,word in enumerate(self.words_list))
        print(self.index_word)
        
        
a=word2vec() 
a.generate_training_data(settings,corpus)

10 0.01 50 2
{0: 'natural', 1: 'language', 2: 'processing', 3: 'and', 4: 'machine', 5: 'learning', 6: 'is', 7: 'fun', 8: 'exciting', 9: 'i', 10: 'am', 11: 'also', 12: 'a', 13: 'good', 14: 'in'}


In [175]:
a={}

In [176]:
a['a']=1

In [178]:
a['a']+=1

In [179]:
a

{'a': 2}