Word2Vec recognizes semantic closeness between words by transforming words into vectors with meaningful contextual information.
Mikolv et. al. in https://arxiv.org/abs/1301.3781 proposed two architectures for Word2Vec:
- Skip-gram
- CBOW
The original code was written in C.
This repository contains the implementation of neural network for skip-gram from scratch in Python without using any machine learning or text processing libraries.
To train the model, run the train_minibatch.py
script on command line:
python train_minibatch.py
To predict similar words, run the predict.py
script on command line:
python predict.py
-
"train_minibatch.py" is the training file. It trains the neural network for any given dataset (
dataset.csv
) and generatesskipgram_w1.npy
,initialPlot.png
(word embeddings of untrained word vectors), andfinalPlot.png
(word embeddings for trained word vectors). -
The resultant trained word vectors are preserved as
skipgram_w1.npy
. -
predict.py
uses the trained word vectors to:- output cosine similarity between two input words.
- output 10 closest context words to any input words. This code has been formatted to fetch input from command line.
After this, I implemented another version using TensorFlow.