Extension of the original word2vec using different architectures
Switch branches/tags
Nothing to show
Clone or download
Latest commit cd11b31 Mar 28, 2017
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
LICENSE Initial commit Feb 21, 2015
README.md Update README.md Jul 13, 2015
cngram2vec.c debug: only select ngrams in vocab Jan 13, 2016
compute-accuracy.c first real push Feb 21, 2015
distance.c first real push Feb 21, 2015
distance_fast.c first real push Feb 21, 2015
distance_txt.c first real push Feb 21, 2015
kmeans_txt.c remove warnings Mar 12, 2015
makefile Update makefile Mar 28, 2017
weightedWord2vec.c weighted word2vec Feb 20, 2016
word-analogy.c first real push Feb 21, 2015
word2phrase.c first real push Feb 21, 2015
word2vec.c debugging hs softmax for window type Jul 20, 2016
wordless2vec.c debug May 31, 2015

README.md

wang2vec

Extension of the original word2vec (https://code.google.com/p/word2vec/) using different architectures

To build the code, simply run:

make

The command to build word embeddings is exactly the same as in the original version, except that we removed the argument -cbow and replaced it with the argument -type:

./word2vec -train input_file -output embedding_file -type 0 -size 50 -window 5 -negative 10 -nce 0 -hs 0 -sample 1e-4 -threads 1 -binary 1 -iter 5 -cap 0

The -type argument is a integer that defines the architecture to use. These are the possible parameters:
0 - cbow
1 - skipngram
2 - cwindow (see below)
3 - structured skipngram(see below)
4 - collobert's senna context window model (still experimental)

If you use functionalities we added to the original code for research, please support us by citing our paper (thanks!):

@InProceedings{Ling:2015:naacl,
author = {Ling, Wang and Dyer, Chris and Black, Alan and Trancoso, Isabel},
title="Two/Too Simple Adaptations of word2vec for Syntax Problems",
booktitle="Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
year="2015",
publisher="Association for Computational Linguistics",
location="Denver, Colorado",
}

The main changes we made to the code are:

****** Structured Skipngram and CWINDOW ******

The two NN architectures cwindow and structured skipngram (aimed for solving syntax problems).

These are described in our paper:

-Two/Too Simple Adaptations of word2vec for Syntax Problems

****** Noise Contrastive Estimation objective ******

Noise contrastive estimation is another approximation for the word softmax objective function, in additon to Hierarchical softmax and negative sampling, which are implemented in the default word2vec toolkit. This can be turned on by setting the -nce argument. Simply set -nce 10, to use 10 negative samples. Also remember to set -negative and -hs to 0.

****** Parameter Capping ******

By default parameters are updated freely, and are not checked for algebric overflows to maximize efficiency. However, we had some datasets where the CWINDOW architecture overflows, which leads to segfaults, If this happens, even in other architectures, try setting the paramter -cap 1 in order to avoid this problem at the cost of a small degradation in computational speed.

****** Class-based Negative Sampling ******

A new argument -negative-classes can be added to specify groups of classes. It receives a file in the format:

N dog
N cat
N worm
V doing
V finding
V dodging
A charming
A satirical

where each line defines a class and a word belonging to that class. For words belonging to the class, negative sampling is only performed on words on that class. For instance, if the desired output is dog, we would only sample from cat and worm. For words not in the list, sampling is performed over all word types.

warning: the file must be order so that all words in the same class are grouped, so the following would not work correctly.

N dog
A charming
N cat
N worm
V doing
V finding
V dodging
A satirical

****** Minor Changes ******

The distance_txt and kmeans_txt are adaptations of the original distance and kmeans code to take textual (-binary 0) embeddings as input