Could you train on Glove.42B.300d.txt #4

liaocs2008 · 2019-10-19T15:38:40Z

Hi,

I find your work interesting and I was testing it on glove.6B.300d.txt (after converting to the w2v format). The program works fine and output vectors look normal.

However, when I run it on glove.42B.300d.txt, the program outputs zeros for all words. I am guessing it might be a convergence issue. Could you test and help me figure out the problem?

Thanks!

tca19 · 2019-10-20T10:06:22Z

Hi @liaocs2008 ,

Thank you for your interest in my work. I was able to replicate the problem with the glove.42B.300d.txt file. There is a segmentation fault right after the binarization step, so the output file binary_vectors.vec is created but nothing is written to it. I'll investigate to find the source of the problem.

The code has been tested on dict2vec (2.3 million vectors) and fasttext common crawl (2 million vectors) and works fine on these files, so it is strange to me that it does not work on the bigger file glove.42B.

liaocs2008 · 2019-10-20T17:17:42Z

For segmentation fault, it can be fixed by restricting number of words read during embedding loading. The problem is that you end the while loop only when seeing EOF. After you fix it, you can train the embedding and check your binary vector.

Your binary transformation function is like a sign function. Output all zeros means all values are negative. That's why I am guessing the convergence could be a problem here.

liaocs2008 · 2019-10-20T19:07:25Z

Actually I find the segmentation fault is caused by not enough word length. There are words as long as 1K in glove.42B.300d.txt. Increasing the MAXWORDLENGTH to 1024 would work.

tca19 · 2019-11-03T10:24:07Z

Setting MAXWORDLEN to 1024 did not solve the problem in my case. The segmentation fault was caused by the stop condition when reading the file to load the embeddings. More explanations can be found here #5 (comment).

I have fixed the problem now and was able to train on glove.42B.300d.txt without any problems. I had to tune some hyperparameters, and set lr-rec to 0 to get some good results. Here are the results I have after binarizing the vectors of glove.42B.300d.txt:

Filename     | Spearman | OOV
==============================
MEN.txt      |    0.684 |   0%
WS353.txt    |    0.591 |   0%
SimLex.txt   |    0.369 |   0%
RW.txt       |    0.372 |   1%
SimVerb.txt  |    0.207 |   0%

liaocs2008 closed this as completed Oct 20, 2019

tca19 reopened this Nov 3, 2019

tca19 closed this as completed Nov 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could you train on Glove.42B.300d.txt #4

Could you train on Glove.42B.300d.txt #4

liaocs2008 commented Oct 19, 2019 •

edited

Loading

tca19 commented Oct 20, 2019

liaocs2008 commented Oct 20, 2019 •

edited

Loading

liaocs2008 commented Oct 20, 2019 •

edited

Loading

tca19 commented Nov 3, 2019

Could you train on Glove.42B.300d.txt #4

Could you train on Glove.42B.300d.txt #4

Comments

liaocs2008 commented Oct 19, 2019 • edited Loading

tca19 commented Oct 20, 2019

liaocs2008 commented Oct 20, 2019 • edited Loading

liaocs2008 commented Oct 20, 2019 • edited Loading

tca19 commented Nov 3, 2019

liaocs2008 commented Oct 19, 2019 •

edited

Loading

liaocs2008 commented Oct 20, 2019 •

edited

Loading

liaocs2008 commented Oct 20, 2019 •

edited

Loading