Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could you train on Glove.42B.300d.txt #4

Closed
liaocs2008 opened this issue Oct 19, 2019 · 4 comments
Closed

Could you train on Glove.42B.300d.txt #4

liaocs2008 opened this issue Oct 19, 2019 · 4 comments

Comments

@liaocs2008
Copy link

liaocs2008 commented Oct 19, 2019

Hi,

I find your work interesting and I was testing it on glove.6B.300d.txt (after converting to the w2v format). The program works fine and output vectors look normal.

However, when I run it on glove.42B.300d.txt, the program outputs zeros for all words. I am guessing it might be a convergence issue. Could you test and help me figure out the problem?

Thanks!

@tca19
Copy link
Owner

tca19 commented Oct 20, 2019

Hi @liaocs2008 ,

Thank you for your interest in my work. I was able to replicate the problem with the glove.42B.300d.txt file. There is a segmentation fault right after the binarization step, so the output file binary_vectors.vec is created but nothing is written to it. I'll investigate to find the source of the problem.

The code has been tested on dict2vec (2.3 million vectors) and fasttext common crawl (2 million vectors) and works fine on these files, so it is strange to me that it does not work on the bigger file glove.42B.

@liaocs2008
Copy link
Author

liaocs2008 commented Oct 20, 2019

For segmentation fault, it can be fixed by restricting number of words read during embedding loading. The problem is that you end the while loop only when seeing EOF. After you fix it, you can train the embedding and check your binary vector.

Your binary transformation function is like a sign function. Output all zeros means all values are negative. That's why I am guessing the convergence could be a problem here.

@liaocs2008
Copy link
Author

liaocs2008 commented Oct 20, 2019

Actually I find the segmentation fault is caused by not enough word length. There are words as long as 1K in glove.42B.300d.txt. Increasing the MAXWORDLENGTH to 1024 would work.

@tca19 tca19 reopened this Nov 3, 2019
@tca19
Copy link
Owner

tca19 commented Nov 3, 2019

Setting MAXWORDLEN to 1024 did not solve the problem in my case. The segmentation fault was caused by the stop condition when reading the file to load the embeddings. More explanations can be found here #5 (comment).

I have fixed the problem now and was able to train on glove.42B.300d.txt without any problems. I had to tune some hyperparameters, and set lr-rec to 0 to get some good results. Here are the results I have after binarizing the vectors of glove.42B.300d.txt:

Filename     | Spearman | OOV
==============================
MEN.txt      |    0.684 |   0%
WS353.txt    |    0.591 |   0%
SimLex.txt   |    0.369 |   0%
RW.txt       |    0.372 |   1%
SimVerb.txt  |    0.207 |   0%

@tca19 tca19 closed this as completed Nov 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants