Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation Fault #8

Closed
Sachin19 opened this issue Dec 10, 2019 · 5 comments
Closed

Segmentation Fault #8

Sachin19 opened this issue Dec 10, 2019 · 5 comments

Comments

@Sachin19
Copy link

Hi,

I'm trying to train new embeddings with your code on a corpus with approximately 4B tokens but the code gives me a segmentation fault right after reading the corpus and showing the number of tokens. I'm using ~200G of RAM. Do I need to use more memory? or could it be another issue. For reference, word2vec and fasttext trained just fine on this corpus.

Thanks in advance!

@yumeng5
Copy link
Owner

yumeng5 commented Dec 11, 2019

Hi,

Thanks for letting me know the issue. I haven't tried running the code on a corpus with more than 4B tokens, so I can't comment on how much memory it will take approximately (I apologize for not being able to try it right now since I'm attending a conference). However, if it were due to the memory error, you should have received a memory allocation error instead of a segmentation fault.

My current best guess is that you have too many documents/paragraphs in the corpus.

const int corpus_max_size = 40000000; // Maximum 40M documents in the corpus

As shown in the above line of code, the maximum number of documents allowed is hard-coded here. If your corpus has more documents than this number, the code will run into a segmentation fault. To solve this issue, simply change it to some number larger than the number of lines (which is equal to the number of documents/paragraphs) in your corpus file. Maybe you can give it a try to see if this solves your issue.

Please let me know if you still encounter any errors or have other questions!

Best,
Yu

@daskol
Copy link

daskol commented Dec 11, 2019

@Sachin19 See related issue #6.

@yumeng5
Copy link
Owner

yumeng5 commented Dec 17, 2019

Hi @Sachin19,

Thanks again for posting this issue. I was wondering if you got a chance to try my suggestions and could provide any update on this issue?

Thanks,
Yu

@Sachin19
Copy link
Author

Hi Yu,

Thank you so much for your suggestion. Line 18 was exactly the issue I was facing and it resolved the issue when I increased the number of documents.

I was also wondering if you could point me to resources on how to implement riemannian optimization in a package like pytorch.

Thanks,
Sachin

@yumeng5
Copy link
Owner

yumeng5 commented Dec 18, 2019

Hi Sachin,

Thanks for letting me know! I'm glad it solved the issue.

Regarding Riemannian optimization implementation, I'm not aware of existing PyTorch projects for the spherical space, but there are some for the hyperbolic space. For example, the Poincare embedding codebase has PyTorch implementation on Riemannian optimization in the Poincare space. Maybe you can take a look specifically at the Poincare manifold implementation where the Riemannian gradient is implemented, as well as the RSGD implementation. Although the optimization formula will be different for the spherical space, I feel the above code might be used as a great reference and template.

Please let me know if you have any other questions!

Best,
Yu

@yumeng5 yumeng5 closed this as completed Dec 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants