Segmentation Fault #8

Sachin19 · 2019-12-10T19:49:19Z

Hi,

I'm trying to train new embeddings with your code on a corpus with approximately 4B tokens but the code gives me a segmentation fault right after reading the corpus and showing the number of tokens. I'm using ~200G of RAM. Do I need to use more memory? or could it be another issue. For reference, word2vec and fasttext trained just fine on this corpus.

Thanks in advance!

yumeng5 · 2019-12-11T07:42:43Z

Hi,

Thanks for letting me know the issue. I haven't tried running the code on a corpus with more than 4B tokens, so I can't comment on how much memory it will take approximately (I apologize for not being able to try it right now since I'm attending a conference). However, if it were due to the memory error, you should have received a memory allocation error instead of a segmentation fault.

My current best guess is that you have too many documents/paragraphs in the corpus.

Spherical-Text-Embedding/src/jose.c

Line 18 in b0f8820

const int corpus_max_size = 40000000; // Maximum 40M documents in the corpus

As shown in the above line of code, the maximum number of documents allowed is hard-coded here. If your corpus has more documents than this number, the code will run into a segmentation fault. To solve this issue, simply change it to some number larger than the number of lines (which is equal to the number of documents/paragraphs) in your corpus file. Maybe you can give it a try to see if this solves your issue.

Please let me know if you still encounter any errors or have other questions!

Best,
Yu

daskol · 2019-12-11T19:06:28Z

@Sachin19 See related issue #6.

yumeng5 · 2019-12-17T12:20:35Z

Hi @Sachin19,

Thanks again for posting this issue. I was wondering if you got a chance to try my suggestions and could provide any update on this issue?

Thanks,
Yu

Sachin19 · 2019-12-17T15:42:47Z

Hi Yu,

Thank you so much for your suggestion. Line 18 was exactly the issue I was facing and it resolved the issue when I increased the number of documents.

I was also wondering if you could point me to resources on how to implement riemannian optimization in a package like pytorch.

Thanks,
Sachin

yumeng5 · 2019-12-18T01:57:47Z

Hi Sachin,

Thanks for letting me know! I'm glad it solved the issue.

Regarding Riemannian optimization implementation, I'm not aware of existing PyTorch projects for the spherical space, but there are some for the hyperbolic space. For example, the Poincare embedding codebase has PyTorch implementation on Riemannian optimization in the Poincare space. Maybe you can take a look specifically at the Poincare manifold implementation where the Riemannian gradient is implemented, as well as the RSGD implementation. Although the optimization formula will be different for the spherical space, I feel the above code might be used as a great reference and template.

Please let me know if you have any other questions!

Best,
Yu

yumeng5 closed this as completed Dec 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation Fault #8

Segmentation Fault #8

Sachin19 commented Dec 10, 2019

yumeng5 commented Dec 11, 2019

daskol commented Dec 11, 2019

yumeng5 commented Dec 17, 2019

Sachin19 commented Dec 17, 2019

yumeng5 commented Dec 18, 2019

Segmentation Fault #8

Segmentation Fault #8

Comments

Sachin19 commented Dec 10, 2019

yumeng5 commented Dec 11, 2019

daskol commented Dec 11, 2019

yumeng5 commented Dec 17, 2019

Sachin19 commented Dec 17, 2019

yumeng5 commented Dec 18, 2019