Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faulty Memory Management in Implementation of Hash Table with Linear Probing #6

Closed
daskol opened this issue Dec 4, 2019 · 2 comments

Comments

@daskol
Copy link

daskol commented Dec 4, 2019

Adding a new word to vocabulary could cause in segmentation fault as vocabulary grows. The reason is that only vocab array is reallocated vocab_hash array remains untouched.

@daskol
Copy link
Author

daskol commented Dec 4, 2019

I think that the best solution for the issue would be removing the reallocation and introducing option to specify expetected vocabulary size.

@yumeng5
Copy link
Owner

yumeng5 commented Dec 4, 2019

Hi,

Thanks for the comment. You are right that the size of vocab_hash is hard-coded as a fixed constant at the beginning of the file.

const int vocab_hash_size = 30000000; // Maximum 30 * 0.7 = 21M words in the vocabulary

So if you have an extremely large vocabulary, a simple solution is to modify the constant here to have a large enough vocab_hash_size.

Similarly, as you proposed, we can remove the reallocation of the vocab array part and set its maximum size to be equal to that of vocab_hash. However, you will need to allocate that much memory to the vocab array regardless of how small the corpus might be. The reallocation procedure serves as a way to dynamically fit to the vocabulary of the real corpus, thus improving the memory efficiency of the code.

If you do not care about memory issues at all, then you can absolutely remove the reallocation part. However, to make the code generally more memory-efficient, I would like to keep the reallocation part for now.

Please let me know if you have any further suggestions!

Best,
Yu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants