-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excessive memory consumption for large enough words #15
Comments
Thanks @jcbrockschmidt for the suggestion and the PR. This is a very good idea to have a failsafe for memory consumption. Memory usage has been an issue for a while, and I haven't really tackled it. Another idea I have in mind is to have a validator to check whether the word could be ignored. For example, if the number of vowels in the word exceeds the I will experiment more on your PR and some other approaches and get back to you soon, hopefully by end of next week. |
I may have figured out a way to reduce the memory footprint. My method essentially splits censored words into slices, computes all variants for each slice, and matches strings by comparing their slices to each censored word's slices. So far I've been able to reduce the memory footprint to 15Mb with the default wordlist, and was able to load a previously unloadable word list with only 60Mb used. My code still needs to be polished. When done, I plan to run some profiling tests to compare speeds with version 0.6. |
Thank you so much for the hard work. I'm looking forward to your result :) |
My solution is currently about 5 times slower. I'm going to attempt some optimizations but I may need to try something different. |
Alright guess I fixed it, haha! Only took a few lines (checking the lengths of strings before comparing their slices). I'm going to try to add multi-threading now. |
Did some profiling. Ran the unit tests for all the versions of Python we support. Tests were ran 30 times each and their times averaged. Here's the results:
So my method is about the same on most version of Python, but actually sees speed-ups on Python 3.8 and PyPy3. I think I can do a little tweaking on the variant/slicing threshold to speed things up a bit more. |
Turns out if we just do character-by-character comparisons we can speed things up roughly 10x. I ran the same profiling tests with this method, this time using only 5 trials (since the different in speed is so drastic we don't need a lot of accuracy):
On top of this, the memory consumption is lowered. Here's the process memory consumption for the old method vs the character-by-character method:
|
Issue fixed with the merge of #17 in version 0.7.0. |
When a word or phrase is sufficiently long, the method
Profanity.load_censor_words_from_file
will consume all of a system's memory.The cause appears to be the use of
product
inProfanity._generate_patterns_from_word
. As the number of substitutable characters in a words grows linearly, the number of possible patterns grows exponentially to their power. In other words, it has a memory footprint ofO(n^x)
wheren
is the approximate number of substitute per-character andx
is the number of substitutable characters. At some point, thex
becomes so large that an entire system's memory will be consumed.You can use these words in a text file to pinpoint where this threshold is:
The second-to-last line of 15 characters will take relatively long to process. Then the final word of 16 characters will consume all of a system's available memory (up to 16 GB). Here's a table of how much memory
python3.6
takes up as a process given the length of our test words,len(product(...))
(
len(product(...))
is calculated as3**x
since "e" can be substituted with "e", "*", or "3")For context,
python3.6
takes up only about 60 Mb when the default 320 length wordlist is loaded.Either the underlying data structure needs to be changed, or words for which the predicted output of
product
exceed a certain amount need to be ignored.The text was updated successfully, but these errors were encountered: