Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive memory consumption for large enough words #15

Closed
jcbrockschmidt opened this issue Sep 29, 2020 · 8 comments
Closed

Excessive memory consumption for large enough words #15

jcbrockschmidt opened this issue Sep 29, 2020 · 8 comments

Comments

@jcbrockschmidt
Copy link
Collaborator

jcbrockschmidt commented Sep 29, 2020

When a word or phrase is sufficiently long, the method Profanity.load_censor_words_from_file will consume all of a system's memory.

The cause appears to be the use of product in Profanity._generate_patterns_from_word. As the number of substitutable characters in a words grows linearly, the number of possible patterns grows exponentially to their power. In other words, it has a memory footprint of O(n^x) where n is the approximate number of substitute per-character and x is the number of substitutable characters. At some point, the x becomes so large that an entire system's memory will be consumed.

You can use these words in a text file to pinpoint where this threshold is:

eeeeeeeeeeee
eeeeeeeeeeeee
eeeeeeeeeeeeee
eeeeeeeeeeeeeee

The second-to-last line of 15 characters will take relatively long to process. Then the final word of 16 characters will consume all of a system's available memory (up to 16 GB). Here's a table of how much memory python3.6 takes up as a process given the length of our test words,

e's len(product(...)) Memory (Mb)
13 1594323 216
14 4782969 625
15 14348907 1411
16 43046721 ...

(len(product(...)) is calculated as 3**x since "e" can be substituted with "e", "*", or "3")

For context, python3.6 takes up only about 60 Mb when the default 320 length wordlist is loaded.

Either the underlying data structure needs to be changed, or words for which the predicted output of product exceed a certain amount need to be ignored.

@snguyenthanh
Copy link
Owner

Thanks @jcbrockschmidt for the suggestion and the PR. This is a very good idea to have a failsafe for memory consumption.

Memory usage has been an issue for a while, and I haven't really tackled it. Another idea I have in mind is to have a validator to check whether the word could be ignored. For example, if the number of vowels in the word exceeds the MAX_NUM_VOWELS of the wordset, it can be ignored.

I will experiment more on your PR and some other approaches and get back to you soon, hopefully by end of next week.

@jcbrockschmidt
Copy link
Collaborator Author

jcbrockschmidt commented Oct 1, 2020

I may have figured out a way to reduce the memory footprint. My method essentially splits censored words into slices, computes all variants for each slice, and matches strings by comparing their slices to each censored word's slices. So far I've been able to reduce the memory footprint to 15Mb with the default wordlist, and was able to load a previously unloadable word list with only 60Mb used.

My code still needs to be polished. When done, I plan to run some profiling tests to compare speeds with version 0.6.

@snguyenthanh
Copy link
Owner

Thank you so much for the hard work. I'm looking forward to your result :)

@jcbrockschmidt
Copy link
Collaborator Author

jcbrockschmidt commented Oct 2, 2020

My solution is currently about 5 times slower. I'm going to attempt some optimizations but I may need to try something different.

@jcbrockschmidt
Copy link
Collaborator Author

Alright guess I fixed it, haha! Only took a few lines (checking the lengths of strings before comparing their slices). I'm going to try to add multi-threading now.

@jcbrockschmidt
Copy link
Collaborator Author

jcbrockschmidt commented Oct 3, 2020

Did some profiling. Ran the unit tests for all the versions of Python we support. Tests were ran 30 times each and their times averaged. Here's the results:

----==== OLD METHOD ====----
Average times:
  python3.4:	2.3438s
  python3.5:	2.5406s
  python3.6:	2.0395s
  python3.7:	2.1116s
  python3.8:	2.0867s
  pypy3:	2.9554s

----==== NEW METHOD ====----
Average times:
  python3.4:	2.3545s
  python3.5:	2.4687s
  python3.6:	2.0596s
  python3.7:	2.1166s
  python3.8:	1.9218s
  pypy3:	1.2474s

So my method is about the same on most version of Python, but actually sees speed-ups on Python 3.8 and PyPy3. I think I can do a little tweaking on the variant/slicing threshold to speed things up a bit more.

@jcbrockschmidt
Copy link
Collaborator Author

Turns out if we just do character-by-character comparisons we can speed things up roughly 10x. I ran the same profiling tests with this method, this time using only 5 trials (since the different in speed is so drastic we don't need a lot of accuracy):

----==== OLD METHOD ====----
Average times:
  python3.4:	2.5894s
  python3.5:	2.8381s
  python3.6:	2.2764s
  python3.7:	2.3851s
  python3.8:	2.3346s
  pypy3:	3.3415s

----==== CHAR-BY-CHAR METHOD ====----
Average times:
  python3.4:	0.2646s
  python3.5:	0.2848s
  python3.6:	0.2064s
  python3.7:	0.2097s
  python3.8:	0.2122s
  pypy3:	0.3304s

On top of this, the memory consumption is lowered. Here's the process memory consumption for the old method vs the character-by-character method:

Method default word list (Mb) YouTube demonetized words (Mb) Google 10000
Old 54.3 ... 1555.1
Char-by-char 14.5 15.0 19.7

@jcbrockschmidt
Copy link
Collaborator Author

Issue fixed with the merge of #17 in version 0.7.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants