Excessive memory consumption for large enough words #15

jcbrockschmidt · 2020-09-29T09:42:13Z

When a word or phrase is sufficiently long, the method Profanity.load_censor_words_from_file will consume all of a system's memory.

The cause appears to be the use of product in Profanity._generate_patterns_from_word. As the number of substitutable characters in a words grows linearly, the number of possible patterns grows exponentially to their power. In other words, it has a memory footprint of O(n^x) where n is the approximate number of substitute per-character and x is the number of substitutable characters. At some point, the x becomes so large that an entire system's memory will be consumed.

You can use these words in a text file to pinpoint where this threshold is:

eeeeeeeeeeee
eeeeeeeeeeeee
eeeeeeeeeeeeee
eeeeeeeeeeeeeee

The second-to-last line of 15 characters will take relatively long to process. Then the final word of 16 characters will consume all of a system's available memory (up to 16 GB). Here's a table of how much memory python3.6 takes up as a process given the length of our test words,

e's	`len(product(...))`	Memory (Mb)
13	1594323	216
14	4782969	625
15	14348907	1411
16	43046721	...

(len(product(...)) is calculated as 3**x since "e" can be substituted with "e", "*", or "3")

For context, python3.6 takes up only about 60 Mb when the default 320 length wordlist is loaded.

Either the underlying data structure needs to be changed, or words for which the predicted output of product exceed a certain amount need to be ignored.

The text was updated successfully, but these errors were encountered:

snguyenthanh · 2020-09-29T16:13:29Z

Thanks @jcbrockschmidt for the suggestion and the PR. This is a very good idea to have a failsafe for memory consumption.

Memory usage has been an issue for a while, and I haven't really tackled it. Another idea I have in mind is to have a validator to check whether the word could be ignored. For example, if the number of vowels in the word exceeds the MAX_NUM_VOWELS of the wordset, it can be ignored.

I will experiment more on your PR and some other approaches and get back to you soon, hopefully by end of next week.

jcbrockschmidt · 2020-10-01T13:10:30Z

I may have figured out a way to reduce the memory footprint. My method essentially splits censored words into slices, computes all variants for each slice, and matches strings by comparing their slices to each censored word's slices. So far I've been able to reduce the memory footprint to 15Mb with the default wordlist, and was able to load a previously unloadable word list with only 60Mb used.

My code still needs to be polished. When done, I plan to run some profiling tests to compare speeds with version 0.6.

snguyenthanh · 2020-10-01T13:57:30Z

Thank you so much for the hard work. I'm looking forward to your result :)

jcbrockschmidt · 2020-10-02T22:48:27Z

My solution is currently about 5 times slower. I'm going to attempt some optimizations but I may need to try something different.

jcbrockschmidt · 2020-10-02T22:54:49Z

Alright guess I fixed it, haha! Only took a few lines (checking the lengths of strings before comparing their slices). I'm going to try to add multi-threading now.

jcbrockschmidt · 2020-10-03T06:36:18Z

Did some profiling. Ran the unit tests for all the versions of Python we support. Tests were ran 30 times each and their times averaged. Here's the results:

----==== OLD METHOD ====----
Average times:
  python3.4:	2.3438s
  python3.5:	2.5406s
  python3.6:	2.0395s
  python3.7:	2.1116s
  python3.8:	2.0867s
  pypy3:	2.9554s

----==== NEW METHOD ====----
Average times:
  python3.4:	2.3545s
  python3.5:	2.4687s
  python3.6:	2.0596s
  python3.7:	2.1166s
  python3.8:	1.9218s
  pypy3:	1.2474s

So my method is about the same on most version of Python, but actually sees speed-ups on Python 3.8 and PyPy3. I think I can do a little tweaking on the variant/slicing threshold to speed things up a bit more.

jcbrockschmidt · 2020-10-03T09:58:01Z

Turns out if we just do character-by-character comparisons we can speed things up roughly 10x. I ran the same profiling tests with this method, this time using only 5 trials (since the different in speed is so drastic we don't need a lot of accuracy):

----==== OLD METHOD ====----
Average times:
  python3.4:	2.5894s
  python3.5:	2.8381s
  python3.6:	2.2764s
  python3.7:	2.3851s
  python3.8:	2.3346s
  pypy3:	3.3415s

----==== CHAR-BY-CHAR METHOD ====----
Average times:
  python3.4:	0.2646s
  python3.5:	0.2848s
  python3.6:	0.2064s
  python3.7:	0.2097s
  python3.8:	0.2122s
  pypy3:	0.3304s

On top of this, the memory consumption is lowered. Here's the process memory consumption for the old method vs the character-by-character method:

Method	default word list (Mb)	YouTube demonetized words (Mb)	Google 10000
Old	54.3	...	1555.1
Char-by-char	14.5	15.0	19.7

jcbrockschmidt · 2020-11-03T06:29:56Z

Issue fixed with the merge of #17 in version 0.7.0.

jcbrockschmidt mentioned this issue Sep 29, 2020

Memory runoff limiter #16

Closed

jcbrockschmidt mentioned this issue Oct 3, 2020

Use character-by-character string comparison #17

Merged

jcbrockschmidt closed this as completed Nov 3, 2020

jcbrockschmidt mentioned this issue Sep 26, 2021

Get all possible leet words from my txt file and store it as list variable. #32

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive memory consumption for large enough words #15

Excessive memory consumption for large enough words #15

jcbrockschmidt commented Sep 29, 2020 •

edited

Loading

snguyenthanh commented Sep 29, 2020

jcbrockschmidt commented Oct 1, 2020 •

edited

Loading

snguyenthanh commented Oct 1, 2020

jcbrockschmidt commented Oct 2, 2020 •

edited

Loading

jcbrockschmidt commented Oct 2, 2020

jcbrockschmidt commented Oct 3, 2020 •

edited

Loading

jcbrockschmidt commented Oct 3, 2020

jcbrockschmidt commented Nov 3, 2020

Excessive memory consumption for large enough words #15

Excessive memory consumption for large enough words #15

Comments

jcbrockschmidt commented Sep 29, 2020 • edited Loading

snguyenthanh commented Sep 29, 2020

jcbrockschmidt commented Oct 1, 2020 • edited Loading

snguyenthanh commented Oct 1, 2020

jcbrockschmidt commented Oct 2, 2020 • edited Loading

jcbrockschmidt commented Oct 2, 2020

jcbrockschmidt commented Oct 3, 2020 • edited Loading

jcbrockschmidt commented Oct 3, 2020

jcbrockschmidt commented Nov 3, 2020

jcbrockschmidt commented Sep 29, 2020 •

edited

Loading

jcbrockschmidt commented Oct 1, 2020 •

edited

Loading

jcbrockschmidt commented Oct 2, 2020 •

edited

Loading

jcbrockschmidt commented Oct 3, 2020 •

edited

Loading