Use character-by-character string comparison #17

jcbrockschmidt · 2020-10-03T11:35:36Z

Comparing strings character-by-character instead of as full string is more computationally and memory efficient. Here is a run-time comparison for the unit tests (in seconds):

Version	3.4	3.5	3.6	3.7	3.8	PyPy3
Original	2.5894	2.8381	2.2764	2.3851	2.3346	3.3415
Char-by-char	0.2646	0.2848	0.2064	0.2097	0.2122	0.3304

And here is the memory consumption comparison for several word sets (in Mb):

Method	default word list	YouTube demonetized words	Google 10000
Original	54.3	...	1555.1
Char-by-char	14.5	15.0	19.7

These changes also include an extra parameter for the Profanity constructor that allows for the optional specification of a word list, either as a file path or an iterable. This comes with respective unit tests.

This merge request would close issue #15.

@snguyenthanh

Suggestion by @snguyenthanh

Speeds up profanity detection by roughly 10x.

Accepts file names and iterables.

snguyenthanh · 2020-10-04T04:14:53Z

better_profanity/varying_string.py

+        Args:
+            string (str): String to generate variants of.
+            char_mappings (dict): Maps characters to substitute characters.
+            variant_thres (int): Maximum number of variants to store in a batch.


It seems variant_thres is not used ? And do we need to use it when we have the MAX_PATTERNS constant in #16 ?

The PR looks good to me. I will merge after variant_thres is added. What do you think about renaming it to max_variants ?

Oh my bad, variant_thres isn't used anymore. It was used for a less-efficient method I experimented with and I forgot to remove it.

MAX_PATTERNS won't be used anymore. The character-by-character string comparison methods doesn't actually construct any of the variants into full strings so it's not relevant anymore.

It's removed now. Should be ready to pull.

snguyenthanh

Looks good!

snguyenthanh · 2020-10-11T12:34:28Z

I will make the corresponding changes to the documentation and make the 0.7.0 release in a few days

snguyenthanh · 2020-11-02T10:56:03Z

@jcbrockschmidt Apologize for the late delay. I have published version 0.7.0.

jcbrockschmidt · 2020-11-03T06:31:20Z

@snguyenthanh No worries. Thanks!

jcbrockschmidt added 10 commits September 29, 2020 04:27

Add limiter for exponential memory consumption runoff

88bd709

Add memory consumption limitation to README

6b9235b

Fix compatibility with Python 3.4 and 3.5

4290e15

Move MAX_PATTERNS to constants.py

de419d3

Add logging for reporting warnings

7d1010e

Suggestion by @snguyenthanh

Split words into slices for more memory efficient variant checking

cb2d772

Use character-by-character string comparison

a2df396

Speeds up profanity detection by roughly 10x.

Add constructor option for custom word list

55b48e5

Accepts file names and iterables.

Remove README limitation for memory consumption

9382fd7

Remove unused code for VaryingString

5feae98

jcbrockschmidt mentioned this pull request Oct 4, 2020

Memory runoff limiter #16

Closed

snguyenthanh reviewed Oct 4, 2020

View reviewed changes

Remove unused variant_thres parameter for VaryingString

93cf88f

snguyenthanh approved these changes Oct 11, 2020

View reviewed changes

snguyenthanh merged commit 890c391 into snguyenthanh:master Oct 11, 2020

jcbrockschmidt mentioned this pull request Nov 3, 2020

Excessive memory consumption for large enough words #15

Closed

jcbrockschmidt deleted the memory-efficient branch November 3, 2020 06:30

jcbrockschmidt mentioned this pull request Nov 20, 2020

Version 0.7.0 significantly slower than 0.6.1 #19

Open

snguyenthanh mentioned this pull request May 19, 2021

Get all possible leet words from my txt file and store it as list variable. #32

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use character-by-character string comparison #17

Use character-by-character string comparison #17

jcbrockschmidt commented Oct 3, 2020

snguyenthanh Oct 4, 2020 •

edited

Loading

snguyenthanh Oct 4, 2020

jcbrockschmidt Oct 4, 2020 •

edited

Loading

jcbrockschmidt Oct 11, 2020

snguyenthanh left a comment

snguyenthanh commented Oct 11, 2020

snguyenthanh commented Nov 2, 2020

jcbrockschmidt commented Nov 3, 2020 •

edited

Loading

Use character-by-character string comparison #17

Use character-by-character string comparison #17

Conversation

jcbrockschmidt commented Oct 3, 2020

snguyenthanh Oct 4, 2020 • edited Loading

Choose a reason for hiding this comment

snguyenthanh Oct 4, 2020

Choose a reason for hiding this comment

jcbrockschmidt Oct 4, 2020 • edited Loading

Choose a reason for hiding this comment

jcbrockschmidt Oct 11, 2020

Choose a reason for hiding this comment

snguyenthanh left a comment

Choose a reason for hiding this comment

snguyenthanh commented Oct 11, 2020

snguyenthanh commented Nov 2, 2020

jcbrockschmidt commented Nov 3, 2020 • edited Loading

snguyenthanh Oct 4, 2020 •

edited

Loading

jcbrockschmidt Oct 4, 2020 •

edited

Loading

jcbrockschmidt commented Nov 3, 2020 •

edited

Loading