Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use character-by-character string comparison #17

Merged
merged 11 commits into from
Oct 11, 2020

Conversation

jcbrockschmidt
Copy link
Collaborator

Comparing strings character-by-character instead of as full string is more computationally and memory efficient. Here is a run-time comparison for the unit tests (in seconds):

Version 3.4 3.5 3.6 3.7 3.8 PyPy3
Original 2.5894 2.8381 2.2764 2.3851 2.3346 3.3415
Char-by-char 0.2646 0.2848 0.2064 0.2097 0.2122 0.3304

And here is the memory consumption comparison for several word sets (in Mb):

Method default word list YouTube demonetized words Google 10000
Original 54.3 ... 1555.1
Char-by-char 14.5 15.0 19.7

These changes also include an extra parameter for the Profanity constructor that allows for the optional specification of a word list, either as a file path or an iterable. This comes with respective unit tests.

This merge request would close issue #15.

Args:
string (str): String to generate variants of.
char_mappings (dict): Maps characters to substitute characters.
variant_thres (int): Maximum number of variants to store in a batch.
Copy link
Owner

@snguyenthanh snguyenthanh Oct 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems variant_thres is not used ? And do we need to use it when we have the MAX_PATTERNS constant in #16 ?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good to me. I will merge after variant_thres is added. What do you think about renaming it to max_variants ?

Copy link
Collaborator Author

@jcbrockschmidt jcbrockschmidt Oct 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my bad, variant_thres isn't used anymore. It was used for a less-efficient method I experimented with and I forgot to remove it.

MAX_PATTERNS won't be used anymore. The character-by-character string comparison methods doesn't actually construct any of the variants into full strings so it's not relevant anymore.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's removed now. Should be ready to pull.

Copy link
Owner

@snguyenthanh snguyenthanh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@snguyenthanh snguyenthanh merged commit 890c391 into snguyenthanh:master Oct 11, 2020
@snguyenthanh
Copy link
Owner

I will make the corresponding changes to the documentation and make the 0.7.0 release in a few days

@snguyenthanh
Copy link
Owner

@jcbrockschmidt Apologize for the late delay. I have published version 0.7.0.

@jcbrockschmidt
Copy link
Collaborator Author

jcbrockschmidt commented Nov 3, 2020

@snguyenthanh No worries. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants