Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

profanity filtering doesn't work for combined words like "f*ckme" or "suckmyd*ck" #18

Open
ghost opened this issue Oct 21, 2020 · 5 comments · May be fixed by #49
Open

profanity filtering doesn't work for combined words like "f*ckme" or "suckmyd*ck" #18

ghost opened this issue Oct 21, 2020 · 5 comments · May be fixed by #49
Labels
discuss Discussion on the project's features / bugs

Comments

@ghost
Copy link

ghost commented Oct 21, 2020

No description provided.

@jcbrockschmidt
Copy link
Collaborator

jcbrockschmidt commented Nov 25, 2020

Unfortunately, this is a well-known problem that extends beyond just better_profanity, known as the Scunthorpe problem. In short, it's very hard to ascertain when and when not to censor words contained within larger words.

Now, there are some improvements we could make to minimize this issue. Currently, if you include the phrase "suck my d*ck" in a word list, the phrase "suckmyd*ck" won't be censored. It should be fairly straight-forward for us to include variations of censored phrases without whitespace into our censor.

@jcbrockschmidt
Copy link
Collaborator

Running through some math and seeing some potential memory problems. Suppose we want to include every variation of whitespaces for the phrase "suck my d*ck", such that "suckmy d*ck" as well as "suck myd*ck" are all censored. This would require essentially adding a new words for each variation. With only two whitespaces, this only amounts to a total of 4 words. But the more whitespace a phrase has, the more words we need to add. Specifically, the number of words needed increases as a factorial, w!, where w is the number of whitespaces. This would make our memory consumption per word O(w! * n), or O((n/2)!) since w < n. Compared to the bound of O(n), this is less than ideal.

Now realistically, we could just disallow phrases with more than, say 8 whitespaces (8! = 40320 word variations). Personally, I haven't come across an applicable phrase with more than 5 whitespaces yet. This would keep memory consumption per word at O(n). Yet, with memory issues out of the way, I'm not positive this would avoid the Scunthorpe problem.

If we instead include only two variations, that with with and that without whitespace (i.e. "suck my d*ck" and "suckmyd*ck"), we'd have no memory concerns to worry about and users could use phrases as long as they'd like. However, again, I see no reason to believe this wouldn't have the potential to manifest the Scunthorpe problem.

So in my mind, there are only two solutions that allow us to avoid false-positives:

  1. Require users to manually add their own variations of phrases, such that "suck my d*ck", "suck myd*ck", "suckmy d*ck", and "suckmyd*ck" would all need to be added to a wordlist.
  2. Add a separate, stricter mode that allows for false-positives, and have users decide their use-case.

@ghost
Copy link
Author

ghost commented Nov 27, 2020

I see, cant we use regex to replace all spaces from that text and put each words seperated with spaces as a element in a list and check if any element in that list matches any word in profanity_wordlist.txt

@snguyenthanh
Copy link
Owner

As mentioned in #14, regex is extremely slow and it will exponentially increase the runtime with the length of the text.

@jcbrockschmidt
Copy link
Collaborator

@MissJuliaRobot, it's possible we could do what you're thinking without regex. Could you elaborate on the method you're proposing? I'm not sure if I understand completely.

@jcbrockschmidt jcbrockschmidt added the discuss Discussion on the project's features / bugs label Dec 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Discussion on the project's features / bugs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants