profanity filtering doesn't work for combined words like "fckme" or "suckmydck" #18

ghost · 2020-10-21T05:24:43Z

No description provided.

jcbrockschmidt · 2020-11-25T04:03:39Z

Unfortunately, this is a well-known problem that extends beyond just better_profanity, known as the Scunthorpe problem. In short, it's very hard to ascertain when and when not to censor words contained within larger words.

Now, there are some improvements we could make to minimize this issue. Currently, if you include the phrase "suck my d*ck" in a word list, the phrase "suckmyd*ck" won't be censored. It should be fairly straight-forward for us to include variations of censored phrases without whitespace into our censor.

jcbrockschmidt · 2020-11-25T13:00:45Z

Running through some math and seeing some potential memory problems. Suppose we want to include every variation of whitespaces for the phrase "suck my d*ck", such that "suckmy d*ck" as well as "suck myd*ck" are all censored. This would require essentially adding a new words for each variation. With only two whitespaces, this only amounts to a total of 4 words. But the more whitespace a phrase has, the more words we need to add. Specifically, the number of words needed increases as a factorial, w!, where w is the number of whitespaces. This would make our memory consumption per word O(w! * n), or O((n/2)!) since w < n. Compared to the bound of O(n), this is less than ideal.

Now realistically, we could just disallow phrases with more than, say 8 whitespaces (8! = 40320 word variations). Personally, I haven't come across an applicable phrase with more than 5 whitespaces yet. This would keep memory consumption per word at O(n). Yet, with memory issues out of the way, I'm not positive this would avoid the Scunthorpe problem.

If we instead include only two variations, that with with and that without whitespace (i.e. "suck my d*ck" and "suckmyd*ck"), we'd have no memory concerns to worry about and users could use phrases as long as they'd like. However, again, I see no reason to believe this wouldn't have the potential to manifest the Scunthorpe problem.

So in my mind, there are only two solutions that allow us to avoid false-positives:

Require users to manually add their own variations of phrases, such that "suck my d*ck", "suck myd*ck", "suckmy d*ck", and "suckmyd*ck" would all need to be added to a wordlist.
Add a separate, stricter mode that allows for false-positives, and have users decide their use-case.

ghost · 2020-11-27T17:22:33Z

I see, cant we use regex to replace all spaces from that text and put each words seperated with spaces as a element in a list and check if any element in that list matches any word in profanity_wordlist.txt

snguyenthanh · 2020-11-27T17:26:19Z

As mentioned in #14, regex is extremely slow and it will exponentially increase the runtime with the length of the text.

jcbrockschmidt · 2020-11-27T19:36:01Z

@MissJuliaRobot, it's possible we could do what you're thinking without regex. Could you elaborate on the method you're proposing? I'm not sure if I understand completely.

jcbrockschmidt added the discuss Discussion on the project's features / bugs label Dec 7, 2020

showierdata9978 linked a pull request Sep 30, 2022 that will close this issue

helped fix major limitations. added 4 tests for it #49

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

profanity filtering doesn't work for combined words like "fckme" or "suckmydck" #18

profanity filtering doesn't work for combined words like "fckme" or "suckmydck" #18

ghost commented Oct 21, 2020

jcbrockschmidt commented Nov 25, 2020 •

edited

Loading

jcbrockschmidt commented Nov 25, 2020

ghost commented Nov 27, 2020

snguyenthanh commented Nov 27, 2020

jcbrockschmidt commented Nov 27, 2020

profanity filtering doesn't work for combined words like "f*ckme" or "suckmyd*ck" #18

profanity filtering doesn't work for combined words like "f*ckme" or "suckmyd*ck" #18

Comments

ghost commented Oct 21, 2020

jcbrockschmidt commented Nov 25, 2020 • edited Loading

jcbrockschmidt commented Nov 25, 2020

ghost commented Nov 27, 2020

snguyenthanh commented Nov 27, 2020

jcbrockschmidt commented Nov 27, 2020

profanity filtering doesn't work for combined words like "fckme" or "suckmydck" #18

profanity filtering doesn't work for combined words like "fckme" or "suckmydck" #18

jcbrockschmidt commented Nov 25, 2020 •

edited

Loading