Add possibility to check unicode symbols #2

Derfirm · 2018-11-12T18:07:38Z

Added a label with unicode letters and a test for it.
Unicode is going like this:

import sys
import unicodedata
from collections import defaultdict

unicode_category = defaultdict(list)
for c in map(chr, range(sys.maxunicode + 1)):
    unicode_category[unicodedata.category(c)].append(c)

And from there on out use that map to translate back to a series of characters for a given category:

alphabetic = unicode_category['Ll']

snguyenthanh · 2018-11-12T19:44:39Z

Can you please explain why only category Ll is used to map the characters ?
Moreover, I have the following sentence: Эффекти́вного противоя́дия от я́да фу́гу не существу́ет до сих пор.
Added a word противоя́дия to the wordlist, the word fails to be censored. Is this an expected behavior?

Derfirm · 2018-11-12T20:37:40Z

Ll is Letter Lowercase, so yes, I deliberately cut off special characters
Do they really need?

Derfirm · 2018-11-12T21:02:47Z

In your case (U+0430) + (U+0301) (as example) special character is used, called "Combining Diacritical Marks", which is always called before the letter. Is it worth expanding the list of characters for this case?

snguyenthanh · 2018-11-13T00:43:30Z

May be we should include category Lu as well.
I think the problem is not the list of characters, but the way the words merge together.

For example, blue tree is a word with an empty space as the separator. It should treat противоя́дия as 1 word with ́ as the separator as well.

Derfirm · 2018-11-13T07:00:06Z

` (U+0027) «APOSTROPHE» not include as Mn | Mark, Non-Spacing or Mc | Mark, Spacing Combining

snguyenthanh · 2018-11-13T14:36:02Z

I just pushed a commit to a new branch development, changing the way the words are merged together.

Currently in master branch, the word hand_job is treated as 1 word hard_job. The newly pushed commit, in development, treats it as 2 words hand and job. This allows a better and more stable comparison between the words.

I tested and it works well for Unicode characters. I will merge this into branch development and add a few more test cases before merging into master.

snguyenthanh · 2018-11-13T14:51:36Z

May I ask how you generated the alphabetic_unicode.json file ?
As I see, the characters are from categories Mn, Lu, Po, Ll, Lo, Cn, Mc.

Derfirm · 2018-11-13T22:03:39Z

import sys
import unicodedata
import json

unicode_category = defaultdict(list) 

for c in map(chr, range(sys.maxunicode + 1)): 
    unicode_category[unicodedata.category(c)].append(c) 

with open("alphabetic_unicode.json", "w") as js: 
    js.write(json.dumps(unicode_category['Ll'] + unicode_category['Lu'] + unicode_category['Mc'] + unicode_category['Mn']))

snguyenthanh · 2018-11-14T09:19:16Z

Thanks.
I created a beta release for Unicode support, which can be installed by:

$ pip install better-profanity==0.3b0

Please let me know if it has any issues.

Add possibility to check unicode symbols

541bf78

Derfirm force-pushed the add-unicode-support branch from d20361d to 541bf78 Compare November 12, 2018 21:21

snguyenthanh changed the base branch from master to development November 13, 2018 14:36

snguyenthanh added 2 commits November 13, 2018 22:45

Merge branch 'development' into add-unicode-support

e7b50c1

Temporarily disable 2 tests for unicode

dfb0784

snguyenthanh merged commit b2ca498 into snguyenthanh:development Nov 13, 2018

snguyenthanh mentioned this pull request Nov 14, 2018

Release 0.3-beta.0 #3

Merged

Derfirm deleted the add-unicode-support branch November 14, 2018 12:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add possibility to check unicode symbols #2

Add possibility to check unicode symbols #2

Derfirm commented Nov 12, 2018

snguyenthanh commented Nov 12, 2018 •

edited

Derfirm commented Nov 12, 2018

Derfirm commented Nov 12, 2018 •

edited

snguyenthanh commented Nov 13, 2018

Derfirm commented Nov 13, 2018 •

edited

snguyenthanh commented Nov 13, 2018 •

edited

snguyenthanh commented Nov 13, 2018 •

edited

Derfirm commented Nov 13, 2018 •

edited

snguyenthanh commented Nov 14, 2018

Add possibility to check unicode symbols #2

Add possibility to check unicode symbols #2

Conversation

Derfirm commented Nov 12, 2018

snguyenthanh commented Nov 12, 2018 • edited

Derfirm commented Nov 12, 2018

Derfirm commented Nov 12, 2018 • edited

snguyenthanh commented Nov 13, 2018

Derfirm commented Nov 13, 2018 • edited

snguyenthanh commented Nov 13, 2018 • edited

snguyenthanh commented Nov 13, 2018 • edited

Derfirm commented Nov 13, 2018 • edited

snguyenthanh commented Nov 14, 2018

snguyenthanh commented Nov 12, 2018 •

edited

Derfirm commented Nov 12, 2018 •

edited

Derfirm commented Nov 13, 2018 •

edited

snguyenthanh commented Nov 13, 2018 •

edited

snguyenthanh commented Nov 13, 2018 •

edited

Derfirm commented Nov 13, 2018 •

edited