Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add possibility to check unicode symbols #2

Merged
merged 3 commits into from Nov 13, 2018

Conversation

Derfirm
Copy link

@Derfirm Derfirm commented Nov 12, 2018

Added a label with unicode letters and a test for it.
Unicode is going like this:

import sys
import unicodedata
from collections import defaultdict

unicode_category = defaultdict(list)
for c in map(chr, range(sys.maxunicode + 1)):
    unicode_category[unicodedata.category(c)].append(c)

And from there on out use that map to translate back to a series of characters for a given category:

alphabetic = unicode_category['Ll']

@snguyenthanh
Copy link
Owner

snguyenthanh commented Nov 12, 2018

  1. Can you please explain why only category Ll is used to map the characters ?

  2. Moreover, I have the following sentence: Эффекти́вного противоя́дия от я́да фу́гу не существу́ет до сих пор.
    Added a word противоя́дия to the wordlist, the word fails to be censored. Is this an expected behavior?

@Derfirm
Copy link
Author

Derfirm commented Nov 12, 2018

Ll is Letter Lowercase, so yes, I deliberately cut off special characters
Do they really need?

@Derfirm
Copy link
Author

Derfirm commented Nov 12, 2018

In your case (U+0430) + (U+0301) (as example) special character is used, called "Combining Diacritical Marks", which is always called before the letter. Is it worth expanding the list of characters for this case?

@snguyenthanh
Copy link
Owner

May be we should include category Lu as well.
I think the problem is not the list of characters, but the way the words merge together.

For example, blue tree is a word with an empty space as the separator. It should treat противоя́дия as 1 word with ́ as the separator as well.

@Derfirm
Copy link
Author

Derfirm commented Nov 13, 2018

` (U+0027) «APOSTROPHE» not include as Mn | Mark, Non-Spacing or Mc | Mark, Spacing Combining

@snguyenthanh
Copy link
Owner

snguyenthanh commented Nov 13, 2018

I just pushed a commit to a new branch development, changing the way the words are merged together.

Currently in master branch, the word hand_job is treated as 1 word hard_job. The newly pushed commit, in development, treats it as 2 words hand and job. This allows a better and more stable comparison between the words.

I tested and it works well for Unicode characters. I will merge this into branch development and add a few more test cases before merging into master.

@snguyenthanh snguyenthanh changed the base branch from master to development November 13, 2018 14:36
@snguyenthanh
Copy link
Owner

snguyenthanh commented Nov 13, 2018

May I ask how you generated the alphabetic_unicode.json file ?
As I see, the characters are from categories Mn, Lu, Po, Ll, Lo, Cn, Mc.

@snguyenthanh snguyenthanh merged commit b2ca498 into snguyenthanh:development Nov 13, 2018
@Derfirm
Copy link
Author

Derfirm commented Nov 13, 2018

import sys
import unicodedata
import json

unicode_category = defaultdict(list) 

for c in map(chr, range(sys.maxunicode + 1)): 
    unicode_category[unicodedata.category(c)].append(c) 

with open("alphabetic_unicode.json", "w") as js: 
    js.write(json.dumps(unicode_category['Ll'] + unicode_category['Lu'] + unicode_category['Mc'] + unicode_category['Mn'])) 

@snguyenthanh snguyenthanh mentioned this pull request Nov 14, 2018
@snguyenthanh
Copy link
Owner

Thanks.
I created a beta release for Unicode support, which can be installed by:

$ pip install better-profanity==0.3b0

Please let me know if it has any issues.

@Derfirm Derfirm deleted the add-unicode-support branch November 14, 2018 12:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants