Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some Latin characters cause to_ascii to return an empty result. #4

Open
DefiDebauchery opened this issue Sep 5, 2022 · 0 comments
Open

Comments

@DefiDebauchery
Copy link

DefiDebauchery commented Sep 5, 2022

It's my understanding that STRATEGY_IGNORE should "add characters to result", which to me sounds like it should retain the character in the output if it isn't matched.

However, I cannot seem to retain my complete original input

import homoglyphs_fork as hgf
hg = hgf.Homoglyphs(strategy=hgf.STRATEGY_IGNORE)

'ß' in hgf.Categories.get_alphabet(['LATIN'])
>>> True

hg.to_ascii('ß')
>>> []

This is an issue because there are characters that, while not true homoglyphs, can still be used as them. Consider the German eszett, ß, which is a common stand-in for 'B' online.

Because of this, I'm unable to properly detect (as an example) the string 'Сaptchaß𝗈t' -- Cyrillic ES (homoglyph of latin C), German Eszett (leet-speak for latin B), and Mathematical o (normalized to latin o). The best I've been able to achieve is Captchaot with strategy LOAD and ascii_strategy REMOVE.

Is there a way to have homoglyphs simply pass-through any character that isn't matched?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant