Some Latin characters cause to_ascii to return an empty result. #4

DefiDebauchery · 2022-09-05T20:44:37Z

It's my understanding that STRATEGY_IGNORE should "add characters to result", which to me sounds like it should retain the character in the output if it isn't matched.

However, I cannot seem to retain my complete original input

import homoglyphs_fork as hgf
hg = hgf.Homoglyphs(strategy=hgf.STRATEGY_IGNORE)

'ß' in hgf.Categories.get_alphabet(['LATIN'])
>>> True

hg.to_ascii('ß')
>>> []

This is an issue because there are characters that, while not true homoglyphs, can still be used as them. Consider the German eszett, ß, which is a common stand-in for 'B' online.

Because of this, I'm unable to properly detect (as an example) the string 'Сaptchaß𝗈t' -- Cyrillic ES (homoglyph of latin C), German Eszett (leet-speak for latin B), and Mathematical o (normalized to latin o). The best I've been able to achieve is Captchaot with strategy LOAD and ascii_strategy REMOVE.

Is there a way to have homoglyphs simply pass-through any character that isn't matched?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some Latin characters cause to_ascii to return an empty result. #4

Some Latin characters cause to_ascii to return an empty result. #4

DefiDebauchery commented Sep 5, 2022 •

edited

Some Latin characters cause to_ascii to return an empty result. #4

Some Latin characters cause to_ascii to return an empty result. #4

Comments

DefiDebauchery commented Sep 5, 2022 • edited

DefiDebauchery commented Sep 5, 2022 •

edited