Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specially crafted string on normalize function returns an abnormally long list #32

Closed
akari-dogman opened this issue Jan 10, 2022 · 1 comment

Comments

@akari-dogman
Copy link

If you run normalize on this string:

abing🪀|C-01 |🍏inv100+ Lv16推赞1300

then it will return an EXTREMELY long (3981312 entries) list.

It must be getting caught on something, because a 3+ million list attempting to normalize a string is absurd.

To reproduce:

import confusables

foo = "abing🪀|C-01 |🍏inv100+ Lv16推赞1300"

x = confusables.normalize(foo)

print(len(x))
print(x)
@ThioJoe
Copy link

ThioJoe commented Jan 10, 2022

I mean when you print out the number of confusable characters for each character in the string you get the following:

a : 115
b : 63
i : 155
n : 72
g : 57
🪀 : 1
| : 141
C : 65
- : 13
0 : 193
1 : 141
  : 18
🍏 : 1
v : 64
+ : 4
L : 147
6 : 17
推 : 1
赞 : 1
3 : 127

So if it is trying to come up with all the different combinations of confusables, yea it could easily get to 3 million depending on how it does it. Though the prioritize_alpha option doesn't seem to make a difference which it should.

If you happen to be trying to find words in the strings by normalizing them first then searching them, the much much better way seems to be to just use the confusable_regex option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants