-
Notifications
You must be signed in to change notification settings - Fork 509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG when autocompleting unicode strings #8
Comments
Fixed as far as I can test (haven't run your test code though). Feel free to verify. |
Fix confirmed! Thanks! |
Going over this, it looks like working with 32 bit runes will not be too hard to do, and will allow full fuzzy support in unicode. but it will make memory consumption terrible. Is having unicode supported fuzzy matching critical for you? |
It's essential, yes. How much would the memory requirements increase? |
The idea is not to use variable length encoding like UTF-8 and UTF-16, and use a fixed length encoding, probably 32 bit - per letter. So for pure latin text, which in utf-8 is represented by 1 byte, you would have to use 4 bytes. Some of the memory consumption is of course pointers and metadata, so we are talking about x2-x3 the amount of RAM. For purely non-latin, which takes up 2-3 bytes per letter in utf-8, it would probably be x1.5-x2. If I manage to get away with 16 bit per letter, which won't cover all languages but will cover the most popular ones IIRC, it won't be so bad. |
Yeah, 16 bits should cover all the languages we are aiming at so that's okay. |
yeah, range 0x0000-0xFFFF covers everything a sane person would need. https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane Although Unicode and sanity don't go hand in hand, in this case it might be a good compromise. |
Following search provides no results:
Another example using random bytes:
The text was updated successfully, but these errors were encountered: