BUG when autocompleting unicode strings #8

mannol · 2016-12-01T14:47:05Z

Following search provides no results:

redisCommand(ctx, "FT.SUGADD userslex %b 1", "\u010Caji\u0107", sizeof("\u010Caji\u0107") - 1);
redisCommand(ctx, "FT.SUGGET userslex %b MAX 2", "\u010Caj", sizeof("\u010Caj") - 1);

Another example using random bytes:

uint32_t a = 1234432413;
redisCommand(ctx, "FT.SUGADD userslex %b 1", &a, 4);
redisCommand(ctx, "FT.SUGGET userslex %b MAX 2", &a, 3); // nothing found

The text was updated successfully, but these errors were encountered:

dvirsky · 2016-12-01T15:58:40Z

Fixed as far as I can test (haven't run your test code though). Feel free to verify.

mannol · 2016-12-01T16:36:30Z

Fix confirmed! Thanks!

dvirsky · 2016-12-01T17:42:24Z

Going over this, it looks like working with 32 bit runes will not be too hard to do, and will allow full fuzzy support in unicode. but it will make memory consumption terrible. Is having unicode supported fuzzy matching critical for you?

mannol · 2016-12-01T18:17:52Z

It's essential, yes. How much would the memory requirements increase?

dvirsky · 2016-12-01T19:20:27Z

The idea is not to use variable length encoding like UTF-8 and UTF-16, and use a fixed length encoding, probably 32 bit - per letter.

So for pure latin text, which in utf-8 is represented by 1 byte, you would have to use 4 bytes. Some of the memory consumption is of course pointers and metadata, so we are talking about x2-x3 the amount of RAM.

For purely non-latin, which takes up 2-3 bytes per letter in utf-8, it would probably be x1.5-x2.

If I manage to get away with 16 bit per letter, which won't cover all languages but will cover the most popular ones IIRC, it won't be so bad.

mannol · 2016-12-01T20:09:08Z

Yeah, 16 bits should cover all the languages we are aiming at so that's okay.

dvirsky · 2016-12-01T21:29:54Z

yeah, range 0x0000-0xFFFF covers everything a sane person would need. https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane

Although Unicode and sanity don't go hand in hand, in this case it might be a good compromise.

mannol mentioned this issue Dec 1, 2016

FEATURE REQUEST: utf-8 support in autocomplete #7

Closed

mannol closed this as completed Dec 1, 2016

aflin mentioned this issue Sep 3, 2019

Crash while inserting #888

Closed

rafie added a commit that referenced this issue Oct 10, 2019

Added macOS build to CircleCI #8

397595a

freecw mentioned this issue Jul 20, 2023

crash when creating index after IndexSpec_CreateFromRdb hits existing index spec #3721

Closed

freecw mentioned this issue Nov 10, 2023

redisearch Crash with intensive vector indexing and query in multi-thread mode #4039

Closed

jason-smarty mentioned this issue Apr 17, 2024

[BUG] Redis freezes and stops responding with 100% CPU Utilization while using redissearch with HNSW vector indexes #4602

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG when autocompleting unicode strings #8

BUG when autocompleting unicode strings #8

mannol commented Dec 1, 2016

dvirsky commented Dec 1, 2016

mannol commented Dec 1, 2016 •

edited

dvirsky commented Dec 1, 2016

mannol commented Dec 1, 2016

dvirsky commented Dec 1, 2016

mannol commented Dec 1, 2016

dvirsky commented Dec 1, 2016

BUG when autocompleting unicode strings #8

BUG when autocompleting unicode strings #8

Comments

mannol commented Dec 1, 2016

dvirsky commented Dec 1, 2016

mannol commented Dec 1, 2016 • edited

dvirsky commented Dec 1, 2016

mannol commented Dec 1, 2016

dvirsky commented Dec 1, 2016

mannol commented Dec 1, 2016

dvirsky commented Dec 1, 2016

mannol commented Dec 1, 2016 •

edited