Support for Thai language #297

kishorenc · 2021-06-21T03:09:27Z

Description

Typesense should be able to index and search on Thai text.

@artt Let's use this issue to continue the discussion on #228.

I'm still unable to compile ICU with the updated dictionary file that you attached to the other thread. Can you please double check that the file that you attached is indeed the one that you were able to build successfully earlier?

This what I did:

git clone git@github.com:unicode-org/icu.git
cd icu/icu4c/source
make clean
cp ~/Downloads/thaidict_no_space.txt data/brkitr/dictionaries/thaidict.txt
./runConfigureICU MacOSX --disable-samples --disable-tests --enable-static --disable-shared --disable-renaming  && make -j8

This is the error I get:

DYLD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$DYLD_LIBRARY_PATH  ../bin/gendict -i ./out/build/icudt70l -c --bytes --transform offset-0x0e00 ./brkitr/dictionaries/thaidict.txt ./out/build/icudt70l/brkitr/thaidict.dict
Codepoint U+002e out of range for --transform offset-0e00!
make[1]: *** [out/build/icudt70l/brkitr/thaidict.dict] Error 1

I then ran the gendict command that the ICU build was making directly and I got the same error:

/usr/local/opt/icu4c/bin/gendict -c --bytes --transform offset-0x0e00 ~/Downloads/thaidict_no_space.txt ~/Downloads/thai.dict
Codepoint U+002e out of range for --transform offset-0e00!

Code point U+002e in the file is the full stop character. I removed it, then similar errors occurred for the usage of hyphen, slashes, numbers etc. and I removed them one by one until I ended up with another error:

gendict: got failure of type U_ILLEGAL_ARGUMENT_ERROR while serializing, if U_ILLEGAL_ARGUMENT_ERROR possibly due to duplicate dictionary entries

I think I must have deleted the lines containing these bad characters than just deleting the characters themselves, but I will pause here -- hoping that perhaps you had attached the wrong file :)

The text was updated successfully, but these errors were encountered:

artt · 2021-06-21T03:21:02Z

Thanks. I only tried gendict but didnt' try to build ICU entirely. Will see what I get :)

BTW, here's what I did for gendict:

> gendict --uchars ~/Downloads/thaidict_no_space.txt ~/Downloads/thai.dict
> gendict: done writing	~/Downloads/thai.dict (1s).

kishorenc · 2021-06-21T04:17:11Z

You just need to make the gendict command I've posted above build with those options. If that works, the build the will work.

artt · 2021-06-21T04:31:44Z

This should work now with a clean build. I basically removed all the words with offending characters ( , ,, -, 2, and maybe a few others) Please let me know how it goes!
thaidict_no_space.txt

kishorenc · 2021-06-22T08:27:11Z

@artt This time ICU built fine, but the two examples we picked (ความเหลื่อมล้ำ and เหลื่) are still not getting tokenized correctly. ความเหลื่อมล้ำ is not split at all, while เหลื่ is being split as before. It looks like there are just fundamental issues with the ICU library that can't be fixed by just updating the dictionary 😞

artt · 2021-06-22T10:12:27Z

@kishorenc Thanks!

It's not very surprising that ความเหลื่อมล้ำ is not split (since it appears as a word in the dictionary), and that เหลื่ is split (since เห is a word, while เหลื่ is not).

I think this solves most of the issues I raised earlier. The only problem I could think of right now is if the user searches for a substring of the word (like เหลื่อมล้ำ), Typesense wouldn't be able to find it.

One solution I could think of off the top of my head is to have a pre-made mapping that tells if a word is a substring of other words (so เหลื่อมล้ำ is a substring of ความเหลื่อมล้ำ), and when a user queries a word, include all of its "superstring" in the query as well. Not sure if this is practical though.

In any case, is there a Typesense version I could try to see issues with this new dictionary? Thanks again!

Edit

I take that back. Looked back at #228 and I think the problem this time is the dictionary is filled too much with compound words :( For example, ความเหลื่อมล้ำ shouldn't be an entry. I'll try to find a way to fix the dictionary and make this acceptable (any other Thais here?) In the meantime, I wonder how other search engines like ElasticSearch does it.

kishorenc · 2021-06-23T04:55:02Z

Yes, I was referring to your earlier comment here:

The few I've tried all thought เหลื่ is one word (as เห and ลื่ doesn't have a meaning, as far as I can tell), and segment ความเหลื่อมล้ำ as ความ|เหลื่อมล้ำ, whereas ICU split it into ความ|เหลื่อม|ล้ำ. This is why it fails when the query is เหลื่อมล (and pre_segmented_query=true).

Both those cases aren't working as expected even with latest dictionary.

One crazy idea I had is to use one of the machine learning models within Typesense for the segmenation if they give good results. However, Typesense would have to use torch (which has a CPP interface). Also, the ML modes are so large, so will end up taking so much memory for smaller datasets.

artt · 2021-07-29T13:29:06Z

Hi! Sorry for being MIA for so long. I think ultimately it could come down to letting the user supply his/her own dictionary and use ICU's algorithm to tokenize words. Then the user could play around with what dictionary would suit his/her needs best. All the user needs to do is to make sure the dict file supplied fits ICU's requirement.

Not sure if that's something you're interested in exploring. Otherwise, for my personal use-case, would it be possible to come up with my dictionary and ask you to generate a build for me :P That doesn't sound like a very good practice though haha.

kishorenc · 2021-07-29T16:27:37Z

I'm on board with the idea of exposing a way to use a custom dictionary file. However, I don't think that will fix the issue we saw. For example, the earlier dict file you shared still faced same issues we saw earlier (my observations here: #297 (comment)). Correct?

artt · 2021-07-29T23:57:45Z

That's correct. But I think this problem could be solved with a "proper" dictionary (in this case, the one where เหลื่อมล้ำ is a word but ความเหลื่อมล้ำ isn't.)

With a custom dictionary, at least that would allow me to play around with the dictionary on my end without to have a new build for each dictionary :)

kishorenc · 2021-07-30T15:29:43Z

Okay, let me figure out how to configure the ICU library to use a custom dictionary. Then it should be easy to expose a flag for providing that for any LOCALE.

artt · 2021-08-03T04:58:37Z

I can add one more thing that might make this whole thing more reliable.

One of the problems I'm seeing now is the word tokenizer has problem with proper name that might not be in a dictionary. For example, if the name is (for simplicity's sake) Bobx and only Bob is in the dictionary, then the word would be cut into Bob|x and searching (with pre_segmented_query=true) for Bobx would not yield the entry.

So what I did (in artt/thaisense) was to index both the tokenized version and non-tokenized version (into the same document), and when searching, duplicate the query and tokenize one of them.

So to Typesense, the document would look like, Bobx Bob x. Querying Bobx would be translated to searching for Bobx Bob x as well, and the appropriate result would be returned.

The downside is that the index space would be doubled. Maybe we could exclude words that are more than 50 characters long (that we're sure it's not a proper noun) and that would save the space a bit. Would propose this as something that can be configured as well.

kishorenc · 2021-11-08T08:36:31Z

@artt I haven't forgotten about this, just got caught up with some last mile issues for 0.22 release that we are going to publish soon.

One of the problems I'm seeing now is the word tokenizer has problem with proper name that might not be in a dictionary. For example, if the name is (for simplicity's sake) Bobx and only Bob is in the dictionary, then the word would be cut into Bob|x and searching (with pre_segmented_query=true) for Bobx would not yield the entry.

Have you tried any of the neural network based word segmentation libraries? Of the projects I found, https://github.com/PyThaiNLP/attacut seems promising. I'm increasingly of the opinion that Dictionary based approaches are not good, and using a neural network will provide a better segmentation, though the downside is that segmentation works outside Typesense so it is not convenient (unless we can integrate Pytorch into Typesense).

In any case, I would love to hear how Attacut performs when compared with the examples you have tried with Dictionary based approaches.

artt · 2021-11-09T02:09:12Z

Thanks @kishorenc. I think the issue is a bit more complex than the choice of tokenizer. Let me elaborate.

Let's say the document contains polarbear (no space but you get the idea). There are two possibilities:

The word doesn't get split. In this case, when you search for polar (unfinished typing), the document will be found. But then when you type another character, b, the tokenizer will either
1. Not split (so the query is polarb), and the result will be gone. Not until you type the full word polarbear that the document will come back. This is not very good UX.
2. Split the query into polar|b. In this case, things should still work, but the results will include a lot of irrelevant documents such as the ones including polar by itself and any word that starts with b.
The word gets split into polar|bear. The same thing will happen to (1.ii).

These are the problems that I recalled. It's been months so I might miss something. Anyways, I ended up creating a middleware (called Thaisense) that do these things:

When indexing, tokenize the entire document and append it to the original document. So polarbeariswhite will become polarbeariswhite polar bear is white.
When searching, tokenize the query string and append it to the original query. So polarb becomes polarb polar b and pass it to Typesense.

The result is usable, but not great, in particular because ranking is determined by the number of tokens found (and so it counts every word that starts with b, for example). I hope this is useful to you as you go on improving it.

The ideal situation, of course, is if we can do infix search for these languages. That's what the user (me, for example) expect. Then we can just not worry about tokenizing at all. ;)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Thai language #297

Support for Thai language #297

kishorenc commented Jun 21, 2021

artt commented Jun 21, 2021 •

edited

kishorenc commented Jun 21, 2021

artt commented Jun 21, 2021 •

edited

kishorenc commented Jun 22, 2021

artt commented Jun 22, 2021 •

edited

kishorenc commented Jun 23, 2021 •

edited

artt commented Jul 29, 2021

kishorenc commented Jul 29, 2021

artt commented Jul 29, 2021

kishorenc commented Jul 30, 2021

artt commented Aug 3, 2021

kishorenc commented Nov 8, 2021 •

edited

artt commented Nov 9, 2021

Support for Thai language #297

Support for Thai language #297

Comments

kishorenc commented Jun 21, 2021

Description

artt commented Jun 21, 2021 • edited

kishorenc commented Jun 21, 2021

artt commented Jun 21, 2021 • edited

kishorenc commented Jun 22, 2021

artt commented Jun 22, 2021 • edited

kishorenc commented Jun 23, 2021 • edited

artt commented Jul 29, 2021

kishorenc commented Jul 29, 2021

artt commented Jul 29, 2021

kishorenc commented Jul 30, 2021

artt commented Aug 3, 2021

kishorenc commented Nov 8, 2021 • edited

artt commented Nov 9, 2021

artt commented Jun 21, 2021 •

edited

artt commented Jun 21, 2021 •

edited

artt commented Jun 22, 2021 •

edited

kishorenc commented Jun 23, 2021 •

edited

kishorenc commented Nov 8, 2021 •

edited