Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Thai language #297

Open
kishorenc opened this issue Jun 21, 2021 · 13 comments
Open

Support for Thai language #297

kishorenc opened this issue Jun 21, 2021 · 13 comments

Comments

@kishorenc
Copy link
Member

Description

Typesense should be able to index and search on Thai text.

@artt Let's use this issue to continue the discussion on #228.

I'm still unable to compile ICU with the updated dictionary file that you attached to the other thread. Can you please double check that the file that you attached is indeed the one that you were able to build successfully earlier?

This what I did:

git clone git@github.com:unicode-org/icu.git
cd icu/icu4c/source
make clean
cp ~/Downloads/thaidict_no_space.txt data/brkitr/dictionaries/thaidict.txt
./runConfigureICU MacOSX --disable-samples --disable-tests --enable-static --disable-shared --disable-renaming  && make -j8

This is the error I get:

DYLD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$DYLD_LIBRARY_PATH  ../bin/gendict -i ./out/build/icudt70l -c --bytes --transform offset-0x0e00 ./brkitr/dictionaries/thaidict.txt ./out/build/icudt70l/brkitr/thaidict.dict
Codepoint U+002e out of range for --transform offset-0e00!
make[1]: *** [out/build/icudt70l/brkitr/thaidict.dict] Error 1

I then ran the gendict command that the ICU build was making directly and I got the same error:

/usr/local/opt/icu4c/bin/gendict -c --bytes --transform offset-0x0e00 ~/Downloads/thaidict_no_space.txt ~/Downloads/thai.dict
Codepoint U+002e out of range for --transform offset-0e00!

Code point U+002e in the file is the full stop character. I removed it, then similar errors occurred for the usage of hyphen, slashes, numbers etc. and I removed them one by one until I ended up with another error:

gendict: got failure of type U_ILLEGAL_ARGUMENT_ERROR while serializing, if U_ILLEGAL_ARGUMENT_ERROR possibly due to duplicate dictionary entries

I think I must have deleted the lines containing these bad characters than just deleting the characters themselves, but I will pause here -- hoping that perhaps you had attached the wrong file :)

@artt
Copy link
Contributor

artt commented Jun 21, 2021

Thanks. I only tried gendict but didnt' try to build ICU entirely. Will see what I get :)

BTW, here's what I did for gendict:

> gendict --uchars ~/Downloads/thaidict_no_space.txt ~/Downloads/thai.dict
> gendict: done writing	~/Downloads/thai.dict (1s).

@kishorenc
Copy link
Member Author

You just need to make the gendict command I've posted above build with those options. If that works, the build the will work.

@artt
Copy link
Contributor

artt commented Jun 21, 2021

This should work now with a clean build. I basically removed all the words with offending characters ( , ,, -, 2, and maybe a few others) Please let me know how it goes!
thaidict_no_space.txt

@kishorenc
Copy link
Member Author

@artt This time ICU built fine, but the two examples we picked (ความเหลื่อมล้ำ and เหลื่) are still not getting tokenized correctly. ความเหลื่อมล้ำ is not split at all, while เหลื่ is being split as before. It looks like there are just fundamental issues with the ICU library that can't be fixed by just updating the dictionary 😞

@artt
Copy link
Contributor

artt commented Jun 22, 2021

@kishorenc Thanks!

It's not very surprising that ความเหลื่อมล้ำ is not split (since it appears as a word in the dictionary), and that เหลื่ is split (since เห is a word, while เหลื่ is not).

I think this solves most of the issues I raised earlier. The only problem I could think of right now is if the user searches for a substring of the word (like เหลื่อมล้ำ), Typesense wouldn't be able to find it.

One solution I could think of off the top of my head is to have a pre-made mapping that tells if a word is a substring of other words (so เหลื่อมล้ำ is a substring of ความเหลื่อมล้ำ), and when a user queries a word, include all of its "superstring" in the query as well. Not sure if this is practical though.

In any case, is there a Typesense version I could try to see issues with this new dictionary? Thanks again!


Edit

I take that back. Looked back at #228 and I think the problem this time is the dictionary is filled too much with compound words :( For example, ความเหลื่อมล้ำ shouldn't be an entry. I'll try to find a way to fix the dictionary and make this acceptable (any other Thais here?) In the meantime, I wonder how other search engines like ElasticSearch does it.

@kishorenc
Copy link
Member Author

kishorenc commented Jun 23, 2021

Yes, I was referring to your earlier comment here:

The few I've tried all thought เหลื่ is one word (as เห and ลื่ doesn't have a meaning, as far as I can tell), and segment ความเหลื่อมล้ำ as ความ|เหลื่อมล้ำ, whereas ICU split it into ความ|เหลื่อม|ล้ำ. This is why it fails when the query is เหลื่อมล (and pre_segmented_query=true).

Both those cases aren't working as expected even with latest dictionary.

One crazy idea I had is to use one of the machine learning models within Typesense for the segmenation if they give good results. However, Typesense would have to use torch (which has a CPP interface). Also, the ML modes are so large, so will end up taking so much memory for smaller datasets.

@artt
Copy link
Contributor

artt commented Jul 29, 2021

Hi! Sorry for being MIA for so long. I think ultimately it could come down to letting the user supply his/her own dictionary and use ICU's algorithm to tokenize words. Then the user could play around with what dictionary would suit his/her needs best. All the user needs to do is to make sure the dict file supplied fits ICU's requirement.

Not sure if that's something you're interested in exploring. Otherwise, for my personal use-case, would it be possible to come up with my dictionary and ask you to generate a build for me :P That doesn't sound like a very good practice though haha.

@kishorenc
Copy link
Member Author

I'm on board with the idea of exposing a way to use a custom dictionary file. However, I don't think that will fix the issue we saw. For example, the earlier dict file you shared still faced same issues we saw earlier (my observations here: #297 (comment)). Correct?

@artt
Copy link
Contributor

artt commented Jul 29, 2021

That's correct. But I think this problem could be solved with a "proper" dictionary (in this case, the one where เหลื่อมล้ำ is a word but ความเหลื่อมล้ำ isn't.)

With a custom dictionary, at least that would allow me to play around with the dictionary on my end without to have a new build for each dictionary :)

@kishorenc
Copy link
Member Author

Okay, let me figure out how to configure the ICU library to use a custom dictionary. Then it should be easy to expose a flag for providing that for any LOCALE.

@artt
Copy link
Contributor

artt commented Aug 3, 2021

I can add one more thing that might make this whole thing more reliable.

One of the problems I'm seeing now is the word tokenizer has problem with proper name that might not be in a dictionary. For example, if the name is (for simplicity's sake) Bobx and only Bob is in the dictionary, then the word would be cut into Bob|x and searching (with pre_segmented_query=true) for Bobx would not yield the entry.

So what I did (in artt/thaisense) was to index both the tokenized version and non-tokenized version (into the same document), and when searching, duplicate the query and tokenize one of them.

So to Typesense, the document would look like, Bobx Bob x. Querying Bobx would be translated to searching for Bobx Bob x as well, and the appropriate result would be returned.

The downside is that the index space would be doubled. Maybe we could exclude words that are more than 50 characters long (that we're sure it's not a proper noun) and that would save the space a bit. Would propose this as something that can be configured as well.

@kishorenc
Copy link
Member Author

kishorenc commented Nov 8, 2021

@artt I haven't forgotten about this, just got caught up with some last mile issues for 0.22 release that we are going to publish soon.

One of the problems I'm seeing now is the word tokenizer has problem with proper name that might not be in a dictionary. For example, if the name is (for simplicity's sake) Bobx and only Bob is in the dictionary, then the word would be cut into Bob|x and searching (with pre_segmented_query=true) for Bobx would not yield the entry.

Have you tried any of the neural network based word segmentation libraries? Of the projects I found, https://github.com/PyThaiNLP/attacut seems promising. I'm increasingly of the opinion that Dictionary based approaches are not good, and using a neural network will provide a better segmentation, though the downside is that segmentation works outside Typesense so it is not convenient (unless we can integrate Pytorch into Typesense).

In any case, I would love to hear how Attacut performs when compared with the examples you have tried with Dictionary based approaches.

@artt
Copy link
Contributor

artt commented Nov 9, 2021

Thanks @kishorenc. I think the issue is a bit more complex than the choice of tokenizer. Let me elaborate.

Let's say the document contains polarbear (no space but you get the idea). There are two possibilities:

  1. The word doesn't get split. In this case, when you search for polar (unfinished typing), the document will be found. But then when you type another character, b, the tokenizer will either
    1. Not split (so the query is polarb), and the result will be gone. Not until you type the full word polarbear that the document will come back. This is not very good UX.
    2. Split the query into polar|b. In this case, things should still work, but the results will include a lot of irrelevant documents such as the ones including polar by itself and any word that starts with b.
  2. The word gets split into polar|bear. The same thing will happen to (1.ii).

These are the problems that I recalled. It's been months so I might miss something. Anyways, I ended up creating a middleware (called Thaisense) that do these things:

  1. When indexing, tokenize the entire document and append it to the original document. So polarbeariswhite will become polarbeariswhite polar bear is white.
  2. When searching, tokenize the query string and append it to the original query. So polarb becomes polarb polar b and pass it to Typesense.

The result is usable, but not great, in particular because ranking is determined by the number of tokens found (and so it counts every word that starts with b, for example). I hope this is useful to you as you go on improving it.

The ideal situation, of course, is if we can do infix search for these languages. That's what the user (me, for example) expect. Then we can just not worry about tokenizing at all. ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants