New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Thai language #297
Comments
Thanks. I only tried BTW, here's what I did for
|
You just need to make the gendict command I've posted above build with those options. If that works, the build the will work. |
This should work now with a clean build. I basically removed all the words with offending characters ( |
@artt This time ICU built fine, but the two examples we picked ( |
@kishorenc Thanks! It's not very surprising that I think this solves most of the issues I raised earlier. The only problem I could think of right now is if the user searches for a substring of the word (like One solution I could think of off the top of my head is to have a pre-made mapping that tells if a word is a substring of other words (so In any case, is there a Typesense version I could try to see issues with this new dictionary? Thanks again! Edit I take that back. Looked back at #228 and I think the problem this time is the dictionary is filled too much with compound words :( For example, |
Yes, I was referring to your earlier comment here:
Both those cases aren't working as expected even with latest dictionary. One crazy idea I had is to use one of the machine learning models within Typesense for the segmenation if they give good results. However, Typesense would have to use |
Hi! Sorry for being MIA for so long. I think ultimately it could come down to letting the user supply his/her own dictionary and use ICU's algorithm to tokenize words. Then the user could play around with what dictionary would suit his/her needs best. All the user needs to do is to make sure the dict file supplied fits ICU's requirement. Not sure if that's something you're interested in exploring. Otherwise, for my personal use-case, would it be possible to come up with my dictionary and ask you to generate a build for me :P That doesn't sound like a very good practice though haha. |
I'm on board with the idea of exposing a way to use a custom dictionary file. However, I don't think that will fix the issue we saw. For example, the earlier dict file you shared still faced same issues we saw earlier (my observations here: #297 (comment)). Correct? |
That's correct. But I think this problem could be solved with a "proper" dictionary (in this case, the one where With a custom dictionary, at least that would allow me to play around with the dictionary on my end without to have a new build for each dictionary :) |
Okay, let me figure out how to configure the ICU library to use a custom dictionary. Then it should be easy to expose a flag for providing that for any LOCALE. |
I can add one more thing that might make this whole thing more reliable. One of the problems I'm seeing now is the word tokenizer has problem with proper name that might not be in a dictionary. For example, if the name is (for simplicity's sake) So what I did (in artt/thaisense) was to index both the tokenized version and non-tokenized version (into the same document), and when searching, duplicate the query and tokenize one of them. So to Typesense, the document would look like, The downside is that the index space would be doubled. Maybe we could exclude words that are more than 50 characters long (that we're sure it's not a proper noun) and that would save the space a bit. Would propose this as something that can be configured as well. |
@artt I haven't forgotten about this, just got caught up with some last mile issues for 0.22 release that we are going to publish soon.
Have you tried any of the neural network based word segmentation libraries? Of the projects I found, https://github.com/PyThaiNLP/attacut seems promising. I'm increasingly of the opinion that Dictionary based approaches are not good, and using a neural network will provide a better segmentation, though the downside is that segmentation works outside Typesense so it is not convenient (unless we can integrate Pytorch into Typesense). In any case, I would love to hear how Attacut performs when compared with the examples you have tried with Dictionary based approaches. |
Thanks @kishorenc. I think the issue is a bit more complex than the choice of tokenizer. Let me elaborate. Let's say the document contains
These are the problems that I recalled. It's been months so I might miss something. Anyways, I ended up creating a middleware (called Thaisense) that do these things:
The result is usable, but not great, in particular because ranking is determined by the number of tokens found (and so it counts every word that starts with The ideal situation, of course, is if we can do infix search for these languages. That's what the user (me, for example) expect. Then we can just not worry about tokenizing at all. ;) |
Description
Typesense should be able to index and search on Thai text.
@artt Let's use this issue to continue the discussion on #228.
I'm still unable to compile ICU with the updated dictionary file that you attached to the other thread. Can you please double check that the file that you attached is indeed the one that you were able to build successfully earlier?
This what I did:
This is the error I get:
I then ran the
gendict
command that the ICU build was making directly and I got the same error:Code point
U+002e
in the file is the full stop character. I removed it, then similar errors occurred for the usage of hyphen, slashes, numbers etc. and I removed them one by one until I ended up with another error:I think I must have deleted the lines containing these bad characters than just deleting the characters themselves, but I will pause here -- hoping that perhaps you had attached the wrong file :)
The text was updated successfully, but these errors were encountered: