New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for writing systems without spaces between words #228
Comments
Or maybe Thai is included in CJK languages as in this post? I think at least having the exact query (checking if query is substring of the text) would make it acceptable. Another way this could be supported is having the dev segment the words in the document (so that Typesense indexes a segmented version of the text.) The problem I find with this approach is that you'll need to segment the query string as well. I could have another API that would do the segmentation before sending the query to Typesense, but I think it makes more sense for Typesense to do that. I think that's how Algolia implemented theirs. |
@artt Intend to soon start supporting languages without spaces between words. This will involve segmenting the text on word boundaries using ICU or another unicode+locale aware library. I will be happy to tackle Thai as part of it. Would you be interested in being beta-testing it when the preview build for that is ready? |
Sure thing! Do you have an approximate timeline for this feature? For other people who might be looking for solutions, for now, my workaround is to segment the sentences before indexing, and create another API that would intercept InstantSearch requests, segment the query, pass the query along, and join the results back together. |
Hope to get to it within the next few weeks. |
Will it also apply for Korean? I'm ready for being beta-testing tho |
@kishorenc any updates on this front? |
Yes, some nice progress has been made. I think I will be able to show an initial preview early next week! |
@iicdii @artt Early preview ready in Example:
You can now query the text like this (example using the Thai language field):
Please give it a whirl and let me know how it works. I'm certainly expecting to iterate on it a bit more before we perfect it. |
@kishorenc Thanks for this! I played around with it for about an hour and here's some of the things I noticed: Mixing languagesA lot of times content in, say, Thai, would have English words in it too. An example:
Right now, the match wouldn't appear unless you have the correct capitalization and type in the whole word. For example, querying "discre", "Discre", or "discrete", wouldn't match, but "Discrete" would.
I realize it's easy to say and there's probably tons that must go into consideration, but hopefully it's a start Prioritizing exact matchNot an NLP expert here, but when words are combined, they take on a specific meaning. Right now, words are tokenized and the original word is disregarded. It would be nice to give priority to exact matching. Example in English: query "goalpost" would be tokenized into "goal" and "post". Suppose we have two documents
Both docs would give a hit, and, currently, get the same priority, when Doc B should get higher priority. An example in Thai: query "รายได้" (income) got tokenized into "ราย" and "ได้"
Thanks again! Please let me know if you have any questions! |
@kishorenc Hello! I tried both local environment and most used online address finder in Korea (Daum Post Code Service) and this is result.
Get startedWe will try to find "서울특별시 관악구". Case 1 - First consonant does not matchTypesense:0.20.0.rc28When I type "서울특별시 ㄱ", typsense gave me results start with "서울특별시 성북구", which is not related to my keyword. 🤔 Daum Post Code ServiceThis is expected result. They gave me results "서울특별시 강남구" which are matched with "서울특별시 ㄱ". Case 2 - First vowel does not matchTypesense:0.20.0.rc28Daum Post Code ServiceThis is not so bad but Typesense does not seem to care of vowel. We typed ㄱ + ㅗ = 고, Daum already found "서울특별시 관악구", which is our final keyword. Case 3 - Unnecessary resultTypesense:0.20.0.rc28Daum Post Code ServiceWhen I typed until "서울특별시 관", they found it! I think it's most well-matched case so far. But, "서울특별시 은평구 blah" are listed in the result, which are unnecessary. Hope it helps for your work, Feedback and questions are welcome! :) |
A quick note of thank you to both of you for documenting these. I'm now going to go through them in detail and will get back to you. |
Hi @kishorenc, any update on this? We're moving closer to the launch date and I'm afraid my own patched-together solution won't make the cut :P |
I've a new build for you This build addresses the issue you noticed about English text not being normalized when present amidst Thai text. But I'm not sure if we can do language detection so easily. There are lots of edges around doing this reliably. For now, I'm hoping that this fix is sufficient for your needs. I could not reproduce the other issue you mentioned with exact match involving "รายได้". I indexed those 2 example sentences as two separate documents and when I queried, the correct text was returned. If you can produce the problem in a specific dataset it will be useful. Check this example: https://gist.github.com/kishorenc/fe9fd5587f5e8e758da8b3d81091f447 |
@iicdii I've a new build which fixes the issues with prefix searches that you noticed: I have written tests with the examples you've provided. The unnecessary results you noticed is because of Typesense's Please try it out and let me know if it is better with this build. |
I tried Two questions
Edit: Not sure if it's just me, but do you guys feel like it's running a bit slow? Maybe I need to restart my computer. |
@artt I will be happy to upgrade your Typesense cloud cluster to the RC build. Can you please email us with the cluster ID? Let's keep the issue open till we 0.21 is out. Regarding speed, if you can provide some more details, I will be happy to look -- feel free to elaborate further over email. |
|
Ah I see what the problem is now. When Thais (that I know, at least) search for things on the internet, they put space between words so that the search engine can do its job better. For example, if I want to search for Not sure if this is the case for other languages. Nonetheless, I think it would be great to have an option (when sending query to Typesense) to specify whether or not Typesense should break up the query. So instead of thinking the user is searching for This would greatly improve the stability of the search result, since Typesense wouldn't have to break words each time a new character is typed in (which could vary greatly as in the examples) and I think would fix all the problems mentioned above. Again, thank you so much! |
One more thing that might be useful if you want to skip the "option" part (or maybe provide it as the default) is if the query is longer than X characters (could depend upon language... maybe 90th percentile of word length) and there's no space in that query, then Typesense should break the query up. |
@artt That makes sense: I've added a You can try this out on this build: |
Did you mean to say I should set Still found some issues with this though. May I ask what library/tool you're using to segment the words? Perhaps that would allow me to help you debug better. Thanks! |
Yes but that flag is available only on this build: typesense/typesense:0.21.0.rc13 |
Sorry what I meant to say was, "I tried setting it to So I think what you meant to say was,
Which I think makes sense ( Also, if you could point out which word segmentation library you're using I can help debug some more :) Thanks a lot! |
Got it. I've updated my comment above to fix that. We use ICU for word segmentation: https://github.com/unicode-org/icu |
Thanks! After trying out ICU's word segmentation here, I think the issue is with the word segmentation tool. You can find list of widely used Thai word segmentation libraries in this repo. Having talked to NLP experts I know, close to nobody uses ICU for Thai word segmentation. The few I've tried all thought Having looked into NLP libraries, most let users specify which word segmentation "engine" to use (example). This might be a bit much to ask, but would it be possible to
Please let me know if any of these is something you might pursue. Otherwise I'll have to go back to having a custom middleman to parse and tokenize queries. |
@artt I looked at the DeepCut library but it is a Python library. I don't mind integrating a separate library that works better for Thai, but it must be a C/C++ library. Otherwise, it might just be better to do word and query segmentation outside Typesense and send space separated text directly into Typesense. This way, you can rely on any library you want externally. If there is anything we can do to make this type of external segmentation easier, let me know. |
Thank you so much for all your prompt replies. I think the ease of being able to deploy a single service should come first if we want Typesense to be user friendly. I think the ICU library should work well given a good enough dictionary. Do you think it would be possible to replace locally the ICU's dictionary with the one mentioned in the above PR? https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/words_th.txt |
Let me check whether there is a way to replace the dictionary used by ICU.
Interestingly, Google Translate also splits it as
Likewise Google Translate again splits this into 3 segments: |
Thanks @kishorenc! I think upgrading and letting the user customize (perhaps append additional words to) the dictionary would be nice in terms of flexibility and ease of use. As for Google Translate, I'm no expert, but I believe to see if Google thinks it's a word (i.e. the word "appears" in Google's dictionary), you need to also look at the bottom-most (right-bottom for PC) panel. For "ลื่", I believe what happened was that Google was trying to make its best guess of what the user was trying to say. You can see that "ลื่น" (which actually means "slippery") is the top suggested word: When you press Enter, this is what comes up: Compare this to when you type in the whole word: In any case, I checked both PyThaiNLP's and ICU's dictionaries and can confirm that This is what Google has for Again, thank you so much for all your time and effort making this happen! |
Thanks for the drill down. To summarise, we have two examples that we can use as tests for ensuring that the tokenizer works on an updated word list: a) I will try building ICU with that PR and see if the above two issues are fixed. Otherwise, we might have to write our own word splitting code, which might not be a trivial effort (not sure if it will be as simple as matching the longest subsequence present in the dictionary file). |
Not sure how to use ICU, but apparently you can supply your own dictionary: https://manpages.debian.org/unstable/icu-devtools/gendict.1.en.html Hopefully this will help make things a bit easier for you :) Thanks! |
@artt I'm unable to compile ICU with the dictionary file from the PR. It fails with some unicode parsing errors. That leaves us with either requiring external segmentation using one of the python based NLP libraries or dictionary based segmentation using a maximal matching algorithm. I can take a stab at the second option, but I don't know what kind of effort and iterations that's going to take and might not be ready in time for your release. An external segmentation might be a better approach for now especially if there is anything we can do to make it easier. |
Yeah I noticed that checks not being passed in the PR too. Not sure if you saw my earlier comment about using |
@artt Gendict will also use the same underlying code from what I can tell. If you can spare some cycles, we can quickly validate the approach with maximal matching. If you are interested, I can outline the python code and we can verify if that works. |
From what I looked, bi-directional maximal matching has pretty good accuracy and it seems simple enough to implement. I found a Python code snippet that works on Chinese text here: http://www.cxyzjd.com/article/allan2222/99549090 We need to make that work for Thai using the custom dictionary you had linked to earlier. The code itself needs only a couple of changes to make that happen:
If the results look promising with this approach, I can convert that code to C++ and integrate into Typesense. |
Sorry for having gone MIA. Been busy with other projects. If I have some time I'll try to build ICU and see what the problem might be. Should be less work for you guys. |
@artt No worries. I hope you can figure this out. Because, Chrome itself uses the ICU library for the on-screen text selection (on double click). As you can see from the screenshot below, it suffers from the same bug: |
@kishorenc I built ICU on my machine and indeed I've attached the file here so you can try building it. I'll give the guys in the ICU repo heads up on this as well. |
Quick test result with typesense/typesense:0.21.0.rc10I want to find "수지구 죽전동" Almost full text input - 수지구 죽전It works!Partial text input - 수지구 죽The first result is quite wrong from the keyword.. |
I can't really search the khmer word here and below is the example. I think the splitting doesn't work really well with the unicode or languages as above. I was trying to search for Is there any solution for this ? Does the ICU work in this case ? I also tried to search using pre_segmented_query=true
|
Late to the chat. Seems like there has been a lot of progress on multi-language support. I just pulled the latest version 0.24.0.rc4 and it works great with Chinese! I just tested on a small dataset, but I will be adding a database for a dictionary. I like how easy typesense is to use. I just guessed the locale would be zh. If there is any documentation or code I can review to learn about any other factors that would be helpful. Thank you for all of the great work! |
I'm maintaining a word tokenizer, and I want to support Khmer. Still, since I don't speak Khmer, I don't know if I made it right. Can you provide a few segmentation examples at veer66/chamkho#4? If you want, maybe you can use my tokenizer with this project. |
I'm not sure about the ICU but you can turn on the infix search for that
specific field and use use fallback infix when query. You can check the
document for this infix
…On Tue, Sep 27, 2022, 1:38 PM Vee Satayamas ***@***.***> wrote:
I was trying to search for ទិញថ្នាំ within the context below but there's
no result but If i try to search for មនុស្ស it'll work but it highlights
the whole phrase មនុស្សដែលបានឃើញគាត់អាចផ្តល់ការពិពណ៌នាបាន. If it supports
a specific language then it may still not work for some others. I also
notice that I cannot search for 5859 or 2304 within 190923045859
Is there any solution for this ? Does the ICU work in this case ? I also
tried to search using pre_segmented_query=true
គណនីនិយាយ ឬសរសេររបស់បុគ្គល វត្ថុ ឬព្រឹត្តិការណ៍។
មនុស្សដែលបានឃើញគាត់អាចផ្តល់ការពិពណ៌នាបាន
វាងាយស្រួលសើចណាស់ក្នុងការទិញថ្នាំគ្រប់ការពិពណ៌នា
I'm maintaining a word tokenizer, and I want to support Khmer. Still,
since I don't speak Khmer, I don't know if I made it right. Can you provide
a few segmentation examples at veer66/chamkho#4
<veer66/chamkho#4>? If you want, maybe you can
use my tokenizer with this project.
—
Reply to this email directly, view it on GitHub
<#228 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFON2XZ4AXR446N2F57UHHDWAKI4PANCNFSM4YGS24NQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I mean you can help me, and use https://github.com/veer66/chamkho instead of ICU. |
How ? What is it about ?
…On Tue, Sep 27, 2022 at 4:25 PM Vee Satayamas ***@***.***> wrote:
I'm not sure about the ICU but you can turn on the infix search for that
specific field and use use fallback infix when query. You can check the
document for this infix
I mean you can help me, and use https://github.com/veer66/chamkho instead
of ICU.
—
Reply to this email directly, view it on GitHub
<#228 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFON2XZRK3YTUGAMDUWHXGDWAK4PXANCNFSM4YGS24NQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Chamkho is a work tokenizer. It works like below:
My problem is that I don't know if មនុស្ស|ដែល|បានឃើញ|គាត់|អាច|ផ្តល់|ការ|ពិពណ៌នា|បាន is correct. So you can help me by correcting its result. We can talk more at veer66/chamkho#4 because this may be off-topic for this issue. |
I replied there
…On Tue, Sep 27, 2022 at 4:54 PM Vee Satayamas ***@***.***> wrote:
How ? What is it about ?
Chamkho is a work tokenizer. It works like below:
> echo 'មនុស្សដែលបានឃើញគាត់អាចផ្តល់ការពិពណ៌នាបាន' | target/release/wordcut -l khmer
មនុស្ស|ដែល|បានឃើញ|គាត់|អាច|ផ្តល់|ការ|ពិពណ៌នា|បាន
My problem is that I don't know if
មនុស្ស|ដែល|បានឃើញ|គាត់|អាច|ផ្តល់|ការ|ពិពណ៌នា|បាន is correct. So you can
help me by correcting its result. We can talk more at veer66/chamkho#4
<veer66/chamkho#4> because this may be
off-topic for this issue.
—
Reply to this email directly, view it on GitHub
<#228 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFON2X3OAMWAODP5JZIUKEDWAK76BANCNFSM4YGS24NQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Your cutting is correct
On Tue, Sep 27, 2022, 5:00 PM Sarathvichet Heng ***@***.***>
wrote:
… I replied there
On Tue, Sep 27, 2022 at 4:54 PM Vee Satayamas ***@***.***>
wrote:
> How ? What is it about ?
>
> Chamkho is a work tokenizer. It works like below:
>
> > echo 'មនុស្សដែលបានឃើញគាត់អាចផ្តល់ការពិពណ៌នាបាន' | target/release/wordcut -l khmer
>
> មនុស្ស|ដែល|បានឃើញ|គាត់|អាច|ផ្តល់|ការ|ពិពណ៌នា|បាន
>
>
> My problem is that I don't know if
> មនុស្ស|ដែល|បានឃើញ|គាត់|អាច|ផ្តល់|ការ|ពិពណ៌នា|បាន is correct. So you can
> help me by correcting its result. We can talk more at veer66/chamkho#4
> <veer66/chamkho#4> because this may be
> off-topic for this issue.
>
> —
> Reply to this email directly, view it on GitHub
> <#228 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AFON2X3OAMWAODP5JZIUKEDWAK76BANCNFSM4YGS24NQ>
> .
> You are receiving this because you commented.Message ID:
> ***@***.***>
>
|
Description
I'm trying to use Typesense with my content in Thai. What's special is that Thai (and a few other languages) doesn't use spaces to separate words. Typesense seems to care about that.
To avoid confusion, let's say my content is, "This is a sample sentence." In Thai, it would look something ilke:
Querying "sample" wouldn't give me a hit. I'm wondering if there's an option to enable this :(
Steps to reproduce
You can reproduce this by looking for something in the middle of the English word. Go to the demo and search for "seradish"
Expected Behavior
It should return results with "horseradish" as well.
Actual Behavior
It doesn't.
The text was updated successfully, but these errors were encountered: