Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for writing systems without spaces between words #228

Open
artt opened this issue Feb 25, 2021 · 49 comments
Open

Support for writing systems without spaces between words #228

artt opened this issue Feb 25, 2021 · 49 comments

Comments

@artt
Copy link
Contributor

artt commented Feb 25, 2021

Description

I'm trying to use Typesense with my content in Thai. What's special is that Thai (and a few other languages) doesn't use spaces to separate words. Typesense seems to care about that.

To avoid confusion, let's say my content is, "This is a sample sentence." In Thai, it would look something ilke:

thisisasamplesentence

Querying "sample" wouldn't give me a hit. I'm wondering if there's an option to enable this :(

Steps to reproduce

You can reproduce this by looking for something in the middle of the English word. Go to the demo and search for "seradish"

Expected Behavior

It should return results with "horseradish" as well.

Actual Behavior

It doesn't.

@artt
Copy link
Contributor Author

artt commented Feb 25, 2021

Or maybe Thai is included in CJK languages as in this post? I think at least having the exact query (checking if query is substring of the text) would make it acceptable.

Another way this could be supported is having the dev segment the words in the document (so that Typesense indexes a segmented version of the text.)

The problem I find with this approach is that you'll need to segment the query string as well. I could have another API that would do the segmentation before sending the query to Typesense, but I think it makes more sense for Typesense to do that. I think that's how Algolia implemented theirs.

@kishorenc
Copy link
Member

@artt Intend to soon start supporting languages without spaces between words. This will involve segmenting the text on word boundaries using ICU or another unicode+locale aware library. I will be happy to tackle Thai as part of it. Would you be interested in being beta-testing it when the preview build for that is ready?

@artt
Copy link
Contributor Author

artt commented Feb 26, 2021

Sure thing! Do you have an approximate timeline for this feature?

For other people who might be looking for solutions, for now, my workaround is to segment the sentences before indexing, and create another API that would intercept InstantSearch requests, segment the query, pass the query along, and join the results back together.

@kishorenc
Copy link
Member

Do you have an approximate timeline for this feature?

Hope to get to it within the next few weeks.

@iicdii
Copy link

iicdii commented Mar 24, 2021

Will it also apply for Korean? I'm ready for being beta-testing tho

@artt
Copy link
Contributor Author

artt commented Mar 29, 2021

@kishorenc any updates on this front? :octocat:

@kishorenc
Copy link
Member

Yes, some nice progress has been made. I think I will be able to show an initial preview early next week!

@kishorenc
Copy link
Member

kishorenc commented Apr 5, 2021

@iicdii @artt Early preview ready in typesense/typesense:0.20.0.rc28 Docker image. Can you please try?

Example:

curl -k "http://localhost:8108/collections" -X POST -H "Content-Type: application/json" \
      -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" -d '{
        "name": "titles", "num_memory_shards": 4,
        "fields": [
          {"name": "title_en", "type": "string" },
          {"name": "title_ko", "type": "string", "locale": "ko" },
          {"name": "title_th", "type": "string", "locale": "th" },
          {"name": "title_ja", "type": "string", "locale": "ja" },
          {"name": "points", "type": "int32" }
        ],
        "default_sorting_field": "points"
      }'

curl "http://localhost:8108/collections/titles/documents?dirty_values=coerce_or_reject" -X POST \
        -H "Content-Type: application/json" \
        -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
        -d '{"points":"1","title_en": "Quick brown fox jumped over the lazy dog.", "title_ko": "빠른 갈색 여우가 게으른 개를 뛰어 넘었습니다.", "title_th": "สุนัขจิ้งจอกสีน้ำตาลอย่างรวดเร็วกระโดดข้ามสุนัขขี้เกียจ", "title_ja": "速い茶色のキツネは怠惰な犬を飛び越えました。"}'

You can now query the text like this (example using the Thai language field):

curl "http://localhost:8108/collections/titles/documents/search?q=สุนัขจิ้งจอก&query_by=title_th&x-typesense-api-key=${TYPESENSE_API_KEY}&prefix=false"

Please give it a whirl and let me know how it works. I'm certainly expecting to iterate on it a bit more before we perfect it.

@artt
Copy link
Contributor Author

artt commented Apr 7, 2021

@kishorenc Thanks for this! I played around with it for about an hour and here's some of the things I noticed:

Mixing languages

A lot of times content in, say, Thai, would have English words in it too. An example:

ผู้เขียนมีความสนใจเกี่ยวกับ Discrete Math และการคำนวณโดยทั่วไป

Right now, the match wouldn't appear unless you have the correct capitalization and type in the whole word. For example, querying "discre", "Discre", or "discrete", wouldn't match, but "Discrete" would.

  • A suggestion might be, instead of specifying locale for each field, the indexer could automatically identify the language for each chunk of text (by their unicode range, maybe) within the field
  • The same could be done for the query
  • Then for non-CJKT languages, use default matching

I realize it's easy to say and there's probably tons that must go into consideration, but hopefully it's a start

Prioritizing exact match

Not an NLP expert here, but when words are combined, they take on a specific meaning. Right now, words are tokenized and the original word is disregarded. It would be nice to give priority to exact matching.

Example in English: query "goalpost" would be tokenized into "goal" and "post". Suppose we have two documents

  • Doc A The post office is open, and the goal of this article is bla bla
  • Doc B The player reached the goalpost

Both docs would give a hit, and, currently, get the same priority, when Doc B should get higher priority.

An example in Thai: query "รายได้" (income) got tokenized into "ราย" and "ได้"

  • Doc A ข้อมูลรายคนหรือรายบริษัทในการเชื่อมโยงส่วนได้ส่วนเสีย
  • Doc B ติดกับดักรายได้ปานกลาง

Thanks again! Please let me know if you have any questions!

@iicdii
Copy link

iicdii commented Apr 7, 2021

@kishorenc Hello! I tried both local environment and most used online address finder in Korea (Daum Post Code Service) and this is result.
Google address spreadsheet I used for test: https://docs.google.com/spreadsheets/d/1aXq4ISWG-ionVc8b0SPgASxJi7WdbQ8oz2pJ56zooCY

  • Note that the spreadsheet DB is not exactly same as Daum Post Code Service, but almost similar.

Get started

We will try to find "서울특별시 관악구".
Remember when you search "서울특별시 관악구" in Korean, user type like below.
서울특별시
서울특별시 ㄱ
서울특별시 고
서울특별시 과
서울특별시 관

Case 1 - First consonant does not match

Typesense:0.20.0.rc28

image

When I type "서울특별시 ㄱ", typsense gave me results start with "서울특별시 성북구", which is not related to my keyword. 🤔
The first consonant of result should start with "ㄱ", not with "ㅅ".

Daum Post Code Service

image

This is expected result. They gave me results "서울특별시 강남구" which are matched with "서울특별시 ".

Case 2 - First vowel does not match

Typesense:0.20.0.rc28

image

Daum Post Code Service

image

This is not so bad but Typesense does not seem to care of vowel. We typed ㄱ + ㅗ = 고, Daum already found "서울특별시 관악구", which is our final keyword.

Case 3 - Unnecessary result

Typesense:0.20.0.rc28

image

Daum Post Code Service

image

When I typed until "서울특별시 관", they found it! I think it's most well-matched case so far. But, "서울특별시 은평구 blah" are listed in the result, which are unnecessary.

Hope it helps for your work, Feedback and questions are welcome! :)

@kishorenc
Copy link
Member

A quick note of thank you to both of you for documenting these. I'm now going to go through them in detail and will get back to you.

@artt
Copy link
Contributor Author

artt commented May 27, 2021

Hi @kishorenc, any update on this? We're moving closer to the launch date and I'm afraid my own patched-together solution won't make the cut :P

@kishorenc
Copy link
Member

@artt

I've a new build for you typesense/typesense:0.21.0.rc9:

This build addresses the issue you noticed about English text not being normalized when present amidst Thai text. But I'm not sure if we can do language detection so easily. There are lots of edges around doing this reliably. For now, I'm hoping that this fix is sufficient for your needs.

I could not reproduce the other issue you mentioned with exact match involving "รายได้". I indexed those 2 example sentences as two separate documents and when I queried, the correct text was returned. If you can produce the problem in a specific dataset it will be useful. Check this example: https://gist.github.com/kishorenc/fe9fd5587f5e8e758da8b3d81091f447

@kishorenc
Copy link
Member

@iicdii I've a new build which fixes the issues with prefix searches that you noticed: typesense/typesense:0.21.0.rc10

I have written tests with the examples you've provided. The unnecessary results you noticed is because of Typesense's typo_tokens_threshold parameter (default: 10), which tries to relax the search query to find more results until that threshold is reached.

Please try it out and let me know if it is better with this build.

@artt
Copy link
Contributor Author

artt commented Jun 7, 2021

@kishorenc

I tried 0.21.0.rc10 and it seems to have addressed all my problems!

Two questions

  1. Does Typesense Cloud support these beta versions?
  2. Should I close this issue or wait until 0.21.0 is out?

Edit: Not sure if it's just me, but do you guys feel like it's running a bit slow? Maybe I need to restart my computer.

@kishorenc
Copy link
Member

@artt I will be happy to upgrade your Typesense cloud cluster to the RC build. Can you please email us with the cluster ID?

Let's keep the issue open till we 0.21 is out.

Regarding speed, if you can provide some more details, I will be happy to look -- feel free to elaborate further over email.

@artt
Copy link
Contributor Author

artt commented Jun 7, 2021

@kishorenc

Thanks so much for this! I haven't set up Typesense Cloud yet. Will let you know when I have a chance to do that.

Seems to perform better after a restart. So no worries about the speed issue.

Found some weird results:

Continuing typing what's highlighted but the results changed

"ความเหลื่อมล้ำ" (one compound word, formed by "ความ" and "เหลื่อมล้ำ" (which could be divided into "เหลื่อม" and "ล้ำ"

  • Queries "เห", "เหล", "เหลื" (all prefixes of "เหลื่อมล้ำ") found and highlighted "เหลื่อม"
  • Queries "เหลื่" and "เหลื่อ" found 0 results (expect to find เหลื่อม)
  • Query "เหลื่อม" found "เหลื่อมล้ำ"

"กระตุ้น" (one word)

  • All prefixes up to "กระต" found the whole word.
  • "กระตุ" highlighted "ตุ้" and "ตุ้น" depending on the document
    image

"มั่งคั่ง" (one word)

  • Query "มั่ง" found the whole word "มั่งคั่ง"
  • Query "มั่งค" resets everything and instead found and highlighted words starting with "ค" (exact same results as when the query was "ค". This happens all the way ("มั่งคั" yields the same result as "คั", "มั่งคั่" yields the same result as "คั่" until you finish typing "มั่งคั่ง" which yields the correct results.

Basically I think the user will notice it if you type something and the top search result is as you expected, but when you continue typing exactly what was highlighted, the top result changed.

Exact match have lower score than imperfect match

Doc A: การกระจายรายได้
Doc B: จารีย์
Query: จารีย์
Result: การกระจายรายได้ was listed first
("จารีย์" is one word. Note also how in the body "จา" was not highlighted, but it was highlighted in the title.)
image

Hope this helps improve the feature and thanks a lot for your work! Please let me know if I can help.

@kishorenc
Copy link
Member

@artt

  1. The reason why เหลื่ does not surface ความเหลื่อมล้ำ as a match is because, the เหลื่ query string is split into: เห and ลื่, while the same fine-grain splitting is not done on the original compound word. Even Google Translate seems to do the same (breaks เหลื่ into H̄e lụ̄̀). I'm not sure how to deal with such context sensitive splitting.

  2. Same issue with the query prefix กระตุ being split into กระ and ตุ , while the full word กระตุ้น is not split at all.

  3. The word จารีย์ is also being split into จา and รีย์ (again Google Translate also parses it as Cā, rīy̒). However, I could not reproduce the ranking issue. See here. If you can send me a snippet like that where you can showcase the problem, it will help me troubleshoot.

@artt
Copy link
Contributor Author

artt commented Jun 7, 2021

@kishorenc

Ah I see what the problem is now. When Thais (that I know, at least) search for things on the internet, they put space between words so that the search engine can do its job better. For example, if I want to search for เห and ลี่ (not even sure what those mean), I'd search for เห ลื่ and not เหลื่.

Not sure if this is the case for other languages. Nonetheless, I think it would be great to have an option (when sending query to Typesense) to specify whether or not Typesense should break up the query. So instead of thinking the user is searching for เห+ลื่, Typesense could know that the user is actually typing เหลื่.

This would greatly improve the stability of the search result, since Typesense wouldn't have to break words each time a new character is typed in (which could vary greatly as in the examples) and I think would fix all the problems mentioned above.

Again, thank you so much!

@artt
Copy link
Contributor Author

artt commented Jun 7, 2021

@kishorenc

One more thing that might be useful if you want to skip the "option" part (or maybe provide it as the default) is if the query is longer than X characters (could depend upon language... maybe 90th percentile of word length) and there's no space in that query, then Typesense should break the query up.

@kishorenc
Copy link
Member

kishorenc commented Jun 7, 2021

@artt That makes sense: I've added a pre_segmented_query=true|false parameter to the search end-point to control this behaviour. By default, we will tokenize the query string the same way we tokenize the text. If you wish to override this behavior, set pre_segmented_query=true. When this parameter is set to true we will simply split the query on space and will not try to apply locale-specific tokenization.

You can try this out on this build: typesense/typesense:0.21.0.rc13. I have already verified this works on the earlier example, but please confirm.

@artt
Copy link
Contributor Author

artt commented Jun 8, 2021

@kishorenc

Did you mean to say I should set pre_segmented_query=true? I tried setting it to false and got the same behavior as in .rc10.

Still found some issues with this though. May I ask what library/tool you're using to segment the words? Perhaps that would allow me to help you debug better. Thanks!

@kishorenc
Copy link
Member

Yes but that flag is available only on this build:

typesense/typesense:0.21.0.rc13

@artt
Copy link
Contributor Author

artt commented Jun 8, 2021

Sorry what I meant to say was, "I tried setting it to false and still get the same behavior [as in rc10, as well as rc13 when I didn't put in this parameter]. But when I set it to true I get a different behavior (as you described)." ;)

So I think what you meant to say was,

If you wish to override this behavior, set pre_segmented_query=true.

Which I think makes sense (pre_segmented_query=true means that Typesense expects the query to be "pre-segmented", so it wouldn't segment the query further). Just wanna make sure other people don't get confused.

Also, if you could point out which word segmentation library you're using I can help debug some more :) Thanks a lot!

@kishorenc
Copy link
Member

Got it. I've updated my comment above to fix that.

We use ICU for word segmentation: https://github.com/unicode-org/icu

@artt
Copy link
Contributor Author

artt commented Jun 8, 2021

@kishorenc

Thanks! After trying out ICU's word segmentation here, I think the issue is with the word segmentation tool. You can find list of widely used Thai word segmentation libraries in this repo. Having talked to NLP experts I know, close to nobody uses ICU for Thai word segmentation.

The few I've tried all thought เหลื่ is one word (as เห and ลื่ doesn't have a meaning, as far as I can tell), and segment ความเหลื่อมล้ำ as ความ|เหลื่อมล้ำ, whereas ICU split it into ความ|เหลื่อม|ล้ำ. This is why it fails when the query is เหลื่อมล (and pre_segmented_query=true).

Having looked into NLP libraries, most let users specify which word segmentation "engine" to use (example).

This might be a bit much to ask, but would it be possible to

  1. Switch to language-specific word segmentation libraries that might perform better than ICU's. The best-performed one seems to be DeepCut. This would be the easiest solution and most user-friendly. Optionally you could install many libraries and let the developer choose the "engine" they'd like to use.
  2. Let the developer supply their own executable and specify which word segmentation binary to use when we spin up Typesense... maybe as a parameter e.g. typesense --segmentcommand="deepcut %" or something along that line, with predetermined string format that would be returned, such as using | to split words.
  3. Since ICU's word segmentation algorithm is dictionary-based and performed poorly compared to even other dictionary-based libraries, it seems like ICU's dictionary is not very good. Perhaps let the developer supply his/her own dictionary and use longest matching algorithm. There seems to be a 2-year old PR that hasn't been merged yet for some reason.

Please let me know if any of these is something you might pursue. Otherwise I'll have to go back to having a custom middleman to parse and tokenize queries.

@kishorenc
Copy link
Member

@artt I looked at the DeepCut library but it is a Python library. I don't mind integrating a separate library that works better for Thai, but it must be a C/C++ library. Otherwise, it might just be better to do word and query segmentation outside Typesense and send space separated text directly into Typesense. This way, you can rely on any library you want externally. If there is anything we can do to make this type of external segmentation easier, let me know.

@artt
Copy link
Contributor Author

artt commented Jun 8, 2021

Thank you so much for all your prompt replies.

I think the ease of being able to deploy a single service should come first if we want Typesense to be user friendly.

I think the ICU library should work well given a good enough dictionary. Do you think it would be possible to replace locally the ICU's dictionary with the one mentioned in the above PR?

https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/words_th.txt

@kishorenc
Copy link
Member

Let me check whether there is a way to replace the dictionary used by ICU.

The few I've tried all thought เหลื่ is one word (as เห and ลื่ doesn't have a meaning, as far as I can tell),

Interestingly, Google Translate also splits it as เห | ลื่ and translates "ลื่" as "slip" in English:

Screenshot 2021-06-08 at 5 19 54 PM

and segment ความเหลื่อมล้ำ as ความ|เหลื่อมล้ำ, whereas ICU split it into ความ|เหลื่อม|ล้ำ.

Likewise Google Translate again splits this into 3 segments:

Screenshot 2021-06-08 at 5 19 47 PM

@artt
Copy link
Contributor Author

artt commented Jun 8, 2021

Thanks @kishorenc! I think upgrading and letting the user customize (perhaps append additional words to) the dictionary would be nice in terms of flexibility and ease of use.

As for Google Translate, I'm no expert, but I believe to see if Google thinks it's a word (i.e. the word "appears" in Google's dictionary), you need to also look at the bottom-most (right-bottom for PC) panel. For "ลื่", I believe what happened was that Google was trying to make its best guess of what the user was trying to say. You can see that "ลื่น" (which actually means "slippery") is the top suggested word:

image

When you press Enter, this is what comes up:

image

Compare this to when you type in the whole word:

image

In any case, I checked both PyThaiNLP's and ICU's dictionaries and can confirm that ลื่ is not in there.

This is what Google has for ความเหลื่อมล้ำ:

image

Again, thank you so much for all your time and effort making this happen!

@kishorenc
Copy link
Member

Thanks for the drill down. To summarise, we have two examples that we can use as tests for ensuring that the tokenizer works on an updated word list:

a) ความเหลื่อมล้ำ being split into ความ | เหลื่อมล้ำ
b) เหลื่ NOT being split

I will try building ICU with that PR and see if the above two issues are fixed. Otherwise, we might have to write our own word splitting code, which might not be a trivial effort (not sure if it will be as simple as matching the longest subsequence present in the dictionary file).

@artt
Copy link
Contributor Author

artt commented Jun 9, 2021

@kishorenc

Not sure how to use ICU, but apparently you can supply your own dictionary: https://manpages.debian.org/unstable/icu-devtools/gendict.1.en.html

Hopefully this will help make things a bit easier for you :)

Thanks!

@kishorenc
Copy link
Member

@artt I'm unable to compile ICU with the dictionary file from the PR. It fails with some unicode parsing errors.

That leaves us with either requiring external segmentation using one of the python based NLP libraries or dictionary based segmentation using a maximal matching algorithm. I can take a stab at the second option, but I don't know what kind of effort and iterations that's going to take and might not be ready in time for your release. An external segmentation might be a better approach for now especially if there is anything we can do to make it easier.

@artt
Copy link
Contributor Author

artt commented Jun 9, 2021

Yeah I noticed that checks not being passed in the PR too. Not sure if you saw my earlier comment about using gendict to supply custom dictionary. Maybe that would work? Otherwise I'll try to see what the problem is and will get back to you.

@kishorenc
Copy link
Member

@artt Gendict will also use the same underlying code from what I can tell.

If you can spare some cycles, we can quickly validate the approach with maximal matching. If you are interested, I can outline the python code and we can verify if that works.

@kishorenc
Copy link
Member

kishorenc commented Jun 9, 2021

From what I looked, bi-directional maximal matching has pretty good accuracy and it seems simple enough to implement. I found a Python code snippet that works on Chinese text here: http://www.cxyzjd.com/article/allan2222/99549090

We need to make that work for Thai using the custom dictionary you had linked to earlier. The code itself needs only a couple of changes to make that happen:

  1. Use an appropriate self.window_size which is the maximum length of a word in the dictionary (in this example it is 3 because the longest chinese word is of length 3)
  2. Use a set() data structure for the dict and populate that by reading the dictionary file

If the results look promising with this approach, I can convert that code to C++ and integrate into Typesense.

@artt
Copy link
Contributor Author

artt commented Jun 14, 2021

Sorry for having gone MIA. Been busy with other projects. If I have some time I'll try to build ICU and see what the problem might be. Should be less work for you guys.

@kishorenc
Copy link
Member

@artt No worries. I hope you can figure this out. Because, Chrome itself uses the ICU library for the on-screen text selection (on double click). As you can see from the screenshot below, it suffers from the same bug:

Screenshot 2021-06-17 at 2 37 23 PM

@artt
Copy link
Contributor Author

artt commented Jun 20, 2021

@kishorenc I built ICU on my machine and indeed gendict was giving an error. Upon further investigation, the errors are lines with space in them. After removing those lines (611 words in all), gendict was able to process the dictionary.

I've attached the file here so you can try building it. I'll give the guys in the ICU repo heads up on this as well.
thaidict_no_space.txt

@iicdii
Copy link

iicdii commented Jun 20, 2021

@kishorenc

Quick test result with typesense/typesense:0.21.0.rc10

I want to find "수지구 죽전동"

Almost full text input - 수지구 죽전

스크린샷 2021-06-20 오후 1 54 37

It works!

Partial text input - 수지구 죽

스크린샷 2021-06-20 오후 1 54 28

The first result is quite wrong from the keyword..

@Vichet97
Copy link

Vichet97 commented Feb 16, 2022

Do you have an approximate timeline for this feature?

Hope to get to it within the next few weeks.

I can't really search the khmer word here and below is the example. I think the splitting doesn't work really well with the unicode or languages as above.

I was trying to search for ទិញថ្នាំ​ within the context below but there's no result but If i try to search for មនុស្ស it'll work but it highlights the whole phrase មនុស្សដែលបានឃើញគាត់អាចផ្តល់ការពិពណ៌នាបាន.
If it supports a specific language then it may still not work for some others. I also notice that I cannot search for 5859 or 2304 within 190923045859

Is there any solution for this ? Does the ICU work in this case ? I also tried to search using pre_segmented_query=true

គណនីនិយាយ ឬសរសេររបស់បុគ្គល វត្ថុ ឬព្រឹត្តិការណ៍។ មនុស្សដែលបានឃើញគាត់អាចផ្តល់ការពិពណ៌នាបាន វាងាយស្រួលសើចណាស់ក្នុងការទិញថ្នាំគ្រប់ការពិពណ៌នា

@wennals
Copy link

wennals commented Jun 20, 2022

Late to the chat. Seems like there has been a lot of progress on multi-language support. I just pulled the latest version 0.24.0.rc4 and it works great with Chinese! I just tested on a small dataset, but I will be adding a database for a dictionary. I like how easy typesense is to use. I just guessed the locale would be zh. If there is any documentation or code I can review to learn about any other factors that would be helpful.

Thank you for all of the great work!

@veer66
Copy link

veer66 commented Sep 27, 2022

I was trying to search for ទិញថ្នាំ​ within the context below but there's no result but If i try to search for មនុស្ស it'll work but it highlights the whole phrase មនុស្សដែលបានឃើញគាត់អាចផ្តល់ការពិពណ៌នាបាន. If it supports a specific language then it may still not work for some others. I also notice that I cannot search for 5859 or 2304 within 190923045859

Is there any solution for this ? Does the ICU work in this case ? I also tried to search using pre_segmented_query=true

គណនីនិយាយ ឬសរសេររបស់បុគ្គល វត្ថុ ឬព្រឹត្តិការណ៍។ មនុស្សដែលបានឃើញគាត់អាចផ្តល់ការពិពណ៌នាបាន វាងាយស្រួលសើចណាស់ក្នុងការទិញថ្នាំគ្រប់ការពិពណ៌នា

I'm maintaining a word tokenizer, and I want to support Khmer. Still, since I don't speak Khmer, I don't know if I made it right. Can you provide a few segmentation examples at veer66/chamkho#4? If you want, maybe you can use my tokenizer with this project.

@Vichet97
Copy link

Vichet97 commented Sep 27, 2022 via email

@veer66
Copy link

veer66 commented Sep 27, 2022

I'm not sure about the ICU but you can turn on the infix search for that specific field and use use fallback infix when query. You can check the document for this infix

I mean you can help me, and use https://github.com/veer66/chamkho instead of ICU.

@Vichet97
Copy link

Vichet97 commented Sep 27, 2022 via email

@veer66
Copy link

veer66 commented Sep 27, 2022

How ? What is it about ?

Chamkho is a work tokenizer. It works like below:

> echo 'មនុស្សដែលបានឃើញគាត់អាចផ្តល់ការពិពណ៌នាបាន' | target/release/wordcut -l khmer
មនុស្ស|ដែល|បានឃើញ|គាត់|អាច|ផ្តល់|ការ|ពិពណ៌នា|បាន

My problem is that I don't know if មនុស្ស|ដែល|បានឃើញ|គាត់|អាច|ផ្តល់|ការ|ពិពណ៌នា|បាន is correct. So you can help me by correcting its result. We can talk more at veer66/chamkho#4 because this may be off-topic for this issue.

@Vichet97
Copy link

Vichet97 commented Sep 27, 2022 via email

@Vichet97
Copy link

Vichet97 commented Sep 28, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants