Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode character cannot be truncated correctly #9

Open
Lw-Cui opened this issue Aug 14, 2015 · 4 comments
Open

Unicode character cannot be truncated correctly #9

Lw-Cui opened this issue Aug 14, 2015 · 4 comments

Comments

@Lw-Cui
Copy link

Lw-Cui commented Aug 14, 2015

Unicode character such as Chinese character can't work.
Please fix it.

@Hrily
Copy link

Hrily commented Aug 15, 2017

I would like to fix this issue.

@tcorral
Copy link
Owner

tcorral commented Aug 15, 2017 via email

@Hrily
Copy link

Hrily commented Aug 15, 2017

Okay :)

@Hrily
Copy link

Hrily commented Aug 16, 2017

Hie

The library works perfectly fine for Unicode characters

But, in case of Chinese, there is no defined word boundary as we have in other languages (the space)
An example I got was, the sentence
这是一个句子
can be split in following ways

  • 这 | 是 | 一 | 个 | 句子
  • 这是 | 一个 | 句子

I don't know the language, so I don't know what above two means.

So the inference is that, as the number of words can't be count due to non availability of word boundary, the library fails at Chinese characters.

The work around I could propose is to split on basis of number of characters in case of Chinese.
But I'm unsure of the effect it will have on readability of the last part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants