Unicode character cannot be truncated correctly #9

Lw-Cui · 2015-08-14T07:37:33Z

Unicode character such as Chinese character can't work.
Please fix it.

Hrily · 2017-08-15T16:37:05Z

I would like to fix this issue.

tcorral · 2017-08-15T16:38:19Z

Go ahead for it.

On Tue, 15 Aug 2017 at 18:37, Hrishi Hiraskar ***@***.***> wrote: I would like to fix this issue. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#9 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAw9IYLl6MSQkgKSm8bv8ikNJxBXiWI7ks5sYckxgaJpZM4Frf4M> .

Hrily · 2017-08-15T16:45:38Z

Okay :)

Hrily · 2017-08-16T16:34:22Z

Hie

The library works perfectly fine for Unicode characters

But, in case of Chinese, there is no defined word boundary as we have in other languages (the space)
An example I got was, the sentence
这是一个句子
can be split in following ways

这 | 是 | 一 | 个 | 句子
这是 | 一个 | 句子

I don't know the language, so I don't know what above two means.

So the inference is that, as the number of words can't be count due to non availability of word boundary, the library fails at Chinese characters.

The work around I could propose is to split on basis of number of characters in case of Chinese.
But I'm unsure of the effect it will have on readability of the last part.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode character cannot be truncated correctly #9

Unicode character cannot be truncated correctly #9

Lw-Cui commented Aug 14, 2015

Hrily commented Aug 15, 2017

tcorral commented Aug 15, 2017 via email

Hrily commented Aug 15, 2017

Hrily commented Aug 16, 2017

Unicode character cannot be truncated correctly #9

Unicode character cannot be truncated correctly #9

Comments

Lw-Cui commented Aug 14, 2015

Hrily commented Aug 15, 2017

tcorral commented Aug 15, 2017 via email

Hrily commented Aug 15, 2017

Hrily commented Aug 16, 2017