New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 Zero Space Word Break not inserted #1

Closed
pepa65 opened this Issue Jun 16, 2016 · 18 comments

Comments

Projects
None yet
2 participants
@pepa65

pepa65 commented Jun 16, 2016

Trying to insert UTF-8 E2808B (u200B) into a Thai text, but it comes up empty.
swath -b $'\xE2\x80\x8B' -u u,u <<<"ผมมีแฟนแล้วนะครับ"

@thep

This comment has been minimized.

Show comment
Hide comment
@thep

thep Jun 20, 2016

Contributor

The -b option currently expects ASCII text, not UTF-8. We need to change it to accept UTF-8 instead to support this test case.

On the other hand, in TIS-620 input mode, the -b option should expect TIS-620, not UTF-8.

In short, the -b option should be treated the same way input text is.

Contributor

thep commented Jun 20, 2016

The -b option currently expects ASCII text, not UTF-8. We need to change it to accept UTF-8 instead to support this test case.

On the other hand, in TIS-620 input mode, the -b option should expect TIS-620, not UTF-8.

In short, the -b option should be treated the same way input text is.

@pepa65

This comment has been minimized.

Show comment
Hide comment
@pepa65

pepa65 Jun 20, 2016

Thanks for the info. Maybe a -bt and -bu option?? (Although I don't know a use case for -bt; the -bu option will also accept ASCII.)

I was wondering about the -f option: when it is not given is plain text input assumed??
Anyway, thanks for this utility, it took me a while to find it..!

pepa65 commented Jun 20, 2016

Thanks for the info. Maybe a -bt and -bu option?? (Although I don't know a use case for -bt; the -bu option will also accept ASCII.)

I was wondering about the -f option: when it is not given is plain text input assumed??
Anyway, thanks for this utility, it took me a while to find it..!

@thep

This comment has been minimized.

Show comment
Hide comment
@thep

thep Jun 20, 2016

Contributor

I think fixing -b behavior should be fine. I'm working on it.

Regarding the -f option, yes, you understand it right.

Contributor

thep commented Jun 20, 2016

I think fixing -b behavior should be fine. I'm working on it.

Regarding the -f option, yes, you understand it right.

thep added a commit that referenced this issue Jun 20, 2016

Fix issue #1 - Support non-ASCII word break string.
Word break string should be read & converted the same way
input text is.

* src/convutil.h, src/convutil.cpp (+ConvStrDup):
  - Add ConvStrDup() utility for converting string in buffer.

* src/filterx.h (wordBreakStr, GetWordBreak(), FilterX()):
  - Change type of word break string from const char*
    to const wchar_t*.
* src/filterhtml.cpp:
* src/filterlatex.h:
* src/filterlatex.cpp:
* src/filterlambda.h:
* src/filterrtf.cpp:
  - Change argument types accordingly.

* src/wordseg.cpp (WordSegmentation):
  - Change type of wbr from const char* to const wchar_t*.
  - Use wchar functions accordingly.

* src/wordseg.cpp (main):
  - Prepare wchar_t version of word break string as needed.
  - Call WordSegmentation() with wchar_t version
    of word break strings.
  - Print trailing word break string explicitly in mule mode.

Thanks GitHub @pepa65 for the report.
@thep

This comment has been minimized.

Show comment
Hide comment
@thep

thep Jun 20, 2016

Contributor

Fix committed. Thanks for your report.

Contributor

thep commented Jun 20, 2016

Fix committed. Thanks for your report.

@thep thep closed this Jun 20, 2016

@pepa65

This comment has been minimized.

Show comment
Hide comment
@pepa65

pepa65 Jun 20, 2016

Thank you Thep for the superquick fix! Now compiling it on Ubuntu 16.04.

  1. Noticing that the INSTALL file doesn't say to do ./autoconf.sh in step 1 of the instructions -- this might trip some people up.
  2. I had to install the package libdatrie1-bin (I don't remember having to do this from Ubuntu 14.04... I had libdatrie-dev already installed.)

Works great!

pepa65 commented Jun 20, 2016

Thank you Thep for the superquick fix! Now compiling it on Ubuntu 16.04.

  1. Noticing that the INSTALL file doesn't say to do ./autoconf.sh in step 1 of the instructions -- this might trip some people up.
  2. I had to install the package libdatrie1-bin (I don't remember having to do this from Ubuntu 14.04... I had libdatrie-dev already installed.)

Works great!

@thep

This comment has been minimized.

Show comment
Hide comment
@thep

thep Jun 21, 2016

Contributor
  1. The released tarball won't need an autogen.sh run. It's only needed for building from VCS. The auto-generated files should not be version-controlled.
  2. libdatrie1-bin is required for dictionary generation.
Contributor

thep commented Jun 21, 2016

  1. The released tarball won't need an autogen.sh run. It's only needed for building from VCS. The auto-generated files should not be version-controlled.
  2. libdatrie1-bin is required for dictionary generation.
@thep

This comment has been minimized.

Show comment
Hide comment
@thep

thep Jun 21, 2016

Contributor

Thinking it twice, I think interpreting -b option in terms of output encoding should make more sense, so that user can directly specify the word break string s/he wish to insert in case the input and output encodings differ. For example:

swath -u u,t -b $'\xFA' <<< "ผมมีแฟนแล้วนะครับ"

The change needed is:

diff --git a/src/wordseg.cpp b/src/wordseg.cpp
index 7fd39e5..a67f101 100644
--- a/src/wordseg.cpp
+++ b/src/wordseg.cpp
@@ -306,7 +306,7 @@ main (int argc, char* argv[])
       const wchar_t* wcwbr = L"|";
       if (wbr)
         {
-          wcwbr = wcwbr_buff = ConvStrDup (wbr, isUniIn);
+          wcwbr = wcwbr_buff = ConvStrDup (wbr, isUniOut);
         }
       while (!feof (stdin))
         {
Contributor

thep commented Jun 21, 2016

Thinking it twice, I think interpreting -b option in terms of output encoding should make more sense, so that user can directly specify the word break string s/he wish to insert in case the input and output encodings differ. For example:

swath -u u,t -b $'\xFA' <<< "ผมมีแฟนแล้วนะครับ"

The change needed is:

diff --git a/src/wordseg.cpp b/src/wordseg.cpp
index 7fd39e5..a67f101 100644
--- a/src/wordseg.cpp
+++ b/src/wordseg.cpp
@@ -306,7 +306,7 @@ main (int argc, char* argv[])
       const wchar_t* wcwbr = L"|";
       if (wbr)
         {
-          wcwbr = wcwbr_buff = ConvStrDup (wbr, isUniIn);
+          wcwbr = wcwbr_buff = ConvStrDup (wbr, isUniOut);
         }
       while (!feof (stdin))
         {

thep added a commit that referenced this issue Jun 21, 2016

Interpret word break string using output encoding.
As an adjustment to issue #1 fix, taking the word break string
in output encoding should be more sensible to users than in
input encoding, so s/he can directly specify the string to insert.

* src/wordseg.cpp (main):
  - Convert wbr depending on isUniOut instead of isUniIn.
@pepa65

This comment has been minimized.

Show comment
Hide comment
@pepa65

pepa65 Jun 21, 2016

I just cloned the github repo, and it didn't have configure. A few years ago I would have been stuck there. ;-)

Absolutely correct, the separator will be in the output string!

(OT: do you know a utility that reports the character length of Thai strings?? I tried gawk because it is supposed to be multibyte-character aware, but that doesn't work.)

pepa65 commented Jun 21, 2016

I just cloned the github repo, and it didn't have configure. A few years ago I would have been stuck there. ;-)

Absolutely correct, the separator will be in the output string!

(OT: do you know a utility that reports the character length of Thai strings?? I tried gawk because it is supposed to be multibyte-character aware, but that doesn't work.)

@pepa65

This comment has been minimized.

Show comment
Hide comment
@pepa65

pepa65 Jun 21, 2016

On second thought, maybe there should be no encoding/conversion of the separator at all. That way, you can accomplish anything with swath that you need, regardless of the input encoding or the output encoding. It becomes less predictable and less flexible if the encoding of the separator gets changed.
If I want to have $'\xE2\x80\x8B' as a separator (I mean the 3 consecutive bytes), but my output encoding is TIS-620, I would still expect to see those 3 bytes back in the output. (Of course, the total resulting string is not TIS-620 then, but that is a choice.)

(My use case is -u u,u with UTF-8 separator, so it works for me as is now.)

pepa65 commented Jun 21, 2016

On second thought, maybe there should be no encoding/conversion of the separator at all. That way, you can accomplish anything with swath that you need, regardless of the input encoding or the output encoding. It becomes less predictable and less flexible if the encoding of the separator gets changed.
If I want to have $'\xE2\x80\x8B' as a separator (I mean the 3 consecutive bytes), but my output encoding is TIS-620, I would still expect to see those 3 bytes back in the output. (Of course, the total resulting string is not TIS-620 then, but that is a choice.)

(My use case is -u u,u with UTF-8 separator, so it works for me as is now.)

@thep

This comment has been minimized.

Show comment
Hide comment
@thep

thep Jun 21, 2016

Contributor

I also had the idea not to convert the separator, but the change would be too deep in the class hierarchy. So, I gave it up.

Contributor

thep commented Jun 21, 2016

I also had the idea not to convert the separator, but the change would be too deep in the class hierarchy. So, I gave it up.

@thep

This comment has been minimized.

Show comment
Hide comment
@thep

thep Jun 21, 2016

Contributor

Regarding the character counting utility, how about "wc -m"?

Contributor

thep commented Jun 21, 2016

Regarding the character counting utility, how about "wc -m"?

@pepa65

This comment has been minimized.

Show comment
Hide comment
@pepa65

pepa65 Jun 21, 2016

wc counts bytes I think:
echo -n ต้น |wc
0 1 9

pepa65 commented Jun 21, 2016

wc counts bytes I think:
echo -n ต้น |wc
0 1 9

@thep

This comment has been minimized.

Show comment
Hide comment
@thep

thep Jun 21, 2016

Contributor

I mean -m, not -n.

Contributor

thep commented Jun 21, 2016

I mean -m, not -n.

@pepa65

This comment has been minimized.

Show comment
Hide comment
@pepa65

pepa65 Jun 21, 2016

Sorry, yes, wc -m counts the number of characters, so:
echo -n ต้น |wc -m
3
(I want to get 2...)

pepa65 commented Jun 21, 2016

Sorry, yes, wc -m counts the number of characters, so:
echo -n ต้น |wc -m
3
(I want to get 2...)

@pepa65

This comment has been minimized.

Show comment
Hide comment
@pepa65

pepa65 Jun 21, 2016

Apparently this is a difficult problem, and many terminal editors stumble here...
I guess it would be almost trivial to adapt swath's code to do this correctly, right..?!

Or I could just use swath with a different "dictionary" with all the combinations of a single "character"..! (Or would there be a quicker way based on swath's code??)

Sorry to clutter with OT, but the dictionary method is impractical. For instance, this is 1 character: echo $'\xe0\xb8\x94\xe0\xb8\xb1\xe0\xb9\x8d\xe0\xb9\x8c\xe0\xb8\xb4\xe0\xb9\x8a'

pepa65 commented Jun 21, 2016

Apparently this is a difficult problem, and many terminal editors stumble here...
I guess it would be almost trivial to adapt swath's code to do this correctly, right..?!

Or I could just use swath with a different "dictionary" with all the combinations of a single "character"..! (Or would there be a quicker way based on swath's code??)

Sorry to clutter with OT, but the dictionary method is impractical. For instance, this is 1 character: echo $'\xe0\xb8\x94\xe0\xb8\xb1\xe0\xb9\x8d\xe0\xb9\x8c\xe0\xb8\xb4\xe0\xb9\x8a'

@thep

This comment has been minimized.

Show comment
Hide comment
@thep

thep Jun 21, 2016

Contributor

You mean to have it count Thai characters?

Contributor

thep commented Jun 21, 2016

You mean to have it count Thai characters?

@pepa65

This comment has been minimized.

Show comment
Hide comment
@pepa65

pepa65 Jun 21, 2016

Yes, every position that gets taken up by echoing a string to the terminal, or counting the positions in a monotype font. I do a lot of work in a terminal. Apparently the functionality is all built in in libc, but taking zero-width characters into account is something you only encounter in certain languages. In latin scripts, they encode separately for accented character, like é is one UTF-8 character, echo -n "é" |wc -m gives 1.

pepa65 commented Jun 21, 2016

Yes, every position that gets taken up by echoing a string to the terminal, or counting the positions in a monotype font. I do a lot of work in a terminal. Apparently the functionality is all built in in libc, but taking zero-width characters into account is something you only encounter in certain languages. In latin scripts, they encode separately for accented character, like é is one UTF-8 character, echo -n "é" |wc -m gives 1.

@pepa65

This comment has been minimized.

Show comment
Hide comment
@pepa65

pepa65 Jun 21, 2016

For Thai input, I made a bash one-liner that can do it:
thaicharcount(){ echo -n "$@" |sed "s@[\xe0\xb8\xb1|\xe0\xb8\xb4|\xe0\xb8\xb5|\xe0\xb8\xb6|\xe0\xb8\xb7|\xe0\xb8\xb8|\xe0\xb8\xb9|\xe0\xb8\xba|\xe0\xb9\x87\xe0\xb9\x88|\xe0\xb9\x89|\xe0\xb9\x8a|\xe0\xb9\x8b|\xe0\xb9\x8c|\xe0\xb9\x8d|\xe0\xb9\x8e]@@g" |wc -m; }
Then thaicharcount $'\xe0\xb8\x94\xe0\xb8\xb1\xe0\xb9\x8d\xe0\xb9\x8c\xe0\xb8\xb4\xe0\xb9\x8a' will give 1.

pepa65 commented Jun 21, 2016

For Thai input, I made a bash one-liner that can do it:
thaicharcount(){ echo -n "$@" |sed "s@[\xe0\xb8\xb1|\xe0\xb8\xb4|\xe0\xb8\xb5|\xe0\xb8\xb6|\xe0\xb8\xb7|\xe0\xb8\xb8|\xe0\xb8\xb9|\xe0\xb8\xba|\xe0\xb9\x87\xe0\xb9\x88|\xe0\xb9\x89|\xe0\xb9\x8a|\xe0\xb9\x8b|\xe0\xb9\x8c|\xe0\xb9\x8d|\xe0\xb9\x8e]@@g" |wc -m; }
Then thaicharcount $'\xe0\xb8\x94\xe0\xb8\xb1\xe0\xb9\x8d\xe0\xb9\x8c\xe0\xb8\xb4\xe0\xb9\x8a' will give 1.

thep added a commit that referenced this issue Jul 6, 2016

Add test case for UTF-8 word break code.
* tests/Makefile.am, +tests/test-utf8-wbr.sh:
  - Add test case for non-ASCII word break string,
    for Issue #1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment