-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF-8 Zero Space Word Break not inserted #1
Comments
The -b option currently expects ASCII text, not UTF-8. We need to change it to accept UTF-8 instead to support this test case. On the other hand, in TIS-620 input mode, the -b option should expect TIS-620, not UTF-8. In short, the -b option should be treated the same way input text is. |
Thanks for the info. Maybe a -bt and -bu option?? (Although I don't know a use case for -bt; the -bu option will also accept ASCII.) I was wondering about the -f option: when it is not given is plain text input assumed?? |
I think fixing -b behavior should be fine. I'm working on it. Regarding the -f option, yes, you understand it right. |
Word break string should be read & converted the same way input text is. * src/convutil.h, src/convutil.cpp (+ConvStrDup): - Add ConvStrDup() utility for converting string in buffer. * src/filterx.h (wordBreakStr, GetWordBreak(), FilterX()): - Change type of word break string from const char* to const wchar_t*. * src/filterhtml.cpp: * src/filterlatex.h: * src/filterlatex.cpp: * src/filterlambda.h: * src/filterrtf.cpp: - Change argument types accordingly. * src/wordseg.cpp (WordSegmentation): - Change type of wbr from const char* to const wchar_t*. - Use wchar functions accordingly. * src/wordseg.cpp (main): - Prepare wchar_t version of word break string as needed. - Call WordSegmentation() with wchar_t version of word break strings. - Print trailing word break string explicitly in mule mode. Thanks GitHub @pepa65 for the report.
Fix committed. Thanks for your report. |
Thank you Thep for the superquick fix! Now compiling it on Ubuntu 16.04.
Works great! |
|
Thinking it twice, I think interpreting -b option in terms of output encoding should make more sense, so that user can directly specify the word break string s/he wish to insert in case the input and output encodings differ. For example:
The change needed is: diff --git a/src/wordseg.cpp b/src/wordseg.cpp
index 7fd39e5..a67f101 100644
--- a/src/wordseg.cpp
+++ b/src/wordseg.cpp
@@ -306,7 +306,7 @@ main (int argc, char* argv[])
const wchar_t* wcwbr = L"|";
if (wbr)
{
- wcwbr = wcwbr_buff = ConvStrDup (wbr, isUniIn);
+ wcwbr = wcwbr_buff = ConvStrDup (wbr, isUniOut);
}
while (!feof (stdin))
{ |
As an adjustment to issue #1 fix, taking the word break string in output encoding should be more sensible to users than in input encoding, so s/he can directly specify the string to insert. * src/wordseg.cpp (main): - Convert wbr depending on isUniOut instead of isUniIn.
I just cloned the github repo, and it didn't have configure. A few years ago I would have been stuck there. ;-) Absolutely correct, the separator will be in the output string! (OT: do you know a utility that reports the character length of Thai strings?? I tried gawk because it is supposed to be multibyte-character aware, but that doesn't work.) |
On second thought, maybe there should be no encoding/conversion of the separator at all. That way, you can accomplish anything with swath that you need, regardless of the input encoding or the output encoding. It becomes less predictable and less flexible if the encoding of the separator gets changed. (My use case is -u u,u with UTF-8 separator, so it works for me as is now.) |
I also had the idea not to convert the separator, but the change would be too deep in the class hierarchy. So, I gave it up. |
Regarding the character counting utility, how about "wc -m"? |
wc counts bytes I think: |
I mean -m, not -n. |
Sorry, yes, wc -m counts the number of characters, so: |
Apparently this is a difficult problem, and many terminal editors stumble here... Or I could just use swath with a different "dictionary" with all the combinations of a single "character"..! (Or would there be a quicker way based on swath's code??) Sorry to clutter with OT, but the dictionary method is impractical. For instance, this is 1 character: |
You mean to have it count Thai characters? |
Yes, every position that gets taken up by echoing a string to the terminal, or counting the positions in a monotype font. I do a lot of work in a terminal. Apparently the functionality is all built in in libc, but taking zero-width characters into account is something you only encounter in certain languages. In latin scripts, they encode separately for accented character, like é is one UTF-8 character, echo -n "é" |wc -m gives 1. |
For Thai input, I made a bash one-liner that can do it: |
* tests/Makefile.am, +tests/test-utf8-wbr.sh: - Add test case for non-ASCII word break string, for Issue #1.
0.6.1 (2018-08-20) ===== - Updated word break dictionary. - Fix a defect in RTF parsing, so RTF gets more complete word break positions. - Compiler warning fixes. - Minor code cleanups. - Useful installation instructions in INSTALL file. (Thanks @pepa65 for the pull request.) 0.6.0 (2017-11-28) ===== - Updated word break dictionary. - Drop undocumented option '-l'. - Revamped internal word break engine. - Updated manpage. 0.5.5 (2016-12-25) ===== - Updated word break dictionary. 0.5.4 (2016-07-08) ===== - Updated word break dictionary. - Fix segfault on extremely long input lines. - Support longer input lines. (Bug report by Santi Romeyen) - Support non-ASCII word break string. tlwg/swath#1 - Some source code clean-ups. - Add test suite. 0.5.3 (2014-09-01) ===== - Updated word break dictionary. - Fix premature output ending on long UTF-8 input line. (Bug report by Sorawee Porncharoenwase) - Fix excessive break positions in plain text mode. (Bug report by Sorawee Porncharoenwase) - Remove dead codes, resulting in a little smaller binary. 0.5.2 (2013-12-23) ===== - Fix infinite loops in LaTeX filter. (Bug report and patch by Neutron Soutmun) - Fix off-by-one character loss in long HTML tokens. (Bug report and analysis by Nicolas Brouard) 0.5.1 (2013-10-30) ===== - Correct word break code for Lambda. - Updated word break dictionary. - Adjust file filters to prevent potential buffer overflow. 0.5.0 (2013-02-11) ===== - Character encoding conversion is now spontaneous, no more buffering via temporary file. - Rewritten RTF filter. It's now tested to work with real RTF document. - Process characters as Unicode internally, so that characters not present in TIS-620 are not lost in output. - Fix potential buffer overflow vulnerability in Mule mode. - Updated word break dictionary. - Significant source clean-ups. - Switch to XZ tarball compression. For pkgsrc, use gmake and add patch to compile wchar functions on NetBSD
Trying to insert UTF-8 E2808B (u200B) into a Thai text, but it comes up empty.
swath -b $'\xE2\x80\x8B' -u u,u <<<"ผมมีแฟนแล้วนะครับ"
The text was updated successfully, but these errors were encountered: