Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double byte characters are printed twice #41

Closed
nu774 opened this issue Oct 4, 2015 · 10 comments
Closed

Double byte characters are printed twice #41

nu774 opened this issue Oct 4, 2015 · 10 comments

Comments

@nu774
Copy link

nu774 commented Oct 4, 2015

Like this:

$ console echo いろは123
いいろろはは123

I guess this is caused by a recent commit on this year. Didn't happen previously.

@rprichard
Copy link
Owner

I think you mean double-width characters rather than double-byte characters. The first three characters are three bytes in UTF-8. (Note to myself, mostly: The first three characters are: U+3044 U+308D U+306F.)

The problem originally didn't reproduce for me, but I managed to reproduce it by switching my system locale to "Japanese (Japan)." To do this, open the Control Panel, search for "Region" or "Region and Language" and open the Administrative tab. Then, in the "Language for non-Unicode programs" box, change the system locale to Japanese.

I noticed that after doing this, the available fonts for the console change. With Windows 7 English, the options were:

  • Consolas
  • Lucida Console
  • Raster Fonts (aka Terminal?)

With a Japanese locale, the options are:

  • MS ゴシック
  • Raster Fonts

However, winpty still successfully changes the font to Lucida Console, even though it isn't an available option. I tested using WINPTY_SHOW_CONSOLE=1 and manually changed the font to one of the other options, and winpty still duplicated the double-width characters.

With my original system locale of English, the Japanese characters were single-width in the Windows Console. After changing the system locale, the characters become double-width, even with Lucida Console, where they're rendered as a featureless box.

The obvious guess is that double-width characters take up two console cells, so Windows reports the same Unicode codepoint for each cell. That could create a problem (i.e. how does winpty detect two Japanese characters in the English locale versus a single Japanese character in the Japanese locale?). I'll need to do some more investigating.

@rprichard
Copy link
Owner

The TrueType Japanese font is known as MS Gothic.

@rprichard
Copy link
Owner

I think you mean double-width characters rather than double-byte characters.

Nevermind. According to https://msdn.microsoft.com/en-us/library/cc194788.aspx, these ideas are equivalent in a double-byte character set, such as Shift-JIS.

The commit that broke things was probably 72557cb.

The fix will probably involve examining the COMMON_LVB_LEADING_BYTE and COMMON_LVB_TRAILING_BYTE console attributes.

@nu774
Copy link
Author

nu774 commented Oct 5, 2015

Thanks, it looks your guess is correct. The following patch did it here (Win10 and CP932 codepage):

diff --git a/agent/Terminal.cc b/agent/Terminal.cc
index d1cdfd4..5b0622a 100644
--- a/agent/Terminal.cc
+++ b/agent/Terminal.cc
@@ -143,6 +143,8 @@ void Terminal::sendLine(int line, CHAR_INFO *lineData, int width)
             length = termLine.size();
             m_remoteColor = color;
         }
+        if (lineData[i].Attributes & COMMON_LVB_TRAILING_BYTE)
+            continue;
         // TODO: Is it inefficient to call WideCharToMultiByte once per
         // character?
         char mbstr[16];

@nu774
Copy link
Author

nu774 commented Oct 5, 2015

Hmm, it seems that this patch is not enough.
This patch indeed fixes THIS issue (kanji characters aren't repeated anymore), but I can observe subtle out of sync in console screen.

One apparent issue is a character "ー" (U+30FC) .
This character is a Japanese punctuation mark represented in two bytes in Shift_JIS (CP932), and occupies two character cells in ordinary CP932 console environment (with MS Gothic font).
However, it seems that this character takes up only one character cell when Lucida Console is used as console font.
This completely breaks the layout estimation of win32 console application being run and also mintty rendering, where this character is assumed to occupy two character cells.
(Naive non-Unicode programs just assume double-byte characters in DBCS code page should occupy two cells. More sophisticated Unicode based programs like mintty usually use some kind of wcwidth(), which basically returns East_Asian_Width property of a Unicode character. As for U+30FC, it is "Wide").

I could temporary fix this by disabling the call to m_console->setSmallFont() in Agent constructor and let it use MS Gothic in my environment, but apparently it is not a satisfactory thing to do...

@rprichard
Copy link
Owner

I'll probably make the choice of font depend upon the code page. There's a table in the registry at HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Console\TrueTypeFont. It's described here.

The code page can change while the console is open (e.g. mode con: cp select=932), so perhaps winpty should regularly check for this and change the font.

I'm currently documenting everything I've learned from studying how the console works with code pages and fonts. It's full of complications. For example, ReadConsoleOutputW reads double-width characters as two CHAR_INFO values if you're using a TrueType font, but if you instead use a raster font, then the double-width character is only a single CHAR_INFO value, and the trailing CHAR_INFO values are zeroed out.

@rprichard
Copy link
Owner

One apparent issue is a character "ー" (U+30FC) .
This character is a Japanese punctuation mark represented in two bytes in Shift_JIS (CP932), and occupies two character cells in ordinary CP932 console environment (with MS Gothic font).
However, it seems that this character takes up only one character cell when Lucida Console is used as console font.

Apparently the situation is more complicated than this.

In Raster Font and MS Gothic, U+30FC is always drawn/rendered as full-width. In Lucida Console and Consolas, the character is always drawn/rendered as half-width.

In terms of the data model (e.g. ReadConsoleOutputW, selection), the character is half-width or full-width depending upon the font size and OS version. I tested Lucida Console, Consolas, and MS Gothic, with all font sizes between 1px tall and 50px tall. Results:

  • Win7: full-width except for these cases, which are half-width:
    • Lucida Console: 1px, 2px, 3px, 5px, 6px, 7px, 8px, 10px, 11px, 13px, 15px, 16px, 18px, 21px, 23px, 26px
    • MS Gothic: 3px
  • Win8, Win8.1, and Win10 legacy mode: full-width except for these cases, which are half-width:
    • Lucida Console: 1px, 2px, 3px, 5px, 6px, 7px, 8px, 10px, 11px, 13px, 15px, 16px, 18px, 21px, 23px, 26px
    • MS Gothic: 1px, 2px, 4px, 6px, 8px, 10px, 12px, 14px, 16px, 18px
  • Win10 new mode: The character is half-width with Consolas and Lucida Console. It is full-width with MS Gothic. This is the only fully correct mode.
  • (I didn't test any other OS versions.)

winpty currently picks Lucida Console 6px as its first choice, which is in the broken list for all the OS versions I tested. If I'm careful to use 5px or 7px for MS Gothic, then winpty should handle U+30FC correctly on all OS versions. If I instead choose 6px MS Gothic, it will fail on Win8 up to Win10 new mode.

In the Windows console properties dialog, when I choose MS Gothic, there is a box listing possible font sizes. It's interesting that the first font sizes in that box overlap with the sizes that don't work: 6, 8, 10, 12, 14, 16, 18, 20, 24.

This Windows behavior makes no sense to me. At least the new console in Win10 works. I wonder if there are other characters with this problem.

@rprichard
Copy link
Owner

@nu774 What version of Windows are you using?

@nu774
Copy link
Author

nu774 commented Oct 10, 2015

I'm using Windows 10.
And I wasn't even aware of THAT complicated and seemingly buggy situation of OS native console.
I noticed the issue on this character (U+30FC) because it appears quite frequently in Japanese text (katakana) when transliterating foreign words. Every long vowels are transliterated into this character (for example, console becomes コンソール)

@rprichard
Copy link
Owner

cce293d fixes this issue, but c3999b5 rewrites the console font configuration to use appropriate TrueType fonts for the CJK code pages (932, 936, 949, 950).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants