Behavior of 'ASCII as UTF-8' in 'File' -> 'Encoding'... ? #561

lhmouse · 2018-07-26T09:36:01Z

These are encoding settings that I have at the moment:

Then I create a new file in UTF-16, such as this one:

The encoding column in the status bar says UTF-16 LE and the file is indeed in this encoding. This is correct.

Then I navigate to this menu entry, to reload the file in the default encoding, UTF-8.

The contents are now seen as nested with NUL characters and the status bar says UTF-8. This is still correct.

Then I try the ASCII as UTF-8 one...

But it does not reload the file in either ASCII or UTF-8. It makes Notepad3 detect the correct encoding and reload the file in the detected encoding, UTF-16.

Question

Is this designed behavior? If it is, the menu entry could have been renamed to Detect encoding and recode.

BTW, when I was translating Notepad3 for zh_CN I used the phrase above, as it was the only precise description of what this menu entry did.

The text was updated successfully, but these errors were encountered:

zufuliu · 2018-07-29T06:11:50Z

If I understand this menu correctly, it's added because my PR XhmikosR/notepad2-mod#200, originally called Reload As UTF-8.

When file can't be detected as Unicode (UTF-16 LE/BE) or UTF-8 (with or without BOM), Notepad2 open it using current system's locale depended ANSI encoding. then some characters may be rendered as question mark (see XhmikosR/notepad2-mod#202).

Reload As UTF-8 (if it works, see zufuliu/notepad2@ec28f0b, zufuliu/notepad2-mod@1897062, and latest code at https://github.com/zufuliu/notepad2/blob/master/src/Edit.c#L556) force load the ANSI file using UTF-8, invalid UTF-8 sequences will then be rendered using their byte value (with black background, like rendering of control characters in you above screenshot).

For example, open Notepad2(-Mod)'s License.txt in a non-Latin system, the copyright sign will be rendered as question mark, after Reload As UTF-8, it will be rendered as \xA9, so you then know what these question mark is (and possible original encoding), without open it with a hex editor.

It's useful when view file which contains invalid UTF-8 bytes.

lhmouse · 2018-07-29T13:18:18Z

Reload As UTF-8 (if it works, see zufuliu/notepad2@ec28f0b, zufuliu/notepad2-mod@1897062, and latest code at https://github.com/zufuliu/notepad2/blob/master/src/Edit.c#L556) force load the ANSI file using UTF-8, invalid UTF-8 sequences will then be rendered using their byte value (with black background, like rendering of control characters in you above screenshot).

Thanks for the detailed explanation.

For example, open Notepad2(-Mod)'s License.txt in a non-Latin system, the copyright sign will be rendered as question mark, after Reload As UTF-8, it will be rendered as \xA9, so you then know what these question mark is (and possible original encoding), without open it with a hex editor.

It's useful when view file which contains invalid UTF-8 bytes.

It is not, especially for who knows nothing about the specification of UTF-8.

Opening something not UTF-8 as UTF-8 is nothing but a mistake. You were trying to mitigate a mistake by another.

Non-ASCII bytes (these include invalid UTF-8 code units) can either:

happen to be valid UTF-8 code points, which are then rendered as gibberish and your description just does not apply, or
be invalid UTF-8 code units, which are rendered as their hexadecimal values while still provide no information about its original encoding.

zufuliu · 2018-07-29T15:04:29Z

There are situations you need to force open some file as UTF-8, especially files contains few invalid UTF-8 bytes, which maybe be rendered better than using system locale depended ANSI, like a (for some reason) truncated file, part of a large file, etc..

If user knows nothing about UTF-8 or other encodings, he may never use the entire encoding menu.
For example, most users of Windows Notepad knows nothing about encoding (through they are displayed in the Save and Save As dialog); When some characters can't be encoded in current ANSI encoding, Notepad will prompt warning for lose of data, and let user to save it in Unicode. User only need to click the OK button, don't need anything about Unicode (even the "Unicode" word itself).

I think all Scintilla based editor is primary for programmers, who knows something about encoding, this why there is a encoding menu.

Compared to select a encoding for current content (which may change the content, may cause lose of data), the five reload menu items (As UTF-8, As ANSI, As OEM, No FileVars, Default) do not cause lose of data.

Since Notepad2 can be used to open arbitrary files (include binary files), these menus provide the ability like a little hex viewer.

For 1. since the original file encoding can't be detected, it only mater which option will render the file more meaningful to the user.
For 2. take above License.txt as an example, if the copyright sign is rendered as question mark, it's unknown whether it's a real question mark or not. If it's rendered as \xA9, after search online, the isolated byte can be basically convinced as the copyright sign, and the file is most likely encoded in some Latin or Western Europe encoding.

zufuliu · 2018-07-29T15:22:12Z

Some discussions on rendering invalid bytes:
https://sourceforge.net/p/scintilla/feature-requests/1211/
https://sourceforge.net/p/scintilla/feature-requests/1211/#8131

zufuliu · 2018-07-30T01:02:58Z

For DBCS (CJK) environment, invalid (DBCS/ANSI) bytes are rendered using hexadecimal value too.
See the release note for Scintilla 4.1.0 and relevant commits.

https://www.scintilla.org/ScintillaHistory.html
https://sourceforge.net/p/scintilla/code/ci/514fde42ccbf37bc9d1393c6bb1cd30728f7b397/
https://sourceforge.net/p/scintilla/code/ci/507e9b40a637b93646a5e3b690c8141160f16f86/

zufuliu · 2018-07-30T01:04:11Z

@lhmouse can you fix https://sourceforge.net/p/scintilla/bugs/2026/ in MinGW-w64 headers?

A ticket also created at https://sourceforge.net/p/mingw-w64/bugs/753/

lhmouse · 2018-07-30T02:38:23Z

I think all Scintilla based editor is primary for programmers, who knows something about encoding, this why there is a encoding menu.

People who program aren't necessarily 'programmers'.

Since Notepad2 can be used to open arbitrary files (include binary files), these menus provide the ability like a little hex viewer.

It is just wrong to open a non-text file with a text editor, as the contents of a non-text file are not guaranteed to displayed in hex (they could become gibberish coincidently).

@lhmouse can you fix https://sourceforge.net/p/scintilla/bugs/2026/ in MinGW-w64 headers?
A ticket also created at https://sourceforge.net/p/mingw-w64/bugs/753/

Yes. But I suggest you make a patch then send it to our mailing list. The rule of thumb is that we don't review our own patches. So if you had had a patch it would be sooner to get it into master. 😂

zufuliu · 2018-07-30T13:59:28Z

@lhmouse first thank you for the bug fix.

Back to this issue, maybe I used some terms wrongly.

I think a text file without knowing it's encoding is nothing different from a binary file, the only difference is when open it using some plain text tools, the former will displays some human readable content, the later displays a lot of gibberish, both depends on what encoding (let's call it Enc) is used to decode it's actual binary/octets/bytes content loaded from disk, network, etc.

Invalid sequences in Encoding Enc been rendered as hexadecimal value with black background like the Unicode replacement character is a simply method to display the content of file, instead of crash the editor or disallow user open it.

I will look into the code to see what's the different between Reload As UTF-8 and Recode UTF-8, seems both did the similar thing.

lhmouse · 2018-07-30T14:52:00Z

The encoding of a text file can be guessed using the GNU file utility, which is implemented using libmagic. It does not guarantee 100% accuracy, but it never mistakes a text file for a non-text one.

RaiKoHoff · 2018-08-06T12:31:56Z

Regarding "ANSI as UTF-8":
If you have a pure ASCII (7-bit) text file (and you have configured your windows ANSI code-page as your default encoding), this ASCII Text will be encoded using your ANSI-CP.
The text can also be encoded as UTF-8, since the (7-bit) encoding is the same.
Switching the encoding from ANSI to UTF-8 will cause to pop-up the warning dialog for a hint that switching the encoding might cause problems (if dialog has not been disabled).
Using the "ANSI as UTF-8" reloading skips this warning, cause it assumes you know that the file is pure ASCII and you want an UTF-8 encoding on later save(-as) ...
If there are non-ASCII bytes in the file, this will lead to the rendering issues mentioned above.

lhmouse · 2018-08-06T12:44:13Z

In my very first example, the file (in UTF-16) contains a UTF-16 BOM (FF FE in little endian) and no non-ASCII character other than that. It is unintended to have this option detect the actual encoding, as its description (ASCII as UTF-8) implies that the file should always be reloaded in UTF-8.

RaiKoHoff · 2018-08-06T13:03:23Z

I will debug ...
Ed.: Okay, that is a bug: On "ANSI as UTF-8", the (re-)load file method is called without disabling the "detect Unicode flag", so the Unicode Detection wins ... 👎

zufuliu · 2018-08-06T13:29:48Z

With https://github.com/zufuliu/notepad2, after Reload As UTF-8, it's still shows UTF-16 LE BOM. through it can be reload as ANSI and OEM, then displays some NULs.

lhmouse · 2018-08-06T15:23:49Z

Yes please provide a solution, and I will update my translation.

(FWIW, some feature names are quite cryptic (e.g. Accelerated Word Navigation) in Notepad3 and I have to check the UI for sure.)

RaiKoHoff · 2018-08-06T18:12:30Z

Please test development beta _X_MUI_4.18.806.1042.
Further translation needed for new menu entry (no shortcut yet): Force Compact Encoding Detection,
which forces to reload the file using build-in of Google's "Compact Encoding Detection" (CED).
The CED is used for ANSI Code Page detection, if the Encoding Detection Settings are not configured to skip it (see discussion at issue #387.

RaiKoHoff · 2018-08-06T18:18:59Z

Regarding "Accelerated Word Navigation":
I would really like to rename this, but I don't know what is the best name for it.
(Reminder: You can specify word separation character (beside white-spaces) in:
[Settings2] ExtendedWhiteSpaceChars= to be used for "Accelerated Word Navigation",
so that words in Scintilla (selection by double-click or Ctrl+Left/Right) can include character,
which are normally separating words.)
If someone has some suggestions, please open another issue abut that ...

lhmouse · 2018-08-07T02:51:03Z

Looks good to me.

lhmouse changed the title ~~Behavior of 'ASCII as UTF-8' in 'Encodings'... ?~~ Behavior of 'ASCII as UTF-8' in 'File' -> 'Encoding'... ? Jul 26, 2018

RaiKoHoff added the question label Aug 6, 2018

lhmouse closed this as completed Aug 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behavior of 'ASCII as UTF-8' in 'File' -> 'Encoding'... ? #561

Behavior of 'ASCII as UTF-8' in 'File' -> 'Encoding'... ? #561

lhmouse commented Jul 26, 2018

zufuliu commented Jul 29, 2018 •

edited

lhmouse commented Jul 29, 2018

zufuliu commented Jul 29, 2018 •

edited

zufuliu commented Jul 29, 2018 •

edited

zufuliu commented Jul 30, 2018

zufuliu commented Jul 30, 2018 •

edited

lhmouse commented Jul 30, 2018

zufuliu commented Jul 30, 2018

lhmouse commented Jul 30, 2018

RaiKoHoff commented Aug 6, 2018 •

edited

lhmouse commented Aug 6, 2018

RaiKoHoff commented Aug 6, 2018 •

edited

zufuliu commented Aug 6, 2018

lhmouse commented Aug 6, 2018

RaiKoHoff commented Aug 6, 2018

RaiKoHoff commented Aug 6, 2018

lhmouse commented Aug 7, 2018

Behavior of 'ASCII as UTF-8' in 'File' -> 'Encoding'... ? #561

Behavior of 'ASCII as UTF-8' in 'File' -> 'Encoding'... ? #561

Comments

lhmouse commented Jul 26, 2018

These are encoding settings that I have at the moment:

Question

zufuliu commented Jul 29, 2018 • edited

lhmouse commented Jul 29, 2018

zufuliu commented Jul 29, 2018 • edited

zufuliu commented Jul 29, 2018 • edited

zufuliu commented Jul 30, 2018

zufuliu commented Jul 30, 2018 • edited

lhmouse commented Jul 30, 2018

zufuliu commented Jul 30, 2018

lhmouse commented Jul 30, 2018

RaiKoHoff commented Aug 6, 2018 • edited

lhmouse commented Aug 6, 2018

RaiKoHoff commented Aug 6, 2018 • edited

zufuliu commented Aug 6, 2018

lhmouse commented Aug 6, 2018

RaiKoHoff commented Aug 6, 2018

RaiKoHoff commented Aug 6, 2018

lhmouse commented Aug 7, 2018

zufuliu commented Jul 29, 2018 •

edited

zufuliu commented Jul 29, 2018 •

edited

zufuliu commented Jul 29, 2018 •

edited

zufuliu commented Jul 30, 2018 •

edited

RaiKoHoff commented Aug 6, 2018 •

edited

RaiKoHoff commented Aug 6, 2018 •

edited