Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Behavior of 'ASCII as UTF-8' in 'File' -> 'Encoding'... ? #561

Closed
lhmouse opened this issue Jul 26, 2018 · 17 comments
Closed

Behavior of 'ASCII as UTF-8' in 'File' -> 'Encoding'... ? #561

lhmouse opened this issue Jul 26, 2018 · 17 comments
Labels

Comments

@lhmouse
Copy link
Contributor

lhmouse commented Jul 26, 2018

These are encoding settings that I have at the moment:

selection_013


Then I create a new file in UTF-16, such as this one:

selection_012

The encoding column in the status bar says UTF-16 LE and the file is indeed in this encoding. This is correct.


Then I navigate to this menu entry, to reload the file in the default encoding, UTF-8.

selection_016

The contents are now seen as nested with NUL characters and the status bar says UTF-8. This is still correct.

selection_014


Then I try the ASCII as UTF-8 one...

selection_017

But it does not reload the file in either ASCII or UTF-8. It makes Notepad3 detect the correct encoding and reload the file in the detected encoding, UTF-16.

selection_015


Question

Is this designed behavior? If it is, the menu entry could have been renamed to Detect encoding and recode.

BTW, when I was translating Notepad3 for zh_CN I used the phrase above, as it was the only precise description of what this menu entry did.

@lhmouse lhmouse changed the title Behavior of 'ASCII as UTF-8' in 'Encodings'... ? Behavior of 'ASCII as UTF-8' in 'File' -> 'Encoding'... ? Jul 26, 2018
@zufuliu
Copy link

zufuliu commented Jul 29, 2018

If I understand this menu correctly, it's added because my PR XhmikosR/notepad2-mod#200, originally called Reload As UTF-8.

When file can't be detected as Unicode (UTF-16 LE/BE) or UTF-8 (with or without BOM), Notepad2 open it using current system's locale depended ANSI encoding. then some characters may be rendered as question mark (see XhmikosR/notepad2-mod#202).

Reload As UTF-8 (if it works, see zufuliu/notepad2@ec28f0b, zufuliu/notepad2-mod@1897062, and latest code at https://github.com/zufuliu/notepad2/blob/master/src/Edit.c#L556) force load the ANSI file using UTF-8, invalid UTF-8 sequences will then be rendered using their byte value (with black background, like rendering of control characters in you above screenshot).

For example, open Notepad2(-Mod)'s License.txt in a non-Latin system, the copyright sign will be rendered as question mark, after Reload As UTF-8, it will be rendered as \xA9, so you then know what these question mark is (and possible original encoding), without open it with a hex editor.

It's useful when view file which contains invalid UTF-8 bytes.

@lhmouse
Copy link
Contributor Author

lhmouse commented Jul 29, 2018

Reload As UTF-8 (if it works, see zufuliu/notepad2@ec28f0b, zufuliu/notepad2-mod@1897062, and latest code at https://github.com/zufuliu/notepad2/blob/master/src/Edit.c#L556) force load the ANSI file using UTF-8, invalid UTF-8 sequences will then be rendered using their byte value (with black background, like rendering of control characters in you above screenshot).

Thanks for the detailed explanation.

For example, open Notepad2(-Mod)'s License.txt in a non-Latin system, the copyright sign will be rendered as question mark, after Reload As UTF-8, it will be rendered as \xA9, so you then know what these question mark is (and possible original encoding), without open it with a hex editor.

It's useful when view file which contains invalid UTF-8 bytes.

It is not, especially for who knows nothing about the specification of UTF-8.

Opening something not UTF-8 as UTF-8 is nothing but a mistake. You were trying to mitigate a mistake by another.

Non-ASCII bytes (these include invalid UTF-8 code units) can either:

  1. happen to be valid UTF-8 code points, which are then rendered as gibberish and your description just does not apply, or
  2. be invalid UTF-8 code units, which are rendered as their hexadecimal values while still provide no information about its original encoding.

@zufuliu
Copy link

zufuliu commented Jul 29, 2018

There are situations you need to force open some file as UTF-8, especially files contains few invalid UTF-8 bytes, which maybe be rendered better than using system locale depended ANSI, like a (for some reason) truncated file, part of a large file, etc..

If user knows nothing about UTF-8 or other encodings, he may never use the entire encoding menu.
For example, most users of Windows Notepad knows nothing about encoding (through they are displayed in the Save and Save As dialog); When some characters can't be encoded in current ANSI encoding, Notepad will prompt warning for lose of data, and let user to save it in Unicode. User only need to click the OK button, don't need anything about Unicode (even the "Unicode" word itself).

I think all Scintilla based editor is primary for programmers, who knows something about encoding, this why there is a encoding menu.

Compared to select a encoding for current content (which may change the content, may cause lose of data), the five reload menu items (As UTF-8, As ANSI, As OEM, No FileVars, Default) do not cause lose of data.

Since Notepad2 can be used to open arbitrary files (include binary files), these menus provide the ability like a little hex viewer.

For 1. since the original file encoding can't be detected, it only mater which option will render the file more meaningful to the user.
For 2. take above License.txt as an example, if the copyright sign is rendered as question mark, it's unknown whether it's a real question mark or not. If it's rendered as \xA9, after search online, the isolated byte can be basically convinced as the copyright sign, and the file is most likely encoded in some Latin or Western Europe encoding.

@zufuliu
Copy link

zufuliu commented Jul 29, 2018

@zufuliu
Copy link

zufuliu commented Jul 30, 2018

For DBCS (CJK) environment, invalid (DBCS/ANSI) bytes are rendered using hexadecimal value too.
See the release note for Scintilla 4.1.0 and relevant commits.

https://www.scintilla.org/ScintillaHistory.html
https://sourceforge.net/p/scintilla/code/ci/514fde42ccbf37bc9d1393c6bb1cd30728f7b397/
https://sourceforge.net/p/scintilla/code/ci/507e9b40a637b93646a5e3b690c8141160f16f86/

@zufuliu
Copy link

zufuliu commented Jul 30, 2018

@lhmouse can you fix https://sourceforge.net/p/scintilla/bugs/2026/ in MinGW-w64 headers?

A ticket also created at https://sourceforge.net/p/mingw-w64/bugs/753/

@lhmouse
Copy link
Contributor Author

lhmouse commented Jul 30, 2018

I think all Scintilla based editor is primary for programmers, who knows something about encoding, this why there is a encoding menu.

People who program aren't necessarily 'programmers'.

Since Notepad2 can be used to open arbitrary files (include binary files), these menus provide the ability like a little hex viewer.

It is just wrong to open a non-text file with a text editor, as the contents of a non-text file are not guaranteed to displayed in hex (they could become gibberish coincidently).

@lhmouse can you fix https://sourceforge.net/p/scintilla/bugs/2026/ in MinGW-w64 headers?
A ticket also created at https://sourceforge.net/p/mingw-w64/bugs/753/

Yes. But I suggest you make a patch then send it to our mailing list. The rule of thumb is that we don't review our own patches. So if you had had a patch it would be sooner to get it into master. 😂

@zufuliu
Copy link

zufuliu commented Jul 30, 2018

@lhmouse first thank you for the bug fix.

Back to this issue, maybe I used some terms wrongly.

I think a text file without knowing it's encoding is nothing different from a binary file, the only difference is when open it using some plain text tools, the former will displays some human readable content, the later displays a lot of gibberish, both depends on what encoding (let's call it Enc) is used to decode it's actual binary/octets/bytes content loaded from disk, network, etc.

Invalid sequences in Encoding Enc been rendered as hexadecimal value with black background like the Unicode replacement character is a simply method to display the content of file, instead of crash the editor or disallow user open it.

I will look into the code to see what's the different between Reload As UTF-8 and Recode UTF-8, seems both did the similar thing.

@lhmouse
Copy link
Contributor Author

lhmouse commented Jul 30, 2018

The encoding of a text file can be guessed using the GNU file utility, which is implemented using libmagic. It does not guarantee 100% accuracy, but it never mistakes a text file for a non-text one.

@RaiKoHoff
Copy link
Collaborator

RaiKoHoff commented Aug 6, 2018

Regarding "ANSI as UTF-8":
If you have a pure ASCII (7-bit) text file (and you have configured your windows ANSI code-page as your default encoding), this ASCII Text will be encoded using your ANSI-CP.
The text can also be encoded as UTF-8, since the (7-bit) encoding is the same.
Switching the encoding from ANSI to UTF-8 will cause to pop-up the warning dialog for a hint that switching the encoding might cause problems (if dialog has not been disabled).
Using the "ANSI as UTF-8" reloading skips this warning, cause it assumes you know that the file is pure ASCII and you want an UTF-8 encoding on later save(-as) ...
If there are non-ASCII bytes in the file, this will lead to the rendering issues mentioned above.

@lhmouse
Copy link
Contributor Author

lhmouse commented Aug 6, 2018

In my very first example, the file (in UTF-16) contains a UTF-16 BOM (FF FE in little endian) and no non-ASCII character other than that. It is unintended to have this option detect the actual encoding, as its description (ASCII as UTF-8) implies that the file should always be reloaded in UTF-8.

@RaiKoHoff
Copy link
Collaborator

RaiKoHoff commented Aug 6, 2018

I will debug ...
Ed.: Okay, that is a bug: On "ANSI as UTF-8", the (re-)load file method is called without disabling the "detect Unicode flag", so the Unicode Detection wins ... 👎

@zufuliu
Copy link

zufuliu commented Aug 6, 2018

With https://github.com/zufuliu/notepad2, after Reload As UTF-8, it's still shows UTF-16 LE BOM. through it can be reload as ANSI and OEM, then displays some NULs.

@lhmouse
Copy link
Contributor Author

lhmouse commented Aug 6, 2018

Yes please provide a solution, and I will update my translation.

(FWIW, some feature names are quite cryptic (e.g. Accelerated Word Navigation) in Notepad3 and I have to check the UI for sure.)

@RaiKoHoff
Copy link
Collaborator

Please test development beta _X_MUI_4.18.806.1042.
Further translation needed for new menu entry (no shortcut yet): Force Compact Encoding Detection,
which forces to reload the file using build-in of Google's "Compact Encoding Detection" (CED).
The CED is used for ANSI Code Page detection, if the Encoding Detection Settings are not configured to skip it (see discussion at issue #387.

@RaiKoHoff
Copy link
Collaborator

Regarding "Accelerated Word Navigation":
I would really like to rename this, but I don't know what is the best name for it.
(Reminder: You can specify word separation character (beside white-spaces) in:
[Settings2] ExtendedWhiteSpaceChars= to be used for "Accelerated Word Navigation",
so that words in Scintilla (selection by double-click or Ctrl+Left/Right) can include character,
which are normally separating words.)
If someone has some suggestions, please open another issue abut that ...

@lhmouse
Copy link
Contributor Author

lhmouse commented Aug 7, 2018

Looks good to me.

@lhmouse lhmouse closed this as completed Aug 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants