New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Behavior of 'ASCII as UTF-8' in 'File' -> 'Encoding'... ? #561
Comments
If I understand this menu correctly, it's added because my PR XhmikosR/notepad2-mod#200, originally called Reload As UTF-8. When file can't be detected as Unicode (UTF-16 LE/BE) or UTF-8 (with or without BOM), Notepad2 open it using current system's locale depended ANSI encoding. then some characters may be rendered as question mark (see XhmikosR/notepad2-mod#202). Reload As UTF-8 (if it works, see zufuliu/notepad2@ec28f0b, zufuliu/notepad2-mod@1897062, and latest code at https://github.com/zufuliu/notepad2/blob/master/src/Edit.c#L556) force load the ANSI file using UTF-8, invalid UTF-8 sequences will then be rendered using their byte value (with black background, like rendering of control characters in you above screenshot). For example, open Notepad2(-Mod)'s License.txt in a non-Latin system, the copyright sign will be rendered as question mark, after Reload As UTF-8, it will be rendered as It's useful when view file which contains invalid UTF-8 bytes. |
Thanks for the detailed explanation.
It is not, especially for who knows nothing about the specification of UTF-8. Opening something not UTF-8 as UTF-8 is nothing but a mistake. You were trying to mitigate a mistake by another. Non-ASCII bytes (these include invalid UTF-8 code units) can either:
|
There are situations you need to force open some file as UTF-8, especially files contains few invalid UTF-8 bytes, which maybe be rendered better than using system locale depended ANSI, like a (for some reason) truncated file, part of a large file, etc.. If user knows nothing about UTF-8 or other encodings, he may never use the entire encoding menu. I think all Scintilla based editor is primary for programmers, who knows something about encoding, this why there is a encoding menu. Compared to select a encoding for current content (which may change the content, may cause lose of data), the five reload menu items (As UTF-8, As ANSI, As OEM, No FileVars, Default) do not cause lose of data. Since Notepad2 can be used to open arbitrary files (include binary files), these menus provide the ability like a little hex viewer. For 1. since the original file encoding can't be detected, it only mater which option will render the file more meaningful to the user. |
Some discussions on rendering invalid bytes: |
For DBCS (CJK) environment, invalid (DBCS/ANSI) bytes are rendered using hexadecimal value too. https://www.scintilla.org/ScintillaHistory.html |
@lhmouse can you fix https://sourceforge.net/p/scintilla/bugs/2026/ in MinGW-w64 headers? A ticket also created at https://sourceforge.net/p/mingw-w64/bugs/753/ |
People who program aren't necessarily 'programmers'.
It is just wrong to open a non-text file with a text editor, as the contents of a non-text file are not guaranteed to displayed in hex (they could become gibberish coincidently).
Yes. But I suggest you make a patch then send it to our mailing list. The rule of thumb is that we don't review our own patches. So if you had had a patch it would be sooner to get it into |
@lhmouse first thank you for the bug fix. Back to this issue, maybe I used some terms wrongly. I think a text file without knowing it's encoding is nothing different from a binary file, the only difference is when open it using some plain text tools, the former will displays some human readable content, the later displays a lot of gibberish, both depends on what encoding (let's call it Invalid sequences in Encoding I will look into the code to see what's the different between Reload As UTF-8 and Recode UTF-8, seems both did the similar thing. |
The encoding of a text file can be guessed using the GNU file utility, which is implemented using libmagic. It does not guarantee 100% accuracy, but it never mistakes a text file for a non-text one. |
Regarding "ANSI as UTF-8": |
In my very first example, the file (in UTF-16) contains a UTF-16 BOM ( |
I will debug ... |
With https://github.com/zufuliu/notepad2, after Reload As UTF-8, it's still shows UTF-16 LE BOM. through it can be reload as ANSI and OEM, then displays some NULs. |
Yes please provide a solution, and I will update my translation. (FWIW, some feature names are quite cryptic (e.g. Accelerated Word Navigation) in Notepad3 and I have to check the UI for sure.) |
Please test development beta _X_MUI_4.18.806.1042. |
Regarding "Accelerated Word Navigation": |
Looks good to me. |
These are encoding settings that I have at the moment:
Then I create a new file in UTF-16, such as this one:
The encoding column in the status bar says UTF-16 LE and the file is indeed in this encoding. This is correct.
Then I navigate to this menu entry, to reload the file in the default encoding, UTF-8.
The contents are now seen as nested with
NUL
characters and the status bar says UTF-8. This is still correct.Then I try the ASCII as UTF-8 one...
But it does not reload the file in either ASCII or UTF-8. It makes Notepad3 detect the correct encoding and reload the file in the detected encoding, UTF-16.
Question
Is this designed behavior? If it is, the menu entry could have been renamed to Detect encoding and recode.
BTW, when I was translating Notepad3 for zh_CN I used the phrase above, as it was the only precise description of what this menu entry did.
The text was updated successfully, but these errors were encountered: