Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode/decode JS entities works on one byte at a time and is not reversible #1

Closed
rdipardo opened this issue May 10, 2023 · 4 comments
Labels
enhancement New feature or request major

Comments

@rdipardo
Copy link
Owner

Original report by Anonymous.


  • Updated to Notepad 8.3.3 x64
  • HtmlTag was missing after update
  • Re-installed HtmlTag

Test:
Following characters require encoding: ä ö ü ß
After encoding:
Following characters require encoding: ä ö ü ß
After decoding encoded text:
Following characters require encoding: ä ö ü ß

@rdipardo rdipardo added enhancement New feature or request major labels May 10, 2023
@rdipardo
Copy link
Owner Author

The decoding algorithm can only handle single-byte sequences. So, this works:

\u00E4 \u00F6 \u00FC \u00DF (decode =>) ä ö ü ß

But this is broken:

ä ö ü ß (encode =>) \u00C3\u00A4 \u00C3\u00B6 \u00C3\u00BC \u00C3\u0178

A file in UTF-8 gives 2 bytes to each character, and the algorithm encodes each one separately.

That's a limitation of the original author's design (based on pre-Unicode Notepad++). It affects both 32- and 64- bit versions.

As I said in 82f9b0e,

More work still needed before utf8mb4 can be encoded *correctly*

Fixing this will be part of that overall task.

@rdipardo
Copy link
Owner Author

If you're running at least Windows 10, here's a way to resolve this issue for the time being:

  • Go to the Control Panel, then “Clock and Region”, and select "Change date, time, or number formats"
  • Click the "Administrative" tab
  • Click "Change system locale..."
  • Check the box labelled "Beta: Use Unicode UTF-8 for worldwide language support” (a reboot is required)


Here is N++ 8.3.3 (64-bit) on Windows 10 21H2, with the updated system encoding :


The plugin is most likely calling a standard library function that uses the system's default encoding. What it should do is encode the document's text as Unicode every time, not rely on Windows.

@rdipardo
Copy link
Owner Author

Original comment by Björn Klug (Bitbucket: [Björn Klug](https://bitbucket.org/Björn Klug/workspace/repositories)).


I tried your workaround ("Beta: Use Unicode UTF-8 for worldwide language support” checkbox) but it broke all my MS Access 2010 applications, so that is not a viable solution for me.

Since I’m using this plugin quite frequently I’d by very interested in your estimat when this bug will be fixed.

[EDIT] Just found the download of version 1.2.2 at https://bitbucket.org/rdipardo/htmltag/downloads/ which works fine again. Thanks!

@rdipardo
Copy link
Owner Author

Fixed in d2189a1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request major
Projects
None yet
Development

No branches or pull requests

1 participant