Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyzet add - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed; invalid continuation byte #3

Open
1 task done
tpwo opened this issue Oct 28, 2021 · 1 comment
Labels
bug Something isn't working

Comments

@tpwo
Copy link
Owner

tpwo commented Oct 28, 2021

During migration from hackmd, one of my notes caused the following error:

> pz add
INFO:root:20211028212144 was created
Traceback (most recent call last):
  File "C:\Users\trivvz\repos\github.com\trivvz\pyzet\venv\Scripts\pz-script.py", line 33, in <module>
    sys.exit(load_entry_point('pyzet', 'console_scripts', 'pz')())
  File "c:\users\trivvz\repos\github.com\trivvz\pyzet\src\pyzet\main.py", line 112, in main
    return add_zettel(config)
  File "c:\users\trivvz\repos\github.com\trivvz\pyzet\src\pyzet\main.py", line 189, in add_zettel
    zettel = get_zettel(zettel_path.parent)
  File "c:\users\trivvz\repos\github.com\trivvz\pyzet\src\pyzet\zettel.py", line 39, in get_zettel
    contents = file.readlines()
  File "C:\Program Files\Python39\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7528: invalid continuation byte

The link to the problematic note:

https://hackmd.io/d4GbNi4sR7yf7YQ-HZoHng

I found the root cause: 😉

So it seems that UTF-8 doesn't like emojis. Probably there's a simple workaround, so this can be fixed, I hope.

  • make pyzet work with emojis (done in d9b3980)
@tpwo
Copy link
Owner Author

tpwo commented Nov 12, 2021

The root cause of the problem is the fact that vim from git bash on Windows seems to not support emojis. When I try to paste 😉, <de09> appears instead, and Python raises the exception with invalid continuation byte when trying to save the file.

<de09> seems to be something that is called surrogate:

image

And surrogate is something more complicated:

UCS uses surrogates to address characters outside the initial Basic Multilingual Plane without resorting to more-than-16-bit byte representations.

But it seems that it's wrong, because if saved externally in vim (i.e. outside pyzet) it cannot print correctly in the terminal:

> cat .\surrogate-test
This is surrogate test:
���

What is more interesting, is the fact that git bash's vim actually can open a file with emoji -- the problem lies only in pasting from external source (because it can yank and paste line with emoji without issues):

image

Vim from WSL2 works fine when pasting emojis, and it shares exactly the same configuration file. A quick search told me that the issue might be in locale setting. IDK how it's done in git bash, but it's the hint for the further searching.

One of the workarounds is to actually use vim from WSL2 in pyzet. WSL2 adds command bash that is visible from PowerShell and I found a way to run Vim with the help of it:

# run Linux Substystem "vim" program like it's a normal Windows program
# something like an alias but really a shortcut to Windows Linux Subsystem vim application
function vim ($File){
    # XXX: need to allow variable length argument list so "vim" can be passed options
    $File = $File -replace '\\', '/'
    bash -c "vim -- '$File'"
}

But WSL2 is not as standard as git bash, so I don't like this solution that much.

I'm leaving this issue open for now. I'm not using emojis that much, and they will appear practically only when pasting text authored by someone else. For now, it'd need some input sanitation before trying to save the file.

@tpwo tpwo added the bug Something isn't working label Jan 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant