Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 0: #154

Closed
waqasraz opened this issue Jul 4, 2018 · 23 comments
Closed

Comments

@waqasraz
Copy link

waqasraz commented Jul 4, 2018

`future: <Task finished coro=<OutputFetcher.task() done, defined at /home/waqas/PycharmProjects/automation_manager/automation_manager/general_automation/commander/output_fetcher.py:49> exception=DisconnectError('Disconnect Error: Unicode decode error',)>
Traceback (most recent call last):
File "/home/waqas/.virtualenv/automation_manager/lib/python3.6/site-packages/asyncssh/channel.py", line 296, in _deliver_data
data = encdata.decode(self._encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 0: invalid continuation byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/waqas/PycharmProjects/automation_manager/automation_manager/general_automation/commander/output_fetcher.py", line 65, in task
await out_file.write(await router.send_command(command))
File "/home/waqas/PycharmProjects/automation_manager/automation_manager/general_automation/netdev/vendors/base.py", line 208, in send_command
output = await self._read_until_prompt_or_pattern(pattern, re_flags)
File "/home/waqas/PycharmProjects/automation_manager/automation_manager/general_automation/netdev/vendors/base.py", line 264, in _read_until_prompt_or_pattern
output += await asyncio.wait_for(fut, self._timeout)
File "/usr/lib/python3.6/asyncio/tasks.py", line 339, in wait_for
return (yield from fut)
File "/usr/lib/python3.6/asyncio/coroutines.py", line 215, in coro
res = yield from res
File "/home/waqas/.virtualenv/automation_manager/lib/python3.6/site-packages/asyncssh/stream.py", line 444, in read
raise recv_buf.pop(0)
File "/home/waqas/.virtualenv/automation_manager/lib/python3.6/site-packages/asyncssh/connection.py", line 504, in data_received
while self._inpbuf and self._recv_handler():
File "/home/waqas/.virtualenv/automation_manager/lib/python3.6/site-packages/asyncssh/connection.py", line 724, in _recv_packet
processed = handler.process_packet(pkttype, seq, packet)
File "/home/waqas/.virtualenv/automation_manager/lib/python3.6/site-packages/asyncssh/packet.py", line 207, in process_packet
self._packet_handlers[pkttype](self, pkttype, pktid, packet)
File "/home/waqas/.virtualenv/automation_manager/lib/python3.6/site-packages/asyncssh/channel.py", line 521, in _process_data
self._accept_data(data)
File "/home/waqas/.virtualenv/automation_manager/lib/python3.6/site-packages/asyncssh/channel.py", line 351, in _accept_data
self._deliver_data(data, datatype)
File "/home/waqas/.virtualenv/automation_manager/lib/python3.6/site-packages/asyncssh/channel.py", line 308, in _deliver_data
'Unicode decode error')
asyncssh.misc.DisconnectError: Disconnect Error: Unicode decode error
`

@ronf
Copy link
Owner

ronf commented Jul 4, 2018

This error means that the data you received on the connection was not a valid UTF-8 byte string. Are you sure the server is sending UTF-8 and not some other character encoding? You can disable this conversion and just operate on raw bytes by setting the encoding to None when you open the SSH session, or you can switch to some other encoding if you know what the server is sending.

I can probably make this a bit cleaner by not raising a nested exception here, but the problem appears to be in the data and not in AsyncSSH.

@waqasraz
Copy link
Author

waqasraz commented Jul 4, 2018

It's Juniper router. I am pretty sure it's UTF-8. But let me double check.

@waqasraz
Copy link
Author

waqasraz commented Jul 4, 2018

It works most of the time, only some routers cause this issues.

@ronf
Copy link
Owner

ronf commented Jul 5, 2018

Whether you get the error or not would depend on the specific text being output, so that might explain why you don't always see it. The byte 0xc2 would be a legal (and fairly common) UTF-8 start byte, but what'll matter is whether the byte that follows it is something in the 0x80-0xbf range or not.

Can you enable logging and set the debug level to 3 and capture the specific bytes in the MSG_CHANNEL_DATA message coming from the router? To enable the logging, you'd need to add something like:

    logging.basicConfig()
    asyncssh.set_log_level('DEBUG')
    asyncssh.set_debug_level(3)

The output would then look something like:

DEBUG:asyncssh:[conn=0, chan=0, pktid=12] Received MSG_CHANNEL_DATA (94), 13 bytes
  00000000: 5e 00 00 00 00 00 00 00 04 66 6f 6f 0a           ^........foo.

@waqasraz
Copy link
Author

waqasraz commented Jul 5, 2018

00004400: fc 5e 65 37 4a ff e0 d2 52 1c 8f d0 da e5 34 12 .^e7J...R.....4. 00004410: 7e 6c ad 19 90 4c 3b f2 0e 00 00 00 10 5c c1 e3 ~l...L;......\.. 00004420: c1 8e 61 b3 da 45 2f 70 07 ba 4b a5 31 77 5e 7b ..a..E/p..K.1w^{ 00004430: 1e 2f 85 77 2c c3 5a 8d 60 b7 67 67 43 00 00 00 ./.w,.Z..ggC...
00004440: 40 e7 e9 cb 74 86 b9 94 5e d5 3a 62 51 4d 95 5b @...t...^.:bQM.[
00004450: db 85 41 d5 f2 a5 41 ba 35 41 1a 90 e3 5d c8 d9 ..A...A.5A...]..
00004460: ad fe c1 b3 a8 50 a0 41 96 67 a5 d1 08 64 8e d4 .....P.A.g...d..
00004470: f2 90 11 6d 9d 48 3d 9a 61 ea e7 1b 65 d2 9c fc ...m.H=.a...e...
00004480: 6e ac da 74 3c 41 78 1a ba ca 8b d6 06 0c 13 0a n..t<Ax.........
00004490: 7a z
[2018-07-05 13:22:52,392: DEBUG/Worker-2] [conn=0, chan=0, pktid=802] Received MSG_CHANNEL_CLOSE (97), 5 bytes
00000000: 61 00 00 00 00 a....
`

@ronf
Copy link
Owner

ronf commented Jul 5, 2018

The full MSG_CHANNEL_DATA message is not shown here, but the portion of it you included appears to be raw binary data, not text. If you are fetching binary data over SSH, you definitely need to set the 'encoding' parameter to None and have your application code expect data of type bytes rather than str.

@waqasraz
Copy link
Author

waqasraz commented Jul 5, 2018

the whole message is huge. But let me set the encoding to none. Thank you for your help.

@waqasraz
Copy link
Author

waqasraz commented Jul 5, 2018

Sorry for the inconvenience. But I Don't see anything special is being requested. It's just opening the session and fetching the data. No binary mode etc

self._stdin, self._stdout, self._stderr = await self._conn.open_session(term_type='Dumb', term_size=(200, 24), )

fut = self._stdout.read(self._MAX_BUFFER) try: output += await asyncio.wait_for(fut, self._timeout) except asyncio.TimeoutError:

https://github.com/selfuryon/netdev/blob/e3dd7c9148a1f13d32dc3125414e882c7c41d1ee/netdev/vendors/base.py#L128

@ronf
Copy link
Owner

ronf commented Jul 5, 2018

Yes, and that's the problem. If you want to receive binary data, you have to explicitly set encoding to None, as it defaults to UTF-8. This would be done in the call to open_session() above. For instance:

    self._stdin, self._stdout, self._stderr = \
        await self._conn.open_session(term_type='dumb', term_size=(200, 24), encoding=None)

@waqasraz
Copy link
Author

waqasraz commented Jul 5, 2018

Ye, I've tried it previously. I got

TypeError: must be str, not bytes

at the same location.

@ronf
Copy link
Owner

ronf commented Jul 5, 2018

Right - as I mentioned above, you need to change the code which handles the data to be expecting a type of bytes rather than str, as you are not dealing with Unicode text. If you are really receiving binary data from the router, there's no way you can convert that data to a Unicode string.

@waqasraz
Copy link
Author

waqasraz commented Jul 5, 2018

encoding='latin-1' worked, for now, let see what happens next.

Thank you for your help. I really appreciate.

@ronf
Copy link
Owner

ronf commented Jul 5, 2018

It looks like Latin-1 will allow all byte values from 0x00 to 0xff to be translated as-is to Unicode code points of U+0000 to U+00FF, so you might get away with this. It's not really the right solution, though, and I'm guessing it will be quite a bit less efficient than operating directly on bytes objects.

@waqasraz
Copy link
Author

waqasraz commented Jul 5, 2018

But if I use bytes to parse and save the output I have to decode at some point for the user to view right? what will i do then?

@waqasraz
Copy link
Author

waqasraz commented Jul 5, 2018

If you can guide me on how to solve this issue since you have some idea about It I can try. Because according to juniper docs

Junos OS escapes and encodes these characters using the equivalent UTF-8 decimal character reference.

@ronf
Copy link
Owner

ronf commented Jul 5, 2018

The message you posted here appears to contain binary data, at least at the end. I didn't see any text in that message that you'd be able to output to the user. However, I'm guessing that the output you get back will be a mixture of text and binary data, depending on what commands you run. You may need to parse the output to split apart the text from the binary data, and then you'd be able to do something like data.decode('utf-8') on the portions of the text output that you want to display.

Also, if the text portions of the output are encoded as UTF-8, setting the encoding to 'Latin-1' will prevent AsyncSSH from raising an UnicodeDecodeError, but it won't actually return the right Unicode data. So, any non-ASCII characters in the text output won't be displayed correctly to the user. That would be the other reason you'll want to figure out how to split up the text & binary data and then manually decode only the text portion of the output using UTF-8.

I didn't see any sign of escaping or other forms of encoding of the binary data in the portion of the message you included. If you were running commands that were the kind of thing a user would run by hand using SSH, I would have expected some kind of conversion to ASCII on the binary data and you wouldn't run into this issue, but I didn't see that here.

@waqasraz
Copy link
Author

waqasraz commented Jul 6, 2018

Actually i am able to pin point the problem. But this does not make sense to me. Can you please see if you have some idea.

here is the output that should be returned

set interfaces xe-9/1/3 apply-groups Unused_Port
set interfaces xe-9/2/0 apply-groups BRAS-INTF-PARAMETERS
set interfaces xe-9/2/0 apply-groups MOI-BB15-VPLS-PORT-MIRRORING
set interfaces xe-9/2/0 description " TYP=INT;VID=0;NBR=JD-DSL-E4n2/0/0"
set interfaces xe-9/2/0 vlan-tagging

Here is the output that is return using latin encoding

set interfaces xe-9/1/3 apply-groups Unused_Port
set interfaces xe-9/2/0 apply-groups BRAS-INTF-PARAMETERS
set interfaces xe-9/2/0 apply-groups MOI-BB15-VPLS-PORT-MIRRORING
set interfaces xe-9/2/0 description " TYP=INT;VID=0;NBR=JD-DSL-E4Â-ten2/0/0"
set interfaces xe-9/2/0 vlan-tagging

this looks like normal n why is it causing a problem.

This is how i came to know the location.

`set interfaces xe-9/1/3 apply-groups Unused_Port

Task exception was never retrieved
future: <Task finished coro=<task() done, defined at /home/waqas/PycharmProjects/testing_network/main.py:16> exception=DisconnectError('Disconnect Error: Unicode decode error',)>
Traceback (most recent call last):
File "/home/waqas/PycharmProjects/testing_network/venv/lib/python3.6/site-packages/asyncssh/channel.py", line 296, in _deliver_data
data = encdata.decode(self._encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 0: invalid continuation byte

`

@ronf
Copy link
Owner

ronf commented Jul 7, 2018

I don’t know what’s happening on the router to cause this, but the byte sequence 0xc2 0x2d is definitely not valid UTF-8, so the error being returned in correct. It’s possible that the hyphen in your message got mangled while passing through e-mail and it was actually a 0xad originally instead of 0x2d, in which case 0xc2 0xad would translate to Unicode U+00AD, which is legal UTF-8 for a “soft hyphen”. However, if the router had actually sent 0xc2 0xad, you wouldn’t have gotten this error. In your output, you also show an extra “te” there, and I don’t know why that would be added.

Do you have the raw hex in the debug output of the MSG_CHANNEL_DATA which contains this response to confirm what bytes are actually being sent?

@waqasraz
Copy link
Author

waqasraz commented Jul 8, 2018

is it possible to add errors='ignore' or in general {encoding=''UTF-8", errors="ignore"}. Because I need to ignore these encoding issues because I've talked to a network engineer, he said this above problem is in a description, so probably someone just copypaste some description with Unicode characters.

@ronf
Copy link
Owner

ronf commented Jul 10, 2018

Adding the ability to specify the Unicode error handler to use seems like a good improvement, and it should be straightforward to add. I'll try to have something in the 'develop' branch shortly, and reply here when it's ready to test.

Note that setting errors='ignore' here will mean invalid Unicode output will be discarded. That should be fine in some cases, but if you were trying to do something like copy configuration from one router to another, you'd be better off with encoding=None and handling everything as bytes, so there's no loss of information.

@ronf
Copy link
Owner

ronf commented Jul 10, 2018

Ok - support for an "errors" argument is now ready to test in the "develop" branch (see commit 39ab119). Methods such as SSHClientConnection's create_session(), create_connection(), create_unix_connection(), create_server(), and create_unix_server() now support this, along with SSHServerConnection's create_connection() and create_unix_connection(). The equivalent functions which return stream or process objects also now support this.

Callers to create_server_channel(), create_tcp_channel(), and create_unix_channel() can also pass in an "errors" argument when customizing other channel parameters.

Support for controlling Unicode error handling is also available via the "session_errors" argument in the top-level AsyncSSH create_server() call (to be used along with "session_encoding"), and whatever is set there will apply to newly created server sessions on that server.

When working with SSH process objects, whatever Unicode error handler is set is also automatically used as the error handler for any I/O redirection which is performed on that process.

Finally, the get_comment() and set_comment() functions that operate on private/public keys and certificates have been updated to accept an "errors" argument as well.

@waqasraz
Copy link
Author

awesome thank you.

@ronf
Copy link
Owner

ronf commented Jul 24, 2018

This feature is now released in AsyncSSH 1.13.3.

@ronf ronf closed this as completed Jul 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants