Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal Error: invalid UTF-8 byte sequence found during decoding - on ü #1047

Open
IngoHohmann opened this issue Feb 5, 2020 · 8 comments

Comments

@IngoHohmann
Copy link

IngoHohmann commented Feb 5, 2020

>> b: read http://www.google.com                                          
== #{
3C21646F63747970652068746D6C3E3C68746D6C206974656D73636F70653D22
22206974656D747970653D22687474703A2F2F736368656D612E6F72672F5765
...

>> x: copy/part at b 7912 5 
== #{476CFC636B}

>> to text! copy/part x 1   
== "G"

>> to text! copy/part x 2 
== "Gl"

>> to text! copy/part x 3 
** Internal Error: invalid UTF-8 byte sequence found during decoding
** Where: to console
** Near: [... copy/part x 3 ~~]
** Line: 1

>> to text! at x 4         
== "ck"

>> copy/part at x 3 1            
== #{FC}

If opened in the Firefox view source window the text is: Glück

@gchiu
Copy link

gchiu commented Feb 5, 2020

>> to text! read https://www.google.de
** Internal Error: invalid UTF-8 byte sequence found during decoding
** Where: to console
** Near: [... text! read https://www.google.de ~~]
** Line: 1

@orr721
Copy link

orr721 commented Feb 7, 2020

>> bin-to-string: function [bin [binary!]][
    text: make text! length? bin
    for-each byte bin [append text to char! byte]
    text
]

>> bin-to-string read https://www.google.de
== {<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="sk"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title>

see stackoverflow

@hostilefork
Copy link
Member

Can someone give me an executive summary of this, so I don't have to do too much research?

Is Google serving up invalid UTF-8 (hence Google's problem, IMO) or is it valid (and thus our problem)?

Note: A long time ago I had suggested to BrianH that PARSE seemed a good interface to TRANSCODE. With the "residual" return result, I imagine we could say that:

parse binary [set t text!]

Could be a way of doing as much UTF-8 encoding as you can, and returning the position of any residual bytes. If you get NULL then that means your encoding succeeded all the way. Something to think about.

@orr721
Copy link

orr721 commented Feb 7, 2020

There is another SO page here: https://stackoverflow.com/questions/47108274/read-https-google-com-doesnt-work-anymore-in-red

I have found both of them previously by a chance when looking for this error. Both claim it is a problem with Google's UTF-8 encoding. I don't know enough about UTF-8 to check myself. But if it would be a problem on Google's side why there are no complaints from people using python, etc. Seems only Rebol/R3-Renc/RED have this problem.

But the fix works, so I didn't investigate further.. ¯_(ツ)_/¯

@orr721
Copy link

orr721 commented Feb 7, 2020

Btw I do get null when executing the parse command.

@orr721
Copy link

orr721 commented Feb 7, 2020

Ok, I have found the problem:

$ curl -i https://www.google.de
HTTP/2 200 
date: Fri, 07 Feb 2020 22:33:03 GMT
expires: -1
cache-control: private, max-age=0
content-type: text/html; charset=ISO-8859-1

...snip...

<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="sk"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
...snip...

There is this SO commentary regarding the <meta charset=“utf-8”> vs <meta http-equiv=“Content-Type”> in HTML:

  • Noted should be that neither is been used for parsing when the page is served over web. Instead, the one in HTTP Content-Type response header will be used. The meta tag is only used when the page is loaded from local disk file system.

So Google is serving ISO-8859-1 even though the HTML says it is UTF-8..

@hostilefork
Copy link
Member

Well, good to know. :-/ Thanks for digging into it.

I've said that there needs to be a clear organization of the meaning of things like READ vs. LOAD, and how it all works. This is yet-another-piece-of-evidence that READ needs to stay in the world of bytes. LOAD then needs to be able to automatically sense content types and give you what you want, or give you an error if you do not have a codec for it.

Going to have to put some thought into this; one piece of good news is that by being in the browser, we can experiment through the lens of something where all the network basics are taken care for us. Then that design could be reused on the desktop based on the information.

@hostilefork
Copy link
Member

>> to text! copy/part x 1   
== "G"

As an aside @IngoHohmann - the nature of text and binary is now such that they can be aliased between each other with AS. This does not make a copy, while TO does.

So above, you are copying a chunk out of a binary, then making another copy in order to do the TO.

You could build a single disconnected copy from the binary with as text! copy/part x 1.

After AS is used to alias a BINARY! as a TEXT!, however, that binary is constrained to where all modifications must keep it as valid UTF-8. In this case that's obviously not a problem for you, since you didn't store the copy anywhere else and hence can't access it as a binary (unless you alias it back). But clearly, aliasing it back will still have had it aliased as TEXT!, so that binary would also have the constraint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants