Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non-UTF-8 character encodings for HTML #6414

Open
k-bx opened this issue Jun 18, 2015 · 15 comments
Open

Support non-UTF-8 character encodings for HTML #6414

k-bx opened this issue Jun 18, 2015 · 15 comments

Comments

@k-bx
Copy link

@k-bx k-bx commented Jun 18, 2015

Popular Ukrainian news website is rendered poorly on my Mac. All cyrillic letters are rendered as squares.

Steps to reproduce:

./mach run --release http://pravda.com.ua

Thanks for looking at it!

@k-bx
Copy link
Author

@k-bx k-bx commented Jun 18, 2015

Sorry, I see it uses windows-1251 encoding like it's a stone age. I'll write them a letter.

@k-bx k-bx closed this Jun 18, 2015
@SimonSapin SimonSapin changed the title pravda.com.ua rendering Support non-UTF-8 character encodings for HTML Jun 18, 2015
@SimonSapin
Copy link
Member

@SimonSapin SimonSapin commented Jun 18, 2015

That’s probably the issue, but we still need to fix it :)

Relevant spec: https://html.spec.whatwg.org/multipage/#determining-the-character-encoding
Related html5ever issue: servo/html5ever#18

@SimonSapin SimonSapin reopened this Jun 18, 2015
@vectorijk
Copy link
Contributor

@vectorijk vectorijk commented Oct 8, 2015

@SimonSapin It seems like Chinese, Japanese characters didn't be rendered properly on linux. Do you have idea or suggestion how to get involved and work on this issue? (lots of stuff seem need to be solved). Is this the encoding or rendering problem?

@Ms2ger
Copy link
Contributor

@Ms2ger Ms2ger commented Oct 8, 2015

Looks like there's a rendering problem too; that'd be another issue.

@SimonSapin
Copy link
Member

@SimonSapin SimonSapin commented Oct 8, 2015

Right, I think we also have bugs in font selection and font rendering, but I don’t know in details right now. Make sure to use UTF-8 for your HTML when testing to isolate from this bug.

@nhirata
Copy link

@nhirata nhirata commented Feb 17, 2016

Chinese and Japanese characters are double byte and use UTF-16.

@SimonSapin
Copy link
Member

@SimonSapin SimonSapin commented Feb 17, 2016

@nhirata Many encodings other than UTF-16 support Chinese and Japanese characters, and “double byte character” is a term typically used in the context of the Shift-JIS encoding. This issue is about supporting both of them and more (https://encoding.spec.whatwg.org/#names-and-labels) through https://github.com/lifthrasiir/rust-encoding

There is already some support in html5ever (servo/html5ever@2f4f64b / servo/html5ever#188) but we still need Servo to use it. I’m looking at this now.

@jdm
Copy link
Member

@jdm jdm commented Feb 23, 2016

Incomplete attempt to solve this: #9677

@SimonSapin
Copy link
Member

@SimonSapin SimonSapin commented Feb 23, 2016

With #9677 there would be two remaining pieces:

  • In Servo, get the transport’s encoding information (charset in Content-Type) and communicate it to html5ever. (See Kuchiki’s from_hyper for how to get this from an hyper::client::Response.)
  • In html5ever, implement the rest of the encoding sniffing algorithm, in particular scan for <meta> elements: servo/html5ever#18 (comment) (This may require refactoring how the parser is interrupted by blocking scripts, see #9677 (comment))
@jdm
Copy link
Member

@jdm jdm commented Mar 10, 2016

Latest attempt was #9730, but there were test failures that need further investigation.

@jdm
Copy link
Member

@jdm jdm commented Jun 21, 2017

For future reference, servo/html5ever@79ed3e3 and #16989 are related work that has happened. It would be good to figure out what work still remains to hook this up.

@shinglyu
Copy link
Member

@shinglyu shinglyu commented Aug 23, 2017

Looks like most previous work were on html5ever parsing. Are there related bugs in the font rendering module as well?

@SimonSapin
Copy link
Member

@SimonSapin SimonSapin commented Aug 23, 2017

This issue is about character encodings: converting bytes from the network to Unicode code points. I believe we also have font rendering issues but they’re independent from this and I don’t know if they’re filed.

@shinglyu
Copy link
Member

@shinglyu shinglyu commented Aug 23, 2017

I see. I did a quick search and found that there are several bugs here and there. Let me open a meta bug to collect them.

@shinglyu shinglyu mentioned this issue Aug 23, 2017
1 of 6 tasks complete
@nox
Copy link
Member

@nox nox commented Dec 13, 2018

We now read encoding from the HTTP Content-Type header.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
9 participants
You can’t perform that action at this time.