Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upSupport non-UTF-8 character encodings for HTML #6414
Comments
|
Sorry, I see it uses windows-1251 encoding like it's a stone age. I'll write them a letter. |
|
That’s probably the issue, but we still need to fix it :) Relevant spec: https://html.spec.whatwg.org/multipage/#determining-the-character-encoding |
|
@SimonSapin It seems like Chinese, Japanese characters didn't be rendered properly on linux. Do you have idea or suggestion how to get involved and work on this issue? (lots of stuff seem need to be solved). Is this the encoding or rendering problem? |
|
Looks like there's a rendering problem too; that'd be another issue. |
|
Right, I think we also have bugs in font selection and font rendering, but I don’t know in details right now. Make sure to use UTF-8 for your HTML when testing to isolate from this bug. |
|
Chinese and Japanese characters are double byte and use UTF-16. |
|
@nhirata Many encodings other than UTF-16 support Chinese and Japanese characters, and “double byte character” is a term typically used in the context of the Shift-JIS encoding. This issue is about supporting both of them and more (https://encoding.spec.whatwg.org/#names-and-labels) through https://github.com/lifthrasiir/rust-encoding There is already some support in html5ever (servo/html5ever@2f4f64b / servo/html5ever#188) but we still need Servo to use it. I’m looking at this now. |
|
Incomplete attempt to solve this: #9677 |
|
With #9677 there would be two remaining pieces:
|
|
Latest attempt was #9730, but there were test failures that need further investigation. |
|
For future reference, servo/html5ever@79ed3e3 and #16989 are related work that has happened. It would be good to figure out what work still remains to hook this up. |
|
Looks like most previous work were on html5ever parsing. Are there related bugs in the font rendering module as well? |
|
This issue is about character encodings: converting bytes from the network to Unicode code points. I believe we also have font rendering issues but they’re independent from this and I don’t know if they’re filed. |
|
I see. I did a quick search and found that there are several bugs here and there. Let me open a meta bug to collect them. |
|
We now read encoding from the HTTP |
Popular Ukrainian news website is rendered poorly on my Mac. All cyrillic letters are rendered as squares.
Steps to reproduce:
Thanks for looking at it!