Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make document's character encoding reflect byte-order marks #5359

Merged
merged 2 commits into from
Mar 24, 2020

Conversation

andreubotella
Copy link
Member

@andreubotella andreubotella commented Mar 16, 2020

This change fixes a bug where document's character encoding was set to the return value of the encoding sniffing algorithm rather than to the actual encoding used, which differed when the stream started with a byte-order mark. This change incorporates BOM sniffing into the encoding sniffing algorithm, ensuring both encodings are identical.

Closes #1077.

This PR must be merged after whatwg/encoding#203.

(See WHATWG Working Mode: Changes for more details.)


/infrastructure.html ( diff )
/parsing.html ( diff )

This change fixes a bug where document's character encoding was set to
the return value of the encoding sniffing algorithm rather than to the
actual encoding used, which differed when the stream started with a
byte-order mark. This change incorporates BOM sniffing into the encoding
sniffing algorithm, ensuring both encodings are identical.

Closes whatwg#1077.
@domenic domenic added the do not merge yet Pull request must not be merged per rationale in comment label Mar 16, 2020
source Show resolved Hide resolved
source Show resolved Hide resolved
Copy link
Member

@domenic domenic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for answering my questions!

annevk pushed a commit to web-platform-tests/wpt that referenced this pull request Mar 19, 2020
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this pull request Mar 21, 2020
…estonly

Automatic update from web-platform-tests
Encoding: BOM detection

For whatwg/html#5359.
--

wpt-commits: 7d9b5a5facaf0b9d986e63e0a031983899d30c79
wpt-pr: 22276
xeonchen pushed a commit to xeonchen/gecko that referenced this pull request Mar 21, 2020
…estonly

Automatic update from web-platform-tests
Encoding: BOM detection

For whatwg/html#5359.
--

wpt-commits: 7d9b5a5facaf0b9d986e63e0a031983899d30c79
wpt-pr: 22276
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified that referenced this pull request Mar 23, 2020
…estonly

Automatic update from web-platform-tests
Encoding: BOM detection

For whatwg/html#5359.
--

wpt-commits: 7d9b5a5facaf0b9d986e63e0a031983899d30c79
wpt-pr: 22276

UltraBlame original commit: ed0a5ea70a1f04bdad4e54f7dfc89ed9e067fe7a
gecko-dev-updater pushed a commit to marco-c/gecko-dev-comments-removed that referenced this pull request Mar 23, 2020
…estonly

Automatic update from web-platform-tests
Encoding: BOM detection

For whatwg/html#5359.
--

wpt-commits: 7d9b5a5facaf0b9d986e63e0a031983899d30c79
wpt-pr: 22276

UltraBlame original commit: ed0a5ea70a1f04bdad4e54f7dfc89ed9e067fe7a
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified-and-comments-removed that referenced this pull request Mar 23, 2020
…estonly

Automatic update from web-platform-tests
Encoding: BOM detection

For whatwg/html#5359.
--

wpt-commits: 7d9b5a5facaf0b9d986e63e0a031983899d30c79
wpt-pr: 22276

UltraBlame original commit: ed0a5ea70a1f04bdad4e54f7dfc89ed9e067fe7a
source Show resolved Hide resolved
source Show resolved Hide resolved
@annevk annevk removed the do not merge yet Pull request must not be merged per rationale in comment label Mar 24, 2020
@annevk annevk merged commit 21b5234 into whatwg:master Mar 24, 2020
@andreubotella andreubotella deleted the bom-sniff branch March 24, 2020 09:05
andreubotella pushed a commit to andreubotella/servo that referenced this pull request Dec 30, 2020
The process of decoding the network byte stream to Unicode is backed by
an instance of `encoding_rs::Decoder`, which will switch the encoding it
uses if it finds a BOM in the byte stream. However, this change in
encoding is not communicated back to the caller and so
`document.characterSet` gives the wrong result. This change fixes that.

See whatwg/html#5359 and whatwg/encoding#203 for the spec-level backing
for this change.
andreubotella pushed a commit to andreubotella/servo that referenced this pull request Dec 30, 2020
The process of decoding the network byte stream to Unicode is backed by
an instance of `encoding_rs::Decoder`, which will switch the encoding it
uses if it finds a BOM in the byte stream. However, this change in
encoding is not communicated back to the caller and so
`document.characterSet` gives the wrong result. This change fixes that.

See whatwg/html#5359 and whatwg/encoding#203 for the spec-level backing
for this change.

Signed-off-by: Andreu Botella <abb@randomunok.com>
andreubotella pushed a commit to andreubotella/servo that referenced this pull request Dec 31, 2020
The process of decoding the network byte stream to Unicode is backed by
an instance of `encoding_rs::Decoder`, which will switch the encoding it
uses if it finds a BOM in the byte stream. However, this change in
encoding is not communicated back to the caller and so
`document.characterSet` gives the wrong result. This change fixes that.

See whatwg/html#5359 and whatwg/encoding#203 for the spec-level backing
for this change.

Signed-off-by: Andreu Botella <abb@randomunok.com>
bors-servo added a commit to servo/servo that referenced this pull request Jan 1, 2021
Fix `document.characterSet` not reflecting byte order marks.

The process of decoding the network byte stream to Unicode is backed by an instance of `encoding_rs::Decoder`, which will switch the encoding it uses if it finds a BOM in the byte stream. However, this change in encoding is not communicated back to the caller and so `document.characterSet` gives the wrong result. This change fixes that.

See whatwg/html#5359 and whatwg/encoding#203 for the spec-level backing for this change.

---
<!-- Thank you for contributing to Servo! Please replace each `[ ]` by `[X]` when the step is complete, and replace `___` with appropriate data: -->
- [X] `./mach build -d` does not report any errors
- [X] `./mach test-tidy` does not report any errors
- [X] These changes fix #28005 (GitHub issue number if applicable)

<!-- Either: -->
- [X] There are tests for these changes OR
- [ ] These changes do not require tests because ___

<!-- Also, please make sure that "Allow edits from maintainers" checkbox is checked, so that we can help you if you get stuck somewhere along the way.-->

<!-- Pull requests that do not address these steps are welcome, but they will require additional verification as part of the review process. -->
bors-servo added a commit to servo/servo that referenced this pull request Jan 1, 2021
Fix `document.characterSet` not reflecting byte order marks.

The process of decoding the network byte stream to Unicode is backed by an instance of `encoding_rs::Decoder`, which will switch the encoding it uses if it finds a BOM in the byte stream. However, this change in encoding is not communicated back to the caller and so `document.characterSet` gives the wrong result. This change fixes that.

See whatwg/html#5359 and whatwg/encoding#203 for the spec-level backing for this change.

---
<!-- Thank you for contributing to Servo! Please replace each `[ ]` by `[X]` when the step is complete, and replace `___` with appropriate data: -->
- [X] `./mach build -d` does not report any errors
- [X] `./mach test-tidy` does not report any errors
- [X] These changes fix #28005 (GitHub issue number if applicable)

<!-- Either: -->
- [X] There are tests for these changes OR
- [ ] These changes do not require tests because ___

<!-- Also, please make sure that "Allow edits from maintainers" checkbox is checked, so that we can help you if you get stuck somewhere along the way.-->

<!-- Pull requests that do not address these steps are welcome, but they will require additional verification as part of the review process. -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Cleanup around the HTML parser and how it sniffs encodings
3 participants