Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: wpt upstreamer fails when git process output contains invalid UTF-8 #29620

Closed
delan opened this issue Apr 12, 2023 · 0 comments · Fixed by #29622
Closed

CI: wpt upstreamer fails when git process output contains invalid UTF-8 #29620

delan opened this issue Apr 12, 2023 · 0 comments · Fixed by #29622
Labels
A-build Related to or part of the build process

Comments

@delan
Copy link
Sponsor Member

delan commented Apr 12, 2023

Out of curiosity, I tried running the new upstreamer (#29601) against #29594 with a merge from master, which is currently more challenging for the script because it applies the commits one at a time.

https://github.com/servo/servo/actions/runs/4675613330/jobs/8280932770?pr=29594

As expected, the script processed a lot of WPT changes made outside my PR, but it failed on them in an interesting way:

INFO:root:  → Execution (cwd='./servo'): git diff b47513006044f769c6c34952028dc163de77982c b47513006044f769c6c34952028dc163de77982c~6 -- tests/wpt/web-platform-tests/
ERROR:root:'utf-8' codec can't decode byte 0xa2 in position 687773: invalid start byte
Traceback (most recent call last):
  File "/home/runner/work/servo/servo/servo/etc/ci/upstream-wpt-changes/wptupstreamer/__init__.py", line 182, in run
    self.handle_new_pull_request_contents(run, pull_data)
  File "/home/runner/work/servo/servo/servo/etc/ci/upstream-wpt-changes/wptupstreamer/__init__.py", line 202, in handle_new_pull_request_contents
    self.local_servo_repo.run(
  File "/home/runner/work/servo/servo/servo/etc/ci/upstream-wpt-changes/wptupstreamer/__init__.py", line 66, in run
    ).decode("utf-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa2 in position 687773: invalid start byte
Error: Process completed with exit code 1.

This is because LocalGitRepo.run assumes that the output can be decoded as UTF-8, but git-diff(1) and git-show(1) output can contain arbitrary high bytes unless git gives up and decides the file is binary.

We should leave git-diff(1) and git-show(1) output as bytes where possible, and if we really need to process them as str (for example, to find --- and +++), we should consider using surrogateescape to treat it as ASCII with opaque high bytes.

We should also check that we handle weird paths correctly, even though Windows and macOS support means we won’t need to deal with any invalid UTF-8. Most git commands like git status and git diff will use a special double-quote-octal escaping for paths with any high bytes (or double quotes), whereas -z modes like git status -z and git diff --numstat -z will happily spit out arbitrary high bytes.

% touch foo bér $'\xFF' '"'
% git add foo bér $'\xFF' '"'
% git status -s    
A  "\""
A  "b\303\251r"
A  foo
A  "\377"

% git diff --cached | rg '^diff '
diff --git "a/\"" "b/\""
diff --git "a/b\303\251r" "b/b\303\251r"
diff --git a/foo b/foo
diff --git "a/\377" "b/\377"

% git status -z | tr \\0 \\n
A  "
A  bér
A  foo
A  �

% git diff --cached --numstat -z | tr \\0 \\n
0	0	"
0	0	bér
0	0	foo
0	0	�

more details

Inspecting the output with utf8check:

% git diff b47513006044f769c6c34952028dc163de77982c b47513006044f769c6c34952028dc163de77982c~6 -- tests/wpt/web-platform-tests/ | ~/code/utf8check/utf8check             
688567: unexpected continuation byte; ASCII or start byte expected
688779: unexpected continuation byte; ASCII or start byte expected
691366: unexpected continuation byte; ASCII or start byte expected
% git diff b47513006044f769c6c34952028dc163de77982c b47513006044f769c6c34952028dc163de77982c~6 -- tests/wpt/web-platform-tests/ | dd status=none bs=1 skip=688000 | less
% git diff b47513006044f769c6c34952028dc163de77982c b47513006044f769c6c34952028dc163de77982c~6 -- tests/wpt/web-platform-tests/ | dd status=none bs=1 skip=691000 | less
(finding the offending paths)
% git log -- tests/wpt/web-platform-tests/html/syntax/charset/inheritance-bogus-meta.html tests/wpt/web-platform-tests/html/syntax/charset/resources/bogus-charset.html | rg '^commit '
commit bb34f95b33cd3b919fda221408720e7e6dea84ab
% git show bb34f95b33cd3b919fda221408720e7e6dea84ab tests/wpt/web-platform-tests/html/syntax/charset/inheritance-bogus-meta.html tests/wpt/web-platform-tests/html/syntax/charset/resources/bogus-charset.html | ~/code/utf8check/utf8check -s
commit bb34f95b33cd3b919fda221408720e7e6dea84ab
Author: WPT Sync Bot <josh+wptsync@joshmatthews.net>
Date:   Fri Apr 7 01:27:34 2023 +0000

    Update web-platform-tests to revision b'1d9b01e2fad6af3a057d571b1e088e15fa9bc8e6'

diff --git a/tests/wpt/web-platform-tests/html/syntax/charset/inheritance-bogus-meta.html b/tests/wpt/web-platform-tests/html/syntax/charset/inheritance-bogus-meta.html
new file mode 100644
index 0000000000..616ba51853
--- /dev/null
+++ b/tests/wpt/web-platform-tests/html/syntax/charset/inheritance-bogus-meta.html
@@ -0,0 +1,46 @@
+<!doctype html>
+<meta charset=windows-1253>
+<title>Inheriting from windows-1253 parent</title>
+<script src=/resources/testharness.js></script>
+<script src=/resources/testharnessreport.js></script>
+<script src=/common/get-host-info.sub.js></script>
+<div id=log></div>
+<script>
+[
+  {
+    "title": "Child with bogus <meta charset>",
+    "url": "resources/bogus-charset.html",
+    "expected": "�\n" // 0x00A2 in windows-1252 is at the same position as 0x0386 in windows-1253
+  },
+  {
+    "title": "Child with bogus Content-Type charset",
+    "url": "resources/bogus-charset-http.py",
+    "expected": "�\n"
+  },
+  {
+    "title": "Child with bogus Content-Type charset, but valid <meta charset>",
+    "url": "resources/bogus-charset-http-valid-meta.py",
+    "expected": "\u045E\n"
+  }
+].forEach(({ title, url, expected }) => {
+  async_test(t => {
+    const frame = document.createElement("iframe");
+    t.add_cleanup(() => frame.remove());
+    frame.src = url;
+    frame.onload = t.step_func_done(() => {
+      assert_equals(frame.contentDocument.body.textContent, expected);
+    });
+    document.body.append(frame);
+  }, title);
+});
+
+async_test(t => {
+  self.onmessage = t.step_func_done(({ data }) => {
+    assert_equals(data, "\u00A2\n");
+  });
+  const frame = document.createElement("iframe");
+  t.add_cleanup(() => frame.remove());
+  frame.src = get_host_info().HTTP_REMOTE_ORIGIN + new URL("resources/bogus-charset.html", location).pathname + "?postMessage";
+  document.body.append(frame);
+}, "Cross-origin child with bogus <meta charset>");
+</script>
diff --git a/tests/wpt/web-platform-tests/html/syntax/charset/resources/bogus-charset.html b/tests/wpt/web-platform-tests/html/syntax/charset/resources/bogus-charset.html
new file mode 100644
index 0000000000..e77e874466
--- /dev/null
+++ b/tests/wpt/web-platform-tests/html/syntax/charset/resources/bogus-charset.html
@@ -0,0 +1,7 @@
+<!doctype html><meta charset=this-is-not-a-charset><script>
+onload = () => {
+  if (location.search === "?postMessage") {
+    parent.postMessage(document.body.textContent, "*");
+  }
+}
+</script>�

The three � characters are A2h bytes, which are valid in Windows-1252 but not UTF-8.

@delan delan added the A-build Related to or part of the build process label Apr 12, 2023
mrobinson added a commit to mrobinson/servo that referenced this issue Apr 12, 2023
The output of `git diff` and `git show` can include non-UTF8 text or
binary data, so the WPT upstream script should handle this properly.

Fixes servo#29620.
mrobinson added a commit to mrobinson/servo that referenced this issue Apr 12, 2023
The output of `git diff` and `git show` can include non-UTF8 text or
binary data, so the WPT upstream script should handle this properly.

Fixes servo#29620.
mrobinson added a commit to mrobinson/servo that referenced this issue Apr 12, 2023
The output of `git diff` and `git show` can include non-UTF8 text or
binary data, so the WPT upstream script should handle this properly.

Fixes servo#29620.
bors-servo added a commit that referenced this issue Apr 12, 2023
Handle non-UTF8 diffs in the WPT upstream script

The output of `git diff` and `git show` can include non-UTF8 text or binary data, so the WPT upstream script should handle this properly.

Fixes #29620.

<!-- Please describe your changes on the following line: -->

---
<!-- Thank you for contributing to Servo! Please replace each `[ ]` by `[X]` when the step is complete, and replace `___` with appropriate data: -->
- [x] `./mach build -d` does not report any errors
- [x] `./mach test-tidy` does not report any errors
- [x] These changes fix #29620
- [x] There are tests for these changes

<!-- Also, please make sure that "Allow edits from maintainers" checkbox is checked, so that we can help you if you get stuck somewhere along the way.-->

<!-- Pull requests that do not address these steps are welcome, but they will require additional verification as part of the review process. -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-build Related to or part of the build process
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant