Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign uptextinput.rs replace_selection now slices wrt bytes rather than chars #20208
Conversation
highfive
commented
Mar 6, 2018
|
Thanks for the pull request, and welcome! The Servo team is excited to review your changes, and you should hear from @asajeffrey (or someone else) soon. |
highfive
commented
Mar 6, 2018
|
Heads up! This PR modifies the following files:
|
highfive
commented
Mar 6, 2018
…cters
|
It looks like
|
|
That's odd, I ran the unit tests using "./mach test-unit" and only one testcase failed - seemed to be something unrelated. I'll rerun and provide an update. |
|
"http_loader::test_redirected_request_to_devtools" is the testcase that failed for me. I must have neglected to run the tests on my most recent changes. Will update soon. |
|
That test is known to fail intermittently: #14774. |
|
I investigated the issue further, turns out I had misunderstood the problem. The code does handle multibyte characters properly, however, it's unable to slice those characters. I wasn't able to figure out a way of forcing Rust to ignore character boundaries when slicing (this is how Chrome handles the case). Any suggestions? |
|
@sarkhanbayramli I'm not sure what you're asking. The specification says that setRangeText operates on characters, so a multi-byte code point like in the testcase shouldn't be modified at all (since character 1 is the character directly following the value of the textarea). Running https://software.hixie.ch/utilities/js/live-dom-viewer/?saved=5815 in Chrome or Firefox agrees with me on this point. |
|
@SimonSapin Do you have any suggestions for the right way to resolve the previously-linked testcase, or is this simply another instance where we need to accept that our UTF-8 strings can break web content? |
|
@sarkhanbayramli "Multi-byte" is usually used in the context of UTF-8 encoding, where it means non-ASCII (above U+007F). This is different from "multi-code-unit" in UTF-16, which means non-BMP (above U+FFFF). The latter is what’s relevant to Servo’s DOM. @jdm Yes, using UTF-8 for I’ve argued before that we should silently replace would-be lone surrogates with U+FFFD, but this met opposition on the basis that we should keep the current panic that asks people to file bug. Whether replacing would actually break real content (as opposed to, say, occasionally showing If we decide that having black-box equivalent behavior (preserving lone surrogates) for things like this test case is important, we’ll need to switch to either WTF-16 (a.k.a. "UCS-2 but also kinda UTF-16 if you squint a little", which is what other browser engines use) or WTF-8. |
|
Ok, so I think the best thing to do here would be to check if splitting would yield an invalid UTF-8 string, then panic with a more helpful message like #6614. We could also use the existing flag that controls whether to replace surrogates or not to optionall perform the appropriate conversion. |
|
@jdm @SimonSapin thanks for clarifying. I'll update my pr today. |
|
It looks like the /// The length in bytes of the first n code units a string when encoded in UTF-16.
///
/// If the string is fewer than n code units, returns the length of the whole string.
/// Return `None` if `n` would split a non-BMP code point.
fn len_of_first_n_code_units(text: &str, n: usize) -> Option<usize> {
if n == 0 {
return 0
}
let mut utf8_len = 0;
let mut utf16_len = 0;
for c in text.chars() {
utf16_len += c.len_utf16();
utf8_len += c.len_utf8();
if utf16_len == n {
beak
}
if utf16_len > n {
return None
}
}
Some(utf8_len)
} |
|
@SimonSapin I'm a bit confused by your proposed changes. The panic originates at line 382 in "textinput.rs" - the indices used there are not calculated using "len_of_first_n_code_units" in the original code. Shouldn't we rather add error handling around these slicing lines?
|
|
Something like:
|
|
@jdm Regarding the panic message in the case when indices don't fall on character boundaries, would the following message be considered informative: |
|
@SimonSapin I wanted to check if I understand the issue correctly: |
|
Sorry, my last message was based on jdm’s message and seeing There’s a distinction that is important to make: indices counted in bytes in an UTF-8 string, and indices counted in code units in a UTF-16 string. Slicing a It’s also important to understand how code points above U+FFFF are encoded as a pair of surrogate code units in UTF-16: one in the range 0xD800 to 0xDBFF followed by one in the range 0xDC00 to 0xDFFF. The corresponding code points are reserved for this purpose and therefore forbidden in UTF-16. So the trouble for us starts when a DOM on JS API splits up a surrogate pair, or otherwise creates an unpaired surrogate which cannot be represented in UTF-8 like in a Rust So … but I think the code being modified by this PR is already operating on UTF-8 offsets. Such offsets that are not at code point boundary should not have been created in the first place, so maybe the code that needs to change is somewhere else. And it would also help to have separate types for the two kinds of offsets (so we don’t accidentally mix them up), but that might be out of scope for this bug: /// Counting UTF-8 bytes, suitable for slicing `&str`
struct ByteOffset(usize);
/// Counting UTF-16 (or WTF-16) code units. Suitable for parameters and return values in DOM APIs.
struct CodeUnitOffset(usize); |
|
@SimonSapin Thanks for the write-up, things are much clearer now! Could you also please take a look at our discussion above on how other browsers handle the case? Specifically this example https://software.hixie.ch/utilities/js/live-dom-viewer/?saved=5816. It seems like other browsers also treat the indices provided to |
|
|
Updating fork
|
I modified Also added a new testcase to All the test passed when ran "./mach test-unit -p script", "./mach test-unit" and "./mach test-tidy". |
|
r? @SimonSapin |
|
@SimonSapin would you be able to take a look at my changes? |
|
Review oing, @SimonSapin. |
|
Sorry for the delay. Besides this specific diff, on closer looks much of the surrounding code seems to be a big mess about UTF-16 indices v.s. UTF-8 indices. For example I feel that properly fixing this issue requires a much larger refactor. |
|
I've talked with Simon, and I agree that the current attempt to fix this in this PR isn't really getting at the underlying problem. I've opened #20455 to build the foundations for solving this properly, and I'm going to close this PR because it's difficult to reason about this code and the right solution without those foundations being in place. |
sarkhanbayramli commentedMar 6, 2018
•
edited
Priorly replace_selection used character indices to slice. However, some characters take more than a single byte. Modified replace_selection to index using bytes with "len_of_first_n_code_units".
./mach build -ddoes not report any errors./mach test-tidydoes not report any errorsTested manually using the instructions provided on the issue.
This change is