Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

textinput.rs replace_selection now slices wrt bytes rather than chars #20208

Closed
wants to merge 6 commits into from

Conversation

@sarkhanbayramli
Copy link

sarkhanbayramli commented Mar 6, 2018

Priorly replace_selection used character indices to slice. However, some characters take more than a single byte. Modified replace_selection to index using bytes with "len_of_first_n_code_units".


  • ./mach build -d does not report any errors
  • ./mach test-tidy does not report any errors
  • These changes fix #20028 (github issue number if applicable).
  • There are tests for these changes OR
  • These changes do not require tests because _____

Tested manually using the instructions provided on the issue.


This change is Reviewable

@highfive
Copy link

highfive commented Mar 6, 2018

Thanks for the pull request, and welcome! The Servo team is excited to review your changes, and you should hear from @asajeffrey (or someone else) soon.

@highfive
Copy link

highfive commented Mar 6, 2018

Heads up! This PR modifies the following files:

@highfive
Copy link

highfive commented Mar 6, 2018

warning Warning warning

  • These commits modify script code, but no tests are modified. Please consider adding a test!
@sarkhanbayramli sarkhanbayramli force-pushed the sarkhanbayramli:master branch from ec4653c to 0a5e9cc Mar 6, 2018
@jdm
Copy link
Member

jdm commented Mar 6, 2018

It looks like ./mach test-unit -p script fails:

---- textinput::test_textinput_delete_char stdout ----
	thread 'textinput::test_textinput_delete_char' panicked at 'assertion failed: `(left == right)`
  left: `DOMString("a", PhantomData)`,
 right: `"ab"`', tests/unit/script/textinput.rs:219:5
---- textinput::test_textinput_insert_char stdout ----
	thread 'textinput::test_textinput_insert_char' panicked at 'assertion failed: `(left == right)`
  left: `DOMString("a🌠cb", PhantomData)`,
 right: `"a🌠bc"`', tests/unit/script/textinput.rs:244:5
@sarkhanbayramli
Copy link
Author

sarkhanbayramli commented Mar 6, 2018

That's odd, I ran the unit tests using "./mach test-unit" and only one testcase failed - seemed to be something unrelated.

I'll rerun and provide an update.

@jdm jdm added the S-needs-tests label Mar 6, 2018
@sarkhanbayramli
Copy link
Author

sarkhanbayramli commented Mar 6, 2018

"http_loader::test_redirected_request_to_devtools" is the testcase that failed for me. I must have neglected to run the tests on my most recent changes.

Will update soon.

@jdm
Copy link
Member

jdm commented Mar 6, 2018

That test is known to fail intermittently: #14774.

@sarkhanbayramli
Copy link
Author

sarkhanbayramli commented Mar 6, 2018

I investigated the issue further, turns out I had misunderstood the problem. The code does handle multibyte characters properly, however, it's unable to slice those characters.

I wasn't able to figure out a way of forcing Rust to ignore character boundaries when slicing (this is how Chrome handles the case). Any suggestions?

@jdm
Copy link
Member

jdm commented Mar 8, 2018

@sarkhanbayramli I'm not sure what you're asking. The specification says that setRangeText operates on characters, so a multi-byte code point like in the testcase shouldn't be modified at all (since character 1 is the character directly following the value of the textarea). Running https://software.hixie.ch/utilities/js/live-dom-viewer/?saved=5815 in Chrome or Firefox agrees with me on this point.

@jdm
Copy link
Member

jdm commented Mar 8, 2018

@SimonSapin Do you have any suggestions for the right way to resolve the previously-linked testcase, or is this simply another instance where we need to accept that our UTF-8 strings can break web content?

@SimonSapin
Copy link
Member

SimonSapin commented Mar 8, 2018

@sarkhanbayramli "Multi-byte" is usually used in the context of UTF-8 encoding, where it means non-ASCII (above U+007F). This is different from "multi-code-unit" in UTF-16, which means non-BMP (above U+FFFF). The latter is what’s relevant to Servo’s DOM.

@jdm Yes, using UTF-8 for DOMString means we can’t have the same behavior as other browsers for this test case.

I’ve argued before that we should silently replace would-be lone surrogates with U+FFFD, but this met opposition on the basis that we should keep the current panic that asks people to file bug. Whether replacing would actually break real content (as opposed to, say, occasionally showing instead of ) isn’t well-known. IIRC we mostly found lone surrogates in Acid3 which has a test case specifically for this.

If we decide that having black-box equivalent behavior (preserving lone surrogates) for things like this test case is important, we’ll need to switch to either WTF-16 (a.k.a. "UCS-2 but also kinda UTF-16 if you squint a little", which is what other browser engines use) or WTF-8.

@jdm
Copy link
Member

jdm commented Mar 8, 2018

Ok, so I think the best thing to do here would be to check if splitting would yield an invalid UTF-8 string, then panic with a more helpful message like #6614. We could also use the existing flag that controls whether to replace surrogates or not to optionall perform the appropriate conversion.

@jdm jdm removed the S-awaiting-answer label Mar 8, 2018
@sarkhanbayramli
Copy link
Author

sarkhanbayramli commented Mar 8, 2018

@jdm @SimonSapin thanks for clarifying. I'll update my pr today.

@SimonSapin
Copy link
Member

SimonSapin commented Mar 8, 2018

It looks like the len_of_first_n_code_units function needs to be changed to return an Option that is None when the given n falls "in the middle" of a code point. Something like (untested):

/// The length in bytes of the first n code units a string when encoded in UTF-16.
///
/// If the string is fewer than n code units, returns the length of the whole string.
/// Return `None` if `n` would split a non-BMP code point.
fn len_of_first_n_code_units(text: &str, n: usize) -> Option<usize> {
    if n == 0 {
        return 0
    }
    let mut utf8_len = 0;
    let mut utf16_len = 0;
    for c in text.chars() {
        utf16_len += c.len_utf16();
        utf8_len += c.len_utf8();
        if utf16_len == n {
            beak
        }
        if utf16_len > n {
            return None
        }
    }
    Some(utf8_len)
}
@sarkhanbayramli
Copy link
Author

sarkhanbayramli commented Mar 9, 2018

@SimonSapin I'm a bit confused by your proposed changes. The panic originates at line 382 in "textinput.rs" - the indices used there are not calculated using "len_of_first_n_code_units" in the original code.
Line 382:
let prefix = &self.lines[start.line][..start.index];

Shouldn't we rather add error handling around these slicing lines?
Lines 382, 383:

    let prefix = &self.lines[start.line][..start.index];
    let suffix = &self.lines[end.line][end.index..];
@sarkhanbayramli
Copy link
Author

sarkhanbayramli commented Mar 9, 2018

Something like:

if self.lines[start.line].is_char_boundary(start.index) && self.lines[end.line].is_char_boundary(end.index) {
    let prefix = &self.lines[start.line][..start.index];
    let suffix = &self.lines[end.line][end.index..];
} else {
    panic!("some message"); // is there a better way of doing this?
}
@sarkhanbayramli
Copy link
Author

sarkhanbayramli commented Mar 9, 2018

@jdm Regarding the panic message in the case when indices don't fall on character boundaries, would the following message be considered informative:
"start: 1, end: 1. start is not a char boundary. Rust guarantees Strings to be valid UTF-8 sequences."

@sarkhanbayramli
Copy link
Author

sarkhanbayramli commented Mar 9, 2018

@SimonSapin I wanted to check if I understand the issue correctly:
I believe this is happening because String.fromCodePoint(128) is a multi-byte character and index 1 does not fall on a character boundary - character boundaries are at 0 and 2. By design, Rust guarantees its Strings to be valid UTF-8 sequences. Then, when we try to slice String.fromCodePoint(128) with respect to index 1, Rust panics, because we'd be creating a non-valid UTF-8 sequence.

@SimonSapin
Copy link
Member

SimonSapin commented Mar 9, 2018

Sorry, my last message was based on jdm’s message and seeing len_of_first_n_code_units used in the PR, but I had not read the full context of the issue being fixed.

There’s a distinction that is important to make: indices counted in bytes in an UTF-8 string, and indices counted in code units in a UTF-16 string. Slicing a String or &str value in Rust takes the former, but DOM APIs take the latter. len_of_first_n_code_units does the conversion between them.

It’s also important to understand how code points above U+FFFF are encoded as a pair of surrogate code units in UTF-16: one in the range 0xD800 to 0xDBFF followed by one in the range 0xDC00 to 0xDFFF. The corresponding code points are reserved for this purpose and therefore forbidden in UTF-16. So the trouble for us starts when a DOM on JS API splits up a surrogate pair, or otherwise creates an unpaired surrogate which cannot be represented in UTF-8 like in a Rust &str or String value. For more background: https://simonsapin.github.io/wtf-8/

So String.fromCodePoint(128) in JS creates a U+0080 code point that takes two UTF-8 bytes, but only one UTF-16 code unit. Surrogates are not involved, so we should have no trouble dealing with this case. The offset 1 in setRangeText means one code unit, internally we need to look at the string and see that this one code unit is encoded as two bytes. This is again what len_of_first_n_code_units does.

… but I think the code being modified by this PR is already operating on UTF-8 offsets. Such offsets that are not at code point boundary should not have been created in the first place, so maybe the code that needs to change is somewhere else.

And it would also help to have separate types for the two kinds of offsets (so we don’t accidentally mix them up), but that might be out of scope for this bug:

/// Counting UTF-8 bytes, suitable for slicing `&str`
struct ByteOffset(usize);

/// Counting UTF-16 (or WTF-16) code units. Suitable for parameters and return values in DOM APIs.
struct CodeUnitOffset(usize);
@sarkhanbayramli
Copy link
Author

sarkhanbayramli commented Mar 10, 2018

@SimonSapin Thanks for the write-up, things are much clearer now! Could you also please take a look at our discussion above on how other browsers handle the case? Specifically this example https://software.hixie.ch/utilities/js/live-dom-viewer/?saved=5816. It seems like other browsers also treat the indices provided to setRangeText as UTF-8 byte indices rather than code unit indices.

@SimonSapin
Copy link
Member

SimonSapin commented Mar 10, 2018

🌠 U+1F320 is not just multi-byte in UTF-8, it’s also multi (two) code units in UTF-16: 0xD83C 0xDF20. Offset 1 in JS/DOM falls in the middle of this surrogate pair, so when inserting a U+0061 there we end up with 0xD83C 0x0061 0xDF20, including two unpaired surrogate code units.

Updating fork
@sarkhanbayramli
Copy link
Author

sarkhanbayramli commented Mar 12, 2018

I modified len_of_first_n_code_units as per @SimonSapin's recommendation, set_selection_range to convert its start and end arguments - numbers of code units, to bytes using the new len_of_first_n_code_units.

Also added a new testcase to tests/unit/script/textinput.rs:test_textinput_set_selection_with_direction to test the new behaviour.

All the test passed when ran "./mach test-unit -p script", "./mach test-unit" and "./mach test-tidy".

@jdm
Copy link
Member

jdm commented Mar 12, 2018

@highfive highfive assigned SimonSapin and unassigned asajeffrey Mar 12, 2018
@sarkhanbayramli
Copy link
Author

sarkhanbayramli commented Mar 17, 2018

@SimonSapin would you be able to take a look at my changes?

@jdm
Copy link
Member

jdm commented Mar 22, 2018

Review oing, @SimonSapin.

@SimonSapin
Copy link
Member

SimonSapin commented Mar 24, 2018

Sorry for the delay. Besides this specific diff, on closer looks much of the surrounding code seems to be a big mess about UTF-16 indices v.s. UTF-8 indices. For example set_dom_range_text takes an start: Option<u32> that comes from the DOM, then proceeds to start = start.unwrap_or_else(|| self.start()) with that start method calling selection_start_offset which is documented as returning an UTF-8 byte count.

I feel that properly fixing this issue requires a much larger refactor.

@jdm
Copy link
Member

jdm commented Mar 27, 2018

I've talked with Simon, and I agree that the current attempt to fix this in this PR isn't really getting at the underlying problem. I've opened #20455 to build the foundations for solving this properly, and I'm going to close this PR because it's difficult to reason about this code and the right solution without those foundations being in place.

@jdm jdm closed this Mar 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

5 participants
You can’t perform that action at this time.