Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

take_while_m_n is invalid for multi-byte UTF-8 characters #1630

Closed
epage opened this issue Jan 28, 2023 · 0 comments · Fixed by #1651
Closed

take_while_m_n is invalid for multi-byte UTF-8 characters #1630

epage opened this issue Jan 28, 2023 · 0 comments · Fixed by #1651

Comments

@epage
Copy link
Contributor

epage commented Jan 28, 2023

take_m_n does:

    match input.position(|c| !cond(c)) {
      Some(idx) => {
        if idx >= m {
          if idx <= n {
            let res: IResult<_, _, Error> = if let Ok(index) = input.slice_index(idx) {
              Ok(input.take_split(index))
            } else {
              Err(Err::Error(Error::from_error_kind(
                input,
                ErrorKind::TakeWhileMN,
              )))
            };

while Input::position is defined as

  /// Returns the byte position of the first element satisfying the predicate
  fn position<P>(&self, predicate: P) -> Option<usize>
  where
    P: Fn(Self::Item) -> bool;

So it is looking up the byte position of the first non-matching character but then treats it as the Item count, comparing it with m and n and then using slice_index to convert it to a byte position. This should cause it to hit m and n sooner than it should and split at the wrong position.

I see that #913 was previously reported but the 7483885 and #1097 just added the "Item count to byte position" conversion but didn't address that it was using a byte position.

The reason the test from #913 is working is the else-clause for when there are more valid elements than n (4 is greater than 1), it caps it by n (1), making the Item count valid that it passes to slice_index.

To show this failing, we need to change m, n, and the input slightly

  #[test]
  fn take_while_m_n_utf8() {
    named!(parser<&str, &str>, take_while_m_n!(1, 2, |c| c == 'A' || c == '😃'));
    assert_eq!(parser("A!"), Ok(("!", "A")));
    assert_eq!(parser("😃!"), Ok(("!", "😃")));
  }

This fails, with the left-hand side reporting Ok(("", "😃!"). The only reason it doesn't crash is that slice_index, when exhausted, returns self.len().

I believe this problem predates #1612

ackxolotl added a commit to ackxolotl/nom that referenced this issue Feb 22, 2023
ackxolotl added a commit to ackxolotl/nom that referenced this issue Feb 22, 2023
ackxolotl added a commit to ackxolotl/nom that referenced this issue Feb 22, 2023
ackxolotl added a commit to ackxolotl/nom that referenced this issue Feb 22, 2023
ackxolotl added a commit to ackxolotl/nom that referenced this issue Feb 22, 2023
ackxolotl added a commit to ackxolotl/nom that referenced this issue Feb 22, 2023
ackxolotl added a commit to ackxolotl/nom that referenced this issue Feb 22, 2023
ackxolotl added a commit to ackxolotl/nom that referenced this issue Feb 22, 2023
ackxolotl added a commit to ackxolotl/nom that referenced this issue Feb 22, 2023
ackxolotl added a commit to ackxolotl/nom that referenced this issue Feb 22, 2023
ackxolotl added a commit to ackxolotl/nom that referenced this issue Feb 22, 2023
ackxolotl added a commit to ackxolotl/nom that referenced this issue Mar 6, 2023
ackxolotl added a commit to ackxolotl/nom that referenced this issue Mar 7, 2023
ackxolotl added a commit to ackxolotl/nom that referenced this issue Mar 7, 2023
ackxolotl added a commit to ackxolotl/nom that referenced this issue Mar 7, 2023
ackxolotl added a commit to ackxolotl/nom that referenced this issue Mar 7, 2023
Geal added a commit that referenced this issue Mar 15, 2023
fixes #1630

* test slice_index for strings with multibyte chars

* fix take_while_m_n for multibyte UTF-8 chars

* reintroduce Input::iter_indices

Co-authored-by: Simon Ellmann <simon.ellmann@tum.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant