Provide methods for finding the normalized prefix of input #4256

hsivonen · 2023-11-07T11:48:52Z

ComposingNormalizer and DecomposingNormalizer currently provide methods is_normalized(), is_normalized_utf8(), and is_normalized_utf16(). If the return value is false and the application then decides to normalize, the normalization-related data structure lookups are done twice for the (potential) already-normalized prefix.

ComposingNormalizer and DecomposingNormalizer should provide methods is_normalized_up_to(), is_normalized_utf8_up_to(), and is_normalized_utf16_up_to() that return usize such that the return value is the largest possible (but no larger than the length of input) with which the following assert passes:

fn test_is_normalized_up_to(input: &str) {
  // set up normalizer `norm`
  let up_to = norm.is_normalized_up_to(input)`
  let (head, tail) = input.split_at(up_to);
  let mut normalized = String::from(head);
  let _ = norm.normalize_to(tail, &mut normalized);
  assert!(norm.is_normalized(&normalized));
}

Then this should become a valid alternative implementation of is_normalized():

pub fn is_normalized(&self, text: &str) -> bool {
    self.is_normalized_up_to(text) == text.len()
}

Gecko use case: https://searchfox.org/mozilla-central/rev/e94bcd536a2a4caad0597d1b2d624342e6a389c4/intl/components/src/String.h#132

(Note that ICU4X deliberately doesn't implement quick check, which Gecko currently uses for the prefix computation.)

The text was updated successfully, but these errors were encountered:

hsivonen · 2023-11-07T11:49:34Z

CC @CanadaHonk

Closes unicode-org#4256. No UTF16 tests or fuzzing yet.

Closes unicode-org#4256. No UTF16 tests or fuzzing yet. Also added UTF8 variant to FFI as `is_normalized_up_to`.

Closes unicode-org#4256. Added UTF8 variant to FFI as `is_normalized_up_to`. No UTF16 tests or fuzzing yet.

hsivonen added C-collator Component: Collation, normalization U-gecko User: Gecko labels Nov 7, 2023

CanadaHonk pushed a commit to CanadaHonk/icu4x that referenced this issue Nov 20, 2023

Add is_normalized_up_tos

cdd4ffc

Closes unicode-org#4256. No UTF16 tests or fuzzing yet.

CanadaHonk mentioned this issue Nov 20, 2023

Add is_normalized_up_to to Normalizer #4334

Merged

CanadaHonk pushed a commit to CanadaHonk/icu4x that referenced this issue Nov 20, 2023

Add is_normalized_up_tos

553413e

Closes unicode-org#4256. No UTF16 tests or fuzzing yet.

CanadaHonk pushed a commit to CanadaHonk/icu4x that referenced this issue Nov 20, 2023

Add is_normalized_up_tos

a89c460

Closes unicode-org#4256. No UTF16 tests or fuzzing yet. Also added UTF8 variant to FFI as `is_normalized_up_to`.

CanadaHonk pushed a commit to CanadaHonk/icu4x that referenced this issue Nov 20, 2023

Add is_normalized_up_tos

16c2a4b

Closes unicode-org#4256. No UTF16 tests or fuzzing yet. Also added UTF8 variant to FFI as `is_normalized_up_to`.

CanadaHonk pushed a commit to CanadaHonk/icu4x that referenced this issue Nov 20, 2023

Add is_normalized_up_tos

8240b4c

Closes unicode-org#4256. No UTF16 tests or fuzzing yet. Also added UTF8 variant to FFI as `is_normalized_up_to`.

CanadaHonk pushed a commit to CanadaHonk/icu4x that referenced this issue Nov 20, 2023

Add is_normalized_up_to to Normalizer

1681ae9

Closes unicode-org#4256. Added UTF8 variant to FFI as `is_normalized_up_to`. No UTF16 tests or fuzzing yet.

CanadaHonk pushed a commit to CanadaHonk/icu4x that referenced this issue Nov 20, 2023

Add is_normalized_up_to to Normalizer

289a27a

Closes unicode-org#4256. Added UTF8 variant to FFI as `is_normalized_up_to`. No UTF16 tests or fuzzing yet.

CanadaHonk pushed a commit to CanadaHonk/icu4x that referenced this issue Apr 23, 2024

Add is_normalized_up_to to Normalizer

1665408

Closes unicode-org#4256. Added UTF8 variant to FFI as `is_normalized_up_to`. No UTF16 tests or fuzzing yet.

hsivonen closed this as completed in #4334 Jul 10, 2024

hsivonen closed this as completed in 34c0a2e Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide methods for finding the normalized prefix of input #4256

Provide methods for finding the normalized prefix of input #4256

hsivonen commented Nov 7, 2023 •

edited

Loading

hsivonen commented Nov 7, 2023

Provide methods for finding the normalized prefix of input #4256

Provide methods for finding the normalized prefix of input #4256

Comments

hsivonen commented Nov 7, 2023 • edited Loading

hsivonen commented Nov 7, 2023

hsivonen commented Nov 7, 2023 •

edited

Loading