Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Implement native normalization for String #38922

Merged
merged 4 commits into from
Oct 1, 2021

Conversation

Azoy
Copy link
Contributor

@Azoy Azoy commented Aug 18, 2021

This is still a wip, but wanted to get some benchmarks and tests ran on this.

Resolves: rdar://51635207, https://bugs.swift.org/browse/SR-9432

@Azoy
Copy link
Contributor Author

Azoy commented Aug 18, 2021

@swift-ci please benchmark

@Azoy
Copy link
Contributor Author

Azoy commented Aug 18, 2021

@swift-ci please test

@swift-ci

This comment has been minimized.

@Azoy
Copy link
Contributor Author

Azoy commented Aug 18, 2021

@swift-ci please benchmark

@swift-ci

This comment has been minimized.

@Azoy
Copy link
Contributor Author

Azoy commented Aug 24, 2021

@swift-ci please benchmark

@Azoy
Copy link
Contributor Author

Azoy commented Aug 25, 2021

@swift-ci please benchmark

@swift-ci

This comment has been minimized.

@Azoy
Copy link
Contributor Author

Azoy commented Aug 26, 2021

@swift-ci please benchmark

@swift-ci

This comment has been minimized.

@milseman
Copy link
Member

@swift-ci please benchmark

@swift-ci

This comment has been minimized.

@karwa
Copy link
Contributor

karwa commented Aug 30, 2021

Awesome! I have a couple of comments though:

  1. It would be really valuable if we could make this available to non-String collections of unicode text.
  2. Have you considered perhaps not decoding the UTF-8 code-units in to full scalars, and leaving them in their encoded form? I've found that decoding to be surprisingly expensive -- yet if you handle/don't care about overlong encodings, there is a 1:1 mapping between an "encoded scalar" and its decoded value, and thanks to the way UTF-8 works, they sort the same way. That means they are usable in lookup tables and it is possible to match ranges of scalars by bit-masking the UTF-8-encoded version.

Finally, as excited as I am to see progress here, my excitement is dampened (just a liiiitle bit) by the use of C. Is it really not possible to write these parts in Swift? Even using withUnsafeBufferPointer to bypass bounds-checking?

@milseman
Copy link
Member

  1. It would be really valuable if we could make this available to non-String collections of unicode text.

Yes, a huge benefit of native normalization is exposing better APIs in the future. Right now the critical part is nailing down the implementation and dropping ICU, but with that we can add normalization (and grapheme breaking) APIs that take collections of scalars or code units.

Another important API for performance will be some kind of makeNormal() call, normalizing initializers, etc., that will put the string into the stdlib's preferred comparison form (which happens to be NFC).

  1. Have you considered perhaps not decoding the UTF-8 code-units in to full scalars, and leaving them in their encoded form?

Fast-paths are vital for normalization performance and give us the biggest bang for our complexity budget. The vast majority of the time you don't have to actually do any work, and fast paths are about finding that quickly. We should try to detect the biggest ones directly in UTF-8. IIRC Alejandro has a fast path for a leading byte < 0xCC, which catches all scalars with values less than 0x300 (the first combining scalar). This is by far and away the most important fast path (other than the internal isNFC perf flag I suppose), as the most frequent strings for comparison are not linguistic text but rather quasi-human/machine readable strings. This also covers the vast majority of Western European and some African and South-East Asian language text too, which is a nice benefit.

The next fast-path would be leading byte checks that span the largest and most important ranges of NFC scalars, such as the Han ideograph code pages. (Finally, Arabic and Cyrillic checks would help round out normally-precomposed languages).

We do need to weigh against complexity. Beyond these fast-paths, I think it makes sense for the general purpose fall-back path to be based on scalars. We need that for lazy UTF-16 Cocoa strings and future API anyways. Any extra effort should probably be channeled into the fast-paths and APIs at that point.

@Azoy
Copy link
Contributor Author

Azoy commented Aug 30, 2021

Finally, as excited as I am to see progress here, my excitement is dampened (just a liiiitle bit) by the use of C. Is it really not possible to write these parts in Swift? Even using withUnsafeBufferPointer to bypass bounds-checking?

The use of C for storing Unicode data and accessing it is simply due to the fact that these data structures need to live within the binary. If I were to store these in Swift, either as a tuple or an array, each element would need to be initialized on program launch. The Swift compiler can't emit immutable globals as data into the binary yet, so writing at least the data structures in C is the only choice at the moment. As for accessing said data, it's simply that importing these C fixed sized arrays are gross to work with in Swift and is easier to access the data from the source language its defined in.

@karwa
Copy link
Contributor

karwa commented Aug 30, 2021

The use of C for storing Unicode data and accessing it is simply due to the fact that these data structures need to live within the binary. If I were to store these in Swift, either as a tuple or an array, each element would need to be initialized on program launch. The Swift compiler can't emit immutable globals as data into the binary yet, so writing at least the data structures in C is the only choice at the moment.

Ahhh that's right, I forgot about that; looking at things in godbolt, it does indeed allocate and populate immutable globals at launch :(. I couldn't find a bug report on bugs.swift.org about it; do you know if it's being tracked? Otherwise I'd be happy to file one.

@Azoy Azoy force-pushed the native-normalization branch 2 times, most recently from ab099c1 to 08488a2 Compare September 8, 2021 03:20
@Azoy
Copy link
Contributor Author

Azoy commented Sep 8, 2021

@swift-ci please benchmark

@swift-ci

This comment has been minimized.

@Azoy Azoy changed the title [WIP][stdlib] Implement native normalization for String [stdlib] Implement native normalization for String Sep 10, 2021
@Azoy Azoy marked this pull request as ready for review September 10, 2021 23:47
Copy link
Contributor

@karwa karwa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nits, but it looks good! Really looking forward to it!

Also, I wonder if there is a way to do Array.removeLast() while guaranteeing the existing capacity is kept...

stdlib/public/core/NFC.swift Outdated Show resolved Hide resolved
stdlib/public/core/NFC.swift Outdated Show resolved Hide resolved
stdlib/public/core/NFD.swift Outdated Show resolved Hide resolved
stdlib/public/core/NFD.swift Outdated Show resolved Hide resolved
stdlib/public/core/String.swift Outdated Show resolved Hide resolved
@milseman
Copy link
Member

@swift-ci please benchmark

@Azoy
Copy link
Contributor Author

Azoy commented Sep 24, 2021

@swift-ci please benchmark

@Azoy
Copy link
Contributor Author

Azoy commented Sep 24, 2021

@swift-ci please benchmark

@swift-ci
Copy link
Contributor

Performance (x86_64): -O

Regression OLD NEW DELTA RATIO
RemoveWhereMoveInts 14 34 +142.8% 0.41x
RemoveWhereSwapInts 19 37 +94.7% 0.51x
StringComparison_abnormal 860 1440 +67.4% 0.60x
CSVParsingAltIndices2 748 1078 +44.1% 0.69x
RemoveWhereSwapStrings 404 564 +39.6% 0.72x
ObjectiveCBridgeStubFromNSDateRef 4210 5060 +20.2% 0.83x (?)
DropFirstSequence 58 67 +15.5% 0.87x
DropFirstSequenceLazy 58 67 +15.5% 0.87x
SuffixCountableRangeLazy 8 9 +12.5% 0.89x
ObjectiveCBridgeStubFromNSDate 6380 7140 +11.9% 0.89x (?)
StringComparison_zalgo 67375 74900 +11.2% 0.90x (?)
StringHashing_zalgo 3350 3700 +10.4% 0.91x
PrefixWhileAnySequence 343 376 +9.6% 0.91x
RomanNumbers2 556 604 +8.6% 0.92x (?)
String.replaceSubrange.String.Small 62 67 +8.1% 0.93x (?)
NormalizedIterator_zalgo 3475 3750 +7.9% 0.93x
DropLastAnySequence 545 588 +7.9% 0.93x (?)
UTF8Decode_InitFromCustom_noncontiguous_ascii_as_ascii 1061 1144 +7.8% 0.93x (?)
DropLastSequence 557 600 +7.7% 0.93x (?)
DropLastSequenceLazy 557 599 +7.5% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
StringHashing_nonBMPSlowestPrenormal 1590 370 -76.7% 4.30x
StringHashing_emoji 1272 304 -76.1% 4.18x
UnicodeStringFromCodable 717 203 -71.7% 3.53x
StringHashing_slowerPrenormal 940 290 -69.1% 3.24x
NormalizedIterator_nonBMPSlowestPrenormal 1660 530 -68.1% 3.13x
NormalizedIterator_emoji 1320 424 -67.9% 3.11x
NormalizedIterator_slowerPrenormal 950 370 -61.1% 2.57x
StringComparison_nonBMPSlowestPrenormal 1500 670 -55.3% 2.24x
StringHashing_fastPrenormal 940 420 -55.3% 2.24x
StringComparison_emoji 820 384 -53.2% 2.14x
StringHashing_latin1 316 150 -52.5% 2.11x
StringComparison_slowerPrenormal 1490 820 -45.0% 1.82x
SubstringEqualString 421 236 -43.9% 1.78x
SubstringEquatable 665 391 -41.2% 1.70x
NormalizedIterator_fastPrenormal 1070 640 -40.2% 1.67x
SuperChars2 509 306 -39.9% 1.66x
NormalizedIterator_latin1 356 220 -38.2% 1.62x
FlattenListLoop 2544 1632 -35.8% 1.56x (?)
FlattenListFlatMap 6158 4194 -31.9% 1.47x (?)
StringHashing_abnormal 1120 800 -28.6% 1.40x
NormalizedIterator_ascii 128 93 -27.3% 1.38x
NormalizedIterator_abnormal 1180 860 -27.1% 1.37x
CSVParsing.Scalar 192 168 -12.5% 1.14x (?)
SuffixCountableRange 9 8 -11.1% 1.12x
UTF8Decode_InitFromBytes_ascii 304 276 -9.2% 1.10x (?)
StringComparison_longSharedPrefix 488 444 -9.0% 1.10x
ObjectiveCBridgeToNSArray 13000 11850 -8.8% 1.10x (?)
SubstringTrimmingASCIIWhitespace 746 687 -7.9% 1.09x (?)
CharIteration_tweet_unicodeScalars_Backwards 10080 9320 -7.5% 1.08x
CharIteration_russian_unicodeScalars_Backwards 5400 5000 -7.4% 1.08x (?)
CharIndexing_tweet_unicodeScalars_Backwards 10040 9320 -7.2% 1.08x (?)
CharIndexing_punctuated_unicodeScalars_Backwards 1200 1120 -6.7% 1.07x (?)
CharIteration_punctuated_unicodeScalars_Backwards 1200 1120 -6.7% 1.07x (?)

Code size: -O

Regression OLD NEW DELTA RATIO
DictTest3.o 15149 16013 +5.7% 0.95x
ReversedCollections.o 8662 8774 +1.3% 0.99x
RangeAssignment.o 3401 3443 +1.2% 0.99x
SortIntPyramids.o 9040 9136 +1.1% 0.99x
Diffing.o 7899 7979 +1.0% 0.99x
 
Improvement OLD NEW DELTA RATIO
ReduceInto.o 13819 11259 -18.5% 1.23x
MirrorTest.o 11078 10374 -6.4% 1.07x
StringSplitting.o 36022 34862 -3.2% 1.03x
DictTest4Legacy.o 15792 15424 -2.3% 1.02x
RemoveWhere.o 16571 16235 -2.0% 1.02x
DictTest4.o 15298 15010 -1.9% 1.02x
Breadcrumbs.o 44283 43579 -1.6% 1.02x
BufferFill.o 9391 9259 -1.4% 1.01x

Performance (x86_64): -Osize

Regression OLD NEW DELTA RATIO
RemoveWhereMoveInts 14 37 +164.3% 0.38x
RemoveWhereSwapInts 19 40 +110.5% 0.48x
ArrayAppendLazyMap 1250 2500 +100.0% 0.50x
StringComparison_abnormal 860 1420 +65.1% 0.61x
CSVParsingAltIndices2 880 1243 +41.2% 0.71x
RemoveWhereSwapStrings 404 565 +39.9% 0.72x
PrefixAnySeqCRangeIter 40 53 +32.5% 0.75x
RandomShuffleLCG2 416 512 +23.1% 0.81x
DropFirstAnyCollection 162 198 +22.2% 0.82x
PrefixAnyCollection 162 197 +21.6% 0.82x (?)
DropLastAnyCollection 57 69 +21.1% 0.83x
PrefixAnySeqCntRange 44 53 +20.5% 0.83x
PrefixWhileAnyCollection 215 250 +16.3% 0.86x (?)
PrefixWhileAnySequence 345 397 +15.1% 0.87x (?)
RemoveWhereFilterInts 43 49 +14.0% 0.88x (?)
StringComparison_zalgo 67300 75150 +11.7% 0.90x
Breadcrumbs.MutatedIdxToUTF16.Mixed 285 317 +11.2% 0.90x (?)
DropWhileAnyCollection 180 198 +10.0% 0.91x (?)
StringHashing_zalgo 3350 3675 +9.7% 0.91x (?)
PrefixWhileAnyCollectionLazy 176 193 +9.7% 0.91x (?)
DropWhileAnyCollectionLazy 252 275 +9.1% 0.92x (?)
SuffixCountableRange 11 12 +9.1% 0.92x (?)
NormalizedIterator_zalgo 3475 3775 +8.6% 0.92x (?)
DropWhileAnySeqCntRange 200 217 +8.5% 0.92x
DropFirstAnySeqCRangeIterLazy 216 234 +8.3% 0.92x
DropFirstAnySeqCntRangeLazy 216 234 +8.3% 0.92x (?)
Array2D 6944 7520 +8.3% 0.92x (?)
String.replaceSubrange.String.Small 63 68 +7.9% 0.93x
NSStringConversion.Mutable 858 926 +7.9% 0.93x (?)
LessSubstringSubstring 39 42 +7.7% 0.93x (?)
EqualStringSubstring 39 42 +7.7% 0.93x (?)
EqualSubstringSubstringGenericEquatable 39 42 +7.7% 0.93x (?)
LessSubstringSubstringGenericComparable 39 42 +7.7% 0.93x
Set.subtracting.Int.Empty 39 42 +7.7% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
StringHashing_nonBMPSlowestPrenormal 1580 370 -76.6% 4.27x
StringHashing_emoji 1268 308 -75.7% 4.12x
UnicodeStringFromCodable 718 204 -71.6% 3.52x
StringHashing_slowerPrenormal 930 290 -68.8% 3.21x
NormalizedIterator_nonBMPSlowestPrenormal 1660 520 -68.7% 3.19x
NormalizedIterator_emoji 1320 424 -67.9% 3.11x
SuffixSequenceLazy 2412 795 -67.0% 3.03x
SuffixSequence 2393 795 -66.8% 3.01x
SuffixAnySequence 2359 794 -66.3% 2.97x
NormalizedIterator_slowerPrenormal 970 380 -60.8% 2.55x
StringComparison_nonBMPSlowestPrenormal 1510 680 -55.0% 2.22x
StringHashing_fastPrenormal 930 420 -54.8% 2.21x
StringHashing_latin1 314 148 -52.9% 2.12x
StringComparison_emoji 828 400 -51.7% 2.07x
StringComparison_slowerPrenormal 1500 830 -44.7% 1.81x
SubstringEqualString 427 237 -44.5% 1.80x
SubstringEquatable 697 405 -41.9% 1.72x
NormalizedIterator_fastPrenormal 1070 640 -40.2% 1.67x
SuperChars2 504 305 -39.5% 1.65x
NormalizedIterator_latin1 356 220 -38.2% 1.62x
StringHashing_abnormal 1120 800 -28.6% 1.40x
NormalizedIterator_ascii 128 93 -27.3% 1.38x
NormalizedIterator_abnormal 1180 860 -27.1% 1.37x
ArrayAppendSequence 1250 1030 -17.6% 1.21x (?)
SuffixAnySeqCRangeIter 935 797 -14.8% 1.17x
SuffixAnySeqCntRange 893 764 -14.4% 1.17x
ObjectiveCBridgeToNSArray 13450 11550 -14.1% 1.16x (?)
StringComparison_longSharedPrefix 493 447 -9.3% 1.10x (?)
UTF8Decode_InitFromData_ascii 258 235 -8.9% 1.10x (?)
DropLastCountableRange 12 11 -8.3% 1.09x (?)
SuffixCountableRangeLazy 12 11 -8.3% 1.09x
ArrayAppendLatin1 3468 3196 -7.8% 1.09x (?)
BucketSort 195 180 -7.7% 1.08x (?)
SuffixAnySequenceLazy 4738 4376 -7.6% 1.08x (?)
SubstringTrimmingASCIIWhitespace 743 687 -7.5% 1.08x (?)
CharIndexing_tweet_unicodeScalars_Backwards 20120 18680 -7.2% 1.08x (?)
ArrayAppendAscii 3366 3128 -7.1% 1.08x (?)
ArrayAppendUTF16 3366 3128 -7.1% 1.08x (?)

Code size: -Osize

Regression OLD NEW DELTA RATIO
RangeOverlaps.o 5633 6121 +8.7% 0.92x
MirrorTest.o 9880 10542 +6.7% 0.94x
RangeAssignment.o 3256 3335 +2.4% 0.98x
StrToInt.o 4553 4649 +2.1% 0.98x
Diffing.o 7256 7395 +1.9% 0.98x
RangeReplaceableCollectionPlusDefault.o 5542 5638 +1.7% 0.98x
SortArrayInClass.o 2809 2842 +1.2% 0.99x
 
Improvement OLD NEW DELTA RATIO
Suffix.o 24685 19675 -20.3% 1.25x
StringSplitting.o 35264 32388 -8.2% 1.09x
ChainedFilterMap.o 3059 2958 -3.3% 1.03x
PopFrontGeneric.o 2742 2681 -2.2% 1.02x
PopFront.o 3460 3399 -1.8% 1.02x
IndexPathTest.o 7367 7255 -1.5% 1.02x
RandomShuffle.o 3473 3426 -1.4% 1.01x
LazyFilter.o 7177 7080 -1.4% 1.01x
BufferFill.o 10174 10064 -1.1% 1.01x

Performance (x86_64): -Onone

Regression OLD NEW DELTA RATIO
StringComparison_abnormal 980 1560 +59.2% 0.63x
Breadcrumbs.MutatedUTF16ToIdx.Mixed 270 320 +18.5% 0.84x (?)
String.data.Small 75 85 +13.3% 0.88x (?)
DataReplaceMediumBuffer 5900 6600 +11.9% 0.89x (?)
CSVParsing.Char 806 901 +11.8% 0.89x (?)
ArrayAppendAscii 14144 15640 +10.6% 0.90x (?)
StringMatch 39900 44100 +10.5% 0.90x (?)
ArrayAppendUTF16 14076 15538 +10.4% 0.91x (?)
ArrayAppendLatin1 14076 15436 +9.7% 0.91x (?)
String.data.LargeUnicode 169 185 +9.5% 0.91x (?)
StringComparison_zalgo 67775 74025 +9.2% 0.92x
ObjectiveCBridgeFromNSArrayAnyObjectForced 8900 9700 +9.0% 0.92x (?)
StringHashing_zalgo 3375 3675 +8.9% 0.92x (?)
NormalizedIterator_zalgo 3475 3775 +8.6% 0.92x (?)
FindString.Loop1.Substring 815 877 +7.6% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
StringHashing_nonBMPSlowestPrenormal 1640 410 -75.0% 4.00x
StringHashing_emoji 1308 340 -74.0% 3.85x
UnicodeStringFromCodable 753 228 -69.7% 3.30x
NormalizedIterator_nonBMPSlowestPrenormal 1660 520 -68.7% 3.19x
NormalizedIterator_emoji 1320 420 -68.2% 3.14x
StringHashing_slowerPrenormal 980 340 -65.3% 2.88x
NormalizedIterator_slowerPrenormal 980 390 -60.2% 2.51x
StringHashing_fastPrenormal 990 480 -51.5% 2.06x
StringHashing_latin1 364 196 -46.2% 1.86x
StringComparison_nonBMPSlowestPrenormal 1940 1090 -43.8% 1.78x
StringComparison_emoji 1076 624 -42.0% 1.72x
NormalizedIterator_fastPrenormal 1060 630 -40.6% 1.68x
NormalizedIterator_latin1 386 250 -35.2% 1.54x
StringComparison_slowerPrenormal 1990 1290 -35.2% 1.54x
SuperChars2 632 426 -32.6% 1.48x
StringHashing_abnormal 1160 820 -29.3% 1.41x
NormalizedIterator_abnormal 1200 860 -28.3% 1.40x
NormalizedIterator_ascii 157 119 -24.2% 1.32x
ArrayAppendRepeatCol 430650 372410 -13.5% 1.16x (?)
Data.append.Sequence.64kB.Count.RE 34861 30495 -12.5% 1.14x (?)
Data.append.Sequence.809B.Count.RE.I 43215 37930 -12.2% 1.14x (?)
Data.init.Sequence.64kB.Count.RE 34575 30400 -12.1% 1.14x (?)
DataAppendSequence 4300800 3792500 -11.8% 1.13x (?)
Data.init.Sequence.809B.Count.RE 43002 38006 -11.6% 1.13x (?)
Data.append.Sequence.809B.Count.RE 43005 38089 -11.4% 1.13x (?)
Data.init.Sequence.64kB.Count.RE.I 34658 30817 -11.1% 1.12x (?)
Data.append.Sequence.64kB.Count.RE.I 34727 30890 -11.0% 1.12x (?)
Data.init.Sequence.809B.Count.RE.I 42977 38307 -10.9% 1.12x (?)
ErrorHandling 4540 4130 -9.0% 1.10x (?)
ObjectiveCBridgeStubDateMutation 1368 1257 -8.1% 1.09x (?)
ObjectiveCBridgeToNSArray 13150 12100 -8.0% 1.09x (?)
String.replaceSubrange.String 26 24 -7.7% 1.08x (?)
SubstringEquatable 6143 5709 -7.1% 1.08x (?)

Code size: -swiftlibs

Regression OLD NEW DELTA RATIO
libswiftCore.dylib 3751936 3817472 +1.7% 0.98x
How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB

Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is coming along very well! Just some minor feedback

stdlib/public/core/NFC.swift Outdated Show resolved Hide resolved
stdlib/public/core/NFC.swift Show resolved Hide resolved
stdlib/public/core/String.swift Show resolved Hide resolved
stdlib/public/core/UnicodeData.swift Outdated Show resolved Hide resolved
stdlib/public/core/UnicodeData.swift Show resolved Hide resolved
@Azoy
Copy link
Contributor Author

Azoy commented Sep 27, 2021

@swift-ci please test

@swift-ci
Copy link
Contributor

Build failed
Swift Test Linux Platform
Git Sha - 4c9650221d94c264d2b3e3aefcb5d397e94dd129

@Azoy
Copy link
Contributor Author

Azoy commented Sep 28, 2021

@swift-ci please clean test Linux

@Azoy
Copy link
Contributor Author

Azoy commented Sep 28, 2021

@swift-ci please clean test Windows

@swift-ci
Copy link
Contributor

Build failed
Swift Test Linux Platform
Git Sha - 4c9650221d94c264d2b3e3aefcb5d397e94dd129


while let current = iterator.next() {
guard let currentComposee = composee else {
// If we don't have a composee at this point, we're most likely looking
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"most likely"? When are we not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the start of the string begins with say 50 non starters, we just emit those and do no normalization work, but the 51st scalar is a starter so technically we're not doing any work until that scalar.

@Azoy
Copy link
Contributor Author

Azoy commented Sep 28, 2021

@swift-ci please test

@swift-ci
Copy link
Contributor

Build failed
Swift Test Linux Platform
Git Sha - 095be7278c4cbe05c1e8ca30b5e6357ea51a9dc8

use >/< instead of !=

fix some bugs

fix
fix infinite recursion bug

NFC: Remove early ccc check

remember that false is turned on
@Azoy
Copy link
Contributor Author

Azoy commented Sep 29, 2021

@swift-ci please test

// is "blocked". We get the last scalar because the scalars we receive are
// already NFD, so the last scalar in the buffer will have the highest
// CCC value in this normalization segment.
guard let lastBufferedNormData = buffer.last?.normData else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a few readings to figure it out, but we're really checking the prior scalar's CCC, right? So maybe:

Suggested change
guard let lastBufferedNormData = buffer.last?.normData else {
guard let priorCCC = buffer.last?.normData.ccc else {

And update the comment appropriately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessarily the prior scalar's CCC, no. I can say something like lastBufferedCCC however and add the .ccc here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the prior CCC what's required by Unicode? When is it a different scalar?

"last" sounds like it's the last scalar in the current normalization segment, and thus would probably appear after current.

@milseman milseman merged commit f8e5f28 into swiftlang:main Oct 1, 2021
@Azoy Azoy deleted the native-normalization branch October 1, 2021 20:04
@MaxDesiatov
Copy link
Contributor

Does the fact that this was merged mean that ICU dependency is going to be removed at some point?

@milseman
Copy link
Member

Hopefully soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants