-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[stdlib] Implement native normalization for String #38922
Conversation
|
@swift-ci please benchmark |
|
@swift-ci please test |
This comment has been minimized.
This comment has been minimized.
7531e38
to
8a56ccc
Compare
|
@swift-ci please benchmark |
This comment has been minimized.
This comment has been minimized.
|
@swift-ci please benchmark |
82668a6
to
8302089
Compare
|
@swift-ci please benchmark |
This comment has been minimized.
This comment has been minimized.
|
@swift-ci please benchmark |
This comment has been minimized.
This comment has been minimized.
|
@swift-ci please benchmark |
This comment has been minimized.
This comment has been minimized.
|
Awesome! I have a couple of comments though:
Finally, as excited as I am to see progress here, my excitement is dampened (just a liiiitle bit) by the use of C. Is it really not possible to write these parts in Swift? Even using |
Yes, a huge benefit of native normalization is exposing better APIs in the future. Right now the critical part is nailing down the implementation and dropping ICU, but with that we can add normalization (and grapheme breaking) APIs that take collections of scalars or code units. Another important API for performance will be some kind of
Fast-paths are vital for normalization performance and give us the biggest bang for our complexity budget. The vast majority of the time you don't have to actually do any work, and fast paths are about finding that quickly. We should try to detect the biggest ones directly in UTF-8. IIRC Alejandro has a fast path for a leading byte < The next fast-path would be leading byte checks that span the largest and most important ranges of NFC scalars, such as the Han ideograph code pages. (Finally, Arabic and Cyrillic checks would help round out normally-precomposed languages). We do need to weigh against complexity. Beyond these fast-paths, I think it makes sense for the general purpose fall-back path to be based on scalars. We need that for lazy UTF-16 Cocoa strings and future API anyways. Any extra effort should probably be channeled into the fast-paths and APIs at that point. |
The use of C for storing Unicode data and accessing it is simply due to the fact that these data structures need to live within the binary. If I were to store these in Swift, either as a tuple or an array, each element would need to be initialized on program launch. The Swift compiler can't emit immutable globals as data into the binary yet, so writing at least the data structures in C is the only choice at the moment. As for accessing said data, it's simply that importing these C fixed sized arrays are gross to work with in Swift and is easier to access the data from the source language its defined in. |
Ahhh that's right, I forgot about that; looking at things in godbolt, it does indeed allocate and populate immutable globals at launch :(. I couldn't find a bug report on bugs.swift.org about it; do you know if it's being tracked? Otherwise I'd be happy to file one. |
ab099c1
to
08488a2
Compare
|
@swift-ci please benchmark |
This comment has been minimized.
This comment has been minimized.
08488a2
to
8faabe1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nits, but it looks good! Really looking forward to it!
Also, I wonder if there is a way to do Array.removeLast() while guaranteeing the existing capacity is kept...
8faabe1
to
aea5314
Compare
|
@swift-ci please benchmark |
9f55cbd
to
b341552
Compare
|
@swift-ci please benchmark |
b341552
to
8955f00
Compare
|
@swift-ci please benchmark |
Performance (x86_64): -O
Code size: -O
Performance (x86_64): -Osize
Code size: -Osize
Performance (x86_64): -Onone
Code size: -swiftlibs
How to read the dataThe tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the Noise: Sometimes the performance results (not code size!) contain false Hardware Overview |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is coming along very well! Just some minor feedback
8955f00
to
4c96502
Compare
|
@swift-ci please test |
|
Build failed |
|
@swift-ci please clean test Linux |
|
@swift-ci please clean test Windows |
|
Build failed |
|
|
||
| while let current = iterator.next() { | ||
| guard let currentComposee = composee else { | ||
| // If we don't have a composee at this point, we're most likely looking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"most likely"? When are we not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the start of the string begins with say 50 non starters, we just emit those and do no normalization work, but the 51st scalar is a starter so technically we're not doing any work until that scalar.
4c96502
to
095be72
Compare
|
@swift-ci please test |
|
Build failed |
use >/< instead of != fix some bugs fix
fix infinite recursion bug NFC: Remove early ccc check remember that false is turned on
095be72
to
014e822
Compare
|
@swift-ci please test |
| // is "blocked". We get the last scalar because the scalars we receive are | ||
| // already NFD, so the last scalar in the buffer will have the highest | ||
| // CCC value in this normalization segment. | ||
| guard let lastBufferedNormData = buffer.last?.normData else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a few readings to figure it out, but we're really checking the prior scalar's CCC, right? So maybe:
| guard let lastBufferedNormData = buffer.last?.normData else { | |
| guard let priorCCC = buffer.last?.normData.ccc else { |
And update the comment appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not necessarily the prior scalar's CCC, no. I can say something like lastBufferedCCC however and add the .ccc here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the prior CCC what's required by Unicode? When is it a different scalar?
"last" sounds like it's the last scalar in the current normalization segment, and thus would probably appear after current.
|
Does the fact that this was merged mean that ICU dependency is going to be removed at some point? |
|
Hopefully soon |
This is still a wip, but wanted to get some benchmarks and tests ran on this.
Resolves: rdar://51635207, https://bugs.swift.org/browse/SR-9432