-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Attempt to transcode into smol strings #78913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@swift-ci please smoke test |
|
@swift-ci please Apple Silicon benchmark |
| ) -> (String, repairsMade: Bool)? | ||
| where Input.Element == Encoding.CodeUnit { | ||
| // TODO(String Performance): Attempt to form smol strings | ||
| if input.count < _SmallString.capacity { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"14 or fewer" is operating under the assumption that "all ASCII except for one" is a relatively common case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think early-exiting for small should be done in the single transcoding loop. We do not know how many UTF-8 code units input.count is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand. This check is a heuristic to avoid even attempting a small string, not the actual check to confirm that we can. We know that 15 UTF16 code units is very unlikely to succeed because if they were all ASCII then we'd probably have an ASCII buffer to start with, and if they're not all ASCII then we'll have >15 UTF8, so 14 is the largest number that could possible work, unless I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is this algorithm proceeds as follows:
- We check a might-be-small heuristic (which could be either an overestimate or underestimate)
- If heuristic passes, we run a transcode loop to try to form and return the small string if we can, otherwise:
- We run a second transcode loop into an
Array<UInt8> - We copy that array's contents into a
String, either making a_SmallStringor__StringStorage, depending on its length
I'm wondering if we can simplify the code and logic, as well as improve the general case, by changing the existing transcode loop to skip the Array if it's going to be returning a small string anyways.
Would it be possible to just decode the input's scalars and append them to a var result = "" (or its guts)?. That would avoid there ever being an array and keep it in small form for every input that could be small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh that's an interesting approach. I wonder if resizing (i.e. multiple mallocs) is actually faster than two transcode passes (one throwing away the data but writing down the size, then one actually copying in) in common cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea. UnicodeScalarView has an append(_: UnicodeScalar), so that seems viable. The implementation is naïve, though; we would need to improve it substantially. (Or we can add something better.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was curious about this and tried to dig it a bit.
What is the agreement on ideal flow?
I tried to look at the suggestions above, but there are some hurdles.
- The basic issue is that we don't know if result can be smol until transcoding is done.
- Starting with "" and resizing it on the go seems inefficient. It will be re-creating
_SmallStringon each append. - IIUC one of the big desires is to remove the intermediate array. I was thinking maybe
__StringStoragecould be a starting point. It is updated as part of transcoding. In the end it is converted to_SmallStringif it fits the criteria.
EDIT: Looks like __StringStorage itself is not meant for growing. And dealing with buffer pointers directly they really want to know the size upfront. The only alternative to doing 2 transcoding passes I can think of is allocating 2x memory (worst case scenario) and then shrinking it. It probably does not help with this PR optimistic scenario, but it gets rid of intermediate Array<UInt8>.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-creating a _SmallString for a few appends is fine, it's two (or three) words held in registers, with a few bytes changed every time. For the ultimate version of this, we need to improve reserveCapacity() and add an internal way to append a known-good UTF8-encoded code point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The relevant entry points for appending to Strings are mostly in StringRangeReplaceableCollection.swift
| @_specialize( | ||
| where Input == Array<UInt8>, Encoding == Unicode.ASCII) | ||
| @_specialize( | ||
| where Input == UnsafeBufferPointer<UInt16>, Encoding == UTF16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about these either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a separate point, we might want a separate fully concrete code path for UTF-16 anyways, rather than going through generic transcoding.
|
|
Closing this in favor of #83407, which adds a vectorized transcoder for this as well as making small strings. |
Fixes rdar://144114867