-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SR-7602] UTF8 should be (one of) the fastest String encoding(s) #50144
Comments
Comment by tanner0101 (JIRA) Huge +1 to this. To give some additional insights here, `String` has been a major burden for us in developing Vapor (server side Swift framework). At this point it's basically forbidden to use String in any internal code and only our public APIs will accept it. We resort to using things like `[UInt8]` and `UnsafeBufferPointer<UInt8>` internally instead. Still even with our internal optimizations, there is still a lot of friction cross-module and to our end users where `String`s are used. If `String` were more performant dealing with UTF-8 that would greatly improve the speed of our framework and cleanup a lot of our internal code. |
Comment by Alex Reilly (JIRA) +1 |
Server side swift needs this big time. Would love to see this added. |
Comment by Francisco Rivas (JIRA) +1 |
Comment by Mikhail Isaev (JIRA) +1 |
Comment by Anthony Castelli (JIRA) +1 |
Comment by Damir Stuhec (JIRA) +1 |
+2 |
Comment by Petro Rovenskyy (JIRA) +1 |
Comment by Helge Heß (JIRA) +1 |
Comment by Lucian Boboc (JIRA) +1 |
cc @milseman Aside: please stop with the +1 comments. There's a "Vote for this issue" button right over there. |
@belkadan +1 😛 . I think this is vital to Swift's long-term health on any platform or domain outside of its current niche (and honestly, even within it). Thank you for sharing your experience tannernelson (JIRA User). Do you have anything with more detail here, such as how much performance and code bloat is due to this? How does this friction manifest for your users and how could storing UTF-8 without transcoding remove it? Is anyone else from this thread able to share their experience? Reports like this really help the project to prioritize effectively and push for the right thing. (As for the fear concerning ABI stability, it's a little complicated and there are degrees to which we can reserve the ability to support this in the future.) |
Comment by tanner0101 (JIRA) @milseman Virtually all of it comes down to `String(data: myData, encoding: .utf8)` and `myString.data(encoding: .utf8)`. When parsing protocols such as HTTP, Redis, MySQL, PostgreSQL, etc we will read data from the OS into an `UnsafeBufferPointer<UInt8>`. This is almost always via NIO's [`ByteBuffer`](https://apple.github.io/swift-nio/docs/current/NIO/Structs/ByteBuffer.html) type. We sometimes grab `String` from that directly or grab `Data` if we want to iterate over the bytes for additional parsing. [Here is an example of common byte buffer usage](https://github.com/vapor/mysql/blob/master/Sources/MySQL/Protocol/MySQLBinaryResultsetRow.swift#L39-L40). In other words, from `UnsafePointer<UInt8>` we commonly read `FixedWidthInteger`, `BinaryFloatingPoint`, `Data`, and `String`. All are very performant except String which is the concern since the vast majority of bytes ends up being `String`s. Considering the DB use case specifically, the data transfer is usually emails, names, bios, comments, etc. Very few bytes are actually dedicated to binary numbers or data blobs. Strings everywhere. To summarize, the faster we can get from `Swift.Unsafe...Pointer<UInt8>` or `Foundation.Data` to `String` the better. That will affect (for the better!) quite literally our entire framework. |
just to add to tannernelson (JIRA User)'s great comment (thanks!): If the String comes from a
|
How do you manage the lifetimes of the storage? I think that String should also express the ability to share storage, but that is yet to be designed and potentially separable. By default, String should allocate new storage and copy in the bytes. edit: For `String(data: myData, encoding: .utf8)`, where did you get `myData` from? For `myString.data(encoding: .utf8)`, where do you typically send the result? |
@milseman agreed, that'd be awesome! And also agreed that that's potentially a separate issue. FYI, in the |
Along the lines of potentially separable issues, what is your validation story? If the stream of bytes contains invalid UTF-8, do you want: 1) The initializer to fail resulting in nil For reference, I think [Rust's model](https://doc.rust-lang.org/std/string/struct.String.html) is pretty good: `from_utf8` produces an error explaining why the code units were invalid I'm not entirely sure if accepting invalid bytes requires voiding memory safety (assuming bounds checking always happens), but it is totally a security hazard if used improperly. We may want to be very cautious about if/how we expose it. I think that trying to do read-time validation is dubious for UTF-16, and totally bananas for UTF-8. |
Comment by tanner0101 (JIRA) From a high-level user perspective, I would love a throwing variant of the String(..., encoding: ) initializer and friends. When I see people using Vapor, the nil-fallable one is almost always getting force unwrapped. (in a context where throwing is handled much better, I should add) In terms of copying, I would expect that the String initializer from `Unsafe...Pointer<UInt8>` would copy the bytes into its own storage. And that in turn is how it would operate with NIO's ByteBuffer type. Which seems fine since the buffer is potentially going to get re-used and filled in with new bytes. I forget whether NIO actually does re-use the unsafe pointers, but it's a method I've used before. In terms of what to do initializing from `Data`, it would be great if they could do some intelligent COW sharing of the internal storage to minimize copies, but idk if that's possible. |
Comment by Helge Heß (JIRA) For the NIOFoundationCompat thing I filed SR-7378. |
tannernelson (JIRA User) when you or your users do `String(data: myData, encoding: .utf8)`, where did `myData` come from? Similarly, for `myString.data(encoding: .utf8)` where or what do you do with the resulting `Data`? edit: the reason I ask is that this work becomes much more compelling if we're able to not only skip transcoding overhead, but also eliminate an intermediary allocation. |
It's now the fastest encoding. https://forums.swift.org/t/string-s-abi-and-utf-8/17676/1 |
Additional Detail from JIRA
md5: f681e7f0741f98e436f811971add77c3
Sub-Tasks:
Issue Description:
I believe that there are really only one (and a half) encodings that matter today: UTF8 (and its subset ASCII).
Therefore it's important that Swift's fastest String encoding is UTF8.
From what I can tell today the fastest String encodings are UTF16 and ASCII. Everything else will have worse performance.
This also seems to ABI relevant so AFAIK this needs to be fixed very soon.
Requirements:
being able to copy UTF-8 encoded bytes from a
String
into a pre-allocated raw buffer must be allocation-free and as fast asmemcpy
can copy themcreating a String from UTF-8 encoded bytes should just validate the encoding and store the bytes as they are
slightly softer but still very strong requirement: currently (even with ASCII) only the stdlib seems to be able to get a pointer to the contiguous ASCII representation (if at all in that form). That works fine if you just want to copy the bytes (
UnsafeMutableBufferPointer(start: destinationStart, count: destinationLength).initialize(from: string.utf8)
which will usememcpy
if in ASCII representation) but doesn't allow you to implement your own algorithms that are only performant on a contiguously stored[UInt8]
The text was updated successfully, but these errors were encountered: