From 8000dc62156e54b3f497f590d9da2000ca97e657 Mon Sep 17 00:00:00 2001 From: YOCKOW Date: Tue, 8 Apr 2025 15:47:04 +0900 Subject: [PATCH 1/7] [Proposal] Add "String Encoding Names" proposal. This proposal allows `String.Encoding` to be converted to and from various names. For example: ```swift print(String.Encoding.utf8.ianaName!) // Prints "UTF-8" print(String.Encoding(ianaName: "ISO_646.irv:1991") == .ascii) // Prints "true" ``` --- Proposals/NNNN-String-Encoding-Names.md | 317 ++++++++++++++++++++++++ 1 file changed, 317 insertions(+) create mode 100644 Proposals/NNNN-String-Encoding-Names.md diff --git a/Proposals/NNNN-String-Encoding-Names.md b/Proposals/NNNN-String-Encoding-Names.md new file mode 100644 index 000000000..61a22aafa --- /dev/null +++ b/Proposals/NNNN-String-Encoding-Names.md @@ -0,0 +1,317 @@ +# String Encoding Names + +* Proposal: Not assigned yet +* Author(s): [YOCKOW](https://GitHub.com/YOCKOW) +* Review Manager: TBD +* Status: **Awaiting review** + +* Implementation: [StringEncodingNameImpl/StringEncodingName.swift](https://github.com/YOCKOW/SF-StringEncodingNameImpl/blob/main/Sources/StringEncodingNameImpl/StringEncodingName.swift) *(Awaiting implementation on [swiftlang/swift-foundation](https://github.com/swiftlang/swift-foundation))* + + +* Review: ([Pitch](https://forums.swift.org/t/pitch-foundation-string-encoding-names/74623)) + + +## Revision History + +### [Pitch#1](https://gist.github.com/YOCKOW/f5a385e3c9e2d0c97f3340a889f57a16/d76651bf4375164f6a46df792fccd74955a4733a) + +- Features + * Fully compatible with CoreFoundation. + + Planned to add static properties corresponding to `kCFStringEncoding*`. + * Spelling of getter/initializer was `ianaCharacterSetName`. +- Pros + * Easy to migrate from CoreFoundation. +- Cons + * Propagating undesirable legacy conversions into current Swift Foundation. + * Including string encodings which might not be supported by Swift Foundation. + + +### [Pitch#2](https://gist.github.com/YOCKOW/f5a385e3c9e2d0c97f3340a889f57a16/215404d620b41119a8a03ec1a51e725eb09be4b6) + +- Features + * Consulting both [IANA Character Sets](https://www.iana.org/assignments/character-sets/character-sets.xhtml) and [WHATWG Encoding Standard](https://encoding.spec.whatwg.org/). + + Making a compromise between them. + * Spelling of getter/initializer was `name`. +- Pros + * Easy to communicate with API. +- Cons + * Hard for users to comprehend conversions. + * Difficult to maintain the API in a consistant way. + +### [Pitch#3](https://github.com/YOCKOW/SF-StringEncodingNameImpl/blob/0.1.0/proposal/NNNN-String-Encoding-Names.md), [Pitch#4](https://github.com/YOCKOW/SF-StringEncodingNameImpl/blob/0.2.1/proposal/NNNN-String-Encoding-Names.md) + +- Features + * Consulting both [IANA Character Sets](https://www.iana.org/assignments/character-sets/character-sets.xhtml) and [WHATWG Encoding Standard](https://encoding.spec.whatwg.org/). + * Following ["Charset Alias Matching"](https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching) rule defined in UTS#22 to parse IANA Charset Names. + * Separated getters/initializers for them. + + #3: `charsetName` and `standardName` respectively. + + #4: `name(.iana)` and `name(.whatwg)` for getters; `init(iana:)` and `init(whatwg:)` for initializers. +- Pros + * Users can recognize what kind of conversions is used. +- Cons + * Not reflecting the fact that WHATWG's Encoding Standard doesn't provide only string encoding names but also implementations to encode/decode data. + +### [Pitch#5](https://github.com/YOCKOW/SF-StringEncodingNameImpl/blob/0.3.1/proposal/NNNN-String-Encoding-Names.md) + +- Features + * Withdrew support for [WHATWG Encoding Standard](https://encoding.spec.whatwg.org/). + * Following ["Charset Alias Matching"](https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching) rule defined in UTS#22 to parse IANA Charset Names. + * Spelling of getter/initializer was `name`. + * "Fixed" some behaviour of parsing, which differs from CoreFoundation. +- Pros + * Simple API to use. +- Cons + * It was unclear that IANA names were used. + * The parsing behavior was complex and unpredictable. + + +### [Pitch#6](https://github.com/YOCKOW/SF-StringEncodingNameImpl/blob/0.4.0/proposal/NNNN-String-Encoding-Names.md), Proposal#1 + +This version. + + +## Introduction + +This proposal allows `String.Encoding` to be converted to and from various names. + +For example: + +```swift +print(String.Encoding.utf8.ianaName!) // Prints "UTF-8" +print(String.Encoding(ianaName: "ISO_646.irv:1991") == .ascii) // Prints "true" +``` + + +## Motivation + +String encoding names are widely used in computer networking and other areas. For instance, you often see them in HTTP headers such as `Content-Type: text/plain; charset=UTF-8` or in XML documents with declarations such as ``. + +Therefore, it is necessary to parse and generate such names. + + +### Current solution + +Swift lacks the necessary APIs, requiring the use of `CoreFoundation` (hereinafter called "CF") as described below. + +```swift +extension String.Encoding { + var nameInLegacyWay: String? { + // 1. Convert `String.Encoding` value to the `CFStringEncoding` value. + // NOTE: The raw value of `String.Encoding` is the same as the value of `NSStringEncoding`, + // while it is not equal to the value of `CFStringEncoding`. + let cfStrEncValue: CFStringEncoding = CFStringConvertNSStringEncodingToEncoding(self.rawValue) + + // 2. Convert it to the name where its type is `CFString?` + let cfStrEncName: CFString? = CFStringConvertEncodingToIANACharSetName(cfStrEncValue) + + // 3. Convert `CFString` to Swift's `String`. + // NOTE: Unfortunately they can not be implicitly casted on Linux. + let charsetName: String? = cfStrEncName.flatMap { + let bufferSize = CFStringGetMaximumSizeForEncoding( + CFStringGetLength($0), + kCFStringEncodingASCII + ) + 1 + let buffer = UnsafeMutablePointer.allocate(capacity: bufferSize) + defer { + buffer.deallocate() + } + guard CFStringGetCString($0, buffer, bufferSize, kCFStringEncodingASCII) else { + return nil + } + return String(utf8String: buffer) + } + return charsetName + } + + init?(fromNameInLegacyWay charsetName: String) { + // 1. Convert `String` to `CFString` + let cfStrEncName: CFString = charsetName.withCString { (cString: UnsafePointer) -> CFString in + return CFStringCreateWithCString(nil, cString, kCFStringEncodingASCII) + } + + // 2. Convert it to `CFStringEncoding` + let cfStrEncValue: CFStringEncoding = CFStringConvertIANACharSetNameToEncoding(cfStrEncName) + + // 3. Check whether or not it's valid + guard cfStrEncValue != kCFStringEncodingInvalidId else { + return nil + } + + // 4. Convert `CFStringEncoding` value to `String.Encoding` value + self.init(rawValue: CFStringConvertEncodingToNSStringEncoding(cfStrEncValue)) + } +} +``` + + +### What's the problem of the current solution? + +- It is complicated to use multiple CF functions to get a simple value. That's not *Swifty*. +- CF functions are legacy APIs that do not always meet modern requirements. +- CF APIs are not officially intended to be called directly from Swift on non-Darwin platforms. + + +## Proposed solution + +The solution is straightforward. +We introduce a computed property that returns the name, and the initializer that creates an instance from a name as shown below. + +```swift +extension String.Encoding { + /// The name of this encoding that is compatible with the one of the IANA registry "charset". + public var ianaName: String? + + /// Creates an instance from the name of the IANA registry "charset". + public init?(ianaName: String) +} +``` + +## Detailed design + +This proposal refers to "[Character Sets](https://www.iana.org/assignments/character-sets/character-sets.xhtml)" published by IANA. + +One of the reasons for this is that The World Wide Web Consortium (W3C) recommends using IANA "charset" names in XML[^XML-IANA-charset-names] and they assert that any IANA "charset" names are available in HTTP header[^HTTP-IANA-charset-names]. + +[^XML-IANA-charset-names]: https://www.w3.org/TR/xml11/#charencoding +[^HTTP-IANA-charset-names]: https://www.w3.org/International/articles/http-charset/index#charset + +Another reason is that CF claims that IANA "charset" names are used, as implied by its function names[^CF-IANA-function-names]. + +[^CF-IANA-function-names]: [`CFStringConvertIANACharSetNameToEncoding`](https://developer.apple.com/documentation/corefoundation/cfstringconvertianacharsetnametoencoding(_:)) and [`CFStringConvertEncodingToIANACharSetName`](https://developer.apple.com/documentation/corefoundation/cfstringconvertencodingtoianacharsetname(_:)) + +However, as mentioned above, CF APIs are sometimes outdated. +Furthermore, CF parses "charset" names inconsistently[^CF-inconsistent-parse]. +Therefore, we shouldn't adopt CF-like behavior without modifications. Nevertheless, adjusting it to some extent can be unpredictable and complex. + +[^CF-inconsistent-parse]: https://forums.swift.org/t/pitch-foundation-string-encoding-names/74623/53 + +Accordingly, this proposal suggests just simple correspondence between `String.Encoding` instances and IANA names: + + +| `String.Encoding` | IANA "charset" Name | +|----------------------|---------------------| +| `.ascii` | US-ASCII | +| `.iso2022JP` | ISO-2022-JP | +| `.isoLatin1` | ISO-8859-1 | +| `.isoLatin2` | ISO-8859-2 | +| `.japaneseEUC` | EUC-JP | +| `.macOSRoman` | macintosh | +| `.nextstep` | *n/a* | +| `.nonLossyASCII` | *n/a* | +| `.shiftJIS` | Shift_JIS | +| `.symbol` | *n/a* | +| `.unicode`/`.utf16` | UTF-16 | +| `.utf16BigEndian` | UTF-16BE | +| `.utf16LittleEndian` | UTF-16LE | +| `.utf32` | UTF-32 | +| `.utf32BigEndian` | UTF-32BE | +| `.utf32LittleEndian` | UTF-32LE | +| `.utf8` | UTF-8 | +| `.windowsCP1250` | windows-1250 | +| `.windowsCP1251` | windows-1251 | +| `.windowsCP1252` | windows-1252 | +| `.windowsCP1253` | windows-1253 | +| `.windowsCP1254` | windows-1254 | + + +### `String.Encoding` to Name + +- Upper-case letters may be used unlike CF. + * `var ianaName` returns *Preferred MIME Name* or *Name* of the encoding defined in "IANA Character Sets". + + +### Name to `String.Encoding` + +- `init(ianaName:)` adopts case-insensitive comparison with *Preferred MIME Name*, *Name*, and *Aliases*. + + +## Source compatibility + +These changes proposed here are only additive. However, care must be taken if migrating from CF APIs. + + +## Implications on adoption + +This feature can be freely adopted and un-adopted in source code with no deployment constraints and without affecting source compatibility. + + +## Future directions + +`String.init(data:encoding:)` and `String.data(using:)` will be implemented more appropriately[^string-data-regression]. + +[^string-data-regression]: https://github.com/swiftlang/swift-foundation/issues/1015 + + +Hopefully, happening some cascades like below might be expected in the longer term. + +- General string decoders/encoders and their protocols (for example, as suggested in "[Unicode Processing APIs](https://forums.swift.org/t/pitch-unicode-processing-apis/69294)") could be implemented. + +- Some types which provide their names and decoders/encoders could be implemented for the purpose of tightness between names and implementations. + * There would be a type for WHATWG Encoding Standard which defines both names and implementations. + +
They would look like...
+ +```swift +public protocol StrawmanStringEncodingProtocol { + static func encoding(for name: String) -> Self? + var name: String? { get } + var encoder: (any StringToByteStreamEncoder)? { get } + var decoder: (any ByteStreamToUnicodeScalarsDecoder)? { get } +} + +public struct IANACharset: StrawmanStringEncodingProtocol { + public static let utf8: IANACharset = ... + public static let shiftJIS: IANACharset = ... + : + : +} + +public struct WHATWGEncoding: StrawmanStringEncodingProtocol { + public static let utf8: WHATWGEncoding = ... + public static let eucJP: WHATWGEncoding = ... + : + : +} +``` + +
+ +- `String.Encoding` might be deprecated as a natural course in the distant future?? + + +## Alternatives considered + +### Following "Charset Alias Matching" + +[UTS#22](https://www.unicode.org/reports/tr22/tr22-8.html) defines "Charset Alias Matching" rule. +ICU adopts that rule and CF partially depends on ICU. +On the other hand, there doesn't seem to be any specifications that require "Charset Alias Matching". +Moreover, some risks may be inherent in such a tolerant rule. + +One possible solution may be letting users choose which rule should be used: +```swift +extension String.Encoding { + public enum NameParsingStrategy { + case uts22 + case caseInsensitiveComparison + } + + public init?(ianaName: String, strategy: NameParsingStrategy = .caseInsensitiveComparison) { + ... + } +} +``` + + +### Adopting the WHATWG Encoding Standard (as well) + +There is another standard for string encodings which is published by WHATWG: "[Encoding Standard](https://encoding.spec.whatwg.org/)". +While it may claim the IANA's Character Sets could be replaced with it, it entirely focuses on Web browsers and their JavaScript APIs. +Furthermore it binds tightly names with implementations. +Since `String.Encoding` is just a `RawRepresentable` type where its `RawValue` is `UInt`, it is more universal but is more loosely bound to implementations. +As a result, WHATWG Encoding Standard doesn't easily align with `String.Encoding`. So it is just mentioned in "Future Directions". + + +## Acknowledgments + +Thanks to everyone who gave me advices on the pitch thread; especially to [@benrimmington](https://github.com/benrimmington) and [@xwu](https://github.com/xwu) who could channel their concerns into this proposal in the very early stage. From 19389fdcd89f10d7d936847c0f05917c82f9e531 Mon Sep 17 00:00:00 2001 From: YOCKOW Date: Fri, 9 May 2025 16:39:19 +0900 Subject: [PATCH 2/7] Change the link to implementation. --- Proposals/NNNN-String-Encoding-Names.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Proposals/NNNN-String-Encoding-Names.md b/Proposals/NNNN-String-Encoding-Names.md index 61a22aafa..6ccf61571 100644 --- a/Proposals/NNNN-String-Encoding-Names.md +++ b/Proposals/NNNN-String-Encoding-Names.md @@ -5,7 +5,7 @@ * Review Manager: TBD * Status: **Awaiting review** -* Implementation: [StringEncodingNameImpl/StringEncodingName.swift](https://github.com/YOCKOW/SF-StringEncodingNameImpl/blob/main/Sources/StringEncodingNameImpl/StringEncodingName.swift) *(Awaiting implementation on [swiftlang/swift-foundation](https://github.com/swiftlang/swift-foundation))* +* Implementation: [swiftlang/swift-foundation#1286](https://github.com/swiftlang/swift-foundation/pull/1286) * Review: ([Pitch](https://forums.swift.org/t/pitch-foundation-string-encoding-names/74623)) From eab21a62220bd0479940aa182213ccba8c1a9115 Mon Sep 17 00:00:00 2001 From: YOCKOW Date: Tue, 17 Jun 2025 10:09:54 +0900 Subject: [PATCH 3/7] Remove description about #1015 since it is resolved. Links: - Issue: https://github.com/swiftlang/swift-foundation/issues/1015 - Resolvers: * https://github.com/swiftlang/swift-foundation/pull/1217 * https://github.com/swiftlang/swift-corelibs-foundation/pull/5194 --- Proposals/NNNN-String-Encoding-Names.md | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/Proposals/NNNN-String-Encoding-Names.md b/Proposals/NNNN-String-Encoding-Names.md index 6ccf61571..91c46be75 100644 --- a/Proposals/NNNN-String-Encoding-Names.md +++ b/Proposals/NNNN-String-Encoding-Names.md @@ -237,10 +237,7 @@ This feature can be freely adopted and un-adopted in source code with no deploym ## Future directions -`String.init(data:encoding:)` and `String.data(using:)` will be implemented more appropriately[^string-data-regression]. - -[^string-data-regression]: https://github.com/swiftlang/swift-foundation/issues/1015 - +This feature will make more programs easy to parse string encoding names so that (e.g.) Web apps written in Swift won't need to implement such parser on their own. Hopefully, happening some cascades like below might be expected in the longer term. @@ -276,8 +273,6 @@ public struct WHATWGEncoding: StrawmanStringEncodingProtocol { -- `String.Encoding` might be deprecated as a natural course in the distant future?? - ## Alternatives considered From 171d501ce0ffd3a66c9cf951542be5bd3f50baf2 Mon Sep 17 00:00:00 2001 From: YOCKOW Date: Mon, 18 Aug 2025 12:16:15 +0900 Subject: [PATCH 4/7] Add a description about already-available ICU string converter. --- Proposals/NNNN-String-Encoding-Names.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/Proposals/NNNN-String-Encoding-Names.md b/Proposals/NNNN-String-Encoding-Names.md index 91c46be75..9e99e2619 100644 --- a/Proposals/NNNN-String-Encoding-Names.md +++ b/Proposals/NNNN-String-Encoding-Names.md @@ -239,6 +239,14 @@ This feature can be freely adopted and un-adopted in source code with no deploym This feature will make more programs easy to parse string encoding names so that (e.g.) Web apps written in Swift won't need to implement such parser on their own. +We already have the string converter in `FoundationInternationalization` that wraps ICU APIs, but that requires IANA Charset Names to create an instance of naive ICU converter[^icu-string-converter]. +Once this feature is adopted, it will become easier to implement other string encoding conversions that are unavailable yet. + +[^icu-string-converter]: https://github.com/swiftlang/swift-foundation/blob/a8bee5bfc71210168fa1b973fb1a1deb8bde2047/Sources/FoundationInternationalization/ICU/ICU%2BStringConverter.swift#L18-L37 + + +### Longer-term perspective + Hopefully, happening some cascades like below might be expected in the longer term. - General string decoders/encoders and their protocols (for example, as suggested in "[Unicode Processing APIs](https://forums.swift.org/t/pitch-unicode-processing-apis/69294)") could be implemented. From 3186ee7e45908a5c34a6d2f5f824783eec8dafe7 Mon Sep 17 00:00:00 2001 From: YOCKOW Date: Sun, 7 Sep 2025 14:41:14 +0900 Subject: [PATCH 5/7] SF-0033: Add `@available` attributes to sample code. In response to: https://forums.swift.org/t/review-sf-0033-string-encoding-names/81965/7 --- Proposals/0033-String-Encoding-Names.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/Proposals/0033-String-Encoding-Names.md b/Proposals/0033-String-Encoding-Names.md index 7a9a73dca..fcc4ac608 100644 --- a/Proposals/0033-String-Encoding-Names.md +++ b/Proposals/0033-String-Encoding-Names.md @@ -159,9 +159,11 @@ We introduce a computed property that returns the name, and the initializer that ```swift extension String.Encoding { /// The name of this encoding that is compatible with the one of the IANA registry "charset". + @available(FoundationPreview 6.2, *) public var ianaName: String? /// Creates an instance from the name of the IANA registry "charset". + @available(FoundationPreview 6.2, *) public init?(ianaName: String) } ``` @@ -232,7 +234,7 @@ These changes proposed here are only additive. However, care must be taken if mi ## Implications on adoption -This feature can be freely adopted and un-adopted in source code with no deployment constraints and without affecting source compatibility. +This feature can be freely adopted and un-adopted in source code without affecting source compatibility. ## Future directions From a9bcfaff27e68924c5900302431e974929d691fc Mon Sep 17 00:00:00 2001 From: YOCKOW Date: Sun, 7 Sep 2025 15:29:50 +0900 Subject: [PATCH 6/7] SF-0033: Clarify which "case-insensitivity" is used. In response to: https://forums.swift.org/t/review-sf-0033-string-encoding-names/81965/8 --- Proposals/0033-String-Encoding-Names.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Proposals/0033-String-Encoding-Names.md b/Proposals/0033-String-Encoding-Names.md index fcc4ac608..0014d6760 100644 --- a/Proposals/0033-String-Encoding-Names.md +++ b/Proposals/0033-String-Encoding-Names.md @@ -224,7 +224,7 @@ Accordingly, this proposal suggests just simple correspondence between `String.E ### Name to `String.Encoding` -- `init(ianaName:)` adopts case-insensitive comparison with *Preferred MIME Name*, *Name*, and *Aliases*. +- `init(ianaName:)` adopts ASCII case-insensitive comparison with *Preferred MIME Name*, *Name*, and *Aliases*. ## Source compatibility From db0555cb5a9191235170096ba079bab4dc81a605 Mon Sep 17 00:00:00 2001 From: YOCKOW Date: Fri, 12 Sep 2025 11:46:46 +0900 Subject: [PATCH 7/7] SF-0033: Change FoundationPreview version to 6.3. In response to: https://github.com/swiftlang/swift-foundation/pull/1502#discussion_r2341788835 --- Proposals/0033-String-Encoding-Names.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Proposals/0033-String-Encoding-Names.md b/Proposals/0033-String-Encoding-Names.md index 0014d6760..e9e2197b8 100644 --- a/Proposals/0033-String-Encoding-Names.md +++ b/Proposals/0033-String-Encoding-Names.md @@ -159,11 +159,11 @@ We introduce a computed property that returns the name, and the initializer that ```swift extension String.Encoding { /// The name of this encoding that is compatible with the one of the IANA registry "charset". - @available(FoundationPreview 6.2, *) + @available(FoundationPreview 6.3, *) public var ianaName: String? /// Creates an instance from the name of the IANA registry "charset". - @available(FoundationPreview 6.2, *) + @available(FoundationPreview 6.3, *) public init?(ianaName: String) } ```