-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enumerateAsDict crashing on ios15 when <U+FEFF> Zero Width No-Break Space missing (Byte Order Mark) #97
Comments
Thanks for reporting! That's weird, why is that character needed at all? 🤔 Could you write a minimal failing test in a PR as a base for a fix? |
same print(dict["name"]) --> dict["\Ufeffname"] |
The addition of this character sounds very much like a bug on e.g. Airtable's end if they add the U+FEFF character. Why Xcode would add that back in is a mystery to me @varun-s -- CSV should not require a leading non-breaking space. |
Not a bug in either Air Table or Xcode. The U+FEFF character is the "Byte Order Mark" and it is a long-established part of the Unicode standard. In version 0.6.x, if you load a CSV file encoded in UTF-8 with a BOM, such as one generated by modern versions of Excel, then the U+FEFF winds up embedded as the first character of the first column header, rather than getting discarded by the parser as it should.
I can't currently try version 0.8.0 due to the breaking change removing .namedRows, which my code relies on. But I'm assuming the behavior hasn't changed. I can open a separate, linked issue if you would like. |
@lardieri Thanks for the pointer, didn't know that! I've been too dismissive of the issue too quickly it seems. It sounds like it'd be most accurate to extract the BOM into Lines 100 to 102 in 048a1d3
I still don't know and would need to research whether the BOM is mandatory, recommended, or even widely supported. Still odd that it crashed for @varun-s IMHO |
Lots of bandwidth and storage have been spent on that topic, and your favorite search engine can show you a sample. People's opinions of BOMs generally correlate with their opinions of Microsoft. As you're reading, please remember that when Unicode was invented, big-endian machines were much more prevalent on the internet than they are today. The Unicode standard was written to embrace big-endian byte order as the assumed default, which forced Microsoft to put BOMs on everything, since Windows has always been little-endian. That seems quaint today, when Mac, Linux, iOS, and Android are also little-endian, but it absolutely was a big deal 30 years ago. The Unicode standard says that BOMs are allowed, but not required or encouraged, for UTF-8 files. The BOM of a UTF-8 file is always 0xEF, 0xBB, 0xBF regardless of the architecture of the machine. Microsoft chooses to put BOMs on their UTF-8 files, not because they provide any information about byte order, but in order to distinguish a UTF-8 file from any other "extended ASCII" file encoding, such as ISO Latin 1 / Windows 1252. With that background out of the way... I think we are looking at a bug, or at least a poorly-thought-out design, in Apple's standard library. A very long time ago, Apple had a String class method, https://developer.apple.com/documentation/foundation/nsstring/1497269-string If there was no BOM, this method would assume the "default C encoding" — just as Windows applications would assume the default Windows code page, as noted above. Apple quickly replaced this method with the This means after your convenience initializer calls var contents = try String(contentsOf: url, encoding: encoding)
if contents.hasPrefix("\u{FEFF}") {
contents.removeFirst()
} You don't need to remember the BOM in metadata or add the BOM back when serializing the file, because it's implied by the outgoing encoding. When you call To support callers who want to serialize 100% Microsoft-compatible UTF-8 CSV files with BOMs, you could augment
extension String.Encoding {
static let utf8WithBOM = String.Encoding(rawValue: UInt(12345))
}
The serialization aspect of this issue is really an API design question, so it's up to you. But for the deserialization aspect, I'd recommend what I wrote above: look for the BOM just at the point of loading the string from the URL, and strip it there. |
First column is always returning nil
enumerateAsDict crashing on ios15 when <U+FEFF> Zero Width No-Break Space missing in start of file.
Was working fine before iOS15.
Steps to reproduce:
Simple fix.
Open the CSV in xcode and just delete the first column header and add it back. On doing this the <U+FEFF> Zero Width No-Break Space gets added to the start and parsing works fine again.
The text was updated successfully, but these errors were encountered: