Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enumerateAsDict crashing on ios15 when <U+FEFF> Zero Width No-Break Space missing (Byte Order Mark) #97

Closed
varun-s opened this issue Sep 17, 2021 · 6 comments

Comments

@varun-s
Copy link

varun-s commented Sep 17, 2021

First column is always returning nil

enumerateAsDict crashing on ios15 when <U+FEFF> Zero Width No-Break Space missing in start of file.
Was working fine before iOS15.

Steps to reproduce:

  1. Download CSV from Airtable (We use Airtable.com for our CSV docs)
  2. Copy to xcode.
  3. Parse

Simple fix.
Open the CSV in xcode and just delete the first column header and add it back. On doing this the <U+FEFF> Zero Width No-Break Space gets added to the start and parsing works fine again.

@varun-s varun-s changed the title Column with name "id" crashing on ios15! enumerateAsDict crashing on ios15 when accessing first column! Sep 17, 2021
@varun-s varun-s closed this as completed Sep 17, 2021
@varun-s varun-s reopened this Sep 17, 2021
@varun-s varun-s changed the title enumerateAsDict crashing on ios15 when accessing first column! enumerateAsDict crashing on ios15 when <U+FEFF> Zero Width No-Break Space missing Sep 17, 2021
@DivineDominion
Copy link
Contributor

Thanks for reporting!

That's weird, why is that character needed at all? 🤔

Could you write a minimal failing test in a PR as a base for a fix?

@arno608rw
Copy link

arno608rw commented Sep 21, 2021

same print(dict["name"]) --> dict["\Ufeffname"]

@DivineDominion
Copy link
Contributor

The addition of this character sounds very much like a bug on e.g. Airtable's end if they add the U+FEFF character.

Why Xcode would add that back in is a mystery to me @varun-s -- CSV should not require a leading non-breaking space.

@lardieri
Copy link
Contributor

lardieri commented Jul 5, 2022

Not a bug in either Air Table or Xcode. The U+FEFF character is the "Byte Order Mark" and it is a long-established part of the Unicode standard.

In version 0.6.x, if you load a CSV file encoded in UTF-8 with a BOM, such as one generated by modern versions of Excel, then the U+FEFF winds up embedded as the first character of the first column header, rather than getting discarded by the parser as it should.

(lldb) po importedCSV.header[0].unicodeScalars
▿ StringUnicodeScalarView("Date")
  - 0 : "\u{FEFF}"
  - 1 : "D"
  - 2 : "a"
  - 3 : "t"
  - 4 : "e"

I can't currently try version 0.8.0 due to the breaking change removing .namedRows, which my code relies on. But I'm assuming the behavior hasn't changed. I can open a separate, linked issue if you would like.

@DivineDominion DivineDominion changed the title enumerateAsDict crashing on ios15 when <U+FEFF> Zero Width No-Break Space missing enumerateAsDict crashing on ios15 when <U+FEFF> Zero Width No-Break Space missing (Byte Order Mark) Jul 8, 2022
@DivineDominion
Copy link
Contributor

@lardieri Thanks for the pointer, didn't know that! I've been too dismissive of the issue too quickly it seems.

It sounds like it'd be most accurate to extract the BOM into CSV 'metadata' when parsing a string and/or file, then add it back it when re-serializing the CSV.

SwiftCSV/SwiftCSV/CSV.swift

Lines 100 to 102 in 048a1d3

open var serialized: String {
return self.content.serialize(header: self.header, delimiter: self.delimiter)
}


I still don't know and would need to research whether the BOM is mandatory, recommended, or even widely supported.

Still odd that it crashed for @varun-s IMHO

@lardieri
Copy link
Contributor

Lots of bandwidth and storage have been spent on that topic, and your favorite search engine can show you a sample. People's opinions of BOMs generally correlate with their opinions of Microsoft. As you're reading, please remember that when Unicode was invented, big-endian machines were much more prevalent on the internet than they are today. The Unicode standard was written to embrace big-endian byte order as the assumed default, which forced Microsoft to put BOMs on everything, since Windows has always been little-endian. That seems quaint today, when Mac, Linux, iOS, and Android are also little-endian, but it absolutely was a big deal 30 years ago.

The Unicode standard says that BOMs are allowed, but not required or encouraged, for UTF-8 files. The BOM of a UTF-8 file is always 0xEF, 0xBB, 0xBF regardless of the architecture of the machine. Microsoft chooses to put BOMs on their UTF-8 files, not because they provide any information about byte order, but in order to distinguish a UTF-8 file from any other "extended ASCII" file encoding, such as ISO Latin 1 / Windows 1252.

With that background out of the way...

I think we are looking at a bug, or at least a poorly-thought-out design, in Apple's standard library. A very long time ago, Apple had a String class method, string(contentsOfFile:), that would explicitly recognize the UTF-8 BOM and strip it. See:

https://developer.apple.com/documentation/foundation/nsstring/1497269-string

If there was no BOM, this method would assume the "default C encoding" — just as Windows applications would assume the default Windows code page, as noted above.

Apple quickly replaced this method with the string(contentsOf:encoding:) method that forces you to pass an explicit encoding argument. When you pass .utf16, .utf32, or .unicode encodings, Apple's string decoder recognizes the BOM and strips it out. Unfortunately, it doesn't do this for the .utf8 encoding. Instead, it decodes the BOM as its Unicode code point, \u{FEFF} and puts it as the first character of the string.

This means after your convenience initializer calls String(contents:encoding:), you have to look for the BOM and strip it manually before passing the string to your designated initializer. Something like this:

    var contents = try String(contentsOf: url, encoding: encoding)
    if contents.hasPrefix("\u{FEFF}") {
        contents.removeFirst()
    }

You don't need to remember the BOM in metadata or add the BOM back when serializing the file, because it's implied by the outgoing encoding. When you call dataUsingEncoding(.utf16) or dataUsingEncoding(.utf32) or dataUsingEncoding(.unicode), Apple puts the BOM on for you. No other encoding needs it, although UTF-8 might want it. Also, remember that a CSV might get deserialized from one encoding and then reserialized to a different encoding.

To support callers who want to serialize 100% Microsoft-compatible UTF-8 CSV files with BOMs, you could augment dataUsingEncoding(_:) somehow.

  • Perhaps you could add a parameter (addBOM: Bool = false) that gets ignored for encodings other than UTF-8.
  • Or perhaps you could create your own encoding definition as an extension to String.Encoding, which you then recognize and act on in dataUsingEncoding(_:).
extension String.Encoding {
    static let utf8WithBOM = String.Encoding(rawValue: UInt(12345))
}
  • The third alternative is to do nothing with BOMs when serializing to UTF-8. If your caller wants one, they can themselves add the 0xEF, 0xBB, 0xBF bytes to the data you return.

The serialization aspect of this issue is really an API design question, so it's up to you. But for the deserialization aspect, I'd recommend what I wrote above: look for the BOM just at the point of loading the string from the URL, and strip it there.

DivineDominion pushed a commit that referenced this issue Aug 4, 2022
CSV files saved by Excel and Airtable include the BOM when encoding to UTF-8.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants