enumerateAsDict crashing on ios15 when <U+FEFF> Zero Width No-Break Space missing (Byte Order Mark) #97

varun-s · 2021-09-17T19:35:54Z

First column is always returning nil

enumerateAsDict crashing on ios15 when <U+FEFF> Zero Width No-Break Space missing in start of file.
Was working fine before iOS15.

Steps to reproduce:

Download CSV from Airtable (We use Airtable.com for our CSV docs)
Copy to xcode.
Parse

Simple fix.
Open the CSV in xcode and just delete the first column header and add it back. On doing this the <U+FEFF> Zero Width No-Break Space gets added to the start and parsing works fine again.

DivineDominion · 2021-09-18T07:18:14Z

Thanks for reporting!

That's weird, why is that character needed at all? 🤔

Could you write a minimal failing test in a PR as a base for a fix?

arno608rw · 2021-09-21T15:39:18Z

same print(dict["name"]) --> dict["\Ufeffname"]

DivineDominion · 2021-11-22T12:26:28Z

The addition of this character sounds very much like a bug on e.g. Airtable's end if they add the U+FEFF character.

Why Xcode would add that back in is a mystery to me @varun-s -- CSV should not require a leading non-breaking space.

lardieri · 2022-07-05T08:32:16Z

Not a bug in either Air Table or Xcode. The U+FEFF character is the "Byte Order Mark" and it is a long-established part of the Unicode standard.

In version 0.6.x, if you load a CSV file encoded in UTF-8 with a BOM, such as one generated by modern versions of Excel, then the U+FEFF winds up embedded as the first character of the first column header, rather than getting discarded by the parser as it should.

(lldb) po importedCSV.header[0].unicodeScalars
▿ StringUnicodeScalarView("Date")
  - 0 : "\u{FEFF}"
  - 1 : "D"
  - 2 : "a"
  - 3 : "t"
  - 4 : "e"

I can't currently try version 0.8.0 due to the breaking change removing .namedRows, which my code relies on. But I'm assuming the behavior hasn't changed. I can open a separate, linked issue if you would like.

DivineDominion · 2022-07-08T13:59:58Z

@lardieri Thanks for the pointer, didn't know that! I've been too dismissive of the issue too quickly it seems.

It sounds like it'd be most accurate to extract the BOM into CSV 'metadata' when parsing a string and/or file, then add it back it when re-serializing the CSV.

SwiftCSV/SwiftCSV/CSV.swift

Lines 100 to 102 in 048a1d3

    
           open var serialized: String { 
        
               return self.content.serialize(header: self.header, delimiter: self.delimiter) 
        
           }

I still don't know and would need to research whether the BOM is mandatory, recommended, or even widely supported.

Still odd that it crashed for @varun-s IMHO

lardieri · 2022-07-12T10:37:07Z

Lots of bandwidth and storage have been spent on that topic, and your favorite search engine can show you a sample. People's opinions of BOMs generally correlate with their opinions of Microsoft. As you're reading, please remember that when Unicode was invented, big-endian machines were much more prevalent on the internet than they are today. The Unicode standard was written to embrace big-endian byte order as the assumed default, which forced Microsoft to put BOMs on everything, since Windows has always been little-endian. That seems quaint today, when Mac, Linux, iOS, and Android are also little-endian, but it absolutely was a big deal 30 years ago.

The Unicode standard says that BOMs are allowed, but not required or encouraged, for UTF-8 files. The BOM of a UTF-8 file is always 0xEF, 0xBB, 0xBF regardless of the architecture of the machine. Microsoft chooses to put BOMs on their UTF-8 files, not because they provide any information about byte order, but in order to distinguish a UTF-8 file from any other "extended ASCII" file encoding, such as ISO Latin 1 / Windows 1252.

With that background out of the way...

I think we are looking at a bug, or at least a poorly-thought-out design, in Apple's standard library. A very long time ago, Apple had a String class method, string(contentsOfFile:), that would explicitly recognize the UTF-8 BOM and strip it. See:

https://developer.apple.com/documentation/foundation/nsstring/1497269-string

If there was no BOM, this method would assume the "default C encoding" — just as Windows applications would assume the default Windows code page, as noted above.

Apple quickly replaced this method with the string(contentsOf:encoding:) method that forces you to pass an explicit encoding argument. When you pass .utf16, .utf32, or .unicode encodings, Apple's string decoder recognizes the BOM and strips it out. Unfortunately, it doesn't do this for the .utf8 encoding. Instead, it decodes the BOM as its Unicode code point, \u{FEFF} and puts it as the first character of the string.

This means after your convenience initializer calls String(contents:encoding:), you have to look for the BOM and strip it manually before passing the string to your designated initializer. Something like this:

    var contents = try String(contentsOf: url, encoding: encoding)
    if contents.hasPrefix("\u{FEFF}") {
        contents.removeFirst()
    }

You don't need to remember the BOM in metadata or add the BOM back when serializing the file, because it's implied by the outgoing encoding. When you call dataUsingEncoding(.utf16) or dataUsingEncoding(.utf32) or dataUsingEncoding(.unicode), Apple puts the BOM on for you. No other encoding needs it, although UTF-8 might want it. Also, remember that a CSV might get deserialized from one encoding and then reserialized to a different encoding.

To support callers who want to serialize 100% Microsoft-compatible UTF-8 CSV files with BOMs, you could augment dataUsingEncoding(_:) somehow.

Perhaps you could add a parameter (addBOM: Bool = false) that gets ignored for encodings other than UTF-8.
Or perhaps you could create your own encoding definition as an extension to String.Encoding, which you then recognize and act on in dataUsingEncoding(_:).

extension String.Encoding {
    static let utf8WithBOM = String.Encoding(rawValue: UInt(12345))
}

The third alternative is to do nothing with BOMs when serializing to UTF-8. If your caller wants one, they can themselves add the 0xEF, 0xBB, 0xBF bytes to the data you return.

The serialization aspect of this issue is really an API design question, so it's up to you. But for the deserialization aspect, I'd recommend what I wrote above: look for the BOM just at the point of loading the string from the URL, and strip it there.

CSV files saved by Excel and Airtable include the BOM when encoding to UTF-8.

varun-s changed the title ~~Column with name "id" crashing on ios15!~~ enumerateAsDict crashing on ios15 when accessing first column! Sep 17, 2021

varun-s closed this as completed Sep 17, 2021

varun-s reopened this Sep 17, 2021

varun-s changed the title ~~enumerateAsDict crashing on ios15 when accessing first column!~~ enumerateAsDict crashing on ios15 when <U+FEFF> Zero Width No-Break Space missing Sep 17, 2021

DivineDominion changed the title ~~enumerateAsDict crashing on ios15 when <U+FEFF> Zero Width No-Break Space missing~~ enumerateAsDict crashing on ios15 when <U+FEFF> Zero Width No-Break Space missing (Byte Order Mark) Jul 8, 2022

DivineDominion mentioned this issue Aug 3, 2022

Remove the Unicode BOM, if present. (#97) #103

Merged

DivineDominion pushed a commit that referenced this issue Aug 4, 2022

Remove the Unicode BOM, if present. (#97) (#103)

591d3e7

CSV files saved by Excel and Airtable include the BOM when encoding to UTF-8.

lardieri closed this as completed in 05bd030 Aug 9, 2022

lardieri mentioned this issue Aug 24, 2023

Adds a test to show error described in #126 (also fixes #126) #128

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enumerateAsDict crashing on ios15 when <U+FEFF> Zero Width No-Break Space missing (Byte Order Mark) #97

enumerateAsDict crashing on ios15 when <U+FEFF> Zero Width No-Break Space missing (Byte Order Mark) #97

varun-s commented Sep 17, 2021 •

edited

Loading

DivineDominion commented Sep 18, 2021

arno608rw commented Sep 21, 2021 •

edited

Loading

DivineDominion commented Nov 22, 2021

lardieri commented Jul 5, 2022

DivineDominion commented Jul 8, 2022

lardieri commented Jul 12, 2022

enumerateAsDict crashing on ios15 when <U+FEFF> Zero Width No-Break Space missing (Byte Order Mark) #97

enumerateAsDict crashing on ios15 when <U+FEFF> Zero Width No-Break Space missing (Byte Order Mark) #97

Comments

varun-s commented Sep 17, 2021 • edited Loading

DivineDominion commented Sep 18, 2021

arno608rw commented Sep 21, 2021 • edited Loading

DivineDominion commented Nov 22, 2021

lardieri commented Jul 5, 2022

DivineDominion commented Jul 8, 2022

lardieri commented Jul 12, 2022

varun-s commented Sep 17, 2021 •

edited

Loading

arno608rw commented Sep 21, 2021 •

edited

Loading