Handle invisible characters are included in CSV column names #688

jeremyf · 2022-12-15T20:53:36Z

In one project we encountered a column name having a Byte Order Mark character; hidden and vexing.

In debugging the data we had

> ent = Bulkrax::CsvEntry.find(id)
> ent.record
=> {"file"=>"annie-spratt-ogYRV8cfsBo-unsplash.jpg; annie-spratt-example-of-second-scan.jpg", "identifier"=>"RML_MS009_001", "identifier.ark"=>"RML_MS009_001", "title"=>"Rena and Juanita Liu seated with their dolls", "description"=>"Rena Liu and Juanita Liu, seated, holding dolls, Eleanor and Evelyn.", "creator"=>"Frost, Samuel Lilley (1884-1981)", "contributor"=>"Frost, Ella Knokey (1887-1968)", "date"=>"1939", "date.other"=>"1939", "format.extent"=>"Photograph: black and white; 7 x 5 cm", "type"=>"Image", "subject"=>"Missions -- China", "language"=>"English", "source"=>"General Conference Office of Archives, Statistics, & Research", "relation.isPartOf"=>"Samuel, Ella, and Gladys Frost Photo Collection, 1910-1975", "rights"=>"http://rightsstatements.org/vocab/NoC-US/1.0/", "coverage.spatial"=>nil}

Notice the "file" key. When I tried ent.record['file'], I got nil. Looking deeper at the record's file key I tried the following:

> ent.raw_metadata.keys.first.chars.map(&:chr)
=> ["", "f", "i", "l", "e"]

There was a hidden prefix character. Upon further digging this was a https://en.wikipedia.org/wiki/Byte_order_mark, likely injected by the tool that generated the CSV.

The text was updated successfully, but these errors were encountered:

Prior to this commit, when someone would upload a CSV it might have a column name that included the tricksy, invisibile [Byte Order Mark][1]. The manifestation was that you could look at the raw metadata and see a key called `"file"` however when checking `raw_metadata.key?("file")` the result was `false`. To find this `entry.raw_metadata.keys.first.chars.map(&:chr)` I got `["", "f", "i", "l", "e"]` where the first value of that array was a [Byte Oder Mark][1]. With this commit, we're handling both how we persist and how we load persisted serialized data. This follows on the documentation for the `ActiveRecord::Base.serialize` method. It's envisioned that we might have more characters to sanitize. Closes: #688 Related to: scientist-softserv/adventist-dl#179 [1]: https://en.wikipedia.org/wiki/Byte_order_mark

* Normalizing serialized data for BOM characters Prior to this commit, when someone would upload a CSV it might have a column name that included the tricksy, invisibile [Byte Order Mark][1]. The manifestation was that you could look at the raw metadata and see a key called `"file"` however when checking `raw_metadata.key?("file")` the result was `false`. To find this `entry.raw_metadata.keys.first.chars.map(&:chr)` I got `["", "f", "i", "l", "e"]` where the first value of that array was a [Byte Oder Mark][1]. With this commit, we're handling both how we persist and how we load persisted serialized data. This follows on the documentation for the `ActiveRecord::Base.serialize` method. It's envisioned that we might have more characters to sanitize. Closes: #688 Related to: scientist-softserv/adventist-dl#179 [1]: https://en.wikipedia.org/wiki/Byte_order_mark * Amending documentation * Appeasing the cop

Prior to this commit, I was thinking that all extracted text was properly encoded. However, in reviewing the underlying data in the application I found cases where the extracted text's content was not properly encoded. The primary culprit was the Byte Order Marker (BOM) character. I'm not entirely certain why the original encoding doesn't cover the problem, but when I was testing in the console, I was getting the encoding error on the `all_text_timv` SOLR field; and it was the BOM that was causing the problem. My conjecture is that there is either issues with Rails's `to_json` is not recognizing BOM as correct encoding for UTF-8 (which I believe it is). Or we're getting something garbled back from Fedora. Or the encode method was not quite right. Regardless, with this commit, I'm forcing encoding of that plain text content and removing the BOM character. Testing this is also a particular challenge because all of our existing tools for copy/paste and typing tend to do some hiddent encoding antics on our behalf. Below is a naive example of using the `Hyku.utf_8_encode` for the BOM stripping. ```ruby irb(main):001:0> "\xEF\xBB\xBFHello" => "Hello" irb(main):002:0> "\xEF\xBB\xBFHello" == "Hello" => false irb(main):003:0> Hyku.utf_8_encode("\xEF\xBB\xBFHello") == "Hello" => true ``` Closes: - https://github.com/scientist-softserv/adventist-dl/issues/181 Related to: - samvera/bulkrax#689 - samvera/bulkrax#688 - https://github.com/scientist-softserv/adventist-dl/issues/179

jeremyf mentioned this issue Dec 15, 2022

Normalizing serialized data for BOM characters #689

Merged

jeremyf closed this as completed in #689 Dec 16, 2022

This was referenced Dec 19, 2022

Ensuring that extracted text is encoded scientist-softserv/adventist-dl#182

Merged

Temporarily disable text extraction scientist-softserv/adventist_knapsack#632

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle invisible characters are included in CSV column names #688

Handle invisible characters are included in CSV column names #688

jeremyf commented Dec 15, 2022

Handle invisible characters are included in CSV column names #688

Handle invisible characters are included in CSV column names #688

Comments

jeremyf commented Dec 15, 2022