Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle invisible characters are included in CSV column names #688

Closed
jeremyf opened this issue Dec 15, 2022 · 0 comments · Fixed by #689
Closed

Handle invisible characters are included in CSV column names #688

jeremyf opened this issue Dec 15, 2022 · 0 comments · Fixed by #689

Comments

@jeremyf
Copy link
Contributor

jeremyf commented Dec 15, 2022

In one project we encountered a column name having a Byte Order Mark character; hidden and vexing.

In debugging the data we had

> ent = Bulkrax::CsvEntry.find(id)
> ent.record
=> {"file"=>"annie-spratt-ogYRV8cfsBo-unsplash.jpg; annie-spratt-example-of-second-scan.jpg", "identifier"=>"RML_MS009_001", "identifier.ark"=>"RML_MS009_001", "title"=>"Rena and Juanita Liu seated with their dolls", "description"=>"Rena Liu and Juanita Liu, seated, holding dolls, Eleanor and Evelyn.", "creator"=>"Frost, Samuel Lilley (1884-1981)", "contributor"=>"Frost, Ella Knokey (1887-1968)", "date"=>"1939", "date.other"=>"1939", "format.extent"=>"Photograph: black and white; 7 x 5 cm", "type"=>"Image", "subject"=>"Missions -- China", "language"=>"English", "source"=>"General Conference Office of Archives, Statistics, & Research", "relation.isPartOf"=>"Samuel, Ella, and Gladys Frost Photo Collection, 1910-1975", "rights"=>"http://rightsstatements.org/vocab/NoC-US/1.0/", "coverage.spatial"=>nil}

Notice the "file" key. When I tried ent.record['file'], I got nil. Looking deeper at the record's file key I tried the following:

> ent.raw_metadata.keys.first.chars.map(&:chr)
=> ["", "f", "i", "l", "e"]

There was a hidden prefix character. Upon further digging this was a https://en.wikipedia.org/wiki/Byte_order_mark, likely injected by the tool that generated the CSV.

jeremyf added a commit that referenced this issue Dec 15, 2022
Prior to this commit, when someone would upload a CSV it might have a
column name that included the tricksy, invisibile [Byte Order Mark][1].
The manifestation was that you could look at the raw metadata and see a
key called `"file"` however when checking `raw_metadata.key?("file")`
the result was `false`.

To find this `entry.raw_metadata.keys.first.chars.map(&:chr)` I got
`["", "f", "i", "l", "e"]` where the first value of that array was a
[Byte Oder Mark][1].

With this commit, we're handling both how we persist and how we load
persisted serialized data.  This follows on the documentation for the
`ActiveRecord::Base.serialize` method.

It's envisioned that we might have more characters to sanitize.

Closes: #688
Related to: scientist-softserv/adventist-dl#179

[1]: https://en.wikipedia.org/wiki/Byte_order_mark
jeremyf added a commit that referenced this issue Dec 15, 2022
Prior to this commit, when someone would upload a CSV it might have a
column name that included the tricksy, invisibile [Byte Order Mark][1].
The manifestation was that you could look at the raw metadata and see a
key called `"file"` however when checking `raw_metadata.key?("file")`
the result was `false`.

To find this `entry.raw_metadata.keys.first.chars.map(&:chr)` I got
`["", "f", "i", "l", "e"]` where the first value of that array was a
[Byte Oder Mark][1].

With this commit, we're handling both how we persist and how we load
persisted serialized data.  This follows on the documentation for the
`ActiveRecord::Base.serialize` method.

It's envisioned that we might have more characters to sanitize.

Closes: #688
Related to: scientist-softserv/adventist-dl#179

[1]: https://en.wikipedia.org/wiki/Byte_order_mark
jeremyf added a commit that referenced this issue Dec 16, 2022
* Normalizing serialized data for BOM characters

Prior to this commit, when someone would upload a CSV it might have a
column name that included the tricksy, invisibile [Byte Order Mark][1].
The manifestation was that you could look at the raw metadata and see a
key called `"file"` however when checking `raw_metadata.key?("file")`
the result was `false`.

To find this `entry.raw_metadata.keys.first.chars.map(&:chr)` I got
`["", "f", "i", "l", "e"]` where the first value of that array was a
[Byte Oder Mark][1].

With this commit, we're handling both how we persist and how we load
persisted serialized data.  This follows on the documentation for the
`ActiveRecord::Base.serialize` method.

It's envisioned that we might have more characters to sanitize.

Closes: #688
Related to: scientist-softserv/adventist-dl#179

[1]: https://en.wikipedia.org/wiki/Byte_order_mark

* Amending documentation

* Appeasing the cop
jeremyf added a commit to scientist-softserv/adventist-dl that referenced this issue Dec 19, 2022
Prior to this commit, I was thinking that all extracted text was
properly encoded.  However, in reviewing the underlying data in the
application I found cases where the extracted text's content was not
properly encoded.  The primary culprit was the Byte Order Marker (BOM)
character.

I'm not entirely certain why the original encoding doesn't cover the
problem, but when I was testing in the console, I was getting the
encoding error on the `all_text_timv` SOLR field; and it was the BOM
that was causing the problem.

My conjecture is that there is either issues with Rails's `to_json` is
not recognizing BOM as correct encoding for UTF-8 (which I believe it
is).  Or we're getting something garbled back from Fedora.  Or the
encode method was not quite right.

Regardless, with this commit, I'm forcing encoding of that plain text
content and removing the BOM character.  Testing this is also a
particular challenge because all of our existing tools for copy/paste
and typing tend to do some hiddent encoding antics on our behalf.

Below is a naive example of using the `Hyku.utf_8_encode` for the BOM
stripping.

```ruby
irb(main):001:0> "\xEF\xBB\xBFHello"
=> "Hello"
irb(main):002:0> "\xEF\xBB\xBFHello" == "Hello"
=> false
irb(main):003:0> Hyku.utf_8_encode("\xEF\xBB\xBFHello") == "Hello"
=> true
```

Closes:

- https://github.com/scientist-softserv/adventist-dl/issues/181

Related to:

- samvera/bulkrax#689
- samvera/bulkrax#688
- https://github.com/scientist-softserv/adventist-dl/issues/179
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant