-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle invisible characters are included in CSV column names #688
Comments
jeremyf
added a commit
that referenced
this issue
Dec 15, 2022
Prior to this commit, when someone would upload a CSV it might have a column name that included the tricksy, invisibile [Byte Order Mark][1]. The manifestation was that you could look at the raw metadata and see a key called `"file"` however when checking `raw_metadata.key?("file")` the result was `false`. To find this `entry.raw_metadata.keys.first.chars.map(&:chr)` I got `["", "f", "i", "l", "e"]` where the first value of that array was a [Byte Oder Mark][1]. With this commit, we're handling both how we persist and how we load persisted serialized data. This follows on the documentation for the `ActiveRecord::Base.serialize` method. It's envisioned that we might have more characters to sanitize. Closes: #688 Related to: scientist-softserv/adventist-dl#179 [1]: https://en.wikipedia.org/wiki/Byte_order_mark
jeremyf
added a commit
that referenced
this issue
Dec 15, 2022
Prior to this commit, when someone would upload a CSV it might have a column name that included the tricksy, invisibile [Byte Order Mark][1]. The manifestation was that you could look at the raw metadata and see a key called `"file"` however when checking `raw_metadata.key?("file")` the result was `false`. To find this `entry.raw_metadata.keys.first.chars.map(&:chr)` I got `["", "f", "i", "l", "e"]` where the first value of that array was a [Byte Oder Mark][1]. With this commit, we're handling both how we persist and how we load persisted serialized data. This follows on the documentation for the `ActiveRecord::Base.serialize` method. It's envisioned that we might have more characters to sanitize. Closes: #688 Related to: scientist-softserv/adventist-dl#179 [1]: https://en.wikipedia.org/wiki/Byte_order_mark
jeremyf
added a commit
that referenced
this issue
Dec 16, 2022
* Normalizing serialized data for BOM characters Prior to this commit, when someone would upload a CSV it might have a column name that included the tricksy, invisibile [Byte Order Mark][1]. The manifestation was that you could look at the raw metadata and see a key called `"file"` however when checking `raw_metadata.key?("file")` the result was `false`. To find this `entry.raw_metadata.keys.first.chars.map(&:chr)` I got `["", "f", "i", "l", "e"]` where the first value of that array was a [Byte Oder Mark][1]. With this commit, we're handling both how we persist and how we load persisted serialized data. This follows on the documentation for the `ActiveRecord::Base.serialize` method. It's envisioned that we might have more characters to sanitize. Closes: #688 Related to: scientist-softserv/adventist-dl#179 [1]: https://en.wikipedia.org/wiki/Byte_order_mark * Amending documentation * Appeasing the cop
jeremyf
added a commit
to scientist-softserv/adventist-dl
that referenced
this issue
Dec 19, 2022
Prior to this commit, I was thinking that all extracted text was properly encoded. However, in reviewing the underlying data in the application I found cases where the extracted text's content was not properly encoded. The primary culprit was the Byte Order Marker (BOM) character. I'm not entirely certain why the original encoding doesn't cover the problem, but when I was testing in the console, I was getting the encoding error on the `all_text_timv` SOLR field; and it was the BOM that was causing the problem. My conjecture is that there is either issues with Rails's `to_json` is not recognizing BOM as correct encoding for UTF-8 (which I believe it is). Or we're getting something garbled back from Fedora. Or the encode method was not quite right. Regardless, with this commit, I'm forcing encoding of that plain text content and removing the BOM character. Testing this is also a particular challenge because all of our existing tools for copy/paste and typing tend to do some hiddent encoding antics on our behalf. Below is a naive example of using the `Hyku.utf_8_encode` for the BOM stripping. ```ruby irb(main):001:0> "\xEF\xBB\xBFHello" => "Hello" irb(main):002:0> "\xEF\xBB\xBFHello" == "Hello" => false irb(main):003:0> Hyku.utf_8_encode("\xEF\xBB\xBFHello") == "Hello" => true ``` Closes: - https://github.com/scientist-softserv/adventist-dl/issues/181 Related to: - samvera/bulkrax#689 - samvera/bulkrax#688 - https://github.com/scientist-softserv/adventist-dl/issues/179
This was referenced Dec 19, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In one project we encountered a column name having a Byte Order Mark character; hidden and vexing.
In debugging the data we had
Notice the
"file"
key. When I triedent.record['file']
, I gotnil
. Looking deeper at the record'sfile
key I tried the following:There was a hidden prefix character. Upon further digging this was a https://en.wikipedia.org/wiki/Byte_order_mark, likely injected by the tool that generated the CSV.
The text was updated successfully, but these errors were encountered: