Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add text snippet to InvalidByteSequenceError error message for easier troubleshooting #59

Conversation

elohanlon
Copy link
Contributor

No description provided.

@jrochkind
Copy link
Member

jrochkind commented May 6, 2019

There are lots of whitespace changes/fixes here, which make it hard for me to find what you actually changed!

Can you divide the whitespace fix and the semantic change into two commits (even if they're both on the same PR)? And point out what line the real change is on?

Also, @adjam are you interested in reviewing this? I think me and @adjam may be it for active committers, and I'd love to spread it around and not have it be me making all the calls!

@elohanlon
Copy link
Contributor Author

Whoops, sorry about that! Whitespace was automatically added by my editor. I'll fix that now. My change only affects line 163.

@jrochkind
Copy link
Member

I found the actual change! It had scrolled off the right side of my screen!

https://github.com/ruby-marc/ruby-marc/pull/59/files#diff-923e31c42a60d51378c49025dc10c3bbR163

This is really neat, that we can give some context!

My concern is -- if it was invalid byte for UTF-8 (is that what this error means? I think so?), is putting it in an exception message... going to result in an exception message that is invalid UTF-8, and which causes it's own exception if you try to log the exception!?

@elohanlon
Copy link
Contributor Author

elohanlon commented May 6, 2019

Haha. Good question and good concern. Fortunately, no additional exception is thrown. I'm currently using my forked version of this gem in another app, and this is what the new output looks like:

Error message: MARC8, input byte offset 112, code set: 0x45, code point: 0xbf. Snippet: ... (Yul phyogs so so��i gsar ��gyu...

Despite certain characters not being printable, I found this to be really helpful for quickly finding out where in the MARC record I should be looking to determine which characters couldn't be parsed (rather than having to look up code points every time).

@elohanlon elohanlon force-pushed the easier-InvalidByteSequenceError-troubleshooting branch from 525704a to f6b57df Compare May 6, 2019 19:59
@elohanlon
Copy link
Contributor Author

@jrochkind Okay, just split my commits into (1) whitespace cleanup and (2) actual error message change.

@jrochkind
Copy link
Member

jrochkind commented May 6, 2019

Ah, it changes the bad characters to replacement chars. Hmm, can you figure out where that replacmeent is coming from? If it's still bad in the exception message, but whatever you are using to log it fixes it -- I still feel that's a bad thing for ruby-marc to do.

This is in the code that translates Marc8 to UTF8, which makes it all even more confusing to deal with.

When you do marc8_string[(pos-20)...[pos+20, marc8_string.length].min], you are actually taking a slice of the string in Marc8, not in UTF8. Marc8 as it is can't be written to any terminal anywhere and be displayed using the correct characters for what it represents. So in some cases, this is going to put "nonsense" on the screen that is actually correct. For instance if the surrounding positions are correct Marc8 for a diacritic -- printing the bytes for correct Marc8 for a diacritic to the screen will not look like the diacritic on the screen!

This is really confusing stuff, and I'm just cautious of leaving something that makes it more confusing for people, which is really easy to do with this stuff.

I wonder if you're only getting � in your output because you actually had the utf8 bytes for � in your actual original record!

@elohanlon
Copy link
Contributor Author

Good points all around. Although it's been useful for me personally because I know what change I made (and I thought of it as a "best attempt at displaying anything, for the sake of troubleshooting"), I can definitely see how that could be confusing for other people. For context, the actual problem I'm trying to solve on my app side is why when catalogers enter smart/curly quotes (single or double) in Voyager OPAC MARC21 records, I can't seem to parse those same records as MARC8, and I get an Encoding::InvalidByteSequenceError when I try to.

Since Voyager seems to accept and handle these characters, I'm not sure why they're not parsing correctly as MARC8. I would have initially guessed that they're just invalid MARC8 characters, but it would surprise me if the Voyager client allows users to enter invalid characters and still displays them properly. (Sorry if that's too off-topic, though I'd appreciate any thoughts you have!)

@jrochkind
Copy link
Member

Yeah, character encoding is the worst. Yeah, my guess too is that Voyager is happy to handle records that have a mix of Marc8 and UTF8 in them, somehow. I don't think they actually do us any favors for getting our data in order. Character encoding issues are confusing no matter what, software doing non-standard things makes it more confusing.

Let's think on it a bit, maybe experiment more to figure out what's going on, and I'm curious if @adjam or any other ruby-marc contributors have any thoughts.

If I was you debugging, I'd figure out exactly what bytes are in that position in the records, and what those bytes mean (if anything) in Marc8, and what they mean (if anything) in UTF8. Easier to figure out what they mean in UTF8, since no modern software knows nothing about Marc8.

I wonder if Voyager is actually outputting the whole record in UTF8 not Marc8, and you actually want to process it as UTF8? That'd probably be too good to be true though.

@elohanlon
Copy link
Contributor Author

elohanlon commented May 6, 2019

Thanks for thinking this through with me! Yeah, character encoding is indeed the worst.

I tried parsing as UTF-8 and although I don't get a ruby error, the result still gives me a bunch of � characters. And for other records with non-ASCII characters, UTF-8 interpretation mangles things while MARC8 gives me the expected result, so it at least seems like MARC8 is the primary encoding (though it could be a mix, like you said).

I'm not sure how significant this is, but this is the character I'm expecting to get (based on a copy and paste of the original character into the search box on https://unicodemap.org):

https://unicodemap.org/details/0x02BC/index.html (0x02BC)

And 0x02BC doesn't appear anywhere in map_to_unicode.rb.

That character looks quite similar to this one though: https://unicodemap.org/details/0x2019/index.html (0x2019)

And 0x2019 appears in a hash called "ODD_MAP" in map_to_unicode.rb.

Also, for the UTF-8-interpreted string that I mentioned earlier ( (Yul phyogs so so��i gsar ��gyu), these are the bytes:

[32, 40, 89, 117, 108, 32, 112, 104, 121, 111, 103, 115, 32, 115, 111, 32, 115, 111, 239, 191, 189, 239, 191, 189, 105, 32, 103, 115, 97, 114, 32, 239, 191, 189, 239, 191, 189, 103, 121, 117]

Since I just grabbed 10 characters before and after, I may have chopped off part of any starting and ending multibyte strings.

@elohanlon
Copy link
Contributor Author

Packing those strings in a couple of different ways gives me:

2.5.3 :001 > [32, 40, 89, 117, 108, 32, 112, 104, 121, 111, 103, 115, 32, 115, 111, 32, 115, 111, 239, 191, 189, 239, 191, 189, 105, 32, 103, 115, 97, 114, 32, 239, 191, 189, 239, 191, 189, 103, 121, 117].pack('c*')
 => " (Yul phyogs so so\xEF\xBF\xBD\xEF\xBF\xBDi gsar \xEF\xBF\xBD\xEF\xBF\xBDgyu"
2.5.3 :002 > [32, 40, 89, 117, 108, 32, 112, 104, 121, 111, 103, 115, 32, 115, 111, 32, 115, 111, 239, 191, 189, 239, 191, 189, 105, 32, 103, 115, 97, 114, 32, 239, 191, 189, 239, 191, 189, 103, 121, 117].pack('U*')
 => " (Yul phyogs so so��i gsar ��gyu"

@jrochkind
Copy link
Member

jrochkind commented May 6, 2019

And 0x02BC doesn't appear anywhere in map_to_unicode.rb.

That's a curly apostrophe in UTF8, says your link? It may be that there is no way to represent a curly apostrophe in MARC8? So it would not be in the map of marc8 bytes to utf8 bytes.

I did not write that map_to_unicode, I'm not sure where it came from -- I'd bet it was at some point ported from Java of ancient origin.

Are you sure you can't just get Voyager to output UTF8 instead? (But now we're diagnosing your problem, which is a different ticket I guess).

@jrochkind
Copy link
Member

jrochkind commented May 6, 2019

Note that there should be a way to tell the marc-to-unicode converter you are using here to replace invalid bytes with � itself, instead of raising. That is an option I see in the existing code (which I haven't looked at in a while, if ever previously). Would using that option work better for you?

Or, to bring it back to this PR, would using that option when debugging obviate the need for the extra (potentially dangerous) debugging-related info in the exception?

@elohanlon
Copy link
Contributor Author

Unfortunately, replacing invalid bytes with � isn't an option for me, since our library workflows have been built around catalogers creating certain records in Voyager with the specific expectation that they'll be synced to an external app. The data needs to be in two places, but we want to avoid redundant cataloging work, and the assumption is that whatever appears in one will appear exactly the same way in the other. These syncs are daily and automatic too, so if we silently replaced bad characters then it would be our public users who would discover the bad-looking string errors rather than us.

In the past, we've asked people not to enter curly quotes into Voyager because it's a known issue, and we've had to manually correct them on a case by case basis (usually when someone accidentally includes them via copy and paste). That gets around the problem because these are records that we control, but I was hoping to figure out why it was an issue. Maybe we actually should be able to handle curly quotes? Though if this character isn't part of MARC8, then maybe not? If we one day find ourselves processing records that we can't control (e.g. from an automated OCLC sync to Voyager), then we'll be in more trouble.

But yes, going back to the pull request, how would you feel about something like this as a safer alternative to what I originally proposed?
raise Encoding::InvalidByteSequenceError.new("MARC8, input byte offset #{pos}, code set: 0x#{code_set.to_s(16)}, code point: 0x#{code_point.to_s(16)}. Problematic string (with ? replacements): #{transcode(marc8_string, :invalid => :replace, :replace => "?")}")

Now it outputs the full field text that caused the problem, doesn't risk chopping off part of a multibyte string by grabbing +-20 characters, and uses the :replace option to safely display what it can. Here's some sample output for the use case we've been working with:

MARC8, input byte offset 112, code set: 0x45, code point: 0xbf. Problematic string (with ? replacements): aThis collection is arranged in six series and several subseries: Series I. The Tibet Mirror (Yul phyogs so so?i̐̐ gsar ?g̐̐yur me long); Series II. Archival Materials from the Tibet Mirror Press Editorial Offices (Kalimpong, India); Series III. Correspondence; Series IV. Audiovisual Materials; Series V. Printed Materials; Series VI. Miscell

Thanks again for all of your help with this!

@jrochkind
Copy link
Member

I guess I'm wondering if as a debugging tool, you could temporarily set the encoder to replace with ?, when you run into the exception. As an alternative to having the exception include more info, which I'm not sure if it could be dangerous or misleading. When you see the exception, run the thing again with replace: :invalid activated to look at where the bad bytes were?

But your updated idea to replace with ? replacement char does seem less dangerous, good thought.

@jrochkind
Copy link
Member

Hmm, less dangerous. The string you are starting with is still Marc8 though, which no terminal can display. Ah, is that the transcode in there, you are converting it from Marc8 to UTF8, and replacing bad bytes with "?", in order to include it in the error message?

I guess that maybe seems okay. I'm not sure if the Problematic string (with ? replacements) lengthy label is necessary, but that's a side issue.

Can you write some tests that verify this behavior? Including test data that includes valid non-ascii Marc8 near the invalid byte? (I'm not sure what Marc8's test infrastructure is like, it's probably kind of old and crufty, but there are tests! Haven't looked at them in a while).

@elohanlon elohanlon force-pushed the easier-InvalidByteSequenceError-troubleshooting branch from f6b57df to d699ded Compare May 7, 2019 00:22
@elohanlon
Copy link
Contributor Author

That's right about using transcode to replace bad bytes. I just updated my PR to include a test for the message, replaced Problematic string (with ? replacements) with Value: to make it more concise, and swapped "?" as the replacement character with "<?>" because "?" alone is likely to appear in other parts of the text.

It was interesting to try out test-unit! I've always been an rspec user. Instead of adding another test, I augmented an existing test (test_bad_byte). Is that okay? Or would you want a separate test instead? I was trying to match the style of the other tests, which seemed to group related assertions together.

Is there anything else I should be testing, or that I should change?

@elohanlon elohanlon force-pushed the easier-InvalidByteSequenceError-troubleshooting branch from d699ded to 02cb4d9 Compare May 7, 2019 00:37
@elohanlon
Copy link
Contributor Author

Actually, I just added a second test (test_multiple_bad_byte_error_message) to make sure it properly displays an error message when there are multiple bad bytes in a string, even though it only reports the location of the first bad byte.

@elohanlon elohanlon force-pushed the easier-InvalidByteSequenceError-troubleshooting branch from 02cb4d9 to 0b779e3 Compare May 7, 2019 00:41
@jrochkind
Copy link
Member

Yeah, I think re-use test or new test, whatever is most convenient. (Personally, I've become a fan of rspec, use of test-unit here is as much legacy as intentional at this point probably).

I like the shorter label, if you think that'll work.

I am feeling much more comfortable with where this is ending up, thanks for being willing to iterate on it!

What do you think about using the actual unicode replacement char "�" instead of "?" or "<?>"? On the one hand sticking to ascii does seem safer... but we're already potentially putting non-ascii UTF8 in there, if any of the surrounding characters have non-ascii MARC8 that is successfully translated to UTF8 in em, no? So... I'm not sure?

@elohanlon
Copy link
Contributor Author

Shorter label works for me, and glad to iterate! Your suggestions and help moved this in a much better direction.

I like the idea of using "�" instead. I went for "<?>" for exactly the reason you said, thinking it might be safer, but yeah, since we're working with unicode anyway I'd be comfortable with �. It's easier to spot in a block of text too. I'll update that now.

@elohanlon
Copy link
Contributor Author

Code and tests updated!

@adjam
Copy link
Collaborator

adjam commented May 29, 2019

@elohanlon @jrochkind sorry, day job has intervened, I've just now gotten a bit of time to look at this. Thanks for the contribution!

High level impressions:

Any additional info in the error message that helps point at the section of the input file that's causing problems is a good thing, even if it's forced to munge the input a bit in order to display it, because the point is the parser doesn't know what to do with the input.

I guess that the hex-encoded output of the original bytes passed in the to the method that's throwing the error would be about as helpful as possible, but that's a long error message. Maybe if we had a verbose mode, the rescue block could construct that or log it itself, but that seems more like part of a larger campaign.

The tests get at the right thing, but if I'm being pedantic, it's one thing to detect that bad bytes in will raise the error, and another to test that the error message in that case has a useful format, so I recommend breaking the latter out into a separate test method. While this seems to violate DRY, these are IMO two different contexts.

Also, testing the exact format of the error message seems a bit brittle, but I'm not sure how to avoid that, so that's a very minor nitpick. The brittleness is mitigated a bit by breaking that check into its own test.

@jrochkind
Copy link
Member

@adjam Thanks for review/comments! I would be overjoyed to turn this over to you. Can you merge this when satisfied; if not currently satisfied after whatever kind of additional collaboration you need with @elohanlon ?

And I guess we should maybe do a ruby-marc release once merged?

@elohanlon
Copy link
Contributor Author

Yes, thank you, @adjam! I'll update the tests in an additional commit.

By the way, @jrochkind, I wanted to tell you some good news about the resolution to my original app problem. My app needs to retrieve Voyager MARC records via two methods: (1) Z39.50 for query-based searches and (2) direct Voyager Oracle database queries for known record IDs, because our Z39.50 search doesn't support direct record id lookups. It turns out that I'm only able to get MARC-8-encoded records from Z39.50 and UTF-8-encoded ones from Oracle (though I need to connect to Oracle with an encoding of AMERICAN_AMERICA.US7ASCII in order to get well-formed UTF-8). Ah, the challenges of character encoding. But I'm pleased to say that things are working now. Thanks very much for your attention to this PR and your troubleshooting help.

@elohanlon
Copy link
Contributor Author

elohanlon commented Jun 3, 2019

@adjam I just updated the tests in a new commit. Does that look good? And is there anything else you'd like me to change?

@adjam
Copy link
Collaborator

adjam commented Jun 3, 2019

@elohanlon thanks! This is good to merge, but I have a front-burner project that takes precedence. I hope to have time to make a 1.0.4 release tomorrow, but feel free to get annoyed if it's not done by Thursday morning North American Eastern time.

@elohanlon
Copy link
Contributor Author

@adjam No rush! Thanks!

@adjam adjam merged commit b50b82a into ruby-marc:master Jun 28, 2019
adjam pushed a commit that referenced this pull request Jun 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants