Corrupted messages are lost #65

toffaletti · 2013-05-21T23:22:43Z

My firehose client I am trying to replace with hbc has a feature that saves messages which aren't valid UTF-8 to files. This has proven useful because it has found and helped fix bugs in the Firehose. Right now the utf8 decoding happens roughly with this stack:

com.twitter.hbc.common.DelimitedStreamReader.readLine
...
com.twitter.hbc.core.processor.StringDelimitedProcessor.processNextMessage
com.twitter.hbc.httpclient.ClientBase.processConnectionData

readLine() calls new String(...), constructor which does this with invalid utf8:

This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string.

I believe this means the original data will be lost and worse the problem will likely be hidden. It would be nice if we could get at the invalid data to log it. One option would be returning byte[] instead of String. Another option would be to use a CharsetDecoder with CodingErrorAction.REPORT and perhaps add an event with the bad byte[]. The second option is a less invasive API change.

Thoughts?

kevinoliver · 2013-05-22T15:24:21Z

Bummer that we are delivering invalid UTF-8 data. I'd love to see the invalid data too so that we can fix things on our side.

Couple options here:

patch DelimitedStringReader so that you wrap the new String(...) in a try-catch block with logging, incrementing a stat and returning null in the case of an exception.
create a new HosebirdMessageProcessor (eg a ByteArrayMessageProcessor) that is parameterized on byte[] and use it in your ClientBuilder.
I'm not familiar with your 2nd option (CharsetDecoder changes), but if there is a good reason that is preferable to either of those 1st 2 options, please explain.

I think the simplest possible thing is patching DelimitedStringReader, but it also doesn't give much flexibility. I suspect its good enough for this though.

toffaletti · 2013-05-22T20:15:15Z

I was mistaken about the data in the firehose messages being invalid UTF-8. The data was valid UTF-8 just not valid JSON and the cause was traced back to multibyte UTF-8 mangling. This occurred back around March 12-14 2013. I communicated with Arne and it was fixed. To quote Arne:

this wound up being a bit related to the unencoded em-dash issue. Along the way, the street_address field was mangling multibyte UTF-8 characters. This basically manifested itself as \u2013 (em-dash) being delivered as \u0013 (byte 13, a control character). I imagine that the other characters you saw were mangled in a similar way.

In any case, I would like to cover both possibilities (invalid UTF-8 and invalid JSON).

I'm fairly new to Java so I might have overlooked something, but it is my understanding that the String constructors do not throw on invalid data. Instead they silently replace it. This is why I ended up at CharsetDecoder. I've written this JUnit test to demonstrate the problem:

    public void testUnicode() {
        final byte[] badBytes = {
                (byte)0xC3, (byte)0x2E
        };
        ByteBuffer uniBuf = ByteBuffer.wrap(badBytes);
        CharBuffer charBuf = null;
        Charset utf8 = Charset.forName("UTF-8");
        CharsetDecoder utf8Decoder = utf8.newDecoder();
        utf8Decoder.onMalformedInput(CodingErrorAction.REPORT);
        utf8Decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
        try {
            charBuf = utf8Decoder.decode(uniBuf);
            String s = charBuf.toString();
            fail("should not decode");
        } catch (CharacterCodingException ignored) {}

        // this will silently replace the bad utf8 sequence with U+FFFD (replacement character)
        // no exception is thrown
        try {
            String z = new String(badBytes, utf8);
        } catch (Exception ignored) {
            fail("nothing is thrown");
        }
    }

I realize this is a bit pedantic and if you think it is out of the scope of hbc to check for encoding errors, I'd agree and just have my own HosebirdMessageProcessor.

kevinoliver · 2013-05-23T22:31:22Z

Ahhh didn't know that bit about the String constructor replacing bad data.

So, my take is that this was a bug on our end that we don't expect to regress again and I'd rather leave this bit simple in our implementation.

toffaletti · 2013-05-24T23:03:50Z

Sounds good to me.

toffaletti closed this as completed May 24, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrupted messages are lost #65

Corrupted messages are lost #65

toffaletti commented May 21, 2013

kevinoliver commented May 22, 2013

toffaletti commented May 22, 2013

kevinoliver commented May 23, 2013

toffaletti commented May 24, 2013

Corrupted messages are lost #65

Corrupted messages are lost #65

Comments

toffaletti commented May 21, 2013

kevinoliver commented May 22, 2013

toffaletti commented May 22, 2013

kevinoliver commented May 23, 2013

toffaletti commented May 24, 2013