Tweet with broken encoding causes issues with TextAsHtml and serializing as XML complains about #16

steviemcg · 2015-08-30T09:02:00Z

A bug in my application surfaced when pulling in this tweet: https://twitter.com/BZTHEEZE/status/637228279825608706 with Emoji and what looks like a complete gibberish character.

I'm not sure if there's anything that can be done about this I'll describe the symptoms and then say what I think the issue may be. My experience with Unicode encodings isn't good enough to be sure.

This issue also appears in Tweetsharp 2.3.1.2-unofficial

First, TextAsHtml ends with the link https://t.co/W2TeREt0Hdt0Hd - the extra t0Hd at the end shouldn't be there.

Finally, when I include this string in an object which gets serialized to XML (an RSS feed), it complains that a high surrogate character must always be paired with a low surrogate character.

I would imagine the slightly skewed link is because the offsets in the entities returned by Twitter don't match the reality. It appears that source of the Tweet itself is damaged, as if it's sending UTF-16 bytes over the wire and declaring them as UTF-8 or something.

Serializing to Json, including the string in a website or my own API all work fine, the issue for me only surfaces when I try to serialize in XML. Therefore before I serialize the object, I run the text through these functions:

private static readonly Encoding Utf8Encoder = Encoding.GetEncoding(
    "UTF-8",
    new EncoderReplacementFallback(string.Empty),
    new DecoderExceptionFallback());

private static string RepairStringEncoding(string str)
{
    return Utf8Encoder.GetString(Utf8Encoder.GetBytes(str));
}

...which serializes without exceptions albeit with the extra characters in the link (t0Hd).

Thanks for reading, I'm curious to know your thoughts and whether you think such an issue should be addressed within TweetMoaSharp itself or by the applications that use it. And whether you think the Tweet is "damaged" or perhaps a bug in their API?

Thanks as always,
Steve

The text was updated successfully, but these errors were encountered:

Yortw · 2015-08-30T12:07:43Z

Hi,

Thanks for the problem report :)

I've spent some time looking and I think the problem is the TextAsHtml function. I believe the twitter entity indices are true 'character' positions, but the string/stringbuilder methods used to parse and update the original string are really byte based (regardless of what the docs might say). I've seen/suspected this before but never found a solution.

In your sample tweet the first url entity has a start index of 106, but the string functions being used only return the correct text if position 110 is used. The additional 4 bytes seem to be because each of the emoji/smiley characters is actually 2 bytes but being treated as 1. I think this can be solved using System.Globalization.StringInfo to measure actual characters, but I've run out of time tonight to try and produce a solution.

Yortw · 2015-08-31T09:19:39Z

Hi,

I think I may have fixed the problem. I've pushed a new build to nuget, could you please update your solution and try again to see if this helps?

Thanks.

Yortw · 2015-09-04T10:11:01Z

Hi,

Any chance you've tried the update? Can I close this issue?

Thanks.

steviemcg · 2015-09-04T10:12:52Z

Thanks for your speedy response! This indeed fixes the issue so it can be closed. Thanks again.

Yortw · 2015-09-04T10:33:06Z

👍 Thank you!

steviemcg closed this as completed Sep 4, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tweet with broken encoding causes issues with TextAsHtml and serializing as XML complains about #16

Tweet with broken encoding causes issues with TextAsHtml and serializing as XML complains about #16

steviemcg commented Aug 30, 2015

Yortw commented Aug 30, 2015

Yortw commented Aug 31, 2015

Yortw commented Sep 4, 2015

steviemcg commented Sep 4, 2015

Yortw commented Sep 4, 2015

Tweet with broken encoding causes issues with TextAsHtml and serializing as XML complains about #16

Tweet with broken encoding causes issues with TextAsHtml and serializing as XML complains about #16

Comments

steviemcg commented Aug 30, 2015

Yortw commented Aug 30, 2015

Yortw commented Aug 31, 2015

Yortw commented Sep 4, 2015

steviemcg commented Sep 4, 2015

Yortw commented Sep 4, 2015