Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweet with broken encoding causes issues with TextAsHtml and serializing as XML complains about #16

Closed
steviemcg opened this issue Aug 30, 2015 · 5 comments

Comments

@steviemcg
Copy link

A bug in my application surfaced when pulling in this tweet: https://twitter.com/BZTHEEZE/status/637228279825608706 with Emoji and what looks like a complete gibberish character.

I'm not sure if there's anything that can be done about this I'll describe the symptoms and then say what I think the issue may be. My experience with Unicode encodings isn't good enough to be sure.

This issue also appears in Tweetsharp 2.3.1.2-unofficial

First, TextAsHtml ends with the link https://t.co/W2TeREt0Hdt0Hd - the extra t0Hd at the end shouldn't be there.

Finally, when I include this string in an object which gets serialized to XML (an RSS feed), it complains that a high surrogate character must always be paired with a low surrogate character.

I would imagine the slightly skewed link is because the offsets in the entities returned by Twitter don't match the reality. It appears that source of the Tweet itself is damaged, as if it's sending UTF-16 bytes over the wire and declaring them as UTF-8 or something.

Serializing to Json, including the string in a website or my own API all work fine, the issue for me only surfaces when I try to serialize in XML. Therefore before I serialize the object, I run the text through these functions:

private static readonly Encoding Utf8Encoder = Encoding.GetEncoding(
    "UTF-8",
    new EncoderReplacementFallback(string.Empty),
    new DecoderExceptionFallback());

private static string RepairStringEncoding(string str)
{
    return Utf8Encoder.GetString(Utf8Encoder.GetBytes(str));
}

...which serializes without exceptions albeit with the extra characters in the link (t0Hd).

Thanks for reading, I'm curious to know your thoughts and whether you think such an issue should be addressed within TweetMoaSharp itself or by the applications that use it. And whether you think the Tweet is "damaged" or perhaps a bug in their API?

Thanks as always,
Steve

@Yortw
Copy link
Owner

Yortw commented Aug 30, 2015

Hi,

Thanks for the problem report :)

I've spent some time looking and I think the problem is the TextAsHtml function. I believe the twitter entity indices are true 'character' positions, but the string/stringbuilder methods used to parse and update the original string are really byte based (regardless of what the docs might say). I've seen/suspected this before but never found a solution.

In your sample tweet the first url entity has a start index of 106, but the string functions being used only return the correct text if position 110 is used. The additional 4 bytes seem to be because each of the emoji/smiley characters is actually 2 bytes but being treated as 1. I think this can be solved using System.Globalization.StringInfo to measure actual characters, but I've run out of time tonight to try and produce a solution.

@Yortw
Copy link
Owner

Yortw commented Aug 31, 2015

Hi,

I think I may have fixed the problem. I've pushed a new build to nuget, could you please update your solution and try again to see if this helps?

Thanks.

@Yortw
Copy link
Owner

Yortw commented Sep 4, 2015

Hi,

Any chance you've tried the update? Can I close this issue?

Thanks.

@steviemcg
Copy link
Author

Thanks for your speedy response! This indeed fixes the issue so it can be closed. Thanks again.

@Yortw
Copy link
Owner

Yortw commented Sep 4, 2015

👍 Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants