Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Comply with Twitter ToS for storing Twitter Data #3329
This issue is two fold. First, I need to re-open #3073 and have those datasets exported again, only this time there must be an associated column for all Twitter data and that column shall include the Tweet ID (possibly also known as the Snowflake?). Secondly, we need to change the way Tweets are stored in Ushahidi because we are potentially in violation of Twitter's Developer Policy and, more specifically, our static archiving of Tweets can potentially put users at risk.
How Ushahidi Stores Twitter Data
When I set up a deployment to consume Tweets, they are displayed in the platform as static text. Moreover, the - bizarrely named - "conversation with author" pane in the data view mode, displays a static list of every Tweet written by that author: including Tweets they have deleted from their account. The screenshot below can be viewed in the COMRADES deployment with this post (you must be logged in).
This is problematic because it puts us in possible violation of Twitter's Developer Policy, specifically section F.2: Be a Good Partner to Twitter, which sets forth limits and how Tweets can be shared. It specifically states that:
The, um... "bright side" is that:
So there is apparently some room to wiggle here: if we create a non-API enabled deployment with low traffic we would likely be fine but if we have a high-traffic, API enabled deployment, we would likely violate these terms.
What privacy impacts does it have?
The privacy impacts here are two fold. In the most disastrous scenario, an activist (or anyone really) might report to a deployment via Twitter (or simply post a Tweet that the deployment collects unbeknownst to them); then realize that the Tweet puts them in some sort of risk and delete the Tweet from their account. The Tweet would still be archived in the deployment thus putting the activist at risk!!
The knock-on effect this is having for COMRADES is that we are also sharing deployment data among consortium partners for research purposes. We are, of course, adhering to the strict security rules of the EU, and COMRADES specifically: we secure all data with PII and won't re-publish it or share it outside the consortium until it's been redacted. But it still raises some difficult questions if want to do things like create a publicly available training dataset for future machine learning algorithms (something we do want to do). The normal process for this would be to eliminate the actual Tweet content from a data set and provide instead the Twitter ID; leaving the task of re-constituting the Tweet content to the next person to process the data set. There isn't a lot of information out there about this practice (called "re-hydrating") but this blog post does a pretty good job and is the best I've found. What this means is that any Tweets or accounts that have been deleted in the interim can't be "re-hydrated", thus preserving the Twitter user's right to privacy - in this case to be forgotten.
Again, there is probably some room for interpretation here since I'm currently having this conversation in England (which adheres to the GDPR) at a university that has taken a very, very conservative interpretation of GDPR to heart. Rules for a deployer in Kenya or the U.S. might be very different. It's worth point out that Ushahidi also adheres to the GDPR.
I've spoken with @rowasc about this in the Platform Channel on Slack and she's confirmed that we do capture the Tweet IDs in the database (why they didn't make into the downloads requested for #3073 is a mystery... looking at you @willdoran) so it's my assumption that fixing this issue would basically involve changing the display of Tweets from static text, to Twitter cards (looks like @dalezak might have a related issue with #3012?). Tweet IDs also need to figure prominently in all deployment exports. My reading of all this is that a deployment administrator could download Tweet content as well as IDs for an internal use dataset (since they aren't distributing it), but we would probably want to configure a permission-based setting that would remove this for any "public" download/access of the data and only provide the Tweet ids for posts that were culled from Twitter.
Ok. That's all I have on this for now. Happy to add more as needed.
referenced this issue
Oct 12, 2018
@Shadrock To answer in reverse order, we used to have tweet ids in the CSV export but they were apparently confusing or unwanted so they were dropped, this can be added back.
Changing from displaying/storing to the message content to simply using the Tweet ID is possible, however, we'll have to change the way in which the data is used. At the moment, the message becomes an entry in a text field which becomes the description portion of a post. We could change this to instead have a "Tweet field" which uses the Tweet ID and displays the Tweet as a Tweet.
There are three other pieces that I'm really concerned by
This means we would look up the tweet in real time when they select a post with a tweet it in the platform (or embed the tweet with their widgets) , right?
referenced this issue
Oct 13, 2018
I'm not really clear if we can use Tweet locations or not.
It seems like if we store the location with the Tweet ID maybe we can use it? But we probably shouldn't allow it to be exported
referenced this issue
Oct 17, 2018
Are these changes we're proposing here explicitly required from Twitter? Or are they changes we're imposing on ourself to be a good partner?
I'm just trying to better understand the problem we are trying to solve here.
Is the issue with deleted tweets still being displayed on our platform? If so, then this can probably be handled server side with background job that purges deleted posts? Or possibly changing post text to something like
Note, the issue #3012 is unrelated to this, that's more for improving how any content from Platform is shared on Twitter.
Has anyone contacted Twitter to get clarification on what needs to be done to comply to their ToS?
I'm worried we're heading down a technical rabbit hole when it might not be necessary to comply.