Internet Archive's crawl of tweet stream: compare against "official" crawl #6

lintool · 2016-06-07T14:18:29Z

The Internet Archive appears to have a crawl of tweets around the 2015 evaluation period:
https://archive.org/details/archiveteam-twitter-stream-2015-07

The list of tweetids from last year's crawl is here:
https://cs.uwaterloo.ca/~jimmylin/TREC2015-tweetids.txt.bz2

This is the "official" crawl in the sense that it was the one used for constructing the pools, etc.

Note that the file is file is 200 MB; it contains 43,956,390 tweetids, sorted. This is the union of two separate crawls using the tools in twittertools (on top of twitter4j).

Can someone compare the Internet Archive crawl with the official tweetids? If overlap is good, then we have a way for getting training data to people who didn't participate last year... :)

LuchenTan · 2016-06-08T15:36:11Z

Here is the list of tweets from my crawl last year. I used python tweepy. Since the file size limitation of github, I cut the original list into 4 files. There are in total 39,623,506 unique tweetids, sorted.

https://github.com/LuchenTan/TREC2015-MB-Tweets.git

igorbrigadir · 2016-06-09T16:21:26Z

FWIW: Downloading 43,956,390 Tweets "officially" from the Twitter API will take just under 20 days (if you make use of statuses/lookup endpoint with both App & User tokens). If you have an app with 3 or 4 authenticated users (lets say your co authors), you can use those extra tokens to spread out the calls and do it in about a week.

lintool · 2016-06-09T17:44:08Z

twarc might be useful if you want to download the tweets using the official Twitter API:
https://github.com/edsu/twarc

Downloading is called "hydrating".

lukuang · 2016-06-15T17:49:59Z

I have compared Jimmy's id list with the tweets of Internet Archive. Jimmy's list seems to contain tweets that are not within the evaluation period. For example, the first tweet id in the list "622918845364219905" was published at "Sun Jul 19 00:00:01 +0000 2015". Therefore, I looked at the the tweets from the Internet Archive, used the tweet ids of the first and last tweets posted within the evaluation period as boundaries to perform filtering on Jimmy's list. The number of tweets left of Jimmy's list is 40260362. The difference between this new list and the tweet archive is not significant (about 0.2%). It is worth noting that a small number of relevant tweets (132 out of 6187) are not in the tweet archive.

As suggested by Jimmy, it seems that you can use the Internet Archive tweets as training data.

lintool · 2016-06-17T19:54:28Z

I'd like to draw everyone's attention to this - this means that people who did not participant in the TREC Microblog track last year can still get the tweet data (e.g., for training).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internet Archive's crawl of tweet stream: compare against "official" crawl #6

Internet Archive's crawl of tweet stream: compare against "official" crawl #6

lintool commented Jun 7, 2016

LuchenTan commented Jun 8, 2016

igorbrigadir commented Jun 9, 2016

lintool commented Jun 9, 2016

lukuang commented Jun 15, 2016

lintool commented Jun 17, 2016

Internet Archive's crawl of tweet stream: compare against "official" crawl #6

Internet Archive's crawl of tweet stream: compare against "official" crawl #6

Comments

lintool commented Jun 7, 2016

LuchenTan commented Jun 8, 2016

igorbrigadir commented Jun 9, 2016

lintool commented Jun 9, 2016

lukuang commented Jun 15, 2016

lintool commented Jun 17, 2016