Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internet Archive's crawl of tweet stream: compare against "official" crawl #6

Open
lintool opened this issue Jun 7, 2016 · 5 comments

Comments

@lintool
Copy link
Member

lintool commented Jun 7, 2016

The Internet Archive appears to have a crawl of tweets around the 2015 evaluation period:
https://archive.org/details/archiveteam-twitter-stream-2015-07

The list of tweetids from last year's crawl is here:
https://cs.uwaterloo.ca/~jimmylin/TREC2015-tweetids.txt.bz2

This is the "official" crawl in the sense that it was the one used for constructing the pools, etc.

Note that the file is file is 200 MB; it contains 43,956,390 tweetids, sorted. This is the union of two separate crawls using the tools in twittertools (on top of twitter4j).

Can someone compare the Internet Archive crawl with the official tweetids? If overlap is good, then we have a way for getting training data to people who didn't participate last year... :)

@LuchenTan
Copy link

Here is the list of tweets from my crawl last year. I used python tweepy. Since the file size limitation of github, I cut the original list into 4 files. There are in total 39,623,506 unique tweetids, sorted.

https://github.com/LuchenTan/TREC2015-MB-Tweets.git

@igorbrigadir
Copy link

FWIW: Downloading 43,956,390 Tweets "officially" from the Twitter API will take just under 20 days (if you make use of statuses/lookup endpoint with both App & User tokens). If you have an app with 3 or 4 authenticated users (lets say your co authors), you can use those extra tokens to spread out the calls and do it in about a week.

@lintool
Copy link
Member Author

lintool commented Jun 9, 2016

twarc might be useful if you want to download the tweets using the official Twitter API:
https://github.com/edsu/twarc

Downloading is called "hydrating".

@lukuang
Copy link

lukuang commented Jun 15, 2016

I have compared Jimmy's id list with the tweets of Internet Archive. Jimmy's list seems to contain tweets that are not within the evaluation period. For example, the first tweet id in the list "622918845364219905" was published at "Sun Jul 19 00:00:01 +0000 2015". Therefore, I looked at the the tweets from the Internet Archive, used the tweet ids of the first and last tweets posted within the evaluation period as boundaries to perform filtering on Jimmy's list. The number of tweets left of Jimmy's list is 40260362. The difference between this new list and the tweet archive is not significant (about 0.2%). It is worth noting that a small number of relevant tweets (132 out of 6187) are not in the tweet archive.

As suggested by Jimmy, it seems that you can use the Internet Archive tweets as training data.

@lintool
Copy link
Member Author

lintool commented Jun 17, 2016

I'd like to draw everyone's attention to this - this means that people who did not participant in the TREC Microblog track last year can still get the tweet data (e.g., for training).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants