Twitter Crawler for Neo4j
A little script to import twitter data into neo4j. Uses Twitter4j and can be configured to use multiple twitter authentications for a speedup.
For the Neo4j schema, I followed the one given in the OSCON Twitter Graph
It allows crawling data from a given user or hashtag with a given max depth. Due to exponential growth, setting the search depth to anything higher than 1 will probably take more than 2 days..
The access tokens for the Twitter API must be configured. See below on the format. The script tries to use the available requests/buckets as efficient as possible. Loading friends and followers of a user is especially expensive and is therefor limited. If the loading of friends or followers will take more than an hour (this depends on the number of access tokens), than the loading will be skipped.
As the limiting factor is the Twitter API rate limit, I decided not to make the script multi-threaded. It allowed to keep things simple.
Twitter access tokens
The crawler needs access tokes for the twitter API. These can be obtained from the Twitter dev console
As the (free) Twitter API is very restricted in the number of calls per time bucket, you can register an arbitrary
number of access tokens in a file
twitter.properties that must reside on the classpath.
twitter.access.name=identifier for logging twitter.access.consumerKey=** twitter.access.consumerSecret=** twitter.access.accessToken=** twitter.access.accessTokenSecret=** twitter.access.name=identifier for logging twitter.access.consumerKey=** twitter.access.consumerSecret=** twitter.access.accessToken=** twitter.access.accessTokenSecret=**
Calling the script
The behaviour can be controlled via command line arguments. They are evaluated/executed in the order given below. This allows to chain actions and let the script run in the background.
setting max depth
The default max depth to follow hashtags or users is 3. This can be changed via
--depth=n with n any number.
querying for a single hashtag
To query for a single hashtag, give the hashtag via the option
--hash=graphdatabase. Would query for the hashtag
and store all object referenced in a Tweet (author, hashtags, mentioned users, links,..). Users will only contain the twitter id and screenName.
following a user
--follow-user=neo4j the script will request the user with the given screen name and load tweets of tht users. It will than
descend along the friends and followers of that user and load their tweets and followers until it reaches the given max depth.
parse the top hashtags of a user
--follow-user-hashtags=neo4j the script will determine the 10 most used hashtags of that user and loads all tweets with that
hashtag. This of course needs some data in the database to bee meaningful. Can be combined with
--follow-user to load the hashtags of that
follow a list of hashtags
To request (and store) depended hashtags (often used together), provide a starting hashtag via
--follow-hashtag=aHashTag. This will
retrieve and store hashtags up to the provided
Referenced tweets are sometimes returned with only the ID and the author. By providing
as option, the program tries to load additional data from twitter. In about 10% of tweets, this fails. For whatever reason.
The links in tweets often use url shorteners. To get the final url they point to and the site they are contained in the
program will try to resolve the url and create (:URL) and (:Site) nodes. This can be disabled by providing the