Start each of the following components in a screen of their own. It might be worthwhile naming the screens as singlenode, shell, decahose
xd-singlenode --hadoopDistro phd20
xd-shell --hadoopDistro phd20
stream create --name gnipdecahose --definition "http --port=9009 | hdfs --directory=/user/vatsan/decahose/ --partitionPath=dateFormat('yyyy/MM/dd')" --deploy
If you want to close a stream, use the following: stream destroy --name gnipdecahose
Now that we have set-up the Spring-XD stream with an HTTP source listening on localhost:9009 You can start the GNIP Twitter Decahose as follows (on a new screen):
python decahose.py localhost:9009
This will start ingesting the contents of the decahose stream on to the http source, which in turn will be written out to HDFS.
The contents of the stream will be written out to HDFS at /user/vatsan/decahose. The folder hierarchy will be of the form YYYY/MM/DD
There are two sql files, one python file and a cron job that set-up this pipeline.
- First set-up your .pgpass file under $HOME/.pgpass. It should be of the form: hostname:port:database:username:password
- The very first time you set-up your environment, you should run the sql file $DECAHOSE_HOME/sql/1_parsetweetjson.sql
- Next, set-up a cron jon to run daily. Run crontab -e and add a line like so:
0 0 * * * python $DECAHOSE_HOME/python/decahose_load_cron.py $DECAHOSE_HOME/sql/2_fetch_from_ext_table.sql hdm2 decahose vatsan $HOME/logs/decahose_cron.log
This basically runs the sql file $DECAHOSE_HOME/sql/2_fetch_from_ext_table.sql through the python wrapper $DECAHOSE_HOME/python/decahose_load_cron.py at midnight, every day.
Srivatsan Ramanujam sramanujam@gopivotal.com Ronert Obst robst@gopivotal.com