-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Travis Pinney edited this page Sep 15, 2013
·
5 revisions
FunnelCloud is a simple tool for ingesting bulk data off the web.
It uses subcommands for doing more specialized operations
Beam is a way to "beam" data from a website to an hdfs partition directly.
Download subset of "Million Song Dataset" directly into HDFS
fcl beam http://static.echonest.com/millionsongsubset_full.tar.gz millionsongsubset_full.tar.gz
Ingest subcommand allows for more complex types to be converted on the fly to other formats more suitable for processing in Hadoop.
fcl ingest osm http://planet.osm.org/pbf/planet-130911.osm.pbf planet-130911.osm.pbf.seq
This converts osm planet file on the fly to a sequence file that uses OSMData protobufs as its values.
curl -O http://dumps.wikimedia.org/enwiki/20130904/enwiki-20130904-pages-articles-multistream-index.txt.bz2
bzunzip2 enwiki-20130904-pages-articles-multistream-index.txt.bz2
fcl ingest wikipedia enwiki-20130904-pages-articles-multistream-index.txt http://dumps.wikimedia.org/enwiki/20130904/enwiki-20130904-pages-articles-multistream.xml.bz2 enwiki-20130904-pages-articles-multistream.xml.bz2.seq
Notice index is required in order to to break up the bz2 files when ingesting