Skip to content
Travis Pinney edited this page Sep 15, 2013 · 5 revisions

FunnelCloud

FunnelCloud is a simple tool for ingesting bulk data off the web.

It uses subcommands for doing more specialized operations

Beam

Beam is a way to "beam" data from a website to an hdfs partition directly.

Usage

Download subset of "Million Song Dataset" directly into HDFS

fcl beam http://static.echonest.com/millionsongsubset_full.tar.gz millionsongsubset_full.tar.gz 

Ingest

Ingest subcommand allows for more complex types to be converted on the fly to other formats more suitable for processing in Hadoop.

OSM Usage

fcl ingest osm http://planet.osm.org/pbf/planet-130911.osm.pbf planet-130911.osm.pbf.seq

This converts osm planet file on the fly to a sequence file that uses OSMData protobufs as its values.

Wikipedia Usage

curl -O http://dumps.wikimedia.org/enwiki/20130904/enwiki-20130904-pages-articles-multistream-index.txt.bz2
bzunzip2 enwiki-20130904-pages-articles-multistream-index.txt.bz2
fcl ingest wikipedia enwiki-20130904-pages-articles-multistream-index.txt http://dumps.wikimedia.org/enwiki/20130904/enwiki-20130904-pages-articles-multistream.xml.bz2 enwiki-20130904-pages-articles-multistream.xml.bz2.seq

Notice index is required in order to to break up the bz2 files when ingesting

GDELT Usage

Clone this wiki locally