robinkraft edited this page Oct 15, 2012 · 7 revisions

How to run FORMA

This stuff applies as of forma-clj 0.2.0 on the develop branch.

Booting the Cluster

The first step is to boot up the damned hadoop cluster. Start up a repl, navigate over to forma.hadoop.cluster, compile the namespace, and call (create-cluster <node-count>). Tweak the cluster definition, if you want! We're going to have to mess around with this so that different clusters become default for different jobs.

Once the cluster successfully loads, head to the AWS console, get the DNS name for the jobtracker, start up a terminal window and ssh in.

Uploading the Jar

Next step is actually getting the project jar up onto the cluster. In the root directory of forma-clj, run

$ lein clean, deps, uberjar

This will create a file called something like forma-0.2.0-SNAPSHOT-standalone.jar. Once this has finished, run (with the proper path and DNS address, of course):

$ scp ./the-uber.jar ec2.dns.address:/home/hadoop/forma-standalone.jar

Once that's done, we're ready to go! Head back to the terminal window with the SSH connection.

Hadoop User

On the cluster, we need to act as the Hadoop user to run this stuff. Do so by typing the following:

$ sudo su - hadoop


This section describes how to process raw files into our datastore. Currently, this datastore is located in the redddata bucket on our S3 account. Replace redddata if that changes.

MODIS Preprocessing

Right now, it's a simple command to go and process all MODIS HDF files at some given path on S3 into our datastore. For january of 2011, for example:

$ hadoop jar forma-standalone.jar forma.hadoop.jobs.modis "s3n://modisfiles/MOD13A2/" "s3n://pailbucket/rawstore/" "2011-01-01" :IDN :MYS

You can also provide a list of dates to match, like "{2000,2001,2002}", which processess three years of data. "{20}" will process data with dates starting with 20. This syntax is useful because the input file paths are serialized into the jobconf, and if there are too many paths you run out of space in the jobconf. Run a few years at a time and you'll be fine.

The final arguments, slapped onto the end, can be country codes or tiles, formatted as "[8 6]". The system will find the union, and ONLY process files inside the supplied directory template that match these tiles.

N.B. Don't forget to include the trailing slash on the input path! Also, if you change something in the code, you must regenerate the uberjar file again for the changes to take effect.

PRECL Preprocessing

This isn't totally up to speed, now, since I haven't handled the case of grabbing PRECL data in the middle of the year -- we're in danger of processing a bunch of empty values into the datastore, since we don't have a filter. Don't worry about this for now!

To process data into the datastore, do this:

$ hadoop jar forma-standalone.jar forma.hadoop.jobs.preprocess rain s3n://modisdata/PRECL/ s3n://pailbucket/rawstore/ :IDN

Same deal with with MODIS. Need to adjust this for better handling of file selection, so we don't have to grab all of these (or type in the exact filename).


To get the rain and ndvi series loaded, we ran

$ hadoop jar forma-standalone.jar forma.hadoop.jobs.timeseries.DynamicTimeseries "s3n://pailbucket/rawstore/" "s3n://pailbucket/timeseries/" "1000" "32" :IDN :MYS

This loads EVERYTHING from the NDVI and PRECL datastores. We depend on a function in run-forma right now to trim the ends of these, as they definitely don't match up.

Fires Preprocessing

To process ALL monthly and daily fires data into the redd bucket:

$ hadoop jar forma-standalone.jar forma.source.fire "monthly" "s3n://modisfiles/MonthlyFires/" "s3n://pailbucket/rawstore/"
$ hadoop jar forma-standalone.jar forma.source.fire "daily" "s3n://modisfiles/DailyFires/" "s3n://pailbucket/rawstore/"

The months business only needed to happen the first time, really. The days are the files that need to get continually updated.

Static Dataset Processing

Currently, static datasets need to have an entry in forma.static/static-datasets. If the supplied resolution of the ASCII grid is higher than the supplied MODIS resolution, the ASCII dataset is downsampled; the system pulls apart all rows and columns, converts them to MODIS pixel coordinates, and then runs some sort of aggregation on all duplicates. Currently, we support max and sum. It's trivial to support avg, min, etc given their support in cascalog.

If the static dataset needs to be upsampled, we generate a temporary tap of MODIS pixels and do a join between these and the torn apart ASCII grid, forcing the sampling (and lots of duplication of values). This choice happens without user input.

GZIPPED files can't be chunked from S3 -- currently, we have to transfer these files over to the jobtracker, unzip them, put them in HDFS before loading them in.

$ sudo apt-get install unzip  
$ wget asciipath.txt.gz
$ unzip asciipath.txt.gz
$ hadoop dfs -copyFromLocal ./asciipath.txt ascii-path.txt
$ rm asciipath.txt; rm asciipath.txt.gz

Now we're good to go for processing:

$ hadoop jar forma-standalone.jar forma.hadoop.jobs.preprocess.PreprocessAscii /home/hadoop/asciipath.txt "s3n://pailbucket/rawstore/" :IDN :MYS

Running FORMA

Now, we have all of the pieces necessary to run FORMA, the big job! Currently it takes a bunch of paths. Note that we don't have a country static dataset yet, so we can't really do this successfully. GADM is too finely segmented.

$ hadoop jar forma-standalone.jar forma.hadoop.jobs.scatter.RunForma "s3n://pailbucket/rawstore/" "s3n://pailbucket/timeseries/" "s3n://formares/unbucketed/"
$ hadoop jar forma-standalone.jar forma.hadoop.jobs.scatter.BucketForma "s3n://pailbucket/unbucketed/" "s3n://pailbucket/bucketed/"