split data creation and analysis into separate Docker apps #30

andrewljohnson · 2016-05-10T21:07:05Z

Currently, Dockerfile.devel-gpu inherits from GDAL Dockerfile, and then adds in both stuff needed for 1) data creation, and stuff needed for 2) analysis... including a nested stack of Tensorflow Dockerfiles that got copied in.

Two Dockers - Data Creation and Data Analysis

It would be cleaner if one Dockerfile created training data and saved it to disk, and one Dockerfile used training data from disk to analyze. The training data docker file could be mostly like the existing non-GPU Dockerfile, and the analysis Docker file would inherit from stock Tensorflow Dockerfile and be short, simple, and easy to maintain/deploy to AWS.

In production, I guess one Dockerfile saves data to S3, and the other mounts that data bucket for analysis. In development, it create a directory on disk, no S3 involved.

API

The data creation Docker app could also serve an API, to give people training data for their own models, or to support an MNIST like open research competition for maps.

zain · 2016-05-11T14:57:10Z

I like the API idea but we'd have to figure out a standardized way to structure the training data (for input into the data analysis dockerfile). I think you're still mucking with pickle/json/etc as the output of create_training_data.py, right?

Am I correct in my understanding that the structure of the training data is almost as important as the Tensorflow model itself?

silberman · 2016-05-11T18:35:57Z

Working backwards, ultimately the Tensorflow models want 2 or 4 big ndarrays: train_input_features, train_labels, and in research mode you'll probably want a test_input_features and test_labels that ideally don't share too much with the training set.

Different models and experiments will want differently shaped versions of those. Say you want to try training on the red-band + infrared-band, with your labels being the current one-hot "does a road go through the middle 3x3 square of this tile". Then the input tensor that tf or keras or tflearn wants would be shaped something like 30000 x 64 x 64 x 2, and output 30000 x 2. (Or maybe you want to try predicting whether each individual pixel is a road or not, so output needs to be 30000 x 64 x 64 x 1. Would be nice if that was easy to try.) An experiment then can be abstracted to be those 2-4 ndarrays plus some arbitrary Tensorflow model with appropriate placeholder shapes at the beginning and end.

So a 2-step, 2-Dockerfile workflow could look something like:
/bin/create_training_data.py -numtiles 35000
Wrote 35000 tiles worth of features and labels to /data/experiment17

Then the first step of running an experiment on the Tensorflow end could look like:

data_sets = load_data("/data/experiment17",
                                      input_features=[NAIP_RED, NAIP_INFRARED],
                                      labels=[MIDDLE_CONTAINS_WAY_3by3])

model = make_some_model()
model.fit(data_set.train_input, data_set.train_labels, validation=0.1)
model.evaluate(data_set.test_input, data_set.test_labels)

Few different ways I can see that cache directory working:

35,000 little 64x64x4 .npy files
1 big 35000x64x64x4 .npy file
Each RGBI feature could be saved individually, so 35000 files like tile_1_NAIPRED.npy, tile_1_NAIPGREEN.npy, etc
All of each feature saved together, so for now we'd have 4 big 35000x64x64x1 .npy, one for each of RGBI.

Then we also have to cache the various labels. Could be tile_1_middle3by3_label.npy, a tiny 2x1 .npy file, or put them all together in all_middle3by3_label.npy

Simplest I think would be keeping it as 1 giant 35000x64x64x4 .npy file for all the input layers we have now, and a few 35000xWhatever, 1 for each of the various label methods we have.

Upside of having tons of little files is that it would scale better if we wanted 100s of thousands of these things, since you only really ever need ~128 at a time, or whatever is one batch, so we could avoid putting a 500000x64x64x4 array in memory. But there's probably a better solution before that point, like a database or some tensorflow built-in feeding methods.

silberman · 2016-05-11T19:39:04Z

Btw, Amazon's dsstne that was open-sourced yesterday uses NetCDF as their ndarray serialization format. I don't think we should use either, though at some point that library may be part of the best way to train huge models on AWS. (Right now though it doesn't support convolutions, is optimized for sparse data, and "emphasizing speed and scale over experimental flexibility.", which is not what we want, but maybe something built on top of that will be.)

https://github.com/amznlabs/amazon-dsstne

zain · 2016-05-12T16:27:16Z

@silberman: Why do you think we shouldn't use NetCDF4?

andrewljohnson · 2016-05-13T00:21:37Z

@silberman how about move your comments to this issue: #23. (Move/delete from here?)

I think this is a separate ticket.

silberman · 2016-05-13T04:35:28Z

@zain Oh I've just never used NetCDF4 before, but now looking into it a bit it looks great for this actually.

So instead of "/data/experiment17", we have a NetCDF group and a data loading function that knows how to handle an experiment name and turn it into the 4 numpy arrays tensorflow wants. (We may want to divide up into train/test groups as well before serializing, depending on the experiment).

So far most experimentation has been on the labelling side, so before this cache. I'd like to speed up that whole pipeline by moving the raster and osm data into a postgis database, and at that point the normal experiment workflow might skip this serialization step, though it would still be useful for testing, reproducibility, or eventually sending through dsstne.

If it worked like that, nearly all the action could happen in the data analysis container. The first step would just be responsible for downloading and inserting a bunch of data into a database, and could be run once. Then the analysis environment should be able to create, save, and load datasets, as long as it can access the database.

re: API, I think one going in the opposite direction could be cool. Website interface for constructing net architectures (producing json like: https://github.com/amznlabs/amazon-dsstne/blob/master/docs/getting_started/userguide.md#neural-network-layer-definition-language), and we tell you how accurate it was/other tensorboard output. Writing the labeller function is a harder interface to come up with, so might have to do the feature sets as already-made, like they do at http://playground.tensorflow.org/

andrewljohnson · 2016-06-22T16:32:42Z

merging with other infrastructure issues

andrewljohnson added data pipeline Infrastructure and removed data pipeline labels May 10, 2016

silberman mentioned this issue May 13, 2016

import NAIP data into a postgres database #23

Closed

andrewljohnson mentioned this issue May 29, 2016

host on AWS, use multiple GPUs #8

Closed

andrewljohnson closed this as completed Jun 22, 2016

andrewljohnson mentioned this issue Jun 22, 2016

automate/improve infrastructure #59

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

split data creation and analysis into separate Docker apps #30

split data creation and analysis into separate Docker apps #30

andrewljohnson commented May 10, 2016

zain commented May 11, 2016

silberman commented May 11, 2016 •

edited

Loading

silberman commented May 11, 2016

zain commented May 12, 2016

andrewljohnson commented May 13, 2016 •

edited

Loading

silberman commented May 13, 2016

andrewljohnson commented Jun 22, 2016

split data creation and analysis into separate Docker apps #30

split data creation and analysis into separate Docker apps #30

Comments

andrewljohnson commented May 10, 2016

Two Dockers - Data Creation and Data Analysis

API

zain commented May 11, 2016

silberman commented May 11, 2016 • edited Loading

silberman commented May 11, 2016

zain commented May 12, 2016

andrewljohnson commented May 13, 2016 • edited Loading

silberman commented May 13, 2016

andrewljohnson commented Jun 22, 2016

silberman commented May 11, 2016 •

edited

Loading

andrewljohnson commented May 13, 2016 •

edited

Loading