Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split data creation and analysis into separate Docker apps #30

Closed
andrewljohnson opened this issue May 10, 2016 · 7 comments
Closed

split data creation and analysis into separate Docker apps #30

andrewljohnson opened this issue May 10, 2016 · 7 comments

Comments

@andrewljohnson
Copy link
Contributor

Currently, Dockerfile.devel-gpu inherits from GDAL Dockerfile, and then adds in both stuff needed for 1) data creation, and stuff needed for 2) analysis... including a nested stack of Tensorflow Dockerfiles that got copied in.

Two Dockers - Data Creation and Data Analysis

It would be cleaner if one Dockerfile created training data and saved it to disk, and one Dockerfile used training data from disk to analyze. The training data docker file could be mostly like the existing non-GPU Dockerfile, and the analysis Docker file would inherit from stock Tensorflow Dockerfile and be short, simple, and easy to maintain/deploy to AWS.

In production, I guess one Dockerfile saves data to S3, and the other mounts that data bucket for analysis. In development, it create a directory on disk, no S3 involved.

API

The data creation Docker app could also serve an API, to give people training data for their own models, or to support an MNIST like open research competition for maps.

@zain
Copy link
Contributor

zain commented May 11, 2016

I like the API idea but we'd have to figure out a standardized way to structure the training data (for input into the data analysis dockerfile). I think you're still mucking with pickle/json/etc as the output of create_training_data.py, right?

Am I correct in my understanding that the structure of the training data is almost as important as the Tensorflow model itself?

@silberman
Copy link
Contributor

silberman commented May 11, 2016

Working backwards, ultimately the Tensorflow models want 2 or 4 big ndarrays: train_input_features, train_labels, and in research mode you'll probably want a test_input_features and test_labels that ideally don't share too much with the training set.

Different models and experiments will want differently shaped versions of those. Say you want to try training on the red-band + infrared-band, with your labels being the current one-hot "does a road go through the middle 3x3 square of this tile". Then the input tensor that tf or keras or tflearn wants would be shaped something like 30000 x 64 x 64 x 2, and output 30000 x 2. (Or maybe you want to try predicting whether each individual pixel is a road or not, so output needs to be 30000 x 64 x 64 x 1. Would be nice if that was easy to try.) An experiment then can be abstracted to be those 2-4 ndarrays plus some arbitrary Tensorflow model with appropriate placeholder shapes at the beginning and end.

So a 2-step, 2-Dockerfile workflow could look something like:
/bin/create_training_data.py -numtiles 35000
Wrote 35000 tiles worth of features and labels to /data/experiment17

Then the first step of running an experiment on the Tensorflow end could look like:

data_sets = load_data("/data/experiment17",
                                      input_features=[NAIP_RED, NAIP_INFRARED],
                                      labels=[MIDDLE_CONTAINS_WAY_3by3])

model = make_some_model()
model.fit(data_set.train_input, data_set.train_labels, validation=0.1)
model.evaluate(data_set.test_input, data_set.test_labels)

Few different ways I can see that cache directory working:

  • 35,000 little 64x64x4 .npy files
  • 1 big 35000x64x64x4 .npy file
  • Each RGBI feature could be saved individually, so 35000 files like tile_1_NAIPRED.npy, tile_1_NAIPGREEN.npy, etc
  • All of each feature saved together, so for now we'd have 4 big 35000x64x64x1 .npy, one for each of RGBI.

Then we also have to cache the various labels. Could be tile_1_middle3by3_label.npy, a tiny 2x1 .npy file, or put them all together in all_middle3by3_label.npy

Simplest I think would be keeping it as 1 giant 35000x64x64x4 .npy file for all the input layers we have now, and a few 35000xWhatever, 1 for each of the various label methods we have.

Upside of having tons of little files is that it would scale better if we wanted 100s of thousands of these things, since you only really ever need ~128 at a time, or whatever is one batch, so we could avoid putting a 500000x64x64x4 array in memory. But there's probably a better solution before that point, like a database or some tensorflow built-in feeding methods.

@silberman
Copy link
Contributor

Btw, Amazon's dsstne that was open-sourced yesterday uses NetCDF as their ndarray serialization format. I don't think we should use either, though at some point that library may be part of the best way to train huge models on AWS. (Right now though it doesn't support convolutions, is optimized for sparse data, and "emphasizing speed and scale over experimental flexibility.", which is not what we want, but maybe something built on top of that will be.)

https://github.com/amznlabs/amazon-dsstne

@zain
Copy link
Contributor

zain commented May 12, 2016

@silberman: Why do you think we shouldn't use NetCDF4?

@andrewljohnson
Copy link
Contributor Author

andrewljohnson commented May 13, 2016

@silberman how about move your comments to this issue: #23. (Move/delete from here?)

I think this is a separate ticket.

@silberman
Copy link
Contributor

@zain Oh I've just never used NetCDF4 before, but now looking into it a bit it looks great for this actually.

So instead of "/data/experiment17", we have a NetCDF group and a data loading function that knows how to handle an experiment name and turn it into the 4 numpy arrays tensorflow wants. (We may want to divide up into train/test groups as well before serializing, depending on the experiment).

So far most experimentation has been on the labelling side, so before this cache. I'd like to speed up that whole pipeline by moving the raster and osm data into a postgis database, and at that point the normal experiment workflow might skip this serialization step, though it would still be useful for testing, reproducibility, or eventually sending through dsstne.

If it worked like that, nearly all the action could happen in the data analysis container. The first step would just be responsible for downloading and inserting a bunch of data into a database, and could be run once. Then the analysis environment should be able to create, save, and load datasets, as long as it can access the database.

re: API, I think one going in the opposite direction could be cool. Website interface for constructing net architectures (producing json like: https://github.com/amznlabs/amazon-dsstne/blob/master/docs/getting_started/userguide.md#neural-network-layer-definition-language), and we tell you how accurate it was/other tensorboard output. Writing the labeller function is a harder interface to come up with, so might have to do the feature sets as already-made, like they do at http://playground.tensorflow.org/

@andrewljohnson
Copy link
Contributor Author

merging with other infrastructure issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants