## Distributing the model

This notebook talks through packaging and distributing the model and training artifacts.

### Step 1: version control your code

Check this code project into a git repository on GitHub. Be sure to include a `.gitignore` which explicitly excludes the data parts of the project.

In [1]:
%ls ../

README.md             [34mmodels[m[m/               [34mscripts[m[m/
[34mdata[m[m/                 [34mnotebooks[m[m/            [34msrc[m[m/
environment.yml       quilt_summarize.json  [34msummaries[m[m/


### Step 2: version control your environment

For this project I used `conda` to version control my environment&mdash;by writing my environment definition to an `environment.yml` file.

In [2]:
%ls ../ | grep 'environment'

environment.yml


### Step 3: version control your data

We'll put the data under version control using [Quilt T4](https://github.com/quiltdata/t4), which adds a layer of abstraction on top of Amazon S3. T4 allows you to manage data dependencies by defining data packages. These work much like code packages on GitHub once you've pushed them up.

You should replace the `s3://quilt-example` path(s) in the code block that follows with paths in an S3 directory you have access to of your choice.

Alternatively, you can just use regular blob storage, S3 or otherwise. In that case, make sure to track which *version* of each data point you wrote; if the data gets updated in between your writing it and someone else reading it, the model they build using the updated data will likely diverge from yours.

In [5]:
import t4
(t4.Package()
     .set_dir('metadata/', '../data/metadata/')
     .set_dir('training_data/', '../data/training/')
     .set('models/latest.h5', '../models/clf.h5')
     .set_dir('images/', '../data/images/')
     .set_dir('images_cropped/', '../data/images_cropped/')
     .set('README.md', '../README.md')
     # Use the Quilt summarize feature to embed Vega visualizations of data attributes
     .set_dir('summary_data/', '../data/summaries/')
     .set_dir('summaries/', '../data/summaries/')
     .set('quilt_summarize.json', '../quilt_summarize.json')
     # Push it to the remote repository
     # This pushes to the Quilt demo bucket
     # You will need to update the destination path to some other S3 bucket you have access to
     .push('quilt/open_images', dest='s3://quilt-example', registry='s3://quilt-example')
)

HBox(children=(IntProgress(value=0, description='Hashing', max=6187722221), HTML(value='')))




HBox(children=(IntProgress(value=0, description='Copying', max=6187722221), HTML(value='')))




README.md
images/
  .DS_Store
  10006714784_9337d5d0e1_o.jpg
  10014143174_1de79c8af8_o.jpg
  10022662923_ab0567fe1a_o.jpg
  10052146336_dc364e0a10_o.jpg
  1006312339_d306fc933d_o.jpg
  10065094283_0db2b64b2d_o.jpg
  10102600246_6385283711_o.jpg
  10123662565_4ab592b952_o.jpg
  101266618_99a28a70ff_o.jpg
  10148587244_f576b88c8f_o.jpg
images_cropped/
metadata/
models/
quilt_summarize.json
summaries/
summary_data/
training_data/

## Conclusion

Having done all of the above means that you can now pull your own trainable copy of the model to disk with the following commands:

```
git clone https://github.com/quiltdata/open-images.git
conda env create -f open-images/environment.yml
source activate quilt-open-images-dev
python -c "import t4; t4.Package.install('s3://quilt-example', 'quilt/open_images', 'open-images/')"
```

Alternatively, if you need just the model you can do the following to get it:

```
python -c "import t4; t4.Package.browse('quilt/open_images', 's3://quilt-example')['models/latest.h5'].fetch('latest.h5')"
```