# Working with IMAGESIM

The following directions are intended to allow an engineer to set up the necessary infrastructure, build the required data assests, train an imagesim model, and deploy a similarity schema for a dedicated set of ARD assets.

## Infrastructure

Before working with code, it will be necessary to set up some EC2 machines in the relevant places. This project has compute assests located in the _analytics-core-development (733530388139)_ account on Maxar AWS.

### Buckets

The main bucket used for all data assests, including ARD data, is located at _s3://imagesim-storage_. The following describe some prefixes and assests that currently exist there and which can be used for model development.

* ard/: This is where all ARD order deliveries are stored and represents the raw image data used in this project
* chips/: This is where all _chipped_ images are stored, and represent image inputs to machine learning models TODO filename explanation
* code/: misc. python scripts. Non-relevant to developers
* datasets/: This prefix stores references to "dataset versioning." TODO
* demo-ard/: Contains ARD image delivery and vectors that can be uilized for testing change detection
* nodata-index.json: This file maps chips from the chips directory (v0.2) to valid imagery; ie, imagery that contains real data at evert pixel. 

### Machines

There are two EC2 machines that need to be set up to run this project effciently, minimizing cost while optimizing compute. 

**datamaker**: The "datamaker" machine is used for acquiring, processing and analyzing data. It is intended to be used to create the data that will eventually be the inputs to the modeling part of the pipeline. This machine has the following configuration properties on AWS EC2:

* Hardware: _m5.xlarge_
* System: Ubuntu
* AMI Id: ami-085925f297f89fce1
* Storage: 500GB (EBS)
* Username: ubuntu
* Security group:

The cost to run this machine is approx. $0.1 per compute hour.

**trainer**: The "trainer" machine is used for training the model that is primarily for training, and is currently used to encode images from trained models. This is a GPU machineThis machine has the following configuration properties on AWS EC2:

* Hardware: _p3.8xlarge_
* System: Ubuntu
* AMI Id: ami-019266bf7a55994a7
* Storage: 1000GB (EBS)
* Username: ubuntu
* Security group:

The cost to run this machine is approx. $12.00 per compute hour.


#### Setting up cloud compute environments and filesystem 

Both machines specify AMIs that do not contain prescribed python compute environments, unlike other DL or data science AMIs. To work in the cloud with imagesim, it will be necessary to set up python environments via conda, which must be installed. The following directions show how to install conda on an Ubuntu system and how to install the relevant packages for imagesim.

**Setting up the Datamaker Python environment**
1. ssh into your datamaker instance: `ssh -i <path-to-your-private-sshkey> ubuntu@<datamaker's public IPv4 DNS address>`.
2. Install [https://docs.conda.io/en/latest/miniconda.html](miniconda). This can be done by first downloading the installer to the system /tmp directory: `wget https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh /tmp`. The installation can now be completed by running the installation script: `/bin/bash /tmp/Anaconda2-4.1.1-Linux-x86_64.sh`. The installation can be installed at the user's home directory, which is the default install location. To access the conda binary, either exit and reinitiate a new ssh tunnel, or simply run `source .bashrc`. You should now have miniconda installed and have access to the various conda commands for creating python environments, etc in your runtime path.
3. Create a new python 3.8 environment called "science" and activate that environment: `conda create -n science python=3.8 -y && conda activate science`.
4. Install the relevant packages, located in the imagesim repository. This can be done by installing the environment file located in the imagesim repo, or by manually installing the relevant packages: ...

**Setting up the Datamaker filesystem**
1. Create the _data_ directory: `mkdir /home/ubuntu/data`. This is the main subdirectory in which all actions and assets are located within the datamaker machine. The following instructions ought to be carried out with respect to this subdirectory.
2. Create the notebooks directory: `mkdir /home/ubuntu/data/notebooks`.
3. Create the chips directory: `mkdir /home/ubuntu/data/chips`.
4. Create the ard directory: `mkdir /home/ubuntu/data/ard/`
5. Download the relevant ARD data. The current project focuses on a subset of the ordered ARD data, namely those tiles in **UTM 33** that exist at _s3://imagesim-storage/ard/33_. Create the subdirectory for this data, `mkdir /home/ubuntu/data/ard/33`. Copy the This can be done by executing `aws s3 cp s3://imagesim-storage/ard/33 /home/ubuntu/data/ard/33 --recursive`.


**Setting up the Trainer Python environment**
1. Follow directions 1-4 as stated in the previous setup instructions



## Getting started with Datamaker

Datamaker is the EC2 instance that is configured for creating OSM labels from ARD data, filtering and processing these data, and computing relevant statistical analysis on these data. It can also be used to run a TMS server to serve mosaiced ARD tiles (see section _Running a TMS on Mosaiced ARD data_). The following sections describe how imagesim modules and functions can be used to acquire OSM data, explore and filter that data, and make stasticical inferences relevant to subsequent model training.


### OSM data acquisition

After creating and setting up the datamaker instance as described in the previous sections, we can start the data engineering process by acquiring the relevant OSM data. The following code cell shows how the imagesim libary can be used to generate OSM data for the ARD tiles in question.

```
from imagesim.scripts.local.osm import fetch_osm_by_quadkey
from imagesim.scripts.constants import DATA_PATH

node_tags, way_tags, relation_tags = fetch_osm_by_quadkey(33, DATA_PATH)
```

This function takes a quadkey zone and a data path and writes out raw OSM queries (meaning no tag filtering) to `DATA_PATH`, which specifies your local ARD data path (ie, `/home/ubuntu/ard`). It looks up the various level 12 quadkeys that exist under the utm path in the ard structure, and writes out the results to that location to a file called _osm_data.json_.

**osm_data.json**
This data file structure is unique to the imagesim data pipeline. The file contains one dictionary object with the following keys:
* quadkey: The level-12 quadkey which specifies the query geometry from which osm results were returned
* nodes: The OSM nodes data parsed out from the raw Overpass results
* ways: the OSM ways data parsed out from the raw Overpass results
* relations: the OSM relations data parsed out from the raw Overpass results

This filestructure reflects the importance of potentially treating different OSM elements individually in downstream applications.

Once this function has completed, you ought to have an _osm_data.json_ at each ard quadkey subdirectory, ie the filepath `/home/ubuntu/data/ard/33/<level-12-quadkey>/osm_data.json` should exist for each quadkey.


### ARD Image chipping and filtering (creating **nodata-index.json**)

ARD image creation can potentially deliver imagery that contains no-data values. Imagesim contains modules and functions for filtering and chipping ARD level-12 data tiles into smaller tiles that are used for model training. The following section describes how to create training-level chips and filter those results to include only true-data imagery.

From the command line, invoke the imagesim chip.py cli as follows:

`imagesim chip --filename /home/ubuntu/data/nodata-index.json --chip-dir /home/ubuntu/data/chips/33`

Running this command will accomplish the following:
1. 
