# Working with IMAGESIM for ARD Image Retrieval

The following directions are intended to allow an engineer to set up the necessary infrastructure, build the required data assests, train an imagesim model, and deploy a similarity schema for a dedicated set of ARD assets.

## Infrastructure

Before working with code, it will be necessary to set up some EC2 machines in the relevant places. This project has compute assests located in the _analytics-core-development (733530388139)_ account on Maxar AWS.

### Buckets

The main bucket used for all data assests, including ARD data, is located at _s3://imagesim-storage_. The following describe some prefixes and assests that currently exist there and which can be used for model development.

* ard/: This is where all ARD order deliveries are stored and represents the raw image data used in this project
* chips/: This is where all _chipped_ images are stored, and represent image inputs to machine learning models TODO filename explanation
* code/: misc. python scripts. Non-relevant to developers
* datasets/: This prefix stores references to "dataset versioning." TODO
* demo-ard/: Contains ARD image delivery and vectors that can be uilized for testing change detection
* models/: Contains ONNX models that have been sufficiently trained to use as feature extractors for image encoding
* nodata-index.json: This file maps chips from the chips directory (v0.2) to valid imagery; ie, imagery that contains real data at evert pixel. 

### Machines

There are two EC2 machines that need to be set up to run this project effciently, minimizing cost while optimizing compute. 

**datamaker**: The "datamaker" machine is used for acquiring, processing and analyzing data. It is intended to be used to create the data that will eventually be the inputs to the modeling part of the pipeline. This machine has the following configuration properties on AWS EC2:

* Hardware: _m5.xlarge_
* System: Ubuntu
* AMI Id: ami-085925f297f89fce1
* Storage: 500GB (EBS)
* Username: ubuntu
* Security group:

The cost to run this machine is approx. $0.1 per compute hour.

**trainer**: The "trainer" machine is used for training the model that is primarily for training, and is currently used to encode images from trained models. This is a GPU machineThis machine has the following configuration properties on AWS EC2:

* Hardware: _p3.8xlarge_
* System: Ubuntu
* AMI Id: ami-019266bf7a55994a7
* Storage: 1000GB (EBS)
* Username: ubuntu
* Security group:

The cost to run this machine is approx. $12.00 per compute hour.


#### Setting up cloud compute environments and filesystem 

Both machines specify AMIs that do not contain prescribed python compute environments, unlike other DL or data science AMIs. To work in the cloud with imagesim, it will be necessary to set up python environments via conda, which must be installed. The following directions show how to install conda on an Ubuntu system and how to install the relevant packages for imagesim.

**Setting up the Datamaker Python environment**
1. ssh into your datamaker instance: `ssh -i <path-to-your-private-sshkey> ubuntu@<datamaker's public IPv4 DNS address>`.
2. Install [https://docs.conda.io/en/latest/miniconda.html](miniconda). This can be done by first downloading the installer to the system /tmp directory: `wget https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh /tmp`. The installation can now be completed by running the installation script: `/bin/bash /tmp/Anaconda2-4.1.1-Linux-x86_64.sh`. The installation can be installed at the user's home directory, which is the default install location. To access the conda binary, either exit and reinitiate a new ssh tunnel, or simply run `source .bashrc`. You should now have miniconda installed and have access to the various conda commands for creating python environments, etc in your runtime path.
3. Create a new python 3.8 environment called "science" and activate that environment: `conda create -n science python=3.8 -y && conda activate science`.
4. Install the relevant packages, located in the imagesim repository. This can be done by installing the environment file located in the imagesim repo, or by manually installing the relevant packages: ...

**Setting up the Datamaker filesystem**
1. Create the _data_ directory: `mkdir /home/ubuntu/data`. This is the main subdirectory in which all actions and assets are located within the datamaker machine. The following instructions ought to be carried out with respect to this subdirectory.
2. Create the notebooks directory: `mkdir /home/ubuntu/data/notebooks`.
3. Create the chips directory: `mkdir /home/ubuntu/data/chips`.
4. Create the ard directory: `mkdir /home/ubuntu/data/ard/`
5. Download the relevant ARD data. The current project focuses on a subset of the ordered ARD data, namely those tiles in **UTM 33** that exist at _s3://imagesim-storage/ard/33_. Create the subdirectory for this data, `mkdir /home/ubuntu/data/ard/33`. Copy the This can be done by executing `aws s3 cp s3://imagesim-storage/ard/33 /home/ubuntu/data/ard/33 --recursive`.


**Setting up the Trainer Python environment**
1. Follow directions 1-4 as stated in the previous setup instructions



## Getting started with Datamaker

Datamaker is the EC2 instance that is configured for creating OSM labels from ARD data, filtering and processing these data, and computing relevant statistical analysis on these data. It can also be used to run a TMS server to serve mosaiced ARD tiles (see section _Running a TMS on Mosaiced ARD data_). The following sections describe how imagesim modules and functions can be used to acquire OSM data, explore and filter that data, and make stasticical inferences relevant to subsequent model training.


### OSM data acquisition

After creating and setting up the datamaker instance as described in the previous sections, we can start the data engineering process by acquiring the relevant OSM data. The following code cell shows how the imagesim libary can be used to generate OSM data for the ARD tiles in question.

```
from imagesim.scripts.local.osm import fetch_osm_by_quadkey
from imagesim.scripts.constants import DATA_PATH

node_tags, way_tags, relation_tags = fetch_osm_by_quadkey(33, DATA_PATH)
```

This function takes a quadkey zone and a data path and writes out raw OSM queries (meaning no tag filtering) to `DATA_PATH`, which specifies your local ARD data path (ie, `/home/ubuntu/ard`). It looks up the various level 12 quadkeys that exist under the utm path in the ard structure, and writes out the results to that location to a file called _osm_data.json_.

**osm_data.json**
This data file structure is unique to the imagesim data pipeline. The file contains one dictionary object with the following keys:
* quadkey: The level-12 quadkey which specifies the query geometry from which osm results were returned
* nodes: The OSM nodes data parsed out from the raw Overpass results
* ways: the OSM ways data parsed out from the raw Overpass results
* relations: the OSM relations data parsed out from the raw Overpass results

This filestructure reflects the importance of potentially treating different OSM elements individually in downstream applications.

Once this function has completed, you ought to have an _osm_data.json_ at each ard quadkey subdirectory, ie the filepath `/home/ubuntu/data/ard/33/<level-12-quadkey>/osm_data.json` should exist for each quadkey.


### ARD Image chipping and filtering 

ARD image creation can potentially deliver imagery that contains no-data values. Imagesim contains modules and functions for filtering and chipping ARD level-12 data tiles into smaller tiles that are used for model training. The following section describes how to create training-level chips and filter those results to include only true-data imagery.

From the command line, invoke ARD chipping as follows:

`python chip.py --ard-path /home/ubuntu/data/ard/33 --zoom 17 --proc 4 --dest /home/ununtu/data/chips/33`

This will chip all the available ard imagery to the output chip path, at the zoom level specified. Supports multiprocessing, which is recommended.

**Image chips: .../chips/33\<imgchip\>.jpg**
Image chips, intended to for model consumption, have the filename structure `Z<utm-zone>-<quadkey>_<catalogid>.jpg`.
The chip filename is prefixed by the letter Z and the UTM zone of origin, a hyphen delim, then the chip quadkey identifier at the _chip zoom level_, an underscore and finally the Maxar catalog id strip the image was derived from.

It is worth noting that these chip images do not encode geospatial references and are simple jpgs. Their filename construct, however, contains the relevant information to map the chip to its geospatial coordinates. Extracting the catalog id from the filename can be used to map back to the origination ARD tile as well. Imagesim has a utility to do just this:

`from imagesi.scripts.chip import get_cog_path_from_chip ...`

This is very useful for downstream model QA tools, eg, cross-referencing model outputs to visualization services like TMS which consume the original ARD filesystem structure.


[**Optional: Create nodata-index**]

Sometimes the ARD data is delivered with no-data values in the event that you messed up the ordering api options I suppose. The following command will consume all of the generated chip imagery from the previous step, perform a "no data" test on the corpus, and write out a json-based result index mapping to each chip filename a classification of "no data," "partial data," or "all data." It does not create, modify or delete any of the existing image files.

`python chip.py --filename /home/ubuntu/data/nodata-index.json --chip-dir /home/ubuntu/data/chips/33`


At this point, all of the raw data acquisition requirements are satisfied and in their proper place on the filesystem. The following section is intended as a guide by which the imagesim library and associated tools can be used for the purpose of data exploration, summarization, filtering and processing. 


## Exploring and operating on massive datasets/ Selecting and Constructing Training data 

Probably the most important part of the Imagesim pipeline, in terms of obtaining useful model ouputs, is the image/labeling construction. The raw data space is colossal in dimensionality, noisy, and unbalanced both in terms of sample sizes as well as spatial distribution, it can be assumed.

To complicate matters further, it is also not necessarily obvious how the feature characterization might positively or negatively affect the underlying target distribution for either feature similarity embeddings as well as the classifcation space. 

Furthermore, multilabel classifcation problems are only just becoming an area of interest in the academic and research domain. Some relevant questions include: what subset of osm labels well characterize the feature space for a particular physical dimensionality space? Ie, how do things change if the images are chipped at zoom level 18 instead of 17? I will elaborate more on these questions later.

Whatever approach is taken in sample definition, it should take into account some of the basic considerations that are standard for the case of multiclass classification on imagery, and go from there.

Practically, the techincal requirements to implement any such strategies are very much non-trivial considering the massive amount of data that lives on the filesystem. Attempting to work with all of the data in memory is **not viable** for many computing opertions on common tabular data structures. 


<TODO: Pre-defined filtering assets, PYGEOS-based SRTree spatial joins (geopandas latest)>

### Training data structure: Dense vs Sparse Encodings

The final 

## Modeling with IMAGESIM

### Model definintion for multilabel classification

Our model definition is relatively straightforward; as a backbone, we use a pre-trained (ImageNet) Resnet50 with initially frozen weights. We attach two fully convolutional layers at the head of the model, and freeze the backbone. The relevant discriminator needs to be a composition of a sigmoid activation followed by a binary cross-entropy loss, which is acheived with `pytorch.nn.BCEWithLogitsLoss` module, and is considered to be more stable than a general composition due to the internals by which logarithms are summed. This means that we must perform the sigmoid activation manually when calculating statistics for predictions and relevant metrics (other than losses). We evaluate performance in training and validation using accuracies and the F2 score, which emphasizes the models ability to "generally" predict labels that are more well represented in the input dataset. We define the model using `pytorch-lightning`, which is effectively a convenience library for reproducibility and readability in the modeling pipeline, and has many plugins. The model definition is shown below:

<code>
    

class SkywayMuliLabelClassifier(pl.LightningModule):
    
    """
    A pytorch_lightning.LightningModule subclass. Used in
    conjunction with SkywayDataset dataloader class for training,
    validating and testing multi-label, multi-class skyway data.
    """
    

    def __init__(self,
                 n_labels: int = 13,
                 cnn_backbone: nn.Module = models.resnet50,
                 loss_fn: nn.Module = nn.BCEWithLogitsLoss(),
                 discriminator_optimizer = torch.optim.Adam,
                 learning_rate: float = 1e-3,
                 discriminator_scheduler = lr_scheduler.CosineAnnealingLR,
                 binary_threshold = 0.5,
                ):
        super().__init__()

        self.n_labels = n_labels
        self.loss = loss_fn
        self.optimizer = discriminator_optimizer
        self.learning_rate = learning_rate
        self.scheduler = discriminator_scheduler
        self.binary_threshold = binary_threshold
        self.cnn = cnn_backbone(pretrained=True)
        # Freeze backbone
        for param in self.cnn.parameters():
            param.requires_grad_(False)

        self.cnn.fc = nn.Sequential(
            nn.Linear(self.cnn.fc.in_features, self.cnn.fc.in_features // 2),
            nn.ReLU(),
            nn.BatchNorm1d(self.cnn.fc.in_features // 2),
            nn.Dropout(0.2),
            nn.Linear(self.cnn.fc.in_features // 2, self.cnn.fc.in_features // 4),
            nn.ReLU(),
            nn.BatchNorm1d(self.cnn.fc.in_features // 4),
            nn.Dropout(0.2),
            nn.Linear(self.cnn.fc.in_features // 4, self.n_labels)
        )

        # Set up metrics
        self.accuracy = Accuracy(threshold=self.binary_threshold,
                                 compute_on_step=True,
                                 dist_sync_on_step=True,
                                )
        self.f2 = FBeta(self.n_labels,
                        beta=2.0,
                        threshold=self.binary_threshold,
                        multilabel=True,
                        compute_on_step=True,
                        dist_sync_on_step=True,
                       )

    def forward(self, x):
        return self.cnn(x)

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.forward(x)

        return {
            "y_hat": y_hat,
            "target": y
        }

    def training_step_end(self, outputs):
        y_hats = outputs['y_hat']
        targets = outputs['target']
        logits = torch.sigmoid(y_hats)

        loss = self.loss(y_hats, targets)
        accuracy = self.accuracy(logits, targets)
        f2 = self.f2(logits, targets)

        self.log('train_loss', loss, prog_bar=True)
        self.log('train_acc', accuracy, prog_bar=True)
        self.log('train_f2', f2, prog_bar=True)

        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.forward(x)

        return {
            "y_hat": y_hat,
            "target": y
        }

    def validation_step_end(self, outputs):
        y_hats = outputs['y_hat']
        targets = outputs['target']

        logits = torch.sigmoid(y_hats)

        loss = self.loss(y_hats, targets)
        accuracy = self.accuracy(logits, targets)
        f2 = self.f2(logits, targets)

        self.log('val_loss', loss, prog_bar=True)
        self.log('val_acc', accuracy, prog_bar=True)
        self.log('val_f2', f2, prog_bar=True)

        return loss

    def test_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.forward(x)

        return {
            "y_hat": y_hat,
            "target": y
        }

    def test_step_end(self, outputs):
        y_hats = outputs['y_hat']
        targets = outputs['target']

        logits = torch.sigmoid(y_hats)

        loss = self.loss(y_hats, targets)
        accuracy = self.accuracy(logits, targets)
        f2 = self.f2(logits, targets)

        self.log('test_loss', loss, prog_bar=True)
        self.log('test_acc', accuracy, prog_bar=True)
        self.log('test_f2', f2, prog_bar=True)

        return loss

    def configure_optimizers(self):
        disc_optimizer = self.optimizer(self.cnn.parameters(), lr=self.learning_rate)
        disc_scheduler = self.scheduler(disc_optimizer, T_max=5, eta_min=0.005)
        return {
            "optimizer": disc_optimizer,
            "lr_scheduler": disc_scheduler
        }
</code>


### Training a SkywayMultiLabelClassifier (SMLC)

After one has produced both datisfactory image chips and the sparse label encodings as described in the datamaker process above, model training can begin. Modeling should be executed on a multi-gpu instance, eg a `p3.8xlarge` or similar as described above. Modeling via `imagesim` is designed to automatically distribute both data loading as well as data training across the available GPUs thanks in part to the underlying machinery of the `pytorch-lightning` modules. 

Although the `imagesim.models` module provides a convenient cli for training an SMLC, it can also be used as an api in an interactive environment, like a jupyter notebook. One benefit of using the api this way is the experimental, inferential, and investigative capabilties granted to an ML engineer that may be more readily available as compared to the cli approach. For an in-depth example of how this can be done, see the `encoder.ipynb` notebook included in this repository.

##### Prerequisites
* An accesible gpu instance as described above
* A conda environment installed on the remote system that includes the packages as defined in `imagesim.build`
* An installation or clone of the imagesim library, specfically imagesim.models
* A local cache of input image chips, by default located at `/home/ubuntu/data/chips` produced from the datamaker pipeline
* The corresponding sparse labels dataset, by default located at `/home/ubuntu/data/sparse_label_encodings.csv`, produced from the datamaker pipeline
* (Optional) a conigured loggin plugin, like tensorboard or [https://www.wandb.com/](WandB)


With those prerequisites, it's easy to train a model using `imagesim.models`. The **entrypoint** to training is the python executable `imagesim.models.train_model`. This can be run as a python script, with a multitude of optional input parameters. To train a model with all default parameters, simply run

`python imagesim.models.train_model.py`

This command will load the images, labels, models and hyperparameters that have been built-in as defaults. It is worth nothing that the **default values** defined there have been chosen empirically in preliminary experimentation in training SMLC models. Running 

`python imagesim.models.train_model.py --help` 

ought to display information about the default hyperparameters used by `imagesim`. 

It is, however, recommended to specify the number of GPUs you would like to distribute training on. By default, `imagesim` does not assume that training is done on a multi-gpu instance. To utilize all GPUs, you can run

`python imagesim.models.train_model.py --gpus=-1`

Besides the basic hyperparameters shown in the help menu, the user has access to all input arguments to initializing a [https://pytorch-lightning.readthedocs.io/en/stable/trainer.html](pytorch_lightning.Trainer) instance, which is the main object used for modeling.


#### Modeling outputs
Once the training process has completed, the engineer has access to both model checkpoints from the CWD that can be loaded via `torch` or `pytorch_lightning`, as well as an ONNX version of the output model. Basic accuracy and F2 score logging outputs are also available for model eavluation. Either model outputs can be loaded and used for encoding an image dataset that can in turn be used for image indexing and image retrieval, as described in the following sections.

It is recommended to upload satisfctory models to the S3 bucket prefix described above.

#### Benchmarks
Current models on S3 have been benchmarked at validation losses < 0.3, with validation accuracies > 0.85

## Building vector indexes via image feature embeddings

### Understanding performance via Alternative Learning Strategies and SOTA Feature Embedding 

Coming soon