# Working with IMAGESIM

The following directions are intended to allow an engineer to set up the necessary infrastructure, build the required data assests, train an imagesim model, and deploy a similarity schema for a dedicated set of ARD assets.

## Infrastructure

Before working with code, it will be necessary to set up some EC2 machines in the relevant places. This project has compute assests located in the _analytics-core-development (733530388139)_ account on Maxar AWS.

### Buckets

The main bucket used for all data assests, including ARD data, is located at _s3://imagesim-storage_. The following describe some prefixes and assests that currently exist there and which can be used for model development.

* ard/: This is where all ARD order deliveries are stored and represents the raw image data used in this project
* chips/: This is where all _chipped_ images are stored, and represent image inputs to machine learning models TODO filename explanation
* code/: misc. python scripts. Non-relevant to developers
* datasets/: This prefix stores references to "dataset versioning." TODO
* demo-ard/: Contains ARD image delivery and vectors that can be uilized for testing change detection
* nodata-index.json: This file maps chips from the chips directory (v0.2) to valid imagery; ie, imagery that contains real data at evert pixel. 

### Machines

There are two EC2 machines that need to be set up to run this project effciently, minimizing cost while optimizing compute. 

**datamaker**: The "datamaker" machine is used for acquiring, processing and analyzing data. It is intended to be used to create the data that will eventually be the inputs to the modeling part of the pipeline. This machine has the following configuration properties on AWS EC2:

* Hardware: _m5.xlarge_
* System: Ubuntu
* AMI Id: ami-085925f297f89fce1
* Storage: 500GB (EBS)
* Username: ubuntu
* Security group:

The cost to run this machine is approx. $0.1 per compute hour.

**trainer**: The "trainer" machine is used for training the model that is primarily for training, and is currently used to encode images from trained models. This is a GPU machineThis machine has the following configuration properties on AWS EC2:

* Hardware: _p3.8xlarge_
* System: Ubuntu
* AMI Id: ami-019266bf7a55994a7
* Storage: 1000GB (EBS)
* Username: ubuntu
* Security group:

The cost to run this machine is approx. $12.00 per compute hour.


#### Setting up cloud compute environments and filesystem 

Both machines specify AMIs that do not contain prescribed python compute environments, unlike other DL or data science AMIs. To work in the cloud with imagesim, it will be necessary to set up python environments via conda, which must be installed. The following directions show how to install conda on an Ubuntu system and how to install the relevant packages for imagesim.

**Setting up the Datamaker Python environment**
1. ssh into your datamaker instance: `ssh -i <path-to-your-private-sshkey> ubuntu@<datamaker's public IPv4 DNS address>`.
2. Install [https://docs.conda.io/en/latest/miniconda.html](miniconda). This can be done by first downloading the installer to the system /tmp directory: `wget https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh /tmp`. The installation can now be completed by running the installation script: `/bin/bash /tmp/Anaconda2-4.1.1-Linux-x86_64.sh`. The installation can be installed at the user's home directory, which is the default install location. To access the conda binary, either exit and reinitiate a new ssh tunnel, or simply run `source .bashrc`. You should now have miniconda installed and have access to the various conda commands for creating python environments, etc in your runtime path.
3. Create a new python 3.8 environment called "science" and activate that environment: `conda create -n science python=3.8 -y && conda activate science`.
4. Install the relevant packages, located in the imagesim repository. This can be done by installing the environment file located in the imagesim repo, or by manually installing the relevant packages: ...

**Setting up the Datamaker filesystem**
1. Create the _data_ directory: `mkdir /home/ubuntu/data`. This is the main subdirectory in which all actions and assets are located within the datamaker machine. The following instructions ought to be carried out with respect to this subdirectory.
2. Create the notebooks directory: `mkdir /home/ubuntu/data/notebooks`.
3. Create the chips directory: `mkdir /home/ubuntu/data/chips`.
4. Create the ard directory: `mkdir /home/ubuntu/data/ard/`
5. Download the relevant ARD data. The current project focuses on a subset of the ordered ARD data, namely those tiles in **UTM 33** that exist at _s3://imagesim-storage/ard/33_. Create the subdirectory for this data, `mkdir /home/ubuntu/data/ard/33`. Copy the This can be done by executing `aws s3 cp s3://imagesim-storage/ard/33 /home/ubuntu/data/ard/33 --recursive`.


**Setting up the Trainer Python environment**
1. Follow directions 1-4 as stated in the previous setup instructions



## Getting started with Datamaker

Datamaker is the EC2 instance that is configured for creating OSM labels from ARD data, filtering and processing these data, and computing relevant statistical analysis on these data. It can also be used to run a TMS server to serve mosaiced ARD tiles (see section _Running a TMS on Mosaiced ARD data_). The following sections describe how imagesim modules and functions can be used to acquire OSM data, explore and filter that data, and make stasticical inferences relevant to subsequent model training.


### OSM data acquisition

After creating and setting up the datamaker instance as described in the previous sections, we can start the data engineering process by acquiring the relevant OSM data. The following code cell shows how the imagesim libary can be used to generate OSM data for the ARD tiles in question.

```
from imagesim.scripts.local.osm import fetch_osm_by_quadkey
from imagesim.scripts.constants import DATA_PATH

node_tags, way_tags, relation_tags = fetch_osm_by_quadkey(33, DATA_PATH)
```

This function takes a quadkey zone and a data path and writes out raw OSM queries (meaning no tag filtering) to `DATA_PATH`, which specifies your local ARD data path (ie, `/home/ubuntu/ard`). It looks up the various level 12 quadkeys that exist under the utm path in the ard structure, and writes out the results to that location to a file called _osm_data.json_.

**osm_data.json**
This data file structure is unique to the imagesim data pipeline. The file contains one dictionary object with the following keys:
* quadkey: The level-12 quadkey which specifies the query geometry from which osm results were returned
* nodes: The OSM nodes data parsed out from the raw Overpass results
* ways: the OSM ways data parsed out from the raw Overpass results
* relations: the OSM relations data parsed out from the raw Overpass results

This filestructure reflects the importance of potentially treating different OSM elements individually in downstream applications.

Once this function has completed, you ought to have an _osm_data.json_ at each ard quadkey subdirectory, ie the filepath `/home/ubuntu/data/ard/33/<level-12-quadkey>/osm_data.json` should exist for each quadkey.


### ARD Image chipping and filtering 

ARD image creation can potentially deliver imagery that contains no-data values. Imagesim contains modules and functions for filtering and chipping ARD level-12 data tiles into smaller tiles that are used for model training. The following section describes how to create training-level chips and filter those results to include only true-data imagery.

From the command line, invoke ARD chipping as follows:

`python chip.py --ard-path /home/ubuntu/data/ard/33 --zoom 17 --proc 4 --dest /home/ununtu/data/chips/33`

This will chip all the available ard imagery to the output chip path, at the zoom level specified. Supports multiprocessing, which is recommended.

**Image chips: .../chips/33\<imgchip\>.jpg**
Image chips, intended to for model consumption, have the filename structure `Z<utm-zone>-<quadkey>_<catalogid>.jpg`.
The chip filename is prefixed by the letter Z and the UTM zone of origin, a hyphen delim, then the chip quadkey identifier at the _chip zoom level_, an underscore and finally the Maxar catalog id strip the image was derived from.

It is worth noting that these chip images do not encode geospatial references and are simple jpgs. Their filename construct, however, contains the relevant information to map the chip to its geospatial coordinates. Extracting the catalog id from the filename can be used to map back to the origination ARD tile as well. Imagesim has a utility to do just this:

`from imagesi.scripts.chip import get_cog_path_from_chip ...`

This is very useful for downstream model QA tools, eg, cross-referencing model outputs to visualization services like TMS which consume the original ARD filesystem structure.


[**Optional: Create nodata-index**]

Sometimes the ARD data is delivered with no-data values in the event that you messed up the ordering api options I suppose. The following command will consume all of the generated chip imagery from the previous step, perform a "no data" test on the corpus, and write out a json-based result index mapping to each chip filename a classification of "no data," "partial data," or "all data." It does not create, modify or delete any of the existing image files.

`python chip.py --filename /home/ubuntu/data/nodata-index.json --chip-dir /home/ubuntu/data/chips/33`


At this point, all of the raw data acquisition requirements are satisfied and in their proper place on the filesystem. The following section is intended as a guide by which the imagesim library and associated tools can be used for the purpose of data exploration, summarization, filtering and processing. 


## Exploring and operating on massive datasets/ Selecting and Constructing Training data 

Probably the most important part of the Imagesim pipeline, in terms of obtaining useful model ouputs, is the image/labeling construction. The raw data space is colossal in dimensionality, noisy, and unbalanced both in terms of sample sizes as well as spatial distribution, it can be assumed.

To complicate matters further, it is also not necessarily obvious how the feature characterization might positively or negatively affect the underlying target distribution for either feature similarity embeddings as well as the classifcation space. 

Furthermore, multilabel classifcation problems are only just becoming an area of interest in the academic and research domain. Some relevant questions include: what subset of osm labels well characterize the feature space for a particular physical dimensionality space? Ie, how do things change if the images are chipped at zoom level 18 instead of 17? I will elaborate more on these questions later.

Whatever approach is taken in sample definition, it should take into account some of the basic considerations that are standard for the case of multiclass classification on imagery, and go from there.

Practically, the techincal requirements to implement any such strategies are very much non-trivial considering the massive amount of data that lives on the filesystem. Attempting to work with all of the data in memory is **not viable** for many computing opertions on common tabular data structures. 


<TODO: Pre-defined filtering assets, PYGEOS-based SRTree spatial joins (geopandas latest)>

### Training data structure: Dense vs Sparse Encodings

The final 

## Model definintion for multilabel classification

Our model definition is relatively straightforward; as a backbone, we use a pre-trained (ImageNet) Resnet50 with initially frozen weights. We want to attach a fully-convolutional feature classifier at the head, and will select one of these last layers for encoding a low dimensional representation of our feature space. We can expect that these last layers will learn a spatial embedding as a function of the physical image features that are well prepresented by our labeling schema... if we set it up right!

### 

In [None]:
import os
import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader, Dataset, random_split
from torchvision import models, transforms

import torch.optim as optim
from torch.optim import lr_scheduler

from tqdm import trange
from sklearn.metrics import precision_score, f1_score

from PIL import Image

from sklearn.preprocessing import MultiLabelBinarizer

import seaborn as sns
import pandas as pd

import numpy as np
import skimage.io as skio


eval_transform = transforms.Compose([
    transforms.Resize(chip_size),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

train_transform = transforms.Compose([
    transforms.Resize((chip_size, chip_size)),
#    transforms.RandomRotation(45),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])


batch_size = 32
dataset = SkywayDataset("/home/ubuntu/data/sample_sparse_encodings.csv",
                        "/home/ubuntu/data/chips",
                        transforms=train_transform)

valid_no = int(len(dataset) * 0.20)

training_set, validation_set = random_split(dataset, [len(dataset) - valid_no, valid_no])
#print(f'''training set length: {len(training_set)}, validation set length: {len(validation_set)}''')

dataloader = {"train": DataLoader(training_set, shuffle=True, batch_size=batch_size),
              "val": DataLoader(validation_set, shuffle=True, batch_size=batch_size)}


def visualize_label_dist(df):
    fig1, ax1 = plt.subplots(figsize=(10,10))
    df.iloc[:,1:].sum(axis=0).plot.pie(autopct='%1.1f%%', shadow=True, startangle=90, ax=ax1)
    ax1.axis("equal")
    plt.show()


def visualize_label_corr(df):
    sns.heatmap(df.iloc[:,1:].corr(), cmap="RdYlBu", vmin=-1, vmax=1)


def visualize_image(idx, classes=classes):
    fd = d.iloc[idx]
    image = fd.Feature
    label = fd[1:].tolist()
    print(image)

    image = Image.open("/home/ubuntu/data/chips/" + image)
    #print(image.shape)
    fig, ax = plt.subplots(figsize=(10,10))
    ax.imshow(image)
    ax.grid(False)
    classes = np.array(classes)[np.array(label, dtype=np.bool)]
    for i, s in enumerate(classes):
        ax.text(0, i*20, s, verticalalignment='top', color='white', fontsize=16, weight='bold')
    plt.show()
    return image



class SkywayDataset(Dataset):
    def __init__(self, csv_file, img_dir, transforms=None):
        self.df = pd.read_csv(csv_file)
        self.img_dir = img_dir
        self.transforms = transforms

    def __getitem__(self, idx):
        d = self.df.iloc[idx]
        image = Image.open(os.path.join(self.img_dir, d.Feature)).convert("RGB")
        label = torch.tensor(d[1:].tolist(), dtype = torch.float32)

        if self.transforms is not None:
            image = self.transforms(image)

        return image, label

    def __len__(self):
        return len(self.df)


def create_head(num_features, num_classes, dropout_prob=0.5, activation_func=nn.ReLU):
    features_lst = [num_features, num_features//2, num_features//4]
    layers = list()
    for in_f, out_f in zip(features_lst[:-1], features_lst[1:]):
        layers.append(nn.Linear(in_f, out_f))
        layers.append(activation_func())
        layers.append(nn.BatchNorm1d(out_f))
        if dropout_prob != 0:
            layers.append(nn.Dropout(dropout_prob))
        layers.append(nn.Linear(features_lst[-1], num_classes))
    return nn.Sequential(*layers)




class MultilabelResnetFC(pl.LightningModule):
    '''
    Resnet50 backbone with fully connected FC head for multilabel classification
    '''
    
    transforms = {
        'train': train_transform,
        'val': eval_transform,
        'test': eval_transform,
    }
    
    self.__init__(self, hparams=None):
        super.__init__()
        self.resnet = models.resnet50(pretrained=True)
        self.resnet.
        # Define the head
        

    @staticmethod
    def loss(*args):
        

    def forward(self, x):
        raise NotImplementedError

    def training_step(self, batch, batch_nb):
        x, y = batch
        y_hat = self.forward(x)
        loss = self.loss(y_hat, y)
        return {
            'loss': loss,
            'log': {'training/loss': loss},
        }

    def validation_step(self, batch, batch_nb):
        x, y = batch
        y_hat = self.forward(x)
        accuracy = (y == y_hat).float().mean()
        return {'val_loss': self.loss(y_hat, y), 'accuracy': accuracy}

    def validation_epoch_end(self, outputs):
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        accuracy = torch.stack([x['accuracy'] for x in outputs]).mean()
        tensorboard_logs = {
            'validation/accuracy': accuracy,
            'validation/loss': avg_loss,
        }
        return {'val_loss': avg_loss, 'log': tensorboard_logs}

    def test_step(self, batch, batch_nb):
        x, y = batch
        y_hat = self.forward(x)
        return {'test_loss': self.loss(y_hat, y)}

    def test_epoch_end(self, outputs):
        avg_loss = torch.stack([x['test_loss'] for x in outputs]).mean()
        return {'avg_test_loss': avg_loss}

    def configure_optimizers(self):
        return torch.optim.SGD(self.parameters(), lr=self.hparams.learning_rate, momentum=self.hparams.momentum)

    def dataloader(self, phase):
        coco_path = os.path.join(self.hparams.data_path, f'{phase}.json')
        dataset = CocoClassificationDataset(coco_path, transform=self.transforms[phase])
        shuffle = phase == 'train'
        return DataLoader(dataset, batch_size=self.hparams.batch_size, shuffle=shuffle, num_workers=4)

    @pl.data_loader
    def train_dataloader(self):
        return self.dataloader('train')

    @pl.data_loader
    def val_dataloader(self):
        return self.dataloader('val')

    @pl.data_loader
    def test_dataloader(self):
        return self.dataloader('test')

    @staticmethod
    def add_model_specific_args(parser):
        parser.add_argument('--learning_rate', type=float, default=0.01)
        parser.add_argument('--momentum', type=float, default=0.9)
        parser.add_argument('--batch_size', type=int, default=16)
        parser.add_argument('--data_path', type=str, required=True)
        return parser
    
    
    
    
model = models.resnet50(pretrained=True)
num_features = model.fc.in_features

def freeze_pretrained(model):
    for param in model.parameters():
        param.requires_grad_(False)
    return model


top_head = create_head(num_features, 13)
model.fc = top_head


criterion = nn.BCEWithLogitsLoss()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# specify optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)
sgdr_partial = lr_scheduler.CosineAnnealingLR(optimizer, T_max=5, eta_min=0.005)


def train(model, data_loader, criterion, optimizer, scheduler, num_epochs=5):
    for epoch in trange(num_epochs, desc="Epochs"):
        result = []
        for phase in ["train", "val"]:
            if phase == "train":
                model.train()
                scheduler.step()
            else:
                model.eval()

            # Keep track of training, validation loss
            running_loss = 0.0
            running_corrects = 0.0

            for data, target in data_loader[phase]:
                data, target = data.to(device), target.to(device)
                print(data.shape)

                with torch.set_grad_enabled(phase=="train"):
                    # Feed input
                    output = model(data)
                    # Calculate loss
                    loss = criterion(output, target)
                    predictions = torch.sigmoid(output).data > 0.35
                    predictions = predictions.to(torch.float32)

                    if phase == "train":
                        # Backwards pass: compute gradient of the loss w.r.t. model params
                        loss.backward()
                        # Update model params
                        optimizer.step()
                        # Zero the grad to stop accumulation
                        optimizer.zero_grad()

                    running_loss += loss.item() * data.size(0)
                    running_corrects += f1_score(target.to("cpu").to(torch.int).numpy(),
                                                predictions.to("cpu").to(torch.int).numpy(),
                                                average="samples") * data.size(0)

        epoch_loss = running_loss / len(data_loader[phase].dataset)
        epoch_acc = running_corrects / len(data_loader[phase].dataset)

        result.append('{} Loss: {:.4f}, Acc: {:.4f}'.format(phase, epoch_loss, epoch_acc))
    print(result)

### Understanding performance via Alternative Learning Strategies and SOTA Feature Embedding 

There are a variety of 