Initial draft of data specifications for cell classification task.

## Summary

Files must be zip containers (filenames will end in '.zip') and include:
- [X.npy array with dimensions (1, y, x, c); raw data](#load-X-data)
- y.npy array with dimensions (1, y, x, 1); instance labels
- [channel_names.json](#channel-names)
- [classes/cell_type.json](#cell-type-class-specs)

This notebook will help create each component of the file and save them in the correct output format. This notebook also provides an example of how to extract the file contents after annotation using the python zipfile library.

In [None]:
import os
import re

import imageio
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from caliban_toolbox.utils.misc_utils import sorted_nicely

In [None]:
def sanitize(x):
    """Strip out non-alphanumeric characters from a string.
    
    https://stackoverflow.com/a/1276774
    
    returns lowercase version of string to help compare
    possible variations of channel or class names, eg:
        - 'B cell' vs 'Bcell' vs 'b_cell'
    
    Note that this will strip out '+' and '-' characters,
    so if that is the only difference between two class names,
    problems may arise! Use 'pos' or 'neg' when creating names
    for lineage classifications instead.
    """
    return re.sub(r'\W+', '', x).lower()

In [None]:
def check_tif_ext(filename):
    # dotfile not a legitimate .tif file
    if filename.startswith('.'):
        return False
    if os.path.splitext(filename.lower())[1] in ['.tif', '.tiff']:
        return True
    return False

In [None]:
# example data
DATA_DIR = os.path.abspath('../data/cell_classification_example')

### Starting from predicted classifications
Annotation files for DCL will likely need to be prepared from existing data in a different format.

In [None]:
# load the existing cell classification mapping
example_key = os.path.join(DATA_DIR, 'cell_key.csv')
example_key_df = pd.read_csv(example_key, header=None)

# this corresponds to the pixel-level classification array
# we will need this information to convert from
# the label array to the cell class assignment dictionary we require
example_key_df

In [None]:
# load and preview the pixel-level classification predictions
example_prediction_path = os.path.join(DATA_DIR, 'Point1', 'Point1_cell_overlay.tiff')
example_prediction_arr = imageio.imread(example_prediction_path)

classes_cmap = plt.get_cmap('Dark2')
classes_cmap.set_bad('black')
fig, ax = plt.subplots(figsize=(10, 10))

ax.imshow(np.ma.masked_equal(example_prediction_arr, 0), 
           cmap=classes_cmap)

### Channel names <a name="channel-names"></a>
Information about channel names is stored at the top level of the zipfile in `channel_names.json`.

This file contains a dictionary mapping each channel index to its name (usually the name of the marker used). This is stored as a dictionary, rather than a list, to make it easier to view the index of each channel name if needed.

This file is referenced for general display purposes, as well as to determine the channel indices used for designated channel "preset" combinations.

As a python dict:
```python
{
    0: 'Au',
    1: 'Background',
    2: 'C',
    3: 'Ca',
    4: 'CC3'
}
```

As JSON:
```json
{
  "0": "Au",
  "1": "Background",
  "2": "C",
  "3": "Ca",
  "4": "CC3"
}
```

### Load channels to make the array and channel name info<a name="load-X-data"></a>
The provided example data has one .tif for each channel. This section loads images in this format into a numpy array (`X.npy`) and creates the mapping of channel names to indices.

In [None]:
images_folder = os.path.join(DATA_DIR, 'Point1', 'TIFs')

# get list of files
file_list = sorted_nicely(os.listdir(images_folder))

# get list of .tif images (no dotfiles either)
img_name_list = [f for f in file_list if check_tif_ext(f)]

In [None]:
# load images and store in a list
img_list = [imageio.imread(os.path.join(images_folder, i)) for i in img_name_list]

# stack the images along channel axis
X = np.stack(img_list, axis=-1)
# array also needs a trivial Z or T dimension
X = np.expand_dims(X, axis=0)

# array should now have shape of (1, height, width, num_channels)
print(X.shape, X.dtype)

In [None]:
# create dictionary with channel names and indices
channel_names = {i: os.path.splitext(name)[0] for i, name in enumerate(img_name_list)}

### Cell type classifications<a name="cell-type-class-specs"></a>
AKA cell developmental lineages. This is named `cell_type.json` to avoid confusion with the `lineage.json` file that contains annotations of cell divisions in files for our live cell tracking project. This file is stored within the `classes` folder of the zip container.

This file contains a mapping of class names to integers, as well as the class values assigned to each label in the file. The assignments for each instance label are grouped on a frame-by-frame basis, although these files only contain one frame's worth of assignments (frame `0`). 

We've pre-populated values for anticipated classes, like the `'Tumor'` class below. Because the classes are defined and stored alongside the assigned values, these initial class definitions and groupings can be safely added to or modified.

For each class specified in this file, helpful channels can also be defined, which will be displayed in the Label app as "preset" combinations to help annotation.

As a python dict, this looks like:
```python
    {        
        'semantic_ids': {
            # class_id: class_information
            0: {
                'name': 'unassigned',
                'markers': None,
                'channels': None
            },
            1: {
                'name': 'Tumor',
                'markers': ['PanCK', 'ECAD', 'CK7'],
                'channels': [1, 2, 3]  # whichever channel indices apply to the file
                # channel indices are automatically determined when creating the file
                # these don't need to (and shouldn't be) defined manually
            },
            # several more entries
        },
        'assignments': {
            # frame number: label assignments in that frame
            0: {
                # label_id: class_id
                # no entry for "label" 0
                1: 5,  # class 5
                2: 1,  # class 1 = Tumor class in this example
                # okay if labels are not sequential, but we try to avoid this
                4: 0,  # unassigned class
            }
        }
    }
```

As JSON, this looks like:
```json
{
  "semantic_ids": {
    "0": {
      "name": "unassigned",
      "markers": null,
      "channels": null
    },
    "1": {
      "name": "Tumor",
      "markers": [
        "PanCK",
        "ECAD",
        "CK7"
      ],
      "channels": [
        1,
        2,
        3
      ]
    }
  },
  "assignments": {
    "0": {
      "1": 5,
      "2": 1,
      "4": 0
    }
  }
}
```