# Data Processing for Crowd Annotation Pipeline

1. Download job report from Figure Eight
2. Download annotated images from report
3. Clean up annotations (remove small objects, fill small holes, sequentially label the annotations)
4. Process annotations and raw images to match each other in size. Can either:
    - chop raw images into corresponding pieces
    - recombine annotation subimages into full-size images
5. Combine raw and annotated images into npz format
6. Images are now ready to be used as training data!

Files are named by these scripts such that the code blocks can run back-to-back with minimal input. For this reason, it is recommended that users run through the whole pipeline before processing another set of images. The user can specify a few directory names and the "identifier" used in pre-annotation and run all cells in the notebook; alternate folder names can be used but this is not recommended.

To function properly, your working folder should contain subfolders:
- json_logs  
    - log from overlapping_chopper ({identifier}\_overlapping\_chopper_log.json)
- raw images (can be named "raw" or something else)

The user will also need to supply:
- job ID for the data to download from figure eight
- API key for figure eight
- "identifier" to access correct json logs and name files correctly

If the default folder names are used, by the end of this pipeline, the working folder (base_dir) will contain the following new files and directories:

- CSV/job report downloaded from figure eight
- annotations  
    - downloaded annotations from figure eight, cleaned, greyscale
- (optional) chopped raw images
- (optional) stitched annotations
- training
    - "all"
        - contains all the raw and annotated images, separated into raw/annotated subfolders but not separated into subfolders beyond that
    - other subfolders depending on what "make_training_data" needs for subfolder structure
- npz (folder)
    - .npz file(s) containing training data


In [None]:
#import statements
import json
import numpy as np
import os
import shutil
import sys
import stat

from deepcell_toolbox.pre_annotation.overlapping_chopper import overlapping_crop_dir
from deepcell_toolbox.post_annotation.download_csv import download_and_unzip, save_annotations_from_csv
from deepcell_toolbox.post_annotation.clean_montages import clean_montages, relabel_montages, convert_grayscale_all
from deepcell_toolbox.post_annotation.montages_to_movies import raw_movie_maker
from deepcell_toolbox.post_annotation.overlapping_stitcher import overlapping_stitcher_folder
from deepcell_toolbox.post_annotation.post_annotation_training_data import post_annotation_make_training_data

#from deepcell_toolbox.utils.dctb_data_utils import make_training_data
from deepcell_toolbox.utils.io_utils import get_img_names

#used to change permissions on folders as they are created
#allows user to access folders from file explorer
#user can delete intermediate folders (eg, contrast-adjusted raw images) once pipeline is finished
#also convenient for moving and editing files (eg, manual correction of images with Fiji)
perm_mod = stat.S_IRWXO | stat.S_IRWXU | stat.S_IRWXG

In [None]:
#set working directory
#base_dir = "/base/directory/path/here"
#raw_dir = "/base/directory/path/here/folder_with_fullsize_raw_images" <- no trailing slash!

base_dir = "/gnv_home/data/testing/post_processing/set1"
raw_dir = "/gnv_home/data/testing/post_processing/set1/FITC_medium_overlay_phase_medium"

#identifier given during pre-annotation pipeline; if you're not sure, it's also in the job report csv
identifier = "HEK293_AM_cyto_medium_s0"

## 1. Download job report from Figure Eight
By default, this script will download, unzip, and rename the full report from Figure Eight as a .csv file. However, the user can change the report type if one of the other report options is more suitable for their use. (support for other report types not guaranteed with version 0 of this notebook)

The user can specify where the zip file should be downloaded and the .csv extracted; by default, the .csv file will be put into a subfolder named CSV (likely the same folder that contained the input data; the CSV files are named to prevent confusion). The report CSV will be renamed "job_{job_number}\_{type of report}\_report.csv".

#### From Figure Eight website:
full - Returns the Full report containing every judgment

aggregated - Returns the Aggregated report containing the aggregated response for each row

json - Returns the JSON report containing the aggregated response, as well as the individual judgments

gold_report - Returns the Test Question report

workset - Returns the Contributor report

source - Returns a CSV of the source data uploaded to the job

In [None]:
job_id_to_download = 1388351
job_type = "full"

In [None]:
download_and_unzip(job_id_to_download, base_dir)

## 2. Use report to download annotations
This script uses the information in the report to download each annotation. Montage annotations will be saved in the "annotations" subfolder (it will be created for you by the script).

Raw images that could not be annotated (those with "broken_link = True") will not be downloaded in this step. If a job contains rows with broken links, the information will be put into two csv files: one, "job_number_full_report_broken_links.csv", contains all of the metadata from the full job report, in case the user wants to inspect this for a pattern in the broken links. The other, "job_number_reupload.csv", has only the information used to upload the images originally (identifier and annotation_url). If the images are suitable for annotation, the user can easily add this csv to the figure eight job and obtain annotations for those images. (Alternatively, the user may need to go through part of the pre-annotation pipeline to adequately fix and reupload the images in question.)

If there are no broken links in the job, or if images with broken links instead of annotations have annotations later on in the job report (if the user has reuploaded those rows to the job), the secondary csv creation will not be triggered.

This function returns a list of the image names of any images with broken links; this list will be used later in the pipeline to automatically skip images when stitching images together or making training data.

In [None]:
csv_dir = os.path.join(base_dir, "CSV")
csv_path = os.path.join(csv_dir, "job_" + str(job_id_to_download) + "_" + job_type + "_report.csv")

#csv_path = "/example/path/CSV/job_number_full_report.csv"

annotation_save = os.path.join(base_dir, "annotations")

In [None]:
images_to_drop = save_annotations_from_csv(csv_path, annotation_save)

## 3. Clean up annotations
First, the RGB annotation is converted into grayscale, simplifying downstream use of the annotation.

Next, small changes to the morphology of the image are made. Sometimes during annotation, small holes or stray annotations will be submitted, as artifacts of the annotation process. However, these holes or stray pixels don't correspond to what should be annotated, so in this step, we use sci-kit image to fix these small mistakes.

Currently uses the old "clean_montage" function; this may change in future versions of notebook.

After cleaning the annotation, user can optionally run "relabel_montages" block, which will relabel the annotations sequentially (eg, perhaps the annotator decided to use the labels 3, 5, and 7 to label cells; this code block would remake the image with labels 1, 2, and 3).

The cleaned and relabled annotations will overwrite the downloaded annotations.

In [None]:
convert_grayscale_all(annotation_save)

In [None]:
clean_montages(annotation_save)

In [None]:
#optional
relabel_montages(annotation_save)

## 4. Make raw and annotated images same size
Choose to chop up raw images to match the annotations, or to stitch the annotations together to match the original raw images. Recommended that user inspect images for quality of annotations after this step, before moving on to making training data.

In [None]:
#are the images named with 2D or 3D conventions?
is_2D = True
npz_mode = "fullsize"
num_images = len(get_img_names(raw_dir))

In [None]:
#read json parameters
json_chopper_log_path = os.path.join(base_dir, "json_logs", identifier + "_overlapping_chopper_log.json")
try:
    with open(json_chopper_log_path) as json_file:
        json_chopper_log = json.load(json_file)
except:
    print("No overlapping_chopper log file found. Is the path to the json file correct?",
          "\nIf the images were not chopped prior to annotation, you can skip this step.")

### Option 1: Chop raw images into pieces to match annotation size

In [None]:
num_x_segments = json_chopper_log['num_x_segments']
num_y_segments = json_chopper_log['num_y_segments']
overlap_perc = json_chopper_log['overlap_perc']
try:
    frame_offset = json_chopper_log['frame_offset']
except:
    frame_offset = 0

In [None]:
overlapping_crop_dir(raw_dir, 
                     identifier + "_raw", 
                     num_x_segments, 
                     num_y_segments, 
                     overlap_perc, 
                     frame_offset, 
                     is_2D)
npz_mode = 'chopped'

### Option 2: Recombine annotations to match original raw image size
Use this option to combine overlapping annotations into a single image. If a subimage file does not exist, that portion of the larger image will be filled with zeros (the stitched image will still be suitable for training if there were no cells in that part of the image, the usual reason an annotation does not exist).

The overlapping stitcher relies on information in the overlapping_chopper json log to function properly. The overlapping stitcher also depends on finding the correct filenames; if the filenames of the image pieces are not what it expects, it will "stitch" together empty images. If this is the case, try:
 - renaming images to match the naming format the stitcher expects

OR
 - create a notebook and copy the overlapping stitcher wrapper into a code block (include imports). Edit the sub_img_format portion of the code to match the format you are passing in. Run the code block to stitch the annotations together. Once the images are stitched you should be able to proceed with this notebook

In [None]:
#where the annotation pieces are located
pieces_dir = annotation_save

#where stitched images should be saved, stitcher will create dir if needed
save_dir = os.path.join(base_dir, "stitched")

In [None]:
overlapping_stitcher_folder(pieces_dir, save_dir, identifier, num_images, json_chopper_log, is_2D)
npz_mode = 'stitched'

## 5. Combine raw and annotated images into npz format
Image files need to be moved into training directories, and possibly subfolders in a particular structure, depending on the training data format of interest. The user can select from several options to make different types of training data (whether changing the images used for training data, or selecting different modes of training data creation). The user is advised to double check that they have selected the appropriate options for their intended use case.

If intending to use both chopped and stitched images to make two different npz files, pick one option, go through the end of the notebook to make_training_data, then delete the intermediary "training" folder to avoid mixing stitched and chopped images. Then, the user can follow the other option through to the end.

### Move raw images and annotations into training folder

In [None]:
#training_dir will hold all subfolders used in make_training_data

training_dir = os.path.join(base_dir, "training")

In [None]:
#choose whether to use the chopped raw + annotation directories, or the original raw + stitched annotations

if npz_mode == "stitched":
    raw_img_dir = raw_dir
    annotation_dir = save_dir
    training_raw_folder = "raw"
    training_annotation_folder = "stitched"
    
elif npz_mode == "chopped":
    raw_img_dir = raw_dir + "_offset_{0:03d}_chopped_{1:02d}_{2:02d}".format(frame_offset, num_x_segments, num_y_segments)
    annotation_dir = annotation_save
    training_raw_folder = "raw_chopped"
    training_annotation_folder = "annotations"
    
elif npz_mode == "fullsize":
    raw_img_dir = raw_dir
    annotation_dir = annotation_save
    training_raw_folder = "raw"
    training_annotation_folder = "annotations"
    
training_raw_dir = os.path.join(training_dir, "all", training_raw_folder)
training_annotation_dir = os.path.join(training_dir, "all", training_annotation_folder)

channel_names = [training_raw_folder]
annotation_folders = [training_annotation_folder]

In [None]:
#copy raw and annotated images into "all" subfolder of training_dir

if not os.path.isdir(training_dir):
    os.makedirs(training_dir)
    os.chmod(training_dir, perm_mod)

shutil.copytree(raw_img_dir, training_raw_dir)
shutil.copytree(annotation_dir, training_annotation_dir)

os.chmod(training_raw_dir, perm_mod)
os.chmod(training_annotation_dir, perm_mod)
os.chmod(os.path.join(training_dir, 'all'), perm_mod)

### Optional: copy over other channels or features into training folder
If you want to add in other channels to include in the npz, this is the place. (eg, an npz with 4 color raw channels and cytoplasmic and nuclear annotations.)

The next code block will take the channels and annotation folders in base_dir and move them into training/all, keeping the same folder names. These should be the same size as the other training images, so if you need to chop them into specific sizes with the overlapping_chopper, you should do that first.

In [None]:
additional_channel_names = ['FITC', "TRITC"]
additional_annotation_folders = ['feature2']

for channel in additional_channel_names:
    channel_dir = os.path.join(base_dir, channel)
    channel_training_dir = os.path.join(training_dir, 'all', channel)
    shutil.copytree(channel_dir, channel_training_dir)
    os.chmod(channel_training_dir, perm_mod)
    
for feature in additional_annotation_folders:
    feature_dir = os.path.join(base_dir, feature)
    feature_training_dir = os.path.join(training_dir, 'all', feature)
    shutil.copytree(feature_dir, feature_training_dir)
    os.chmod(feature_training_dir, perm_mod)
    
channel_names = [training_raw_folder] + additional_channel_names
annotation_folders = [training_annotation_folder] + additional_annotation_folders

### Optional: make subfolders in training folder
Use this code block if you intend to make small (manageable) npz movies for tracking/curation. This is not necessary if you are making segmentation training data, or if you want to track a full movie. Make sure to select the appropriate folders in make_training_data if you use this code block.

Note: you may also make smaller tracking movies from a full-size movie with the reshape_size argument of make_training_data. This code block is appropriate for movies of the sub-images created by the overlapping chopper.

Subfolders are named by the x and y location of the image pieces, and each subfolder contains folders for raw and annotated images. Each subfolder will contain all of the frames of the movie but could be rewritten to allow for parts of movies (eg, 20 sequential frames per folder).

There is currently no reason to use this code block in conjuction with the previous, optional code block (multiple channels). The following code block may change if we have reason to track multi-channel movies.

In [None]:
raw_img_list = get_img_names(training_raw_dir)
annotated_img_list = get_img_names(training_annotation_dir)

#### Subfolders for chopped movies

In [None]:
for i in range(num_x_segments):
    for j in range(num_y_segments):
        
        #make subfolders
        subfolder = "x_{0:02d}_y_{1:02d}".format(i,j)
        subdir = os.path.join(training_dir, subfolder)
        if not os.path.isdir(subdir):
            os.makedirs(subdir)
            os.chmod(subdir, perm_mod)
            
        annotation_subdir = os.path.join(subdir, "annotated")
        if not os.path.isdir(annotation_subdir):
            os.makedirs(annotation_subdir)
            os.chmod(annotation_subdir, perm_mod)
            
        raw_subdir = os.path.join(subdir, "raw")
        if not os.path.isdir(raw_subdir):
            os.makedirs(raw_subdir)
            os.chmod(raw_subdir, perm_mod)
            
        #move raw images into subfolders
        for raw_img in raw_img_list:
            if subfolder in raw_img:
                shutil.copy(os.path.join(raw_img_dir, raw_img), raw_subdir)
            
        #move annotations into subfolders
        for annotated_img in annotated_img_list:
            if subfolder in annotated_img:
                shutil.copy(os.path.join(annotation_dir, annotated_img), annotation_subdir)
                
channel_names = ["raw"]
annotation_folders = ["annotated"]

### Make training data

In [None]:
output_dir = os.path.join(base_dir, 'npz')

if not os.path.isdir(output_dir):
    os.makedirs(output_dir)
    os.chmod(output_dir, perm_mod)

#Training directories are organized according to location within an image
#if there are any movies that shouldn't be included in the npz
#(unsuitable for training, or don't need to be tracked), put them in "samples_to_drop"
#"samples_to_drop" does not yet automatically update 
#based on "images_to_drop" (from downloading annotations), but will in the future

if npz_mode == "stitched" and is_2D == False:
    training_folders = ["all"]
    
elif npz_mode == "chopped" and is_2D == False:
    training_folders = ['x_0{}_y_0{}'.format(i,j) for i in range(num_x_segments) for j in range(num_y_segments)]
    samples_to_drop = ["all"]
    training_folders = [x for x in training_folders if x not in samples_to_drop]
    
elif npz_mode == "fullsize" and is_2D == False:
    training_folders = ["all"]
    
elif is_2D:
    training_dir = os.path.join(training_dir, 'all')
    
    
if is_2D:
    dimensionality = 2
    kwargs = {}
else:
    dimensionality = 3
    kwargs = {"num_frames": num_images,
              'training_folders' : training_folders}

    
npz_name = "{0}_{1}_{2}D.npz".format(identifier, npz_mode, dimensionality)
file_name_save = os.path.join(output_dir, npz_name)

In [None]:
# Create the training data
post_annotation_make_training_data(training_dir,
                                   file_name_save,
                                   channel_names,
                                   annotation_folders,
                                   reshape_size = None,
                                   dimensionality = dimensionality,
                                   **kwargs)

In [None]:
# Verify the result
data = np.load(file_name_save)
X_to_load, y_to_load = data['X'][()], data['y'][()]

print(data.keys())
data_readable_X, data_readable_y = data['X'][()], data['y'][()]
print('X Shape:', data_readable_X.shape)
print('y Shape:', data_readable_y.shape)