# Data Processing for Crowd Annotation Pipeline

1. Download job report from Figure Eight
2. Download annotated images from report
3. Clean up montages (remove small objects, fill small holes, sequentially label the annotations)
4. Process montages into movies
5. Process annotations and raw images to match each other in size. Can either:
    - chop raw images into corresponding pieces
    - recombine annotation subimages into full-size images
6. Combine raw and annotated movies into npz format
7. Images are now ready to be used as training data or to track!

Files are named by these scripts such that the code blocks can run back-to-back with minimal input. For this reason, it is recommended that users run through the whole pipeline before processing another set of images. The user can specify a few directory names and the "identifier" used in pre-annotation and run all cells in the notebook; alternate folder names can be used but this is not recommended.

To function properly, your working folder should already contain subfolders:
- json_logs  
    - log from overlapping_chopper ({identifier}\_overlapping_chopper\_log.json)
    - log from montage_maker ({identifier}\_montage\_maker.json)
- raw images (can be named "raw" or something else)

The user will also need to supply:
- job ID for the data to download from figure eight
- API key for figure eight
- "identifier" to access correct json logs and name files correctly

If the default folder names are used, by the end of this pipeline, the working folder (base_dir) will contain subfolders named:

- CSV  
    - data that was uploaded to figure eight in pre-annotation notebook  
    - job report downloaded from figure eight

- annotations  
    - downloaded montages from figure eight, cleaned

- movies  
    - subfolders for different parts  
        - subfolders for each subsection of image  
            - subfolders for holding raw and annotated data  
                - images
                
- npz
    - .npz files corresponding to each "part"

In [None]:
#import statements
import os
import pathlib

from deepcell_toolbox.post_annotation.download_csv import download_and_unzip, save_annotations_from_csv
from deepcell_toolbox.post_annotation.clean_montages import clean_montages, relabel_montages, convert_grayscale_all
from deepcell_toolbox.post_annotation.montages_to_movies import raw_movie_maker, all_montages_chopper, read_json_params_montage

from deepcell_toolbox.utils.data_utils import make_training_data

In [None]:
#set working directory
#base_dir = "/base/directory/path/here"
#raw_dir = "/base/directory/path/here/folder_with_fullsize_raw_images"

base_dir = "/gnv_home/data/example"
raw_dir = "/gnv_home/data/example/raw"

#identifier given during pre-annotation pipeline; if you're not sure, it's also in the job report csv
identifier = "3D_post_annotation_example"

## 1. Download job report from Figure Eight
By default, this script will download, unzip, and rename the full report from Figure Eight as a .csv file. However, the user can change the report type if one of the other report options is more suitable for their use. (support for other report types not guaranteed with version 0 of this notebook)

The user can specify where the zip file should be downloaded and the .csv extracted; by default, the .csv file will be put into a subfolder named CSV (likely the same folder that contained the input data; the CSV files are named to prevent confusion). The report CSV will be renamed "job\_{job_number}\_{type of report}\_report.csv".

#### From Figure Eight website:
full - Returns the Full report containing every judgment

aggregated - Returns the Aggregated report containing the aggregated response for each row

json - Returns the JSON report containing the aggregated response, as well as the individual judgments

gold_report - Returns the Test Question report

workset - Returns the Contributor report

source - Returns a CSV of the source data uploaded to the job

In [None]:
job_id_to_download = 1363594
job_type = "full"

In [None]:
download_and_unzip(job_id_to_download, base_dir)

## 2. Use report to download annotations
This script uses the information in the report to download each annotation. Montage annotations will be saved in the "annotations" subfolder (it will be created for you by the script).

Raw images that could not be annotated (those with "broken_link = True") will not be downloaded in this step. If a job contains rows with broken links, the information will be put into two csv files: one, "job_number_full_report_broken_links.csv", contains all of the metadata from the full job report, in case the user wants to inspect this for a pattern in the broken links. The other, "job_number_reupload.csv", has only the information used to upload the images originally (identifier and annotation_url). If the images are suitable for annotation, the user can easily add this csv to the figure eight job and obtain annotations for those images. (Alternatively, the user may need to go through part of the pre-annotation pipeline to adequately fix and reupload the images in question.)

If there are no broken links in the job, or if images with broken links instead of annotations have annotations later on in the job report (if the user has reuploaded those rows to the job), the secondary csv creation will not be triggered.

This function returns a list of the image names of any images with broken links; this list will be used later in the pipeline to automatically skip images when stitching images together or making training data.

In [None]:
csv_dir = os.path.join(base_dir, "CSV")
csv_path = os.path.join(csv_dir, "job_" + str(job_id_to_download) + "_" + job_type + "_report.csv")

#csv_path = "/example/path/CSV/job_number_full_report.csv"

annotation_save = os.path.join(base_dir, identifier + "_annotations")

In [None]:
save_annotations_from_csv(csv_path, annotation_save)

## 3. Clean up the montages
First, the RGB montage annotation is converted into grayscale, simplifying downstream use of the annotation.

Next, small changes to the morphology of the image are made. Sometimes during annotation, small holes or stray annotations will be submitted, as artifacts of the annotation process. However, these holes or stray pixels don't correspond to what should be annotated, so in this step, we use sci-kit image to fix these small mistakes.

Currently uses the old "clean_montage" function; this may change in future versions of notebook.

After cleaning the montage, user can optionally run "relabel_montages" block, which will relabel the annotations sequentially (eg, perhaps the annotator decided to use the labels 3, 5, and 7 to label cells; this code block would remake the image with labels 1, 2, and 3).

The cleaned and relabled annotations will overwrite the downloaded annotations.

In [None]:
annotations_folder = annotation_save
#annotations_folder = "/base/directory/path/here/wherever_you_moved_the_annotations"

In [None]:
convert_grayscale_all(annotations_folder)

In [None]:
clean_montages(annotations_folder)

In [None]:
#optional
relabel_montages(annotations_folder)

## 4. Process montages into movies

Each montage is composed of frames of a timelapse (or sometimes, a z-stack) that have been placed next to each other. This is useful for annotators, but we want to use these images frame by frame in movies. This section of the notebook takes montages, as well as the parameters used to make the montage (such as spacing between frames) to chop one montage into its constituent frames. These sequential frames will then be saved in subfolders together.

By default, these will be saved in a "movies" folder containing subfolders corresponding to the crop location of each montage (eg, x_1_y_0). Each subfolder will then contain a folder for the annotations of that position. The annotations folder will contain the image files for each frame.

In [None]:
all_montages_chopper(base_dir, identifier + "_annotations", identifier)

In [None]:
raw_movie_maker(base_dir, raw_dir, identifier)

## 5. Combine raw and annotation images into .npz formatted training data
Currently, this section allows the user to go through part-level directories in a "movies" folder and combine the images contained therein to make training data. The training data is saved as an .npz file in the "npz" folder that will be created inside the "base_dir" (specified at beginning of notebook). (This default setting can be changed by specifying a different "output_directory" below.)

Feature to be added: include "full movie" mode that recombines all parts; this would allow movies longer than the montage length to be tracked. However, the resulting .npz file wouldn't be suitable for use as tracking training data (at least, without curation first) because track continuity is not maintained between parts.

Feature to be added: option to loop through .npz creation for multiple parts at a time. The user should be able to specify which parts they want to make .npz files from. The code block should then loop through each part and save .npz files with appropriate names. Although users would not be able to easily specify "samples_to_drop", this is generally not a problem when users are looking to track the resulting movies.

In [None]:
#get info from json logs
log_folder = os.path.join(base_dir, "json_logs")
montage_params = read_json_params_montage(log_folder, identifier)
#number of frames per montage
num_frames = montage_params[0]
#segments the original image was chopped into:
num_x = montage_params[1]
num_y = montage_params[2]

# Load data
direc_name = '/gnv_home/data/example/movies/part1'
output_directory = os.path.join(base_dir, "npz")
#output_directory = '/gnv_home/data/Ed/3T3/set0'
file_name_save = os.path.join(output_directory, 'example_movie_S0P1_same.npz')

#Training directories are organized according to location within an image
#if there are any movies that shouldn't be included in the npz
#(unsuitable for training, or don't need to be tracked), put them in "samples_to_drop"
samples_to_drop = []
training_direcs = ['x_0{}_y_0{}'.format(i,j) for i in range(num_x) for j in range(num_y)]
training_direcs = [x for x in training_direcs if x not in samples_to_drop]
channel_names = [""] # Commonality in raw filenames

# Create output directory, if necessary
pathlib.Path(output_directory).mkdir(parents=True, exist_ok=True)

In [None]:
# Create the training data
make_training_data(
    direc_name = direc_name,
    file_name_save = file_name_save,
    channel_names = channel_names,
    dimensionality = 3,
    training_direcs = training_direcs,
    raw_image_direc = "raw",
    annotation_direc = "annotated",
    annotation_name = "",
    border_mode = "same",
    output_mode = "conv",
    num_frames = num_frames,
    reshape_size = None,
    verbose = True)