# Data Processing for Crowd Annotation Pipeline

1. Download job report from Figure Eight
2. Download annotated images from report
3. Clean up montages (remove small objects, fill small holes, sequentially label the annotations)
4. Process montages into movies
5. Process annotations and raw images to match each other in size. Can either:
    - chop raw images into corresponding pieces
    - recombine annotation subimages into full-size images
6. Combine raw and annotated movies into npz format
7. Images are now ready to be used as training data or to track!

Files are named by these scripts such that the code blocks can run back-to-back with minimal input. For this reason, it is recommended that users run through the whole pipeline before processing another set of images. The user can specify a few directory names and the "identifier" used in pre-annotation and run all cells in the notebook; alternate folder names can be used but this is not recommended.

To function properly, your working folder should already contain subfolders:
- json_logs  
    - log from overlapping_chopper ({identifier}\_overlapping_chopper\_log.json)
    - log from montage_maker ({identifier}\_montage\_maker.json)
- raw images (can be named "raw" or something else)

The user will also need to supply:
- job ID for the data to download from figure eight
- API key for figure eight
- "identifier" to access correct json logs and name files correctly

If the default folder names are used, by the end of this pipeline, the working folder (base_dir) will contain subfolders named:

- CSV  
    - data that was uploaded to figure eight in pre-annotation notebook  
    - job report downloaded from figure eight

- annotations  
    - downloaded montages from figure eight, cleaned

- movies  
    - subfolders for different parts and subsections of full movie 
            - subfolders for holding raw and annotated data  
                - images
                
- npz
    - .npz where each montage has been turned into one batch

In [None]:
#import statements
import json
import numpy as np
import os
import pathlib
import shutil
import stat
import sys

from deepcell_toolbox.utils.io_utils import get_img_names
from deepcell_toolbox.pre_annotation.overlapping_chopper import overlapping_crop_dir

from deepcell_toolbox.post_annotation.download_csv import download_and_unzip, save_annotations_from_csv
from deepcell_toolbox.post_annotation.clean_montages import clean_montages, relabel_montages, convert_grayscale_all
from deepcell_toolbox.post_annotation.montages_to_movies import all_montages_chopper
from deepcell_toolbox.post_annotation.post_annotation_training_data import post_annotation_make_training_data

perm_mod = stat.S_IRWXO | stat.S_IRWXU | stat.S_IRWXG

In [None]:
#set working directory
#base_dir = "/base/directory/path/here"
#raw_dir = "/base/directory/path/here/folder_with_fullsize_raw_images"

base_dir = "/gnv_home/data/testing/post_processing/set3"
raw_dir = "/gnv_home/data/testing/post_processing/set3/FITC_overlay_phase"

#identifier given during pre-annotation pipeline; if you're not sure, it's also in the job report csv
identifier = "set0_cyto_overlay"

## 1. Download job report from Figure Eight
By default, this script will download, unzip, and rename the full report from Figure Eight as a .csv file. However, the user can change the report type if one of the other report options is more suitable for their use. (support for other report types not guaranteed with version 0 of this notebook)

The user can specify where the zip file should be downloaded and the .csv extracted; by default, the .csv file will be put into a subfolder named CSV (likely the same folder that contained the input data; the CSV files are named to prevent confusion). The report CSV will be renamed "job\_{job_number}\_{type of report}\_report.csv".

#### From Figure Eight website:
full - Returns the Full report containing every judgment

aggregated - Returns the Aggregated report containing the aggregated response for each row

json - Returns the JSON report containing the aggregated response, as well as the individual judgments

gold_report - Returns the Test Question report

workset - Returns the Contributor report

source - Returns a CSV of the source data uploaded to the job

In [None]:
job_id_to_download = 1363594
job_type = "full"

In [None]:
download_and_unzip(job_id_to_download, base_dir)

## 2. Use report to download annotations
This script uses the information in the report to download each annotation. Montage annotations will be saved in the "annotations" subfolder (it will be created for you by the script).

Raw images that could not be annotated (those with "broken_link = True") will not be downloaded in this step. If a job contains rows with broken links, the information will be put into two csv files: one, "job_number_full_report_broken_links.csv", contains all of the metadata from the full job report, in case the user wants to inspect this for a pattern in the broken links. The other, "job_number_reupload.csv", has only the information used to upload the images originally (identifier and annotation_url). If the images are suitable for annotation, the user can easily add this csv to the figure eight job and obtain annotations for those images. (Alternatively, the user may need to go through part of the pre-annotation pipeline to adequately fix and reupload the images in question.)

If there are no broken links in the job, or if images with broken links instead of annotations have annotations later on in the job report (if the user has reuploaded those rows to the job), the secondary csv creation will not be triggered.

This function returns a list of the image names of any images with broken links; this list will be used later in the pipeline to automatically skip images when stitching images together or making training data.

In [None]:
csv_dir = os.path.join(base_dir, "CSV")
csv_path = os.path.join(csv_dir, "job_" + str(job_id_to_download) + "_" + job_type + "_report.csv")

#csv_path = "/example/path/CSV/job_number_full_report.csv"

montage_dir = os.path.join(base_dir, identifier + "_montaged_annotations")

In [None]:
save_annotations_from_csv(csv_path, montage_dir)

## 3. Clean up the montages
First, the RGB montage annotation is converted into grayscale, simplifying downstream use of the annotation.

Next, small changes to the morphology of the image are made. Sometimes during annotation, small holes or stray annotations will be submitted, as artifacts of the annotation process. However, these holes or stray pixels don't correspond to what should be annotated, so in this step, we use sci-kit image to fix these small mistakes.

Currently uses the old "clean_montage" function; this may change in future versions of notebook.

After cleaning the montage, user can optionally run "relabel_montages" block, which will relabel the annotations sequentially (eg, perhaps the annotator decided to use the labels 3, 5, and 7 to label cells; this code block would remake the image with labels 1, 2, and 3).

The cleaned and relabled annotations will overwrite the downloaded annotations.

In [None]:
convert_grayscale_all(montage_dir)

In [None]:
clean_montages(montage_dir)

In [None]:
#optional
relabel_montages(montage_dir)

## 4. Process montages into movies

Each montage is composed of frames of a timelapse (or sometimes, a z-stack) that have been placed next to each other. This is useful for annotators, but we want to use these images frame by frame in movies. This section of the notebook takes montages, as well as the parameters used to make the montage (such as spacing between frames) to chop one montage into its constituent frames. These sequential frames will then be saved in subfolders together.

By default, these will be saved in a "movies" folder containing subfolders corresponding to the crop location of each montage (eg, x_1_y_0). Each subfolder will then contain a folder for the annotations of that position. The annotations folder will contain the image files for each frame.

In [None]:
#read json parameters
json_montage_log_path = os.path.join(base_dir, "json_logs", identifier + "_montage_maker_log.json")
try:
    with open(json_montage_log_path) as json_file:
        json_montage_log = json.load(json_file)
except:
    print("No montage maker log file found. Is the path to the json file correct?")

In [None]:
all_montages_chopper(base_dir,
                     montage_dir, 
                     identifier, 
                     json_montage_log)

chopped_annotations_dir = os.path.join(base_dir, 'chopped_annotations')

## 5. Match raw images to annotations
This section of the notebook will chop up raw images to match the size of the annotations extracted from montages. Then, files will be rearranged into a "movies" folder to match raw and annotated images in subfolders corresponding to the montage they came from.

### Chop the raw images into the same size pieces as the annotations

In [None]:
#read json parameters
json_chopper_log_path = os.path.join(base_dir, "json_logs", identifier + "_overlapping_chopper_log.json")
try:
    with open(json_chopper_log_path) as json_file:
        json_chopper_log = json.load(json_file)
except:
    print("No overlapping_chopper log file found. Is the path to the json file correct?")

In [None]:
num_x_segments = json_chopper_log['num_x_segments']
num_y_segments = json_chopper_log['num_y_segments']
overlap_perc = json_chopper_log['overlap_perc']
try:
    frame_offset = json_chopper_log['frame_offset']
except:
    frame_offset = 0

In [None]:
overlapping_crop_dir(raw_dir, 
                     identifier + "_raw", 
                     num_x_segments, 
                     num_y_segments, 
                     overlap_perc, 
                     frame_offset, 
                     is_2D = False)

raw_pieces_dir = raw_dir + "_offset_{0:03d}_chopped_{1:02d}_{2:02d}".format(frame_offset, num_x_segments, num_y_segments)

### Rearrange the image files into a "movies" folder

In [None]:
movies_dir = os.path.join(base_dir, "movies")
if not os.path.isdir(movies_dir):
    os.makedirs(movies_dir)
    os.chmod(movies_dir, perm_mod)

In [None]:
raw_img_list = get_img_names(raw_pieces_dir)
annotated_img_list = get_img_names(chopped_annotations_dir)

In [None]:
parts = json_montage_log['montages_in_pos']
montage_len = json_montage_log['montage_len']

for part in range(parts):
    start_frame = part * montage_len
    
    for i in range(num_x_segments):
        for j in range(num_x_segments):
            subfolder = 'x_{0:02d}_y_{1:02d}_part_{2}'.format(i,j,part)
            subdir = os.path.join(movies_dir, subfolder)
            if not os.path.isdir(subdir):
                os.makedirs(subdir)
                os.chmod(subdir, perm_mod)
            
            annotation_subdir = os.path.join(subdir, "annotated")
            if not os.path.isdir(annotation_subdir):
                os.makedirs(annotation_subdir)
                os.chmod(annotation_subdir, perm_mod)
            
            raw_subdir = os.path.join(subdir, "raw")
            if not os.path.isdir(raw_subdir):
                os.makedirs(raw_subdir)
                os.chmod(raw_subdir, perm_mod)
            
            #move raw images into subfolders
            for frame in range(montage_len):
                raw_name = 'x_{0:02d}_y_{1:02d}_frame_{2:03d}.'.format(i,j, start_frame + frame)
                for raw_img in raw_img_list:
                    if raw_name in raw_img:
                        shutil.copy(os.path.join(raw_pieces_dir, raw_img), raw_subdir)
            
            #move annotations into subfolders
            annotation_name = subfolder + "_frame"
            for annotated_img in annotated_img_list:
                if annotation_name in annotated_img:
                    shutil.copy(os.path.join(chopped_annotations_dir, annotated_img), annotation_subdir)

## 6. Combine raw and annotation images into .npz formatted training data
Currently, this section allows the user to go through part-level directories in a "movies" folder and combine the images contained therein to make training data. The training data is saved as an .npz file in the "npz" folder that will be created inside the "base_dir" (specified at beginning of notebook). (This default setting can be changed by specifying a different "output_dir" below.)

Feature to be added: include "full movie" mode that recombines all parts; this would allow movies longer than the montage length to be tracked. However, the resulting .npz file wouldn't be suitable for use as tracking training data (at least, without curation first) because track continuity is not maintained between parts.

In [None]:
channel_names = ['raw']
annotation_folders = ['annotated']

In [None]:
output_dir = os.path.join(base_dir, "npz")

if not os.path.isdir(output_dir):
    os.makedirs(output_dir)
    os.chmod(output_dir, perm_mod)

#Training directories are organized according to location within an image
#if there are any movies that shouldn't be included in the npz
#(unsuitable for training, or don't need to be tracked), put them in "samples_to_drop"
samples_to_drop = []
training_folders = ['x_{0:02d}_y_{1:02d}_part_{2}'.format(i,j,part) for part in range(parts) for i in range(num_x_segments) for j in range(num_y_segments)]
training_folders = [x for x in training_folders if x not in samples_to_drop]

npz_name = "{0}_montaged.npz".format(identifier)
file_name_save = os.path.join(output_dir, npz_name)

In [None]:
post_annotation_make_training_data(movies_dir,
                                   file_name_save,
                                   channel_names,
                                   annotation_folders,
                                   reshape_size = None,
                                   dimensionality = 3,
                                   num_frames = montage_len,
                                   training_folders = training_folders)

In [None]:
# Verify the result
data = np.load(file_name_save)
X_to_load, y_to_load = data['X'][()], data['y'][()]

print(data.keys())
data_readable_X, data_readable_y = data['X'][()], data['y'][()]
print('X Shape:', data_readable_X.shape)
print('y Shape:', data_readable_y.shape)