# Data and Project Prep for Aquaculture Mask R-CNN

This notebook contains code related to processing and organizing Planet images (PlanetScope 3B Analytic SR products) and their associated object annotations for use by a Mask R-CNN model. 

The project data is organized within the `data` directory as follows:
  + `aqua`: root directory for aquaculture data
      + `planet`: full PlanetScope 3B Analytic SR scenes downloaded from Planet. 
      + `gridded_planet`: image chips (256x256) created from the raw PlanetScope scenes in `planet` and a JSON file with corresponding object annotations
      + `train`: subset of image chips for use in model training
          + `image`: raw GeoTiff image chip from `gridded_planet`
          + `masks`: instance-specific GeoTiff masks for each object
      + `test`: subset of image chips for use in model testing
          + `image`: raw GeoTiff image chip from `gridded_planet`
          + `masks`: instance-specific GeoTiff masks for each object

## Setup

Load packages:

In [1]:
import os
import sys
import pathlib
import math
import random
import rasterio # requires gdal to be installed from conda forge w/ "conda install -c conda-forge gdal"
import numpy as np
import pandas as pd
import skimage.io as skio
import matplotlib
import matplotlib.pyplot as plt
import copy
import skimage.draw
from skimage import measure

import aqua_preprocess as pp # preprocess script with directory variables

%matplotlib inline

## Image Processing

This project relies on Planet images that have been annotated with the locations of aquaculture farms for the training and testing of a Mask R-CNN model. These annotations have been generated in two ways:

1. Fully annotated Planet scenes - annotations stored as binary bands of full Planet scenes stored in `planet` with filenames ending with `_labels.tif` 

2. JSON annotations - annotations created for previsouly created image chips (256x256 pixels) and recorded in `json` files stored within individual scene folders in `gridded_planet`

For both annotation methods, image chips and object instance masks are created and stored in the `train`. Each image chip will be stored in its own directory that contains two subdirectories: `image_id/image` containing the raw GeoTiff image, and `image_id/masks` containing the image's object instance masks. 

At the time of model training, the `train` directory will then be sampled and a subset placed in `test`. 

### Fully Annotated Planet Scenes

For each labeled Planet scene in `planet`, the following function will create a series of 256x256 image chips and instance masks and place them in the `train` directory.

In [5]:
# Check data directory
print(pp.DATASET)

/home/tclavelle/tana-crunch/CropMask_RCNN/data/aqua


In [7]:
# Define a function to create image chips with masks of every GeoTiff file in a directory
def prep_labeled_scenes(tiff_directory, prep_directory):
    
    # Get all GeoTiff filnames in specified directory
    files = np.array(os.listdir(tiff_directory))
    tiffs = pd.Series(files).str.contains('_labels.tif')
    files = files[tiffs] 
    
    # Loop over files
    for filename in files:
        
        # Get image name to use for creating directory
        image_name = filename.split("_")[0:3]
        image_name = "%s_%s_%s" % (image_name[0], image_name[1], image_name[2])
        
        # Image directory and subdirectories
        image_dir = prep_directory + '/' + image_name + '/'        
        
        # Print filenames
        print('filename: ' + filename + '\n' + 'image name: ' + image_name)
        
        # Iterate over image blocks - which are 256x256 - and save new GeoTiffs
        with rasterio.open(os.path.join(tiff_directory, filename)) as src:
            
            # Get block dimensions of src
            for ji, window in src.block_windows(1):
                
                # read B,G,R,NIR band
                r = src.read((1,2,3,4), window=window)
                
                # Skip image if missing data
                if 0 in r:
                    continue
           
                else:
                    
                    # Create chip id
                    chip_name = image_name + '_' + str(ji[0]) + '_' + str(ji[1])                    
                    
                    # Create directory for image chip and subdirectories for image and labels
                    chip_dir = prep_directory + '/' + chip_name + '/'                    
                    img_dir = chip_dir + '/image/'
                    mask_dir = chip_dir + '/class_masks/'
                    
                    # list of directories to map over
                    dirs = [chip_dir, img_dir, mask_dir]
                    
                    # Make chip directory and subdirectories
                    for d in dirs:
                        pathlib.Path(d).mkdir(parents=True, exist_ok=True)
                    
                    # Open a new GeoTiff data file in which to save the image chip
                    with rasterio.open((img_dir + chip_name + '.tif'), 'w', driver='GTiff',
                               height=r.shape[1], width=r.shape[2], count=4,
                               dtype=rasterio.uint16, crs=src.crs, 
                               transform=src.transform) as new_img:
        
                        # Write the rescaled image to the new GeoTiff
                        new_img.write(r)
                
                """Load and save mask as separate tif file(s), one for each class"""
                # Count number of mask bands (bands - 4)
                masks = src.count - 4
                
                if masks < 2:
                    # read mask
                    m = src.read(5, window=window)
                    
                    # Open a new Tiff data file in which to save the image mask (use class 1 for now)                    
                    with rasterio.open((mask_dir + chip_name + '_line_mask.tif'), 'w', driver='GTiff',
                                       height=m.shape[0], width=m.shape[1], count=1,
                                       dtype=rasterio.uint16, crs=src.crs, 
                                       transform=src.transform) as new_img:
                        # Write the mask to the new GeoTiff            
                        new_img.write(m, 1)
                
                else:
                                        
                    for a in (1, masks):
                        
                        # read mask
                        m = src.read(4 + a, window=window)
                        
                        # set type of aquaculture class
                        if a == 1: 
                            types = 'raft'
                        else: 
                            types = 'line'
                    
                        # Open a new Tiff data file in which to save the image mask. Label with class number                    
                        with rasterio.open((mask_dir + chip_name + '_' + types + '_mask.tif'), 'w', driver='GTiff',
                                           height=m.shape[0], width=m.shape[1], count=1,
                                           dtype=rasterio.uint16, crs=src.crs, 
                                           transform=src.transform) as new_img:
                            # Write the mask to the new GeoTiff            
                            new_img.write(m, 1)

Run the function to process the labeled Planet scenes

In [9]:
# Run the chipping function for the labeled scenes in the 'planet' directory and save in 'prepped' directory
scene_dir = os.path.join(pp.DATASET, 'planet')

# Run function on complete labeled Planet scenes                            
prep_labeled_scenes(scene_dir, pp.PREPPED)

filename: 20180410_020422_0f31_3B_AnalyticMS_SR_labels.tif
image name: 20180410_020422_0f31
filename: 20180409_014042_1015_3B_AnalyticMS_SR_labels.tif
image name: 20180409_014042_1015


### Annotated Image Chips

For the Planet scenes that were annotated after being segmented into image chips, the same process of creating directories for each image chip and mask in `train` is performed. Annotations for all image chips created from a single Planet scene are recorded in a JSON file stored in that scenes folder within `gridded_planet`. 

In [None]:
import json

def prep_labeled_chips(chip_dir):
    
    """Takes a directory of Planet image chips and a JSON file with object annotations and
    creates a 'masks' directory containing object instance masks"""
    
    # Find all gridded planet scenes
    scenes = np.array(os.listdir(chip_dir))
    
    # Find all files in scene directories
    scene_files = [os.listdir(os.path.join(chip_dir, scene)) for scene in scenes]
    
    # Flatten list of lists
    scene_files = sum(scene_files, [])
    
    # Pull out label files
    scene_labels = [file for file in scene_files if "_labels_" in file]
    
    # Loop over annotated scenes
    for label in scene_labels:
        
        # Pull out scene names of labels
        scene = label.split("_labels")[0]
    
        # Set directory for label
        scene_dir = os.path.join(chip_dir, scene)
        
        # Create "class_masks" directory to store chip masks
        masks_dir = os.path.join(scene_dir, 'class_masks')
        pathlib.Path(masks_dir).mkdir(parents=True, exist_ok=True)
               
        # We mostly care about the x and y coordinates of each region
        annotations = json.load(open(os.path.join(scene_dir, label)))
        annotations = list(annotations.values())  # don't need the dict keys

        # The VIA tool saves images in the JSON even if they don't have any
        # annotations. Skip unannotated images.
        annotations = [a for a in annotations if a['regions']]
    
        for a in annotations:    
                        
            chip = a['filename'].split('.png')[0]
            print(chip)            
            
            # Read geotiff for chip
            gtiff = scene_dir +  '/chips/' + chip + '.tif'
            src = rasterio.open(gtiff)
            
            # Use try to only extract masks for chips with complete annotations and class labels
            try:
                                        
                """Code for processing VGG annotations from Matterport balloon color splash sample"""
                # Load annotations
                # VGG Image Annotator saves each image in the form:
                # { 'filename': '28503151_5b5b7ec140_b.jpg',
                #   'regions': {
                #       '0': {
                #           'region_attributes': {},
                #           'shape_attributes': {
                #               'all_points_x': [...],
                #               'all_points_y': [...],
                #               'name': 'polygon'}},
                #       ... more regions ...
                #   },
                #   'size': 100202
                # } 
        
                # Get the aquaculture class of each polygon    
                polygon_types = [r['region_attributes'] for r in a['regions']]        

                # Get unique aquaculture classes in annotations
                types = set(val for dic in polygon_types for val in dic.values())            

                for t in types:
                    # Get the x, y coordinaets of points of the polygons that make up
                    # the outline of each object instance. There are stores in the
                    # shape_attributes (see json format above) 

                    # Pull out polygons of that type               
                    polygons = [r['shape_attributes'] for r in a['regions'] if r['region_attributes']['class'] == t]            

                    # Draw mask using height and width of Geotiff
                    mask = np.zeros([src.height, src.width], dtype=np.uint8)

                    for p in polygons:

                        # Get indexes of pixels inside the polygon and set them to 1
                        rr, cc = skimage.draw.polygon(p['all_points_y'], p['all_points_x'])                    
                        mask[rr, cc] = 1            

                    # Open a new GeoTiff data file in which to save the image chip
                    with rasterio.open((masks_dir + '/' + chip + '_' + str(t) + '_mask.tif'), 'w', driver='GTiff',
                               height=src.shape[0], width=src.shape[1], count=1,
                               dtype=rasterio.ubyte, crs=src.crs, 
                               transform=src.transform) as new_img:

                        # Write the rescaled image to the new GeoTiff
                        new_img.write(mask.astype('uint8'),1)

            except KeyError:                
                print(chip + ' missing aquaculture class assignment')
                # write chip name to file for double checking
                continue

Run chip labeling function:

In [None]:
# Run function                    
prep_labeled_chips(PLANET_DIR)

### Move Gridded Images and Masks to Prepped Directory

After creating class-specific masks from the JSON annotations, we need to move each image and it's mask(s) to the `prepped_planet` folder with the other images.

In [None]:
from shutil import copyfile
from shutil import copy2

# Function to move images and masks to train folder
def move_gridded_images(chips_dir, prep_dir):
    
    """Takes a directory of Planet image chips and class masks and moves chips and their masks to 
    their own directories within the train directory"""
    
    # Find all gridded planet scenes
    scenes = np.array(next(os.walk(chips_dir))[1])   
    # Find scenes that have 'class_masks' directory and thus have had masks prepared
    scenes = [s for s in scenes if 'class_masks' in os.listdir(os.path.join(chips_dir, s))]
    
    # Get unique image chip ids for mask files
    masks = [os.listdir(os.path.join(chips_dir, scene, 'class_masks')) for scene in scenes]
    masks = sum(masks, []) # flatten nested lists
    masks = [mask for mask in masks if mask != '.DS_Store'] # remove stupid DS_Store file
    
    # Get set of chip ids for masks
    chips = [mask.split('_')[0:5] for mask in masks]
    chips = set("%s_%s_%s_%s_%s" % (m[0],m[1],m[2],m[3],m[4]) for m in chips)
    
    # Loop over mask chip ids and copy image and masks to train folder
    for chip in chips:
                
        # Create directory for chip, chip image, and chip class masks in PREPPED_DIR
        chip_dir = os.path.join(prep_dir, chip)
        chip_image_dir = os.path.join(chip_dir, 'image')
        chip_masks_dir = os.path.join(chip_dir, 'class_masks')
        
        # list of directories to map over
        dirs = [chip_dir, chip_image_dir, chip_masks_dir]

        # Make chip directory and subdirectories
        for d in dirs:
            pathlib.Path(d).mkdir(parents=True, exist_ok=True)
        
        # Image chip location. Chips are stored in scene directories, so use first 20 chrs
        # to indicate scene directory of chip
        scene_dir = os.path.join(chips_dir, chip[0:20])
        chip_filename = scene_dir + '/chips/' + chip + '.tif'
        
        # get chip mask files
        chip_masks = [mask for mask in masks if chip + '_' in mask]
        mask_filenames = [scene_dir + '/class_masks/' + mask for mask in chip_masks]

        print(scene_dir)
        print(chip_filename)
        print(mask_filenames)
        
        # Copy image chip and masks from scene directory in 'gridded_planet' 
        # to chip directory in 'prepped_planet'
        # Copy chips
        copy2(chip_filename, chip_image_dir)
        # Copy masks
        for m in mask_filenames:
            copy2(m, chip_masks_dir)
                
move_gridded_images(PLANET_DIR, PREPPED_DIR)

## Remove Empty Images

Find prepped images that don't contain any objects

In [7]:
def imgs_without_objects(directory):

    # Get directories inside prepped folder
    images = os.listdir(directory)
    images.remove('.DS_Store')

    # List to store files with no instances
    no_objects = []

    # For each file, check if any class mask file exists
    for i in images:

        # get list of class masks
        masks = os.listdir(os.path.join(directory, i,'class_masks'))

        # Empty vector of instances
        instances = []

        # Loop over masks and calculate instances
        for m in masks:

            arr = skio.imread(os.path.join(os.path.join(pp.PREPPED, i,'class_masks', m)))
            blob_labels = measure.label(arr, background=0)
            blob_vals = np.sum(np.unique(blob_labels))
            instances.append(blob_vals)

        # Find total number of instances
        if np.sum(instances) == 0:
            no_objects.append(i)

    return no_objects

In [9]:
# Get list of images without objects
images_to_remove = imgs_without_objects(pp.TRAIN)
print(images_to_remove)

[]


In [18]:
import shutil
# Remove images from PREPPED DIR
for f in images_to_remove:
    if not f.startswith('.'):
        shutil.rmtree(os.path.join(pp.PREPPED,f))