## General Info


This is the dataset currently loaded:

1. [Sparcs Dataset ~2GB](https://www.usgs.gov/core-science-systems/nli/landsat/spatial-procedures-automated-removal-cloud-and-shadow-sparcs)


These are some other options we have:

1. [Landsat Validation Data ~100GB](https://www.usgs.gov/core-science-systems/nli/landsat/landsat-8-cloud-cover-assessment-validation-data?qt-science_support_page_related_con=1#qt-science_support_page_related_con)

2. [Kaggle Dataset ~20GB](https://www.kaggle.com/sorour/95cloud-cloud-segmentation-on-satellite-images)

## Download Data

Download the SPARCS dataset of images, which for each image contains:
  1. satellite tiff file (format w/ multiple color bands besides RGB)
  2. txt metadata about the image
  3. a satellite image png
  4. a satellite mask png (with colors representing masks)

In [1]:
import tensorflow as tf
import tensorflow_datasets as tfds
import matplotlib.pyplot as plt
import numpy as np

In [None]:
# download the SPARCS dataset
dl_manager = tfds.download.DownloadManager(download_dir='junk', extract_dir='/content/clouds')
data_url = 'https://landsat.usgs.gov/cloud-validation/sparcs/l8cloudmasks.zip'
dataset_path = dl_manager.download_and_extract(data_url)
dataset_path += '/sending' # weird USGS quirks

## Read Data into Dataset

In [12]:
# convenience kwargs
parallel_map_kwargs = dict(
  num_parallel_calls=tf.data.AUTOTUNE,
  deterministic=False)

In [13]:
# Given an image path, read in both the image and its mask
# by loading img and mask as a stacked tensor i.e. (2, w, h, d) #
@tf.function
def read_img_and_mask(img_path):
    # read img at specified path
    img = tf.io.read_file(img_path)
    img = tf.image.decode_png(img, channels=3)
    # read corresponding mask (whose path replaces 'photo' w/ 'mask')
    mask_path = tf.strings.regex_replace(img_path, "photo", "mask")
    mask = tf.io.read_file(mask_path)
    mask = tf.image.decode_png(mask, channels=3)
    return tf.stack([img, mask])

In [14]:
# creates a dataset consisting of image file paths
ds = tf.data.Dataset.list_files(dataset_path + "/*photo.png")
# read in each image and its mask using those file paths 
ds = ds.map(read_img_and_mask, **parallel_map_kwargs)

CARDINALITY = ds.cardinality()

In [15]:
# take n random crops of an image and its mask
@tf.function
def sample_crop(dp, w, h, n):
  crops = [tf.image.random_crop(dp, (2, w, h, 3)) for i in range(n)]
  crops = tf.stack(crops)
  crops = tf.data.Dataset.from_tensor_slices(crops)
  return crops

In [16]:
# randomly crop each img (and its mask) several times
n, w, h = 5, 128, 128
ds = ds.interleave(lambda dp: sample_crop(dp, w, h, n), **parallel_map_kwargs)
ds.take(1)

# tf doesn't know cardinality after flatmap, so we help it out
CARDINALITY *= n
ds = ds.apply(tf.data.experimental.assert_cardinality(CARDINALITY))

In [17]:
@tf.function
def normalize(dp):
  img = tf.cast(dp['img'], tf.float32) / 255.0
  # convert to single channel
  mask = tf.image.rgb_to_grayscale(dp['mask'])
  # now that all pixels are 0, 127, or 255, convert to labels 0, 1, 2
  mask = tf.math.floordiv(mask, 127)
  return {'img': img, 'mask': mask}

@tf.function
def prepare(dp):
  img, mask = tf.unstack(dp)
  return {'img': img, 'mask': mask}

In [18]:
# normalize and put in correct format
ds = ds.map(prepare, **parallel_map_kwargs)
ds = ds.map(normalize, **parallel_map_kwargs)

In [19]:
# random shuffle
ds.shuffle(buffer_size=CARDINALITY)
# split into train and test
test_ds = ds.take(CARDINALITY // 5)
train_ds = ds.skip(CARDINALITY // 5)
# prefetch for optimal performance
train_ds.prefetch(tf.data.AUTOTUNE)
test_ds.prefetch(tf.data.AUTOTUNE)

<PrefetchDataset shapes: {img: (128, 128, 3), mask: (128, 128, 1)}, types: {img: tf.float32, mask: tf.uint8}>

# Build Model

Refer to [Tensorflow Image Segmentation](https://www.tensorflow.org/tutorials/images/segmentation) for next steps

In [20]:
# we have two datasets, train_ds & test_ds
# each is a dataset of {'img': img, 'mask': mask} dicts
# where img is a tensor of shape (128, 128, 3)
# and mask is a tensor of shape (128, 128, 1)