# Overview

This Jupyter notebook downloads and preprocesses Sentinel 1 and 2 tiles for large areas (at least 40 sq km). The workflow entails generating the tile coordinates, downloading the raw data, and processing (cloud and shadow removal, gap interpolation, indices, and superresolution).

The notebook is broken down into the following sections:

   * **Parameter definition**:
   * **Projection functions**
   * **Data download functions**
   * **Cloud and shadow removal functions**
   * **Superresoluttion functions**
   * **Tile and folder management functions**
   * **Function execution**

If you are planning to download new Sentinel data, you need to have an API key to use the data provider [Sentinel Hub](https://www.sentinel-hub.com). If you do not have an API key but have access to sentinel imagery, the input data for this notebook is an entire year of:
  * Cloud masks
  * L1C bands 2, 8A, 11
  * 10- and 20m L2A bands
  * VV-VH Sentinel 1 bands
  * Digital elevation model
  
  
The data are tiled into 6300m x 6300m windows. An example of the raw data can be downloaded by running the following cell. This data can be preprocessed (cloud interpolation, super resolution, smoothing, etcetera) by running the rest of the notebook. It can then also be predicted by running `4b-predict-large-area`.

# 1.0 Package Imports

In [1]:
import pandas as pd
import numpy as np
from random import shuffle
from osgeo import ogr, osr
from sentinelhub import WmsRequest, WcsRequest, MimeType, CRS, BBox, constants
import logging
from collections import Counter
import datetime
import os
import yaml
from sentinelhub import DataSource
import scipy.sparse as sparse
from scipy.sparse.linalg import splu
from skimage.transform import resize
from sentinelhub import CustomUrlParam
from time import time as timer
from time import sleep as sleep
import multiprocessing
import math
import reverse_geocoder as rg
import pycountry
import pycountry_convert as pc
import hickle as hkl
from shapely.geometry import Point, Polygon
import geopandas
from tqdm import tnrange, tqdm_notebook
import math
import boto3
from pyproj import Proj, transform
from timeit import default_timer as timer
from typing import Tuple, List
import warnings

In [2]:
if os.path.exists("../config.yaml"):
    with open("../config.yaml", 'r') as stream:
        key = (yaml.safe_load(stream))
        API_KEY = key['key']
        AWSKEY = key['awskey']
        AWSSECRET = key['awssecret']
else:
    API_KEY = "none"

In [3]:
%run ../src/preprocessing/slope.py
%run ../src/preprocessing/indices.py
%run ../src/downloading/utils.py
%run ../src/preprocessing/cloud_removal.py
%run ../src/preprocessing/whittaker_smoother.py
%run ../src/io/upload.py

# 1.1 Constants and Parameters

Currently the only years that can be downloaded from Sentinel Hub are 2018 and 2019. 2017 has an ETA of Summer 2020.

The `landscapes` dictionary has a key, value convention of the landscape name, and a `(lat, long)` tuple

In [4]:
year = 2019
landscape = 'malawi-rumphi'

if year > 2017:
    dates = (f'{str(year - 1)}-12-01' , f'{str(year + 1)}-02-01')
else: 
    dates = (f'{str(year)}-01-01' , f'{str(year + 1)}-02-01')
    
dates_sentinel_1 = (f'{str(year)}-01-01' , f'{str(year)}-12-31')
SIZE = 9*5
IMSIZE = (7*2) + (SIZE * 14)+2 # process 6320 x 6320 m blocks

days_per_month = [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30]
starting_days = np.cumsum(days_per_month)

In [5]:
database = pd.read_csv("../project-monitoring/database.csv")
coords = database[database['landscape'] == landscape]
path = coords['path'].tolist()[0]
coords = (float(coords['longitude']), float(coords['latitude']))

IO_PARAMS = {'prefix': '../',
             'bucket': 'restoration-monitoring',
             'coords': coords,
             'bucket-prefix': '',
             'path': path}

OUTPUT_FOLDER = IO_PARAMS['prefix'] + IO_PARAMS['path'] + str(year) + '/'
print(coords, OUTPUT_FOLDER)

(33.241888, -11.134766) ../project-monitoring/zambia/eastern/chama/2019/


In [6]:
to_append = pd.DataFrame({'landscape': ['malawi-rumphi'],
                             'latitude': ['-11.134766'],
                             'longitude': ['33.241888'],
                             'path': [get_folder_prefix((-11.134766,  33.241888),
                                                        params = {'bucket-prefix': 'project-monitoring'})]})
database = database.append([to_append])
#database.to_csv("../project-monitoring/database.csv", index = False)


Loading formatted geocoded file...


# 2.1 Data download functions

If using Sentinel hub, identify the following layers:
  * CLOUD: return [CLP / 255]
  * SHADOW: return [B02, B8A, B11]
  * DEM: return [DEM]
  * SENT: return [VV, VH]
  * L2A10: return [B02,B03,B04, B08]
  * L2A20: return [B05,B06,B07, B8A,B11,B12]

In [7]:
def extract_dates(date_dict: dict, year: int) -> List:
    """ Transforms a SentinelHub date dictionary to a
         list of integer calendar dates
    """
    dates = []
    days_per_month = [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30]
    starting_days = np.cumsum(days_per_month)
    for date in date_dict:
        if date.year == year - 1:
            dates.append(-365 + starting_days[(date.month-1)] + date.day)
        if date.year == year:
            dates.append(starting_days[(date.month-1)] + date.day)
        if date.year == year + 1:
            dates.append(365 + starting_days[(date.month-1)]+date.day)
    return dates


### Cloud and cloud shadow

In [8]:
def identify_clouds(bbox: List[Tuple[float, float]],
                epsg: 'CRS', dates: dict = dates) -> (np.ndarray, np.ndarray, np.ndarray):
    """ Downloads and calculates cloud cover and shadow
        
        Parameters:
         bbox (list): output of calc_bbox
         epsg (float): EPSG associated with bbox 
         dates (tuple): YY-MM-DD - YY-MM-DD bounds for downloading 
    
        Returns:
         cloud_img (np.array):
         shadows (np.array): 
         clean_steps (np.array):
    """
    # Download 160 x 160 meter cloud masks, 0 - 255
    box = BBox(bbox, crs = epsg)
    cloud_request = WcsRequest(
        layer='CLOUD_NEW',
        bbox=box, time=dates,
        resx='160m',resy='160m',
        image_format = MimeType.TIFF_d8,
        maxcc=0.75, instance_id=API_KEY,
        custom_url_params = {constants.CustomUrlParam.UPSAMPLING: 'NEAREST'},
        time_difference=datetime.timedelta(hours=72),
    )

    # Download 20 x 20 meter bands for shadow masking, 0 - 65535
    shadow_request = WcsRequest(
        layer='SHADOW',
        bbox=box, time=dates,
        resx='20m', resy='20m',
        image_format = MimeType.TIFF_d16,
        maxcc=0.75, instance_id=API_KEY,
        custom_url_params = {constants.CustomUrlParam.UPSAMPLING: 'NEAREST'},
        time_difference=datetime.timedelta(hours=72))

    cloud_img = np.array(cloud_request.get_data())
    if not isinstance(cloud_img.flat[0], np.floating):
        assert np.max(cloud_img) > 1
        print(f"Original cloud image max is {np.max(cloud_img)}, {cloud_img.shape}")
        cloud_img = cloud_img / 255
    assert np.max(cloud_img) <= 1., f'The max cloud probability is {np.max(cloud_img)}'
    
    c_probs_pus = ((40*40)/(512*512)) *(1/3)*cloud_img.shape[0]
    print(f"Cloud_probs used {round(c_probs_pus, 1)} processing units")
    
    cloud_img = resize(cloud_img, (cloud_img.shape[0], IMSIZE, IMSIZE), order = 0)
    n_cloud_px = np.sum(cloud_img > 0.33, axis = (1, 2))
    cloud_steps = np.argwhere(n_cloud_px > (IMSIZE**2 * 0.15))
    clean_steps = [x for x in range(cloud_img.shape[0]) if x not in cloud_steps]
    
    # Extract cloud, shadow dates
    cloud_dates_dict = [x for x in cloud_request.get_dates()]
    cloud_dates = extract_dates(cloud_dates_dict, year)
    cloud_dates = [val for idx, val in enumerate(cloud_dates) if idx in clean_steps]
    shadow_dates_dict = [x for x in shadow_request.get_dates()]
    shadow_dates = extract_dates(shadow_dates_dict, year)
    shadow_steps = [idx for idx, val in enumerate(shadow_dates) if val in cloud_dates]
    
    shadow_img = np.array(shadow_request.get_data(data_filter = shadow_steps))
    shadow_pus = (shadow_img.shape[1]*shadow_img.shape[2])/(512*512) * shadow_img.shape[0]
    shadow_img = resize(shadow_img, (shadow_img.shape[0], IMSIZE, IMSIZE, shadow_img.shape[-1]), order = 0)
    print(f"There are unique {len(np.unique(shadow_img))} shadow L1C values")
    
    print(f"The max shadows is {np.max(shadow_img)}")
    if not isinstance(shadow_img.flat[0], np.floating):
        assert np.max(shadow_img) > 1
        shadow_img = shadow_img / 65535
    assert np.max(shadow_img) <= 1
 
    cloud_img = np.delete(cloud_img, cloud_steps, 0)
    assert shadow_img.shape[0] == cloud_img.shape[0], (shadow_img.shape, cloud_img.shape)
    shadows = mcm_shadow_mask(np.array(shadow_img), cloud_img) # Make usre this makes sense??
    print(f"Shadows ({shadows.shape}) used {round(shadow_pus, 1)} processing units")
    n_always_shadow = np.min(shadows, axis = (1, 2))
    print(f"There are {np.sum(n_always_shadow)} pixels that always have shadows")
    
    return cloud_img, shadows, clean_steps, np.array(cloud_dates)



### Digital elevation model, slope

In [9]:
def download_dem(bbox: List[Tuple[float, float]], epsg: 'CRS') -> np.ndarray:
    """ Downloads the DEM layer from Sentinel hub
        
        Parameters:
         bbox (list): output of calc_bbox
         epsg (float): EPSG associated with bbox 
    
        Returns:
         dem_image (arr):
    """

    box = BBox(bbox, crs = epsg)
    dem_size = 650
    dem_request = WmsRequest(data_source=DataSource.DEM,
                         layer='DEM', bbox=box,
                         width=dem_size, height=dem_size,
                         instance_id=API_KEY,
                         image_format=MimeType.TIFF_d32f,
                         custom_url_params={CustomUrlParam.SHOWLOGO: False})
    dem_image = dem_request.get_data()[0]
    dem_image = median_filter(dem_image, size = 5)
    dem_image = calcSlope(dem_image.reshape((1, dem_size, dem_size)),
                          np.full((dem_size, dem_size), 10), 
                          np.full((dem_size, dem_size), 10), zScale = 1, minSlope = 0.02)
    dem_image = dem_image.reshape((dem_size,dem_size, 1))
    dem_image = dem_image[1:dem_size-1, 1:dem_size-1, :]
    print(f"DEM used {round(((IMSIZE*IMSIZE)/(512*512))*2, 1)} processing units")
    print(f"There are {len(np.unique(dem_image))} unique DEM values")
    return dem_image

###  Sentinel 2 L2A, 10 and 20 meter bands

In [10]:
def download_layer(bbox: List[Tuple[float, float]],
                   clean_steps: np.ndarray, epsg: 'CRS',
                   dates: dict = dates, year: int = year) -> (np.ndarray, np.ndarray):
    """ Downloads the L2A sentinel layer with 10 and 20 meter bands
        
        Parameters:
         bbox (list): output of calc_bbox
         clean_steps (list): list of steps to filter download request
         epsg (float): EPSG associated with bbox 
         time (tuple): YY-MM-DD - YY-MM-DD bounds for downloading 
    
        Returns:
         img (arr):
         img_request (obj): 
    """
    box = BBox(bbox, crs = epsg)
    image_request = WcsRequest(
            layer='L2A20',
            bbox=box, time=dates,
            image_format = MimeType.TIFF_d16,
            maxcc=0.75, resx='20m', resy='20m',
            instance_id=API_KEY,
            custom_url_params = {constants.CustomUrlParam.DOWNSAMPLING: 'NEAREST',
                                constants.CustomUrlParam.UPSAMPLING: 'NEAREST'},
            time_difference=datetime.timedelta(hours=72),
        )
    ## There needs to be a code block here aligning the dates with cloud probs
    image_dates_dict = [x for x in image_request.get_dates()]
    image_dates = extract_dates(image_dates_dict, year)
    steps_to_download = [i for i, val in enumerate(image_dates) if val in clean_steps]
    dates_to_download = [val for i, val in enumerate(image_dates) if val in clean_steps]
    print(f"The cloud-free image dates are {dates_to_download}")
    
    img_bands = image_request.get_data(data_filter = steps_to_download)
    img_20 = np.stack(img_bands)
    if not isinstance(img_20.flat[0], np.floating):
        assert np.max(img_20) > 2
        print(f"Converting S2, 20m to float32, with {np.max(img_20)} max and"
              f" {len(np.unique(img_20))} unique values")
        img_20 = img_20.astype(np.float32)
        img_20 = img_20 / 65535.
        assert np.max(img_20) <= 2.
    
    s2_20_usage = (img_20.shape[1]*img_20.shape[2])/(512*512) * (6/3) * img_20.shape[0]
    print(f"Original 20 meter bands size: {img_20.shape}, using {round(s2_20_usage, 1)} PU")
    if img_20.shape[2]*img_20.shape[2] != 323*323:
        print(f"Reshaping: {img_20.shape}")
        img_20 = resize(img_20, (img_20.shape[0], 323, 323, img_20.shape[-1]), order = 0)

    image_request = WcsRequest(
            layer='L2A10',
            bbox=box, time=dates,
            image_format = MimeType.TIFF_d16,
            maxcc=0.75, resx='10m', resy='10m',
            instance_id=API_KEY,
            custom_url_params = {constants.CustomUrlParam.DOWNSAMPLING: 'BICUBIC',
                                constants.CustomUrlParam.UPSAMPLING: 'BICUBIC'},
            time_difference=datetime.timedelta(hours=72),
    )
    
    img_bands = image_request.get_data(data_filter = steps_to_download)
    img_10 = np.stack(img_bands)#.astype(np.float32)
    if not isinstance(img_10.flat[0], np.floating):
        assert np.max(img_10) > 2
        print(f"Converting S2, 10m to float32, with {np.max(img_10)} max and"
                  f" {len(np.unique(img_10))} unique values")
        img_10 = img_10.astype(np.float32)
        img_10 = img_10 / 65535.
    assert np.max(img_10) <= 2.
    
    s2_10_usage = (img_10.shape[1]*img_10.shape[2])/(512*512) * (4/3) * img_10.shape[0]
    if img_10.shape[2]*img_10.shape[1] != IMSIZE*IMSIZE:
        print(f"Reshaping: {img_10.shape}")
        img_10 = resize(img_10, (img_10.shape[0], IMSIZE, IMSIZE, img_10.shape[-1]), order = 0)
    
    
    img_10 = np.clip(img_10, 0, 1)
    img_20 = np.clip(img_20, 0, 1)
    return img_10, img_20, np.array(dates_to_download)

### Sentinel 1 IW bands

In [11]:
def download_sentinel_1(bbox: List[Tuple[float, float]],
                        epsg: 'CRS', imsize: int = IMSIZE, 
                        dates: dict = dates_sentinel_1, layer: str = "SENT",
                        year: int = year) -> (np.ndarray, np.ndarray):
    """ Downloads the GRD Sentinel 1 VV-VH layer from Sentinel Hub
        
        Parameters:
         bbox (list): output of calc_bbox
         epsg (float): EPSG associated with bbox 
         imsize (int):
         dates (tuple): YY-MM-DD - YY-MM-DD bounds for downloading 
         layer (str):
         year (int): 
    
        Returns:
         s1 (arr):
         image_dates (arr): 
    """
    source = DataSource.SENTINEL1_IW_DES if layer == "SENT_DESC" else DataSource.SENTINEL1_IW_ASC
    box = BBox(bbox, crs = epsg)
    image_request = WcsRequest(
            layer=layer, bbox=box,
            time=dates,
            image_format = MimeType.TIFF_d16,
            data_source=source, maxcc=1.0,
            resx='10m', resy='10m',
            instance_id=API_KEY,
            custom_url_params = {constants.CustomUrlParam.DOWNSAMPLING: 'NEAREST',
                                constants.CustomUrlParam.UPSAMPLING: 'NEAREST'},
            time_difference=datetime.timedelta(hours=72),
        )
    
    data_filter = [x for x in range(len(image_request.download_list))]
    if len(image_request.download_list) > 40:
        data_filter = [x for x in range(len(image_request.download_list)) if x % 2 == 0]
        
    if len(image_request.download_list) > 0:
        img_bands = image_request.get_data(data_filter = data_filter)
        s1 = np.stack(img_bands)#.astype(np.float32)
        if not isinstance(s1.flat[0], np.floating):
            assert np.max(s1) > 2
            print(f"Converting s1 to float32, with {np.max(s1)} max and"
                  f" {len(np.unique(s1))} unique values")
            s1 = s1.astype(np.float32)
            s1 = s1 / 65535.

        s1_usage = (2/3) * s1.shape[0] * ((s1.shape[1]*s1.shape[2]) / (512*512))
        print(f"Sentinel 1 used {round(s1_usage, 1)} PU for \
              {s1.shape[0]} out of {len(image_request.download_list)} images")

        image_dates_dict = [x for x in image_request.get_dates()]
        image_dates = extract_dates(image_dates_dict, year)
        image_dates = [val for idx, val in enumerate(image_dates) if idx in data_filter]
        image_dates = np.array(image_dates)

        s1c = np.copy(s1)
        s1c[np.where(s1c < 1.)] = 0
        s1c[np.where(s1c >= 1.)] = 1.
        n_pix_oob = np.sum(s1c, axis = (1, 2, 3))
        print(n_pix_oob / (imsize*2*imsize*2))
        to_remove = np.argwhere(n_pix_oob > (imsize*2*imsize*2)/50)
        s1 = np.delete(s1, to_remove, 0)
        image_dates = np.delete(image_dates, to_remove)
        s1 = np.clip(s1, 0, 1)
        return s1, image_dates
    else: 
        return np.empty((0,)), np.empty((0,))


def identify_s1_layer(coords: Tuple[float, float]) -> str:
    """ Identifies whether to download ascending or descending 
        sentinel 1 orbit based upon predetermined geographic coverage
        
        Reference: https://sentinel.esa.int/web/sentinel/missions/
                   sentinel-1/satellite-description/geographical-coverage
        
        Parameters:
         coords (tuple): 
    
        Returns:
         layer (str): either of SENT, SENT_DESC 
    """
    results = rg.search(coords)
    country = results[-1]['cc']
    continent_name = pc.country_alpha2_to_continent_code(country)
    if continent_name in ['AF', 'OC']:
        layer = "SENT"
    if continent_name in ['SA']:
        if coords[0] > -7.11:
            layer = "SENT"
        else:
            layer = "SENT_DESC"
    if continent_name in ['AS']:
        if coords[0] > 23.3:
            layer = "SENT"
        else:
            layer = "SENT_DESC"
    if continent_name in ['NA']:
        layer = "SENT_DESC"
    print(f"The continent is: {continent_name}, and the sentinel 1 orbit is {layer}")
    return layer
    
    
def process_sentinel_1_tile(sentinel1: np.ndarray, dates: np.ndarray) -> np.ndarray:
    """Converts a (?, X, Y, 2) Sentinel 1 array to (24, X, Y, 2)

        Parameters:
         sentinel1 (np.array):
         dates (np.array):

        Returns:
         s1 (np.array)
    """
    s1, _ = calculate_and_save_best_images(sentinel1, dates)
    biweekly_dates = np.array([day for day in range(0, 360, 5)])
    to_remove = np.argwhere(biweekly_dates % 15 != 0)
    s1 = np.delete(s1, to_remove, 0)
    return s1
    

# 2.3 Superresolution

In [12]:
import tensorflow as tf
sess = tf.Session()
from keras import backend as K
K.set_session(sess)

MDL_PATH = "../models/supres/"

model = tf.train.import_meta_graph(MDL_PATH + 'model.meta')
model.restore(sess, tf.train.latest_checkpoint(MDL_PATH))

logits = tf.get_default_graph().get_tensor_by_name("Add_6:0")
inp = tf.get_default_graph().get_tensor_by_name("Placeholder:0")
inp_bilinear = tf.get_default_graph().get_tensor_by_name("Placeholder_1:0")

def superresolve(input_data, bilinear_upsample):
    """ Worker function to run predictions on input data
    """
    x = sess.run([logits], 
                 feed_dict={inp: input_data,
                            inp_bilinear: bilinear_upsample})
    return x[0]


def superresolve_tile(arr: np.ndarray) -> np.ndarray:
    """Superresolves each 56x56 subtile in a 646x646 input tile
       by padding the subtiles to 64x64 and removing the pad after prediction,
       eliminating boundary artifacts

        Parameters:
         arr (arr): (?, 646, 646, 10) array

        Returns:
         superresolved (arr): (?, 646, 646, 10) array
    """
    print(f"The input array to superresolve is {arr.shape}")
    tiles = tile_window(646, 646, 60, 60)
    for i in tnrange(len(tiles)):
        subtile = tiles[i]
        pad_l = 0 if subtile[0] >= 2 else 2
        pad_r = 0 if subtile[0] < (644 - 60) else 2
        pad_u = 0 if subtile[1] >= 2 else 2
        pad_d = 0 if subtile[1] < (644 - 60) else 2
        to_resolve = arr[:, np.max([subtile[0]-2, 0]):subtile[0]+62,
                            np.max([subtile[1]-2, 0]):subtile[1]+62, :]
        to_resolve = np.pad(to_resolve, ((0, 0), (pad_l, pad_r), (pad_u, pad_d), (0, 0)), 'reflect')
        
        bilinear = to_resolve[..., 4:]
        
        resolved = superresolve(
            to_resolve, bilinear)
        resolved = resolved[:, 2:-2, 2:-2, :]
        arr[:, subtile[0]:subtile[0]+60, subtile[1]:subtile[1]+60, 4:] = resolved
    return arr

Using TensorFlow backend.


# 2.4 Tiling and folder management functions

In [13]:
# move to src/utils/pathing.py
def make_output_and_temp_folders(idx: str, output_folder: str = OUTPUT_FOLDER) -> None:
    """Makes necessary folder structures for IO of raw and processed data

        Parameters:
         idx (str)
         output_folder (path)

        Returns:
         None
    """
    def _find_and_make_dirs(dirs):
        if not os.path.exists(os.path.realpath(dirs)):
            os.makedirs(os.path.realpath(dirs))
            
    _find_and_make_dirs(output_folder + "raw/")
    _find_and_make_dirs(output_folder + "raw/clouds/")
    _find_and_make_dirs(output_folder + "raw/s1/")
    _find_and_make_dirs(output_folder + "raw/s2_10/")
    _find_and_make_dirs(output_folder + "raw/s2_20/")
    _find_and_make_dirs(output_folder + "raw/misc/")
    _find_and_make_dirs(output_folder + "processed/")
    _find_and_make_dirs(output_folder + "interim/")
    _find_and_make_dirs(output_folder + "raw/s2/")

# move to src/utils/typing.py
def to_int16(array: np.array) -> np.array:
    '''Converts a float32 array to int16, reducing storage costs by three-fold'''
    assert np.min(array) > -0.01, np.min(array)
    assert np.max(array) < 1.01, np.max(array)
    
    array = np.clip(array, 0, 1)
    array = np.trunc(array * 65535)
    assert np.min(array >= 0)
    assert np.max(array <= 65535)
    
    return array.astype(np.uint16)

# move to src/utils/typing.py
def to_float32(array: np.array) -> np.array:
    '''Converts an int16 array to float32'''
    divide = 1. if isinstance(array.flat[0], np.floating) else 65535
    return np.float32(array) / divide

def id_missing_px(sentinel2, thresh = 11):
    missing_images = np.sum(sentinel2[..., :10] == 0.0, axis = (1, 2, 3))
    missing_images_p = np.sum(sentinel2[..., :10] >= 1., axis = (1, 2, 3))
    missing_images = missing_images + missing_images_p
    print(missing_images)
    missing_images = np.argwhere(missing_images >= (sentinel2.shape[1]**2) / thresh)
    missing_images = missing_images.flatten()
    return missing_images
 

# Download worker fn

In [14]:
def download_large_tile(coord: tuple,
                        step_x: int,
                        step_y: int,
                        folder: str = OUTPUT_FOLDER, 
                        year: int = year,
                        s1_layer: str = "SENT") -> None:
    """Wrapper function to download cloud probs, Sentinel 2, Sentinel 1, and DEM

        Parameters:
         coord (tuple):
         step_x (int):
         step_y (int):
         folder (path):
         year (int):
         s1_layer (str):

        Returns:
         None
    """
    bbx, epsg = calculate_bbx_pyproj(coord, step_x, step_y, expansion = 80)
    dem_bbx, _ = calculate_bbx_pyproj(coord, step_x, step_y, expansion = 90)
    idx = str(step_y) + "_" + str(step_x)
    idx = str(idx)
    make_output_and_temp_folders(idx)
    
    output_path = f"{folder}output/{str(step_y*5)}/{str(step_x*5)}.npy"
    process_path = f"{folder}processed/{str(step_y*5)}/{str(step_x*5)}.npy"
    if not (os.path.exists(output_path) or os.path.exists(process_path)):
        clouds_file = f'{folder}raw/clouds/clouds_{idx}.hkl'
        shadows_file = f'{folder}raw/clouds/shadows_{idx}.hkl'
        s1_file = f'{folder}raw/s1/{idx}.hkl'
        s1_dates_file = f'{folder}raw/misc/s1_dates_{idx}.hkl'
        s2_10_file = f'{folder}raw/s2_10/{idx}.hkl'
        s2_20_file = f'{folder}raw/s2_20/{idx}.hkl'
        s2_dates_file = f'{folder}raw/misc/s2_dates_{idx}.hkl'
        s2_file = f'{folder}raw/s2/{idx}.hkl'
        clean_steps_file = f'{folder}raw/clouds/clean_steps_{idx}.hkl'

        if not os.path.exists(clouds_file):
            # All this needs to be int16, copied to cloud with io.save_file
            print(f"Downloading clouds because {clouds_file} does not exist")
            cloud_probs, shadows, _, image_dates = identify_clouds(bbx, epsg = epsg)
            to_remove, _ = calculate_cloud_steps(cloud_probs, image_dates)
            print(to_remove)
            clean_dates = np.delete(image_dates, to_remove)
            cloud_probs = np.delete(cloud_probs, to_remove, 0)
            shadows = np.delete(shadows, to_remove, 0)
            hkl.dump(cloud_probs, clouds_file, mode='w', compression='gzip')
            hkl.dump(shadows, shadows_file, mode='w', compression='gzip')
            hkl.dump(clean_dates, clean_steps_file, mode='w', compression='gzip')

        if not os.path.exists(s1_file):
            print(f"Downloading S1 because {s1_file} does not exist")
            s1_layer = identify_s1_layer((coord[1], coord[0]))
            s1, s1_dates = download_sentinel_1(bbx, layer = s1_layer, epsg = epsg)
            if s1.shape[0] == 0:
                s1_layer = "SENT_DESC" if s1_layer == "SENT" else "SENT"
                print(f'Switching to {s1_layer}')
                s1, s1_dates = download_sentinel_1(bbx, layer = s1_layer, epsg = epsg)
            s1 = process_sentinel_1_tile(s1, s1_dates)
            hkl.dump(to_int16(s1), s1_file, mode='w', compression='gzip')
            hkl.dump(s1_dates, s1_dates_file, mode='w', compression='gzip')

        if not os.path.exists(s2_10_file):
            # All this needs to be int16, copied to cloud with io.save_file
            print(f"Downloading S2 because {s2_10_file} does not exist")
            clean_steps = list(hkl.load(clean_steps_file))
            cloud_probs = hkl.load(clouds_file)
            shadows = hkl.load(shadows_file)    
            s2_10, s2_20, s2_dates = download_layer(bbx, clean_steps = clean_steps, epsg = epsg)

            # Steps to ensure that L2A, L1C derived products have exact matching dates
            print(f"Shadows {shadows.shape}, clouds {cloud_probs.shape}, S2, {s2_10.shape}, S2d, {s2_dates.shape}")
            to_remove_clouds = [i for i, val in enumerate(clean_steps) if val not in s2_dates]
            to_remove_dates = [val for i, val in enumerate(clean_steps) if val not in s2_dates]
            if len(to_remove_clouds) >= 1:
                print(f"Removing {to_remove_dates} from clouds because not in S2")
                cloud_probs = np.delete(cloud_probs, to_remove_clouds, 0)
                shadows = np.delete(shadows, to_remove_clouds, 0)
                print(f"Shadows {shadows.shape}, clouds {cloud_probs.shape}"
                      f" S2, {s2_10.shape}, S2d, {s2_dates.shape}")
                hkl.dump(cloud_probs, clouds_file, mode='w', compression='gzip')
                hkl.dump(shadows, shadows_file, mode='w', compression='gzip')

            assert cloud_probs.shape[0] == s2_10.shape[0], "There is a date mismatch"
            hkl.dump(to_int16(s2_10), s2_10_file, mode='w', compression='gzip')
            hkl.dump(to_int16(s2_20), s2_20_file, mode='w', compression='gzip')
            hkl.dump(s2_dates, s2_dates_file, mode='w', compression='gzip')

        if not os.path.exists(folder + "raw/misc/dem_{}.hkl".format(idx)):
            dem = download_dem(dem_bbx, epsg = epsg)
            hkl.dump(dem, folder + "raw/misc/dem_{}.hkl".format(idx), mode='w', compression='gzip')

In [15]:
# move to src/utils/pathing.py
def make_folder_names(step_x: int, step_y: int) -> (list, list):
    '''Given an input tile location (step_x, step_y), identify the folder and file
       names for each 5x5 subtile
       
       Parameters:
         step_x (int):
         step_y (int):

        Returns:
         x_vals (list)
         y_vals (list)
    '''
    x_vals = []
    y_vals = []
    for i in range(25):
        y_val = (24 - i) // 5
        x_val = 5 - ((25 - i) % 5)
        x_val = 0 if x_val == 5 else x_val
        x_vals.append(x_val)
        y_vals.append(y_val)
    y_vals = [i + (5*step_y) for i in y_vals]
    x_vals = [i + (5*step_x) for i in x_vals]
    return x_vals, y_vals


def process_large_tile(coord: tuple,
                       step_x: int,
                       step_y: int,
                       folder: str = OUTPUT_FOLDER,
                       model: 'model' = model) -> None:
    '''Wrapper function to interpolate clouds and temporal gaps, superresolve tiles,
       calculate relevant indices, and save analysis-ready data to the output folder
       
       Parameters:
        coord (tuple)
        step_x (int):
        step_y (int):
        foldre (str):

       Returns:
        None
    '''
    idx = str(step_y) + "_" + str(step_x)
    x_vals, y_vals = make_folder_names(step_x, step_y)

    processed = True
    for x, y in zip(x_vals, y_vals):
        folder_path = f"{str(y)}/{str(x)}"
        processed_exists = os.path.exists(folder + "processed/" + folder_path + ".hkl")
        output_exists = os.path.exists(folder + "output/" + folder_path + ".npy")
        if not (processed_exists or output_exists):
            processed = False
    if not processed:
        print(f"Processing because folder {folder_path}.npy does not exist")

        clouds = hkl.load(f'{folder}raw/clouds/clouds_{idx}.hkl')
        sentinel1 = to_float32(hkl.load(f'{folder}raw/s1/{idx}.hkl'))
        sentinel2_10 = to_float32(hkl.load(f'{folder}raw/s2_10/{idx}.hkl'))
        sentinel2_20 = to_float32(hkl.load(f'{folder}raw/s2_20/{idx}.hkl'))
        dem = hkl.load(f'{folder}raw/misc/dem_{idx}.hkl')
        image_dates = hkl.load(f'{folder}raw/misc/s2_dates_{idx}.hkl')
        shadows = hkl.load(f'{folder}raw/clouds/shadows_{idx}.hkl')  
        
        sentinel2 = np.empty((sentinel2_10.shape[0], 646, 646, 10))
        sentinel2[..., :4] = sentinel2_10
        for band in range(6):
            for time in range(sentinel2.shape[0]):
                sentinel2[time, ..., band + 4] = resize(sentinel2_20[time,..., band], (646, 646), 2)
    
        to_remove, _ = calculate_cloud_steps(clouds, image_dates)
        print(sentinel2.shape, clouds.shape, shadows.shape, image_dates.shape)
        if len(to_remove) > 0:
            sentinel2 = np.delete(sentinel2, to_remove, axis = 0)
            clouds = np.delete(clouds, to_remove, axis = 0)
            shadows = np.delete(shadows, to_remove, axis = 0)
            image_dates = np.delete(image_dates, to_remove)
        print(f"{to_remove} cloudy and missing images removed")

        missing_px = id_missing_px(sentinel2, 3)
        if len(missing_px) > 0:
            print(f"Removing {missing_px} dates due to missing data")
            clouds = np.delete(clouds, missing_px, axis = 0)
            shadows = np.delete(shadows, missing_px, axis = 0)
            image_dates = np.delete(image_dates, missing_px)
            sentinel2 = np.delete(sentinel2, missing_px, axis = 0)
                    
        x, interp = remove_cloud_and_shadows(sentinel2, clouds, shadows, image_dates) 
        to_remove = np.argwhere(np.mean(interp, axis = (1, 2)) > 0.5)
        print(f"{len(to_remove)} steps removed because of >50% interpolation rate")
        x = np.delete(x, to_remove, axis = 0)
        clouds = np.delete(clouds, to_remove, axis = 0)
        shadows = np.delete(shadows, to_remove, axis = 0)
        image_dates = np.delete(image_dates, to_remove)
        interp = np.delete(interp, to_remove, 0)
                
        x = superresolve_tile(np.float32(x))
        dem_i = np.tile(dem[np.newaxis, 1:-1, 1:-1, :], (x.shape[0], 1, 1, 1))
        dem_i = dem_i / 90
        dem_i[dem_i > 0.25] = 0.25
        x = np.concatenate([x, dem_i], axis = -1)
        x = np.clip(x, 0, 1)
        return x, image_dates, interp
    else:
        return None, None, None
        

In [16]:
INPUT_FOLDER = "/".join(OUTPUT_FOLDER.split("/")[:-2]) + "/"

def interpolate_na_vals(s2):
    '''Interpolates NA values with closest time steps, to deal with
       the small potential for NA values in calculating indices'''
    for x_loc in range(s2.shape[1]):
        for y_loc in range(s2.shape[2]):
            n_na = np.sum(np.isnan(s2[:, x_loc, y_loc, :]), axis = 1)
            for date in range(s2.shape[0]):
                if n_na.flatten()[date] > 0:
                    before, after = calculate_proximal_steps(date, np.argwhere(n_na == 0))
                    s2[date, x_loc, y_loc, :] = ((s2[date + before, x_loc, y_loc] + 
                                                 s2[date + after, x_loc, y_loc]) / 2)
    numb_na = np.sum(np.isnan(s2), axis = (1, 2, 3))
    if np.sum(numb_na) > 0:
        print(f"There are {numb_na} NA values")
    return s2

def process_subtiles(coord: tuple,
                       step_x: int,
                       step_y: int,
                       year = 2019,
                       path: str = INPUT_FOLDER,
                       s2: np.ndarray = None, 
                       dates: np.ndarray = None,
                       interp: np.ndarray = None) -> None:
    '''Wrapper function to interpolate clouds and temporal gaps, superresolve tiles,
       calculate relevant indices, and save analysis-ready data to the output folder
       
       Parameters:
        coord (tuple)
        step_x (int):
        step_y (int):
        folder (str):

       Returns:
        None
    '''
    idx = str(step_y) + "_" + str(step_x)
    x_vals, y_vals = make_folder_names(step_x, step_y)
    s1 = hkl.load(f"{path}/{year}/raw/s1/{idx}.hkl")

    s2 = evi(s2, verbose = True)
    s2 = bi(s2, verbose = True)
    s2 = msavi2(s2, verbose = True)
    s2 = si(s2, verbose = True)
    s2 = interpolate_na_vals(s2)

    index = 0
    tiles = tile_window(IMSIZE, IMSIZE, window_size = 142)
    for t in tiles:
        start_x, start_y = t[0], t[1]
        end_x = start_x + t[2]
        end_y = start_y + t[3]
        subset = s2[:, start_x:end_x, start_y:end_y, :]
        interp_tile = interp[:, start_x:end_x, start_y:end_y]
        interp_tile = np.sum(interp_tile, axis = (1, 2))
        print(f"Interpolated amounts in this tile: {interp_tile}")

        dates_tile = np.copy(dates)
        to_remove = np.argwhere(interp_tile > ((142*142) / 10)).flatten()
        if len(to_remove) > 0:
            print(f"Removing {to_remove} dates due to interpolation")
            dates_tile = np.delete(dates_tile, to_remove)
            subset = np.delete(subset, to_remove, 0)

        missing_px = id_missing_px(subset)
        if len(missing_px) > 0:
            print(f"Removing {missing_px} dates due to missing data")
            dates_tile = np.delete(dates_tile, missing_px)
            subset = np.delete(subset, missing_px, 0)

        to_remove = remove_missed_clouds(subset)
        if len(to_remove) > 0:
            subset = np.delete(subset, to_remove, axis = 0)
            dates_tile = np.delete(dates_tile, to_remove)
        print(f"{len(to_remove)} missed cloudy images were removed: {to_remove}")

        subtile, _ = calculate_and_save_best_images(subset, dates_tile)
        output = f"{path}/{year}/processed/{y_vals[index]}/{x_vals[index]}.hkl"

        index += 1
        sm = Smoother(lmbd = 800, size = subtile.shape[0], nbands = 14, dim = subtile.shape[1])
        subtile = sm.interpolate_array(subtile)
        subtile = np.concatenate([subtile, s1[:, start_x:end_x, start_y:end_y, :]], axis = -1)

        output_folder = "/".join(output.split("/")[:-1])
        if not os.path.exists(os.path.realpath(output_folder)):
            os.makedirs(os.path.realpath(output_folder))
        subtile = np.float32(subtile)
        subtile = np.reshape(subtile, (12, 2, 142, 142, subtile.shape[-1]))
        subtile = np.mean(subtile, axis = (1))
        print(f"{index}: Writing {output}, {subtile.shape} shape")
        assert subtile.shape[1] == 142, f"subtile shape is {subtile.shape}"

        hkl.dump(subtile, output, mode='w', compression='gzip')




# 2.5 Function execution

In [18]:
from scipy.ndimage import median_filter
import time
downloaded = 0

if not os.path.exists(os.path.realpath(OUTPUT_FOLDER)):
            os.makedirs(os.path.realpath(OUTPUT_FOLDER))
        
print(f"Downloading {year} for {landscape}")

max_x = 50
max_y = 50

for y_tile in range(0, 5):
    for x_tile in range(0, 6):
        #contains = True
        contains = check_contains(coords, x_tile, y_tile, OUTPUT_FOLDER)
        print(y_tile, x_tile, contains, downloaded)
        if contains:
            print(f"Download {downloaded}/{max_x*max_y}; X: {x_tile} Y:{y_tile}")
            downloaded += 1
            download_large_tile(coord = coords, step_x = x_tile, step_y = y_tile)
            s2, image_dates, interp = process_large_tile(coords, x_tile, y_tile)
            if s2 is not None:
                process_subtiles(coords, x_tile, y_tile, year = 2019,
                                    s2 = s2, dates = image_dates, interp = interp)
                print("\n")

Downloading 2019 for malawi-rumphi
0 0 False 0
0 1 False 0
0 2 False 0
0 3 True 0
Download 0/2500; X: 3 Y:0
0 4 False 1
0 5 False 1
1 0 False 1
1 1 True 1
Download 1/2500; X: 1 Y:1
1 2 True 2
Download 2/2500; X: 2 Y:1
1 3 True 3
Download 3/2500; X: 3 Y:1
1 4 True 4
Download 4/2500; X: 4 Y:1
1 5 True 5
Download 5/2500; X: 5 Y:1
2 0 False 6
2 1 True 6
Download 6/2500; X: 1 Y:2
2 2 True 7
Download 7/2500; X: 2 Y:2
2 3 True 8
Download 8/2500; X: 3 Y:2
2 4 True 9
Download 9/2500; X: 4 Y:2
2 5 True 10
Download 10/2500; X: 5 Y:2
3 0 True 11
Download 11/2500; X: 0 Y:3
3 1 True 12
Download 12/2500; X: 1 Y:3
3 2 True 13
Download 13/2500; X: 2 Y:3
3 3 True 14
Download 14/2500; X: 3 Y:3
3 4 True 15
Download 15/2500; X: 4 Y:3
3 5 True 16
Download 16/2500; X: 5 Y:3
4 0 False 17
4 1 True 17
Download 17/2500; X: 1 Y:4
4 2 True 18
Download 18/2500; X: 2 Y:4
4 3 True 19
Download 19/2500; X: 3 Y:4
4 4 True 20
Download 20/2500; X: 4 Y:4
4 5 True 21
Download 21/2500; X: 5 Y:4
Downloading clouds because ../



Original cloud image max is 255, (69, 40, 40)
Cloud_probs used 0.1 processing units
There are unique 8997 shadow L1C values
The max shadows is 1.0
Calculating shadows the new way took 0.4168720245361328


HBox(children=(IntProgress(value=0, max=36), HTML(value='')))




  .format(dtypeobj_in, dtypeobj_out))


Shadows ((36, 646, 646)) used 14.3 processing units
There are 0.0 pixels that always have shadows
0, Dates: [-23], Min dist: 70, thresh: 0.06
1, Dates: [47], Min dist: 45, thresh: 0.03
2, Dates: [87], Min dist: 40, thresh: 0.15
3, Dates: [ 92 107], Min dist: 30, thresh: 0.03
4, Dates: [137 142], Min dist: 30, thresh: 0.03
5, Dates: [162 172], Min dist: 20, thresh: 0.01
6, Dates: [182 187 192 202], Min dist: 20, thresh: 0.01
7, Dates: [212 217 232 237 242], Min dist: 30, thresh: 0.03
8, Dates: [247 257 262 272], Min dist: 20, thresh: 0.01
9, Dates: [277 287 302], Min dist: 25, thresh: 0.01
10, Dates: [307 312], Min dist: 5, thresh: 0.01
11, Dates: [347], Min dist: 35, thresh: 0.03
[ 1  9 11 18 23 26 29 31]
[0.06437807 0.09421158 0.01691045 0.07668769 0.0068629  0.
 0.02741568 0.02630381]
[ 1  9 11 18 23 26 29 31]
Downloading S1 because ../project-monitoring/zambia/eastern/chama/2019/raw/s1/4_5.hkl does not exist
The continent is: AF, and the sentinel 1 orbit is SENT
Converting s1 to flo

  warn("Bi-quadratic interpolation behavior has changed due "


0, Dates: [-23], Min dist: 70, thresh: 0.06
1, Dates: [47], Min dist: 70, thresh: 0.06
2, Dates: [87], Min dist: 40, thresh: 0.15
3, Dates: [ 92 107], Min dist: 30, thresh: 0.03
4, Dates: [137 142], Min dist: 30, thresh: 0.03
5, Dates: [162 172], Min dist: 20, thresh: 0.01
6, Dates: [182 187 192 202], Min dist: 20, thresh: 0.01
7, Dates: [212 217 232 237 242], Min dist: 30, thresh: 0.03
8, Dates: [247 257 262 272], Min dist: 25, thresh: 0.01
9, Dates: [277 287 302], Min dist: 25, thresh: 0.01
10, Dates: [307 312], Min dist: 5, thresh: 0.01
11, Dates: [347], Min dist: 35, thresh: 0.03
[]
(28, 646, 646, 10) (28, 646, 646) (28, 646, 646) (28,)
[] cloudy and missing images removed
[    0     0   713     0     0   675     0     0     0     0     0     1
     0     0     0    33     0     0     0   647     0     3 16638     0
     0    17     0     0]
Interpolated 107914 px
0 steps removed because of >50% interpolation rate
The input array to superresolve is (28, 646, 646, 10)


HBox(children=(IntProgress(value=0, max=121), HTML(value='')))


Interpolated amounts in this tile: [    0.     0.  9757. 17406.     0. 10133.     0.     0.     0.     0.
     0.     0.     0.     0.     0.     0.     0.     0.     0.     0.
     0.     0.   926.     0.     0. 16342.     0.     0.]
Removing [ 2  3  5 25] dates due to interpolation
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
0 missed cloudy images were removed: []
Maximum time distance: 70
(72, 142, 142, 14)
1: Writing ../project-monitoring/zambia/eastern/chama//2019/processed/24/25.hkl, (12, 142, 142, 17) shape
Interpolated amounts in this tile: [    0.     0.  8412. 16471.     0. 15919.     0.     0.     0.     0.
     0.     0.     0.     0.     0.     0.     0.     0.     0.     0.
     0.     0.  1166.     0.     0. 10201.     0.     0.]
Removing [ 2  3  5 25] dates due to interpolation
[0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
0 missed cloudy images were removed: []
Maximum time distance: 70
(72, 142, 142, 14)
2: Writing ../project-monitoring/zambia/eastern/chama

(72, 142, 142, 14)
16: Writing ../project-monitoring/zambia/eastern/chama//2019/processed/21/25.hkl, (12, 142, 142, 17) shape
Interpolated amounts in this tile: [    0.     0.  5664.     0.     0. 12249.     0.     0.     0.     0.
     0.     0.     0.     0.     0.     0.     0.     0.     0. 10268.
     0.     0.   526.     0.     0.  2190.     0.     0.]
Removing [ 2  5 19] dates due to interpolation
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
0 missed cloudy images were removed: []
Maximum time distance: 70
(72, 142, 142, 14)
17: Writing ../project-monitoring/zambia/eastern/chama//2019/processed/21/26.hkl, (12, 142, 142, 17) shape
Interpolated amounts in this tile: [    0.     0. 12496.    40.     0. 13050.     0.     0.     0.     0.
     0.     0.     0.     0.     0.     0.     0.     0.     0.  2733.
     0.     0. 12162.     0.     0.     0.     0.     0.]
Removing [ 2  5 22] dates due to interpolation
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0]
0 missed cloud