# Overview

This Jupyter notebook downloads and preprocesses Sentinel 1 and 2 tiles for large areas (at least 40 sq km). The workflow entails generating the tile coordinates, downloading the raw data, and processing (cloud and shadow removal, gap interpolation, indices, and superresolution).

The notebook is broken down into the following sections:

   * **Parameter definition**:
   * **Projection functions**
   * **Data download functions**
   * **Cloud and shadow removal functions**
   * **Superresoluttion functions**
   * **Tile and folder management functions**
   * **Function execution**

If you are planning to download new Sentinel data, you need to have an API key to use the data provider [Sentinel Hub](https://www.sentinel-hub.com). If you do not have an API key but have access to sentinel imagery, the input data for this notebook is an entire year of:
  * Cloud masks
  * L1C bands 2, 8A, 11
  * 10- and 20m L2A bands
  * VV-VH Sentinel 1 bands
  * Digital elevation model
  
  
The data are tiled into 6300m x 6300m windows. An example of the raw data can be downloaded by running the following cell. This data can be preprocessed (cloud interpolation, super resolution, smoothing, etcetera) by running the rest of the notebook. It can then also be predicted by running `4b-predict-large-area`.

In [1]:
# If using example raw data
import os
if not os.path.exists("../data/example/raw"):
    os.makedirs("../data/example/raw/")
    
landscape = 'example'
OUTPUT_FOLDER = '../data/{}/'.format(landscape)
coords = (13.727334, -90.015579)
coords = (coords[1], coords[0])

In [2]:
# Download example raw data - only if you don't have an API key!
#!curl https://restoration-monitoring-external.s3.amazonaws.com/restoration-mapper/example/example.zip \
#    -o ../data/example/raw/data.zip
#!unzip ../data/example/raw/data.zip -d ../data/example/raw/

# 1.0 Package Imports

In [1]:
import pandas as pd
import numpy as np
from random import shuffle
from osgeo import ogr, osr
from sentinelhub import WmsRequest, WcsRequest, MimeType, CRS, BBox, constants
import logging
from collections import Counter
import datetime
import os
import yaml
from sentinelhub import DataSource
import scipy.sparse as sparse
import scipy
from scipy.sparse.linalg import splu
from skimage.transform import resize
from sentinelhub import CustomUrlParam
from time import time as timer
from time import sleep as sleep
import multiprocessing
import math
import reverse_geocoder as rg
import pycountry
import pycountry_convert as pc
import hickle as hkl
from shapely.geometry import Point, Polygon
import geopandas
from tqdm import tnrange, tqdm_notebook
import math
import boto3
from pyproj import Proj, transform
from timeit import default_timer as timer
from typing import Tuple, List
import warnings

In [2]:
if os.path.exists("../config.yaml"):
    with open("../config.yaml", 'r') as stream:
        key = (yaml.safe_load(stream))
        API_KEY = key['key']
        AWSKEY = key['awskey']
        AWSSECRET = key['awssecret']
else:
    API_KEY = "none"

In [3]:
%run ../src/preprocessing/slope.py
%run ../src/preprocessing/indices.py
%run ../src/downloading/utils.py
%run ../src/preprocessing/cloud_removal.py
%run ../src/preprocessing/whittaker_smoother.py
%run ../src/dsen2/utils/DSen2Net.py
%run ../src/io/upload.py

Using TensorFlow backend.


# 1.1 Constants and Parameters

Currently the only years that can be downloaded from Sentinel Hub are 2018 and 2019. 2017 has an ETA of Summer 2020.

The `landscapes` dictionary has a key, value convention of the landscape name, and a `(lat, long)` tuple

In [4]:
year = 2019
landscape = 'ethiopia-aroresa'

if year > 2017:
    dates = (f'{str(year - 1)}-12-01' , f'{str(year + 1)}-02-01')
else: 
    dates = (f'{str(year)}-01-01' , f'{str(year + 1)}-02-01')
    
dates_sentinel_1 = (f'{str(year)}-01-01' , f'{str(year)}-12-31')
SIZE = 9*5
IMSIZE = (7*2) + (SIZE * 14)+2

days_per_month = [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30]
starting_days = np.cumsum(days_per_month)

In [5]:
database = pd.read_csv("../project-monitoring/database.csv")
coords = database[database['landscape'] == landscape]
path = coords['path'].tolist()[0]
coords = (float(coords['longitude']), float(coords['latitude']))

IO_PARAMS = {'prefix': '../',
             'bucket': 'restoration-monitoring',
             'coords': coords,
             'bucket-prefix': '',
             'path': path}

OUTPUT_FOLDER = IO_PARAMS['prefix'] + IO_PARAMS['path'] + str(year) + '/'
print(coords, OUTPUT_FOLDER)

(38.832917, 6.136697) ../project-monitoring/ethiopia/oromiya/kibre-mengist/2019/


In [6]:
#uploader = FileUploader(awskey = AWSKEY, awssecret = AWSSECRET)
#file = '../tile_data/processed/data_x_l2a_processed.hkl'
#key = get_folder_prefix(coordinates) + '2018/raw/s2/0_0.hkl'
#key = 'restoration-mapper/model-data/global/data_x_l2a_processed.hkl'
#uploader.upload(bucket = 'restoration-monitoring', key = key, file = file)

In [7]:
to_append = pd.DataFrame({'landscape': ['ethiopia-aroresa'],
                             'latitude': ['6.136697'], 
                             'longitude': ['38.832917'],
                             'path': [get_folder_prefix((6.136697, 38.832917),
                                                        params = {'bucket-prefix': 'project-monitoring'})]})
database = database.append([to_append])
database.to_csv("../project-monitoring/database.csv", index=False)

Loading formatted geocoded file...


# 2.1 Data download functions

If using Sentinel hub, identify the following layers:
  * CLOUD: return [CLP / 255]
  * SHADOW: return [B02, B8A, B11]
  * DEM: return [DEM]
  * SENT: return [VV, VH]
  * L2A10: return [B02,B03,B04, B08]
  * L2A20: return [B05,B06,B07, B8A,B11,B12]

In [8]:
def identify_clouds(bbox: List[Tuple[float, float]],
                epsg: 'CRS', dates: dict = dates) -> (np.ndarray, np.ndarray, np.ndarray):
    """ Downloads and calculates cloud cover and shadow
        
        Parameters:
         bbox (list): output of calc_bbox
         epsg (float): EPSG associated with bbox 
         dates (tuple): YY-MM-DD - YY-MM-DD bounds for downloading 
    
        Returns:
         cloud_img (np.array):
         shadows (np.array): 
         clean_steps (np.array):
    """
    box = BBox(bbox, crs = epsg)
    cloud_request = WcsRequest(
        layer='CLOUD_NEW',
        bbox=box,
        time=dates,
        resx='160m', 
        resy='160m',
        image_format = MimeType.TIFF_d8,
        maxcc=0.75,
        instance_id=API_KEY,
        custom_url_params = {constants.CustomUrlParam.UPSAMPLING: 'NEAREST'},
        time_difference=datetime.timedelta(hours=72),
    )

    shadow_request = WcsRequest(
        layer='SHADOW',
        bbox=box,
        time=dates,
        resx='20m',
        resy='20m',
        image_format = MimeType.TIFF_d16,
        maxcc=0.75,
        instance_id=API_KEY,
        custom_url_params = {constants.CustomUrlParam.UPSAMPLING: 'NEAREST'},
        time_difference=datetime.timedelta(hours=72))

    cloud_img = np.array(cloud_request.get_data())
    print(f"There are {len(np.unique(cloud_img))} unique cloud values")
    if np.max(cloud_img > 10):
        cloud_img = cloud_img / 255
        
    assert np.max(cloud_img) <= 1., f'The max cloud probability is {np.max(cloud_img)}'
    c_probs_pus = ((40*40)/(512*512)) *(1/3)*cloud_img.shape[0]
    print(f"Cloud_probs used {round(c_probs_pus, 1)} processing units")
    
    cloud_img = resize(cloud_img, (cloud_img.shape[0], IMSIZE, IMSIZE), order = 0)
    n_always_cloud = np.min(cloud_img > 0.2, axis = (0))
    print(f"There are {np.sum(n_always_cloud)} pixels that always have clouds")
    n_cloud_px = np.sum(cloud_img > 0.33, axis = (1, 2))
    
    cloud_steps = np.argwhere(n_cloud_px > (IMSIZE**2 * 0.15))
    clean_steps = [x for x in range(cloud_img.shape[0]) if x not in cloud_steps]
    
    shadow_img = shadow_request.get_data(data_filter = clean_steps)
    shadow_img = np.array(shadow_img)
    shadow_pus = (shadow_img.shape[1]*shadow_img.shape[2])/(512*512) * shadow_img.shape[0]
    shadow_img = resize(shadow_img, (shadow_img.shape[0], IMSIZE, IMSIZE, shadow_img.shape[-1]), order = 0)
    print(f"There are unique {len(np.unique(shadow_img))} shadow L1C values")
    
    if np.max(shadow_img > 10):
        print(f"The max shadows is {np.max(shadow_img)}")
        shadow_img = shadow_img / 65535
 
    cloud_img = np.delete(cloud_img, cloud_steps, 0)
    shadows = mcm_shadow_mask(np.array(shadow_img), cloud_img) # Make usre this makes sense??
    print(f"Shadows ({shadows.shape}) used {round(shadow_pus, 1)} processing units")
    n_always_shadow = np.min(shadows, axis = (1, 2))
    print(f"There are {np.sum(n_always_shadow)} pixels that always have shadows")
    
    image_dates = []
    for date in cloud_request.get_dates():
        if date.year == year - 1:
            image_dates.append(-365 + starting_days[(date.month-1)] + date.day)
        if date.year == year:
            image_dates.append(starting_days[(date.month-1)] + date.day)
        if date.year == year + 1:
            image_dates.append(365 + starting_days[(date.month-1)]+date.day)
    image_dates = [val for idx, val in enumerate(image_dates) if idx in clean_steps]
    image_dates = np.array(image_dates, dtype = np.float32)   
    
    return cloud_img, shadows, clean_steps, image_dates
    
def download_dem(bbox: List[Tuple[float, float]], epsg: 'CRS') -> np.ndarray:
    """ Downloads the DEM layer from Sentinel hub
        
        Parameters:
         bbox (list): output of calc_bbox
         epsg (float): EPSG associated with bbox 
    
        Returns:
         dem_image (arr):
    """

    box = BBox(bbox, crs = epsg)
    dem_size = 650
    dem_request = WmsRequest(data_source=DataSource.DEM,
                         layer='DEM',
                         bbox=box,
                         width=dem_size,
                         height=dem_size,
                         instance_id=API_KEY,
                         image_format=MimeType.TIFF_d32f,
                         custom_url_params={CustomUrlParam.SHOWLOGO: False})
    dem_image = dem_request.get_data()[0]
    std_before = np.std(dem_image)
    dem_image = median_filter(dem_image, size = 5)
    std_after = np.std(dem_image)
    print(f"The std before was {std_before}, after median filter is {std_after}")
    dem_image = calcSlope(dem_image.reshape((1, dem_size, dem_size)),
                  np.full((dem_size, dem_size), 10), np.full((dem_size, dem_size), 10), zScale = 1, minSlope = 0.02)
    dem_image = dem_image.reshape((dem_size,dem_size, 1))
    dem_image = dem_image[1:dem_size-1, 1:dem_size-1, :]
    print(f"DEM used {round(((IMSIZE*IMSIZE)/(512*512))*2, 1)} processing units")
    print(f"There are {len(np.unique(dem_image))} unique DEM values")
    return dem_image
 

def download_layer(bbox: List[Tuple[float, float]],
                   clean_steps: np.ndarray, epsg: 'CRS',
                   dates: dict = dates, year: int = year) -> (np.ndarray, np.ndarray):
    """ Downloads the L2A sentinel layer with 10 and 20 meter bands
        
        Parameters:
         bbox (list): output of calc_bbox
         clean_steps (list): list of steps to filter download request
         epsg (float): EPSG associated with bbox 
         time (tuple): YY-MM-DD - YY-MM-DD bounds for downloading 
    
        Returns:
         img (arr):
         img_request (obj): 
    """
    box = BBox(bbox, crs = epsg)
    image_request = WcsRequest(
            layer='L2A20',
            bbox=box,
            time=dates,
            image_format = MimeType.TIFF_d16,
            maxcc=0.75,
            resx='20m', resy='20m',
            instance_id=API_KEY,
            custom_url_params = {constants.CustomUrlParam.DOWNSAMPLING: 'NEAREST',
                                constants.CustomUrlParam.UPSAMPLING: 'NEAREST'},
            time_difference=datetime.timedelta(hours=72),
        )
    img_bands = image_request.get_data(data_filter = clean_steps)
    img_20 = np.stack(img_bands).astype(np.float32)
    print(f"There are unique {len(np.unique(img_20))} S2-20 values")
    if np.max(img_20) >= 50:
        img_20 = img_20 / 65535
    assert np.max(img_20) <= 2.
    
    s2_20_usage = (img_20.shape[1]*img_20.shape[2])/(512*512) * (6/3) * img_20.shape[0]
    print(f"Original 20 meter bands size: {img_20.shape}, using {round(s2_20_usage, 1)} PU")
    img_20 = resize(img_20, (img_20.shape[0], IMSIZE, IMSIZE, img_20.shape[-1]), order = 0)

    image_request = WcsRequest(
            layer='L2A10',
            bbox=box,
            time=dates,
            image_format = MimeType.TIFF_d16,
            maxcc=0.75,
            resx='10m', resy='10m',
            instance_id=API_KEY,
            custom_url_params = {constants.CustomUrlParam.DOWNSAMPLING: 'BICUBIC',
                                constants.CustomUrlParam.UPSAMPLING: 'BICUBIC'},
            time_difference=datetime.timedelta(hours=72),
    )
    
    img_bands = image_request.get_data(data_filter = clean_steps)
    img_10 = np.stack(img_bands).astype(np.float32)
    print(f"There are unique {len(np.unique(img_10))} S2 values")
    if np.max(img_10) >= 50:
        img_10 = img_10 / 65535
    assert np.max(img_10) <= 2.
    
    s2_10_usage = (img_10.shape[1]*img_10.shape[2])/(512*512) * (4/3) * img_10.shape[0]
    img_10 = resize(img_10, (img_10.shape[0], IMSIZE, IMSIZE, img_10.shape[-1]), order = 0)
    img = np.concatenate([img_10, img_20], axis = -1)
    print(f"Sentinel 2 used {round(s2_20_usage + s2_10_usage, 1)} PU")

    image_dates = []
    for date in image_request.get_dates():
        if date.year == year - 1:
            image_dates.append(-365 + starting_days[(date.month-1)] + date.day)
        if date.year == year:
            image_dates.append(starting_days[(date.month-1)] + date.day)
        if date.year == year + 1:
            image_dates.append(365 + starting_days[(date.month-1)]+date.day)
    image_dates = [val for idx, val in enumerate(image_dates) if idx in clean_steps]
    image_dates = np.array(image_dates, dtype = np.float32)
    return img, image_dates

        
def download_sentinel_1(bbox: List[Tuple[float, float]],
                        epsg: 'CRS', imsize: int = IMSIZE, 
                        dates: dict = dates_sentinel_1, layer: str = "SENT",
                        year: int = year) -> (np.ndarray, np.ndarray):
    """ Downloads the GRD Sentinel 1 VV-VH layer from Sentinel Hub
        
        Parameters:
         bbox (list): output of calc_bbox
         epsg (float): EPSG associated with bbox 
         imsize (int):
         dates (tuple): YY-MM-DD - YY-MM-DD bounds for downloading 
         layer (str):
         year (int): 
    
        Returns:
         s1 (arr):
         image_dates (arr): 
    """
    source = DataSource.SENTINEL1_IW_DES if layer == "SENT_DESC" else DataSource.SENTINEL1_IW_ASC
    box = BBox(bbox, crs = epsg)
    image_request = WcsRequest(
            layer=layer,
            bbox=box,
            time=dates,
            image_format = MimeType.TIFF_d16,
            data_source=source,
            maxcc=1.0,
            resx='10m', resy='10m',
            instance_id=API_KEY,
            custom_url_params = {constants.CustomUrlParam.DOWNSAMPLING: 'NEAREST',
                                constants.CustomUrlParam.UPSAMPLING: 'NEAREST'},
            time_difference=datetime.timedelta(hours=72),
        )
    
    data_filter = [x for x in range(len(image_request.download_list))]
    if len(image_request.download_list) > 40:
        data_filter = [x for x in range(len(image_request.download_list)) if x % 2 == 0]
        
    if len(image_request.download_list) > 0:
        img_bands = image_request.get_data(data_filter = data_filter)
        s1 = np.stack(img_bands).astype(np.float32)
        np.save("s1.npy", s1)
        print(f"The original max value is {np.max(s1)}")
        print(f"There are unique {len(np.unique(s1))} S! values")
        if np.max(s1) >= 100:
                s1 = s1 / 65535.

        print(s1.shape)
        s1_usage = (2/3) * s1.shape[0] * ((s1.shape[1]*s1.shape[2]) / (512*512))
        print(f"Sentinel 1 used {round(s1_usage, 1)} PU for \
              {s1.shape[0]} out of {len(image_request.download_list)} images")

        image_dates = []
        for date in image_request.get_dates():
            if date.year == year - 1:
                image_dates.append(-365 + starting_days[(date.month-1)] + date.day)
            if date.year == year:
                image_dates.append(starting_days[(date.month-1)] + date.day)
            if date.year == year + 1:
                image_dates.append(365 + starting_days[(date.month-1)]+date.day)

        image_dates = [val for idx, val in enumerate(image_dates) if idx in data_filter]
        image_dates = np.array(image_dates)

        s1c = np.copy(s1)
        s1c[np.where(s1c < 1.)] = 0
        s1c[np.where(s1c >= 1.)] = 1.
        n_pix_oob = np.sum(s1c, axis = (1, 2, 3))
        print(n_pix_oob / (imsize*2*imsize*2))
        to_remove = np.argwhere(n_pix_oob > (imsize*2*imsize*2)/50)
        s1 = np.delete(s1, to_remove, 0)
        image_dates = np.delete(image_dates, to_remove)
        return s1, image_dates
    else: 
        return np.empty((0,)), np.empty((0,))


def identify_s1_layer(coords: Tuple[float, float]) -> str:
    """ Identifies whether to download ascending or descending 
        sentinel 1 orbit based upon predetermined geographic coverage
        
        Reference: https://sentinel.esa.int/web/sentinel/missions/
                   sentinel-1/satellite-description/geographical-coverage
        
        Parameters:
         coords (tuple): 
    
        Returns:
         layer (str): either of SENT, SENT_DESC 
    """
    results = rg.search(coords)
    country = results[-1]['cc']
    continent_name = pc.country_alpha2_to_continent_code(country)
    if continent_name in ['AF', 'OC']:
        layer = "SENT"
    if continent_name in ['SA']:
        if coords[0] > -7.11:
            layer = "SENT"
        else:
            layer = "SENT_DESC"
    if continent_name in ['AS']:
        if coords[0] > 23.3:
            layer = "SENT"
        else:
            layer = "SENT_DESC"
    if continent_name in ['NA']:
        layer = "SENT_DESC"
    print(f"The continent is: {continent_name}, and the sentinel 1 orbit is {layer}")
    return layer

# 2.3 Superresolution

In [9]:
MDL_PATH = "../src/dsen2/models/"

input_shape = ((4, None, None), (6, None, None))
model = s2model(input_shape, num_layers=6, feature_size=128)
predict_file = MDL_PATH+'s2_032_lr_1e-04.hdf5'
print('Symbolic Model Created.')

model.load_weights(predict_file)

def superresolve_tile(arr: np.ndarray, model) -> np.ndarray:
    """Superresolves each 56x56 subtile in a 646x646 input tile
       by padding the subtiles to 64x64 and removing the pad after prediction,
       eliminating boundary artifacts

        Parameters:
         arr (arr): (?, 646, 646, 10) array

        Returns:
         superresolved (arr): (?, 646, 646, 10) array
    """
    print(f"The input array to superresolve is {arr.shape}")
    tiles = tile_window(646, 646, 56, 56)
    for i in tnrange(len(tiles)):
        subtile = tiles[i]
        to_resolve = arr[:, subtile[0]:subtile[0]+56, subtile[1]:subtile[1]+56, :]

        resolved = superresolve(
            np.pad(to_resolve, ((0, 0), (4, 4), (4, 4), (0, 0)), 'reflect'),
            model)
        resolved = resolved[:, 4:-4, 4:-4, :]
        arr[:, subtile[0]:subtile[0]+56, subtile[1]:subtile[1]+56] = resolved
    return arr


Instructions for updating:
Colocations handled automatically by placer.
Symbolic Model Created.


# 2.4 Tiling and folder management functions

In [10]:
def make_output_and_temp_folders(idx: str, output_folder: str = OUTPUT_FOLDER) -> None:
    """Makes necessary folder structures for IO of raw and processed data

        Parameters:
         idx (str)
         output_folder (path)

        Returns:
         None
    """
    def _find_and_make_dirs(dirs):
        if not os.path.exists(os.path.realpath(dirs)):
            os.makedirs(os.path.realpath(dirs))
            
    _find_and_make_dirs(output_folder + "raw/")
    _find_and_make_dirs(output_folder + "raw/clouds/")
    _find_and_make_dirs(output_folder + "raw/s1/")
    _find_and_make_dirs(output_folder + "raw/s2/")
    _find_and_make_dirs(output_folder + "raw/misc/")
    _find_and_make_dirs(output_folder + "processed/")
    _find_and_make_dirs(output_folder + "interim/")

    
def to_int32(array: np.array) -> np.array:
    '''Converts a float32 array to int32, reducing storage costs by three-fold'''
    return np.trunc(array * 65535).astype(np.int32)

def to_int16(array: np.array) -> np.array:
    '''Converts a float32 array to int16, reducing storage costs by three-fold'''
    return np.trunc(array * 65535).astype(np.uint16)

def to_float32(array: np.array) -> np.array:
    divide = 1. if isinstance(array.flat[0], np.floating) else 65535
    return np.float32(array) / divide
    

def download_large_tile(coord: tuple,
                        step_x: int,
                        step_y: int,
                        folder: str = OUTPUT_FOLDER, 
                        year: int = year,
                        s1_layer: str = "SENT") -> None:
    """Wrapper function to download cloud probs, Sentinel 2, Sentinel 1, and DEM

        Parameters:
         coord (tuple):
         step_x (int):
         step_y (int):
         folder (path):
         year (int):
         s1_layer (str):

        Returns:
         None
    """
    bbx, epsg = calculate_bbx_pyproj(coord, step_x, step_y, expansion = 80)
    dem_bbx, _ = calculate_bbx_pyproj(coord, step_x, step_y, expansion = 90)
    idx = str(step_y) + "_" + str(step_x)
    idx = str(idx)
    make_output_and_temp_folders(idx)

    if not os.path.exists(folder + "output/" + str(step_y*5) + "/" + str(step_x*5) + ".npy"):
        if not os.path.exists(folder + "processed/" + str(step_y*5) + "/" + str(step_x*5) + ".hkl"):
            clouds_file = f'{folder}raw/clouds/clouds_{idx}.hkl'
            shadows_file = f'{folder}raw/clouds/shadows_{idx}.hkl'
            s1_file = f'{folder}raw/s1/{idx}.hkl'
            s1_dates_file = f'{folder}raw/misc/s1_dates_{idx}.hkl'
            s2_file = f'{folder}raw/s2/{idx}.hkl'
            s2_dates_file = f'{folder}raw/misc/s2_dates_{idx}.hkl'
            clean_steps_file = f'{folder}raw/clouds/clean_steps_{idx}.hkl'
            
            if not os.path.exists(clouds_file):
                # All this needs to be int16, copied to cloud with io.save_file
                print(f"Downloading clouds because {clouds_file} does not exist")
                cloud_probs, shadows, clean_steps, image_dates = identify_clouds(bbx, epsg = epsg)
                to_remove, _ = calculate_cloud_steps(cloud_probs, image_dates)
                print(to_remove)
                clean_steps = np.delete(clean_steps, to_remove)
                cloud_probs = np.delete(cloud_probs, to_remove, 0)
                shadows = np.delete(shadows, to_remove, 0)
                image_dates = np.delete(image_dates, to_remove)
                hkl.dump(cloud_probs, clouds_file, mode='w', compression='gzip')
                hkl.dump(shadows, shadows_file, mode='w', compression='gzip')
                hkl.dump(clean_steps, clean_steps_file, mode='w', compression='gzip')
                    
            if not os.path.exists(s1_file):
                # All this needs to be int16, copied to cloud with io.save_file
                print(f"Downloading S1 because {s1_file} does not exist")
                s1_layer = identify_s1_layer((coord[1], coord[0]))
                s1, s1_dates = download_sentinel_1(bbx, layer = s1_layer, epsg = epsg)
                if s1.shape[0] == 0:
                    s1_layer = "SENT_DESC" if s1_layer == "SENT" else "SENT"
                    print(f'Switching to {s1_layer}')
                    s1, s1_dates = download_sentinel_1(bbx, layer = s1_layer, epsg = epsg)
                s1 = process_sentinel_1_tile(s1, s1_dates)
                print("The S1 has been processed")
                hkl.dump(to_int16(s1), s1_file, mode='w', compression='gzip')
                hkl.dump(s1_dates, s1_dates_file, mode='w', compression='gzip')

            if not os.path.exists(s2_file):
                # All this needs to be int16, copied to cloud with io.save_file
                print(f"Downloading S2 because {s2_file} does not exist")
                if 'clean_steps' not in globals() or locals():
                    clean_steps = hkl.load(clean_steps_file)
                clean_steps = list(clean_steps)
                s2, s2_dates = download_layer(bbx, clean_steps = clean_steps, epsg = epsg)
                hkl.dump(to_int16(s2), s2_file, mode='w', compression='gzip')
                hkl.dump(s2_dates, s2_dates_file, mode='w', compression='gzip')

            if not os.path.exists(folder + "raw/misc/dem_{}.hkl".format(idx)):
                # All this needs to be int16, copied to cloud with io.save_file
                dem = download_dem(dem_bbx, epsg = epsg)
                hkl.dump(dem, folder + "raw/misc/dem_{}.hkl".format(idx), mode='w', compression='gzip')

In [11]:
def process_sentinel_1_tile(sentinel1: np.ndarray, dates: np.ndarray) -> np.ndarray:
    """Converts a (?, X, Y, 2) Sentinel 1 array to (24, X, Y, 2)

        Parameters:
         sentinel1 (np.array):
         dates (np.array):

        Returns:
         s1 (np.array)
    """
    s1, _ = calculate_and_save_best_images(sentinel1, dates)
    biweekly_dates = np.array([day for day in range(0, 360, 5)])
    to_remove = np.argwhere(biweekly_dates % 15 != 0)
    s1 = np.delete(s1, to_remove, 0)
    return s1


def make_folder_names(step_x: int, step_y: int) -> (list, list):
    '''Given an input tile location (step_x, step_y), identify the folder and file
       names for each 5x5 subtile
       
       Parameters:
         step_x (int):
         step_y (int):

        Returns:
         x_vals (list)
         y_vals (list)
    '''
    x_vals = []
    y_vals = []
    for i in range(25):
        y_val = (24 - i) // 5
        x_val = 5 - ((25 - i) % 5)
        x_val = 0 if x_val == 5 else x_val
        x_vals.append(x_val)
        y_vals.append(y_val)
    y_vals = [i + (5*step_y) for i in y_vals]
    x_vals = [i + (5*step_x) for i in x_vals]
    return x_vals, y_vals


def process_large_tile(coord: tuple,
                       step_x: int,
                       step_y: int,
                       folder: str = OUTPUT_FOLDER,
                       model: 'model' = model) -> None:
    '''Wrapper function to interpolate clouds and temporal gaps, superresolve tiles,
       calculate relevant indices, and save analysis-ready data to the output folder
       
       Parameters:
        coord (tuple)
        step_x (int):
        step_y (int):
        foldre (str):

       Returns:
        None
    '''
    idx = str(step_y) + "_" + str(step_x)
    x_vals, y_vals = make_folder_names(step_x, step_y)

    processed = True
    print(f"{folder}interim/{idx}.hkl")
    interim_exists = os.path.exists(f"{folder}interim/{idx}.hkl")
    for x, y in zip(x_vals, y_vals):
        folder_path = f"{str(y)}/{str(x)}"
        processed_exists = os.path.exists(folder + "processed/" + folder_path + ".hkl")
        output_exists = os.path.exists(folder + "output/" + folder_path + ".npy")
        if not (processed_exists or output_exists or interim_exists):
            processed = False
    if not processed:
        print(f"Downloading because folder {folder_path}.npy does not exist")
        # All this needs to be converted to float32
        clouds = hkl.load(f'{folder}raw/clouds/clouds_{idx}.hkl')
        sentinel1 = to_float32(hkl.load(f'{folder}raw/s1/{idx}.hkl'))
        radar_dates = hkl.load(f'{folder}raw/misc/s1_dates_{idx}.hkl')
        sentinel2 = to_float32(hkl.load(f'{folder}raw/s2/{idx}.hkl'))
        dem = hkl.load(f'{folder}raw/misc/dem_{idx}.hkl')
        image_dates = hkl.load(f'{folder}raw/misc/s2_dates_{idx}.hkl')
        print(image_dates)
        if os.path.exists(f'{folder}raw/clouds/shadows_{idx}.hkl'):
            shadows = hkl.load(f'{folder}raw/clouds/shadows_{idx}.hkl')
        else:
            print("No shadows file, so calculating shadows with L2A")
            shadows = mcm_shadow_mask(sentinel2, clouds)        
        
        to_remove, _ = calculate_cloud_steps(clouds, image_dates)
        print(sentinel2.shape, clouds.shape, shadows.shape, image_dates.shape)
        if len(to_remove) > 0:
            sentinel2 = np.delete(sentinel2, to_remove, axis = 0)
            clouds = np.delete(clouds, to_remove, axis = 0)
            shadows = np.delete(shadows, to_remove, axis = 0)
            image_dates = np.delete(image_dates, to_remove)
        print(f"{len(to_remove)} Cloudy and missing images removed, radar processed: {to_remove}")
        
        to_remove = remove_missed_clouds(sentinel2)
        if len(to_remove) > 0:
            sentinel2 = np.delete(sentinel2, to_remove, axis = 0)
            clouds = np.delete(clouds, to_remove, axis = 0)
            shadows = np.delete(shadows, to_remove, axis = 0)
            image_dates = np.delete(image_dates, to_remove)
        print(f"{len(to_remove)} missed cloudy images were removed: {to_remove}")
        
        x, interp = remove_cloud_and_shadows(sentinel2, clouds, shadows, image_dates)
        print("Clouds and shadows interpolated")    
                
        to_remove = np.argwhere(np.mean(interp, axis = (1, 2, 3)) > 0.5)
        print(f"{len(to_remove)} steps removed because of >50% interpolation rate")
        x = np.delete(x, to_remove, axis = 0)
        clouds = np.delete(clouds, to_remove, axis = 0)
        shadows = np.delete(shadows, to_remove, axis = 0)
        image_dates = np.delete(image_dates, to_remove)
                
        x = np.float32(x)
        x = superresolve_tile(x, model)
        
        dem_i = np.tile(dem[np.newaxis, 1:-1, 1:-1, :], (x.shape[0], 1, 1, 1))
        dem_i = dem_i / 90
        dem_i[dem_i > 0.25] = 0.25
        x = np.concatenate([x, dem_i], axis = -1)
        x = np.clip(x, 0, 1)
        
        interim_file = f"{folder}interim/{idx}.hkl"
        interim_dates = f"{folder}interim/dates_{idx}.hkl"
        hkl.dump(np.float16(x), interim_file, mode = 'w', compression = 'gzip')
        hkl.dump(image_dates, interim_dates, mode = 'w', compression = 'gzip')

# 2.5 Function execution

In [13]:
from scipy.ndimage import median_filter
downloaded = 0

if not os.path.exists(os.path.realpath(OUTPUT_FOLDER)):
            os.makedirs(os.path.realpath(OUTPUT_FOLDER))
        
print(f"Downloading {year} for {landscape}")

max_x = 50
max_y = 50

for y_tile in range(0, 4):
    for x_tile in range(0, 4
                       ):
        #contains = True
        contains = check_contains(coords, x_tile, y_tile, OUTPUT_FOLDER)
        print(y_tile, x_tile, contains, downloaded)
        if contains:
            print(f"Download {downloaded}/{max_x*max_y}; X: {x_tile} Y:{y_tile}")
            downloaded += 1
            download_large_tile(coord = coords, step_x = x_tile, step_y = y_tile)
            process_large_tile(coords, x_tile, y_tile)
            print("\n")

Downloading 2019 for ethiopia-aroresa
0 0 False 0
0 1 True 0
Download 0/2500; X: 1 Y:0
../project-monitoring/ethiopia/oromiya/kibre-mengist/2019/interim/0_1.hkl
Downloading because folder 0/9.npy does not exist
[ 14.  19.  24.  29.  34.  49.  54.  59.  69.  79. 109. 274. 354. 364.
 374. 384.]
0, Good steps: [14. 19. 24. 29.], min dist of 5.0, and 0.01 thresh
1, Good steps: [34. 49. 54.], min dist of 20.0, and 0.01 thresh
2, Good steps: [59. 69. 79.], min dist of 30.0, and 0.03 thresh
3, Good steps: [109.], min dist of 165.0, and 0.15 thresh
4, Good steps: [], min dist of 365, and 0.15 thresh
5, Good steps: [], min dist of 365, and 0.15 thresh
6, Good steps: [], min dist of 365, and 0.15 thresh
7, Good steps: [], min dist of 365, and 0.15 thresh
8, Good steps: [], min dist of 365, and 0.15 thresh
9, Good steps: [274.], min dist of 165.0, and 0.15 thresh
10, Good steps: [], min dist of 365, and 0.15 thresh
11, Good steps: [354. 364. 374. 384.], min dist of 80.0, and 0.06 thresh
[]
(16, 6

HBox(children=(IntProgress(value=0, max=144), HTML(value='')))




0 2 True 1
Download 1/2500; X: 2 Y:0
../project-monitoring/ethiopia/oromiya/kibre-mengist/2019/interim/0_2.hkl
Downloading because folder 0/14.npy does not exist
[ 14.  19.  24.  29.  34.  39.  49.  59.  69.  79.  84. 124. 139. 184.
 199. 214. 219. 244. 274. 339. 354. 364. 374.]
0, Good steps: [14. 19. 24. 29.], min dist of 5.0, and 0.01 thresh
1, Good steps: [34. 39. 49.], min dist of 20.0, and 0.01 thresh
2, Good steps: [59. 69. 79. 84.], min dist of 40.0, and 0.03 thresh
3, Good steps: [], min dist of 365, and 0.15 thresh
4, Good steps: [124. 139.], min dist of 55.0, and 0.03 thresh
5, Good steps: [], min dist of 365, and 0.15 thresh
6, Good steps: [184. 199.], min dist of 45.0, and 0.06 thresh
7, Good steps: [214. 219.], min dist of 25.0, and 0.06 thresh
8, Good steps: [244.], min dist of 30.0, and 0.06 thresh
9, Good steps: [274.], min dist of 65.0, and 0.06 thresh
10, Good steps: [], min dist of 365, and 0.15 thresh
11, Good steps: [339. 354. 364. 374.], min dist of 65.0, and 

HBox(children=(IntProgress(value=0, max=144), HTML(value='')))




0 3 True 2
Download 2/2500; X: 3 Y:0




The std before was 133.18931579589844, after median filter is 133.1517791748047
DEM used 3.2 processing units
There are 337 unique DEM values
../project-monitoring/ethiopia/oromiya/kibre-mengist/2019/interim/0_3.hkl
Downloading because folder 0/19.npy does not exist
[ 14.  19.  24.  29.  34.  49.  59.  69.  79.  84. 124. 214. 274. 339.
 354. 364. 374.]
0, Good steps: [14. 19. 24. 29.], min dist of 5.0, and 0.01 thresh
1, Good steps: [34. 49.], min dist of 20.0, and 0.01 thresh
2, Good steps: [59. 69. 79. 84.], min dist of 40.0, and 0.03 thresh
3, Good steps: [], min dist of 365, and 0.15 thresh
4, Good steps: [124.], min dist of 90.0, and 0.15 thresh
5, Good steps: [], min dist of 365, and 0.15 thresh
6, Good steps: [], min dist of 365, and 0.15 thresh
7, Good steps: [214.], min dist of 90.0, and 0.15 thresh
8, Good steps: [], min dist of 365, and 0.15 thresh
9, Good steps: [274.], min dist of 65.0, and 0.1 thresh
10, Good steps: [], min dist of 365, and 0.15 thresh
11, Good steps: [33

HBox(children=(IntProgress(value=0, max=144), HTML(value='')))




1 0 False 3
1 1 True 3
Download 3/2500; X: 1 Y:1
Downloading clouds because ../project-monitoring/ethiopia/oromiya/kibre-mengist/2019/raw/clouds/clouds_1_1.hkl does not exist
There are 256 unique cloud values
Cloud_probs used 0.1 processing units
There are 0 pixels that always have clouds
There are unique 9269 shadow L1C values


KeyboardInterrupt: 

In [14]:
INPUT_FOLDER = "/".join(OUTPUT_FOLDER.split("/")[:-2]) + "/"
def process_single_year(coord: tuple,
                       step_x: int,
                       step_y: int,
                       year = 2019,
                       path: str = INPUT_FOLDER,
                       delete = False) -> None:
    '''Wrapper function to interpolate clouds and temporal gaps, superresolve tiles,
       calculate relevant indices, and save analysis-ready data to the output folder
       
       Parameters:
        coord (tuple)
        step_x (int):
        step_y (int):
        folder (str):

       Returns:
        None
    '''
    idx = str(step_y) + "_" + str(step_x)
    if not os.path.exists(f"{path}/{year}/interim/{idx}.hkl"):
        print(f"Skipping because {path}/{year}/interim/{idx}.hkl does not exist")
    if os.path.exists(f"{path}/{year}/interim/{idx}.hkl"):
        x_vals, y_vals = make_folder_names(step_x, step_y)
        dates = hkl.load(f"{path}/{year}/interim/dates_{idx}.hkl")
        s2 = hkl.load(f"{path}/{year}/interim/{idx}.hkl").astype(np.float32)
        s2 = np.clip(s2, 0, 1)
        s1 = hkl.load(f"{path}/{year}/raw/s1/{idx}.hkl")
        
        
        s2 = evi(s2, verbose = True)
        s2 = bi(s2, verbose = True)
        s2 = msavi2(s2, verbose = True)
        s2 = si(s2, verbose = True)

        # spaghetti code to interpolate NA values induced in msavi2 ocassionally
        for x_loc in range(s2.shape[1]):
            for y_loc in range(s2.shape[2]):
                n_na = np.sum(np.isnan(s2[:, x_loc, y_loc, :]), axis = 1)
                for date in range(s2.shape[0]):
                    if n_na.flatten()[date] > 0:
                        before, after = calculate_proximal_steps(date, np.argwhere(n_na == 0))
                        s2[date, x_loc, y_loc, :] = (s2[date + before, x_loc, y_loc] + s2[date + after, x_loc, y_loc]) / 2
        
        numb_na = np.sum(np.isnan(s2), axis = (1, 2, 3))
        print(numb_na)
        index = 0
        tiles = tile_window(IMSIZE, IMSIZE, window_size = 142)
        for t in tiles:
            start_x, start_y = t[0], t[1]
            end_x = start_x + t[2]
            end_y = start_y + t[3]
            subset = s2[:, start_x:end_x, start_y:end_y, :]
            subtile, _ = calculate_and_save_best_images(subset, dates)
            print(np.sum(np.isnan(subtile), axis = (1, 2, 3)))
            output = f"{path}/{year}/processed/{y_vals[index]}/{x_vals[index]}.hkl"

            index += 1
            print(f"{index}: The output file is {output}")
            sm = Smoother(lmbd = 800, size = subtile.shape[0], nbands = 14, dim = subtile.shape[1])
            subtile = sm.interpolate_array(subtile)
            subtile = np.concatenate([subtile, s1[:, start_x:end_x, start_y:end_y, :]], axis = -1)

            output_folder = "/".join(output.split("/")[:-1])
            if not os.path.exists(os.path.realpath(output_folder)):
                os.makedirs(os.path.realpath(output_folder))
            subtile = np.float32(subtile)
            subtile = np.reshape(subtile, (12, 2, 142, 142, subtile.shape[-1]))
            subtile = np.mean(subtile, axis = (1))
            print(subtile.shape)
            assert subtile.shape[1] == 142, f"subtile shape is {subtile.shape}"

            hkl.dump(subtile, output, mode='w', compression='gzip')
        if delete:
            os.remove(f"{path}/{year}/interim/{idx}.hkl")

In [None]:
max_x = 14
max_y = 28

for y_tile in range(0, 4):
    for x_tile in range(0, 4):
        contains = True
        #contains = check_contains(coords, x_tile, y_tile, OUTPUT_FOLDER)
        if contains:
            process_single_year(coords, x_tile, y_tile, year = 2019, delete = True)
            print("\n")