# Select Sentinel-2 Scenes covering 10km Grid for defining chip centers

This notebook creates a list of Sentinel-2 scenes to use for deploying the model.

* Uses 10km Grid output from the infrastructure density model
* Finds Sentinel-2 scenes in to cover the deployment region

## Import required libraries

In [1]:
from earthai.all import *
import earthai.chipping.strategy as chp
import pyspark.sql.functions as F
import geopandas as gpd
import pandas as pd
import os

import shapely.wkt
from shapely.geometry.multipolygon import MultiPolygon
from shapely.geometry.polygon import Polygon


Importing EarthAI libraries.
EarthAI version 1.6.0; RasterFrames version 0.9.0; PySpark version 2.4.7



## Define input and output files and parameters

### Input files

* `macro10km_cement_shp` is a shapefile specifying the 10km grid from the infrastructure density model for cement
* `macro10km_steel_shp` is a shapefile specifying the 10km grid from the infrastructure density model for cement

In [2]:
macro10km_cement_shp = '../../resources/nt-model/10km_CS_revised/macroloc_cement_CHN_10_correct1.shp'
macro10km_steel_shp = '../../resources/nt-model/10km_CS_revised/macroloc_steel_CHN_10_correct1.shp'

### Parameters

* `chip_size` is the size of chips (length) to create (in pixels)

In [3]:
chip_size = 300 # 3 km for Sentinel-2

### Output files and paths

* `output_path` defines directory to write data to
* `deployment_gjson` is output GeoJSON of the deployment region
* `grid_gjson` is output GeoJSON of the 10-km grid region
* `s2_scene_gjson` is an output GeoJSON file with Sentinel-2 scene extents

In [4]:
output_path = '../../resources/macro-loc-model-deployment4/'
deployment_gjson = 'S2-deployment-region-CHN-10km-nowater.geojson'
grid_gjson = 'S2-deployment-grid-CHN-10km-nowater.geojson'
s2_scene_gjson = 'S2-deployment-scene-extents-CHN-10km-nowater.geojson'

In [5]:
if not os.path.exists(output_path):
    os.mkdir(output_path)

## Load 10km grid from infrastructure density model

* Add a buffer equivalent to about 1 chip size around the geometries to ensure chips are uniform and cover full region
* Combine into a single mulipolygon by finding unary union
* Write out deployment regions to GeoJSON

### Load in 10km grids with no waterbodies

#### Cement

In [6]:
macro10km_cement_gdf = gpd.read_file(macro10km_cement_shp)
macro10km_cement_gdf = macro10km_cement_gdf[['index', 'preds', 'geometry']]
macro10km_cement_gdf = macro10km_cement_gdf.rename(columns={'index': 'inds_id', 
                                                    'preds': 'inds_cmt_pred'})
print("Cement grid CRS: ", macro10km_cement_gdf.crs)
print("Number of cement grid cells: ", len(macro10km_cement_gdf))

Cement grid CRS:  epsg:4326
Number of cement grid cells:  24258


#### Steel

In [7]:
macro10km_steel_gdf = gpd.read_file(macro10km_steel_shp)
macro10km_steel_gdf = macro10km_steel_gdf[['index', 'preds']]
macro10km_steel_gdf = macro10km_steel_gdf.rename(columns={'index': 'inds_id', 
                                                  'preds': 'inds_stl_pred'})
print("Number of steel grid cells: ", len(macro10km_steel_gdf))

Number of steel grid cells:  24258


#### Join cement and steel

In [8]:
macro_10km_gdf = pd.merge(macro10km_cement_gdf, macro10km_steel_gdf,
                         how='inner', on='inds_id')

#### Write out merged 10km grid

In [9]:
macro_10km_gdf.to_file(output_path+grid_gjson, driver='GeoJSON')

### Add small buffer to geometries in grid

*Note: 1 arcsec = 0.00028 deg ~ 30m at the equator.*

### Union to create simplier DataFrame of deployment region

In [10]:
macro_10km_union = macro_10km_gdf.unary_union
macro_10km_union = [MultiPolygon([x]) if (x.type == 'Polygon') else x for x in macro_10km_union]
reg_cnt = len(macro_10km_union)
reg_ind = [str(ind).zfill(len(str(reg_cnt))) for ind in list(range(1, reg_cnt+1))]
macro_deployment_gdf = gpd.GeoDataFrame({'reg_id': reg_ind,
                                         'geometry': gpd.GeoSeries(macro_10km_union)},
                                        geometry='geometry',
                                        crs='EPSG:4326')

### Write out deployment region

In [11]:
macro_deployment_gdf.to_file(output_path+deployment_gjson, driver='GeoJSON')

## Get catalog of Sentinel-2 scenes that intersect with grid cells

* Queries EarthAI Catalog to find S2 scenes that intersect with grid cells
* Returns all scenes for June in 2020 (successfully finds coverage for full deployment region)

In [12]:
row_cnt = len(macro_10km_gdf)
start_index = list(range(0, row_cnt+1, 2000))
end_index = list(range(2000, row_cnt+2000, 2000))
end_index[-1] = row_cnt+1
site_cat_list = []

In [13]:
for si, ei in zip(start_index, end_index):
    cat = earth_ondemand.read_catalog(
        geo=macro_10km_gdf[si:ei],
        start_datetime='2020-06-01', 
        end_datetime='2020-06-30',
        max_cloud_cover=100,
        collections='sentinel2_l2a'
    )
    site_cat_list.append(cat)
    print('Done loading catalog for rows ', si, ' through ', ei-1)

HBox(children=(FloatProgress(value=0.0, max=2788.0), HTML(value='')))


Done loading catalog for rows  0  through  1999


HBox(children=(FloatProgress(value=0.0, max=2243.0), HTML(value='')))


Done loading catalog for rows  2000  through  3999


HBox(children=(FloatProgress(value=0.0, max=1134.0), HTML(value='')))


Done loading catalog for rows  4000  through  5999


HBox(children=(FloatProgress(value=0.0, max=893.0), HTML(value='')))


Done loading catalog for rows  6000  through  7999


HBox(children=(FloatProgress(value=0.0, max=1211.0), HTML(value='')))


Done loading catalog for rows  8000  through  9999


HBox(children=(FloatProgress(value=0.0, max=899.0), HTML(value='')))


Done loading catalog for rows  10000  through  11999


HBox(children=(FloatProgress(value=0.0, max=891.0), HTML(value='')))


Done loading catalog for rows  12000  through  13999


HBox(children=(FloatProgress(value=0.0, max=1021.0), HTML(value='')))


Done loading catalog for rows  14000  through  15999


HBox(children=(FloatProgress(value=0.0, max=872.0), HTML(value='')))


Done loading catalog for rows  16000  through  17999


HBox(children=(FloatProgress(value=0.0, max=978.0), HTML(value='')))


Done loading catalog for rows  18000  through  19999


HBox(children=(FloatProgress(value=0.0, max=1393.0), HTML(value='')))


Done loading catalog for rows  20000  through  21999


HBox(children=(FloatProgress(value=0.0, max=1281.0), HTML(value='')))


Done loading catalog for rows  22000  through  23999


HBox(children=(FloatProgress(value=0.0, max=227.0), HTML(value='')))


Done loading catalog for rows  24000  through  24258


In [14]:
site_cat = pd.concat(site_cat_list, axis=0, join='outer', ignore_index=True) \
             .drop_duplicates(subset='id', ignore_index=True)
site_cat = site_cat.sort_values('datetime') \
                   .groupby('eod_grid_id') \
                   .first() \
                   .reset_index()

### Print counts of interest

In [15]:
s2_scene_cnt = site_cat.eod_grid_id.nunique()
print('Number of Geometries in deployment region: ', reg_cnt)
print('Number of Sentinel-2 scenes in deployment regions: ', s2_scene_cnt)

Number of Geometries in deployment region:  2014
Number of Sentinel-2 scenes in deployment regions:  1099


## Write out scene extents to GeoJSON

In [16]:
scene_geom_pdf = site_cat[['eod_grid_id', 'eod_epsg4326_geometry_simplified']]

In [17]:
scene_geom_gdf = gpd.GeoDataFrame({'grid_id': scene_geom_pdf.eod_grid_id,
                                   'grid_extent': scene_geom_pdf.eod_epsg4326_geometry_simplified},
                                  geometry='grid_extent',
                                  crs='EPSG:4326')

In [18]:
scene_geom_gdf.to_file(output_path+s2_scene_gjson, driver='GeoJSON')