# Select Landsat 8 Scenes covering 10km Grid for defining chip centers

This notebook creates a GeoJSON file defining the deployment regions for the Landsat 8 TIR macro-localization model, and a list of Landsat 8 scenes to use for defining the deployment grid in the following step.

This addresses the issue that Landsat 8 scenes with the same grid id taken at different dates do not map to the exact same projected extents, which is required when combining these images in the 3-band dataset for deployment. This code thus defines a per-scene grid of tile centroids that we can use to create chips of the desired size, centered at the same lat/long.

* Uses 10km Grid output from the proximity to infrastructure model
* Finds Landsat scenes in to cover the deployment region

## Import required libraries

In [None]:
from earthai.all import *
import earthai.chipping.strategy as chp
import pyspark.sql.functions as F
import geopandas as gpd
import pandas as pd
import os

## Define input and output files and parameters

### Input files

* `macro_10km_shp` is a shapefile specifying the 10km grid from the proximity to infrastructure model

In [None]:
macro_10km_shp = "../../resources/nt-model/10km_CS_macro/macroloc_cement_steel_CHN_10.shp"

### Parameters

* `chip_size` is the size of chips (length) to create (in pixels)
* `pred_thresh` is the prediction threshold for selecting deployment grid cells

In [None]:
chip_size = 35 # 1.05 km for Landsat 8
pred_thresh = 0.002

### Output files and paths

* `output_path` defines directory to write data to
* `deployment_gjson` is output GeoJSON of the deployment region
* `catalog_csv` is a csv file of the catalog returned from EOD
* `l8_scene_gjson` is an output GeoJSON file with Landsat-8 scene extents

In [None]:
output_path = '../../resources/macro-loc-model-deployment/'
deployment_gjson = 'L8-deployment-region-CHN-10km-pthsh'+str(pred_thresh)+'.geojson'
catalog_csv = 'L8-deployment-catalog-CHN-10km-pthsh'+str(pred_thresh)+'.csv'
l8_scene_gjson = 'L8-deployment-scene-extents-CHN-10km-pthsh'+str(pred_thresh)+'.geojson'

## Load 10km grid from proximity to infrastructure model

* Filter by `pred_thresh`
* Add a buffer equivalent to about 1 chip size around the geometries to ensure chips are uniform and cover full region
* Combine into a single mulipolygon by finding unary union
* Write out deployment regions to GeoJSON

### Load in and filter 10km grid by `pred_thresh`

In [None]:
macro_10km_gdf = gpd.read_file(macro_10km_shp)
macro_10km_gdf = macro_10km_gdf[macro_10km_gdf.preds >= pred_thresh]
print("CRS: ", macro_10km_gdf.crs)
print("Number of grid cells in 10km CS Macro: ", len(macro_10km_gdf))
macro_10km_gdf.plot()

### Add small buffer to geometries in grid

*Note: 1 arcsec = 0.00028 deg ~ 30m at the equator.*

In [None]:
macro_10km_gdf = gpd.GeoDataFrame({'index': macro_10km_gdf.index,
                                   'geometry': macro_10km_gdf.buffer(0.00028*chip_size)},
                                   geometry='geometry',
                                   crs='EPSG:4326')

### Union to create simplier DataFrame of deployment region

In [None]:
macro_10km_union = macro_10km_gdf.unary_union
reg_cnt = len(macro_10km_union)
reg_ind = [str(ind).zfill(len(str(reg_cnt))) for ind in list(range(1, reg_cnt+1))]
macro_deployment_gdf = gpd.GeoDataFrame({'index': reg_ind,
                                         'geometry': macro_10km_union},
                                        geometry='geometry',
                                        crs='EPSG:4326')

### Write out deployment region vector file

In [None]:
macro_deployment_gdf.to_file(output_path+deployment_gjson, driver='GeoJSON')

## Get catalog of Landsat 8 scenes that intersect with grid cells

* Queries EarthAI Catalog to find L8 scenes that intersect with grid cells
* Returns all scenes for April-June in 2020 (successfully finds coverage for full deployment region)
* Join back to grid cells for chipping

*Note: work around for the 500 server error that I get with reading in the full regions*

In [None]:
row_cnt = len(macro_10km_gdf)
start_index = list(range(0, row_cnt+1, 2000))
end_index = list(range(2000, row_cnt+2000, 2000))
end_index[-1] = row_cnt+1
site_cat_list = []

In [None]:
for si, ei in zip(start_index, end_index):
    cat = earth_ondemand.read_catalog(
        geo=macro_10km_gdf[si:ei],
        start_datetime='2020-04-01', 
        end_datetime='2020-06-30',
        max_cloud_cover=100,
        collections='landsat8_l1tp'
    )
    site_cat_list.append(cat)
    print('Done loading catalog for rows ', si, ' through ', ei-1)

In [None]:
site_cat = pd.concat(site_cat_list, axis=0, join='outer', ignore_index=True) \
             .drop_duplicates(subset='id', ignore_index=True)
site_cat['grp_grid'] = site_cat['eod_grid_id']
site_cat = site_cat.sort_values('datetime') \
                   .groupby('grp_grid') \
                   .first() \
                   .reset_index(drop=True)

### Write out catalog as csv

In [None]:
site_cat.to_csv(output_path+catalog_csv, index=False)

### Print counts of interest

In [None]:
l8_scene_cnt = site_cat.eod_grid_id.nunique()
print('Number of Geometries in deployment region: ', reg_cnt)
print('Number of Landsat 8 scenes in deployment regions: ', l8_scene_cnt)

## Write out scene extents to GeoJSON

In [None]:
scene_geom_pdf = site_cat[['eod_grid_id', 'eod_epsg4326_geometry_simplified']]

In [None]:
scene_geom_gdf = gpd.GeoDataFrame({'scene_id': scene_geom_pdf.eod_grid_id,
                                   'scene_extent': scene_geom_pdf.eod_epsg4326_geometry_simplified},
                                  geometry='scene_extent',
                                  crs='EPSG:4326')

In [None]:
type(scene_geom_gdf)

In [None]:
scene_geom_gdf.to_file(output_path+l8_scene_gjson, driver='GeoJSON')