#### Before running anything - there are two extra libraries to install for the visualisations in these notebooks

In [1]:
!pip install folium mapclassify



In [6]:
import geopandas as gpd
import pandas as pd
from shapely import wkt
import matplotlib.colors as colors

from huggingface_hub import hf_hub_download

### Download definition files

Two definition files are required

- One file (.geojson) outlining the geographic extent of each tile
- One file (.csv) giving the train/test/validation split for each tile

Each file is indexed by the alphanumeric identifier of the tile. Each datachip corresponding to a fixed geographic tile (i.e the same physical space on the surface of the Earth) has the same identifier across all component datasets (e.g. S1GRD, S2RGB etc.)

In [7]:
#Name of huggingface repo
HUGGINGFACE_REPO_ID = "M3LEO-miniset/conus"

#Relevant files to retrieve
chipdefs_remote = 'conus_partitions_aschips_293d95e3ee589_miniset.geojson'
splitdefs_remote = 'conus_partitions_aschips_293d95e3ee589_splits_60bands_angle09_60-20-20_miniset.csv'

We download the relevant files from the HuggingFace repository using hf_hub_download():

In [8]:
chipdefs_local = hf_hub_download(repo_id=HUGGINGFACE_REPO_ID, filename=chipdefs_remote, repo_type='dataset')
splitdefs_local = hf_hub_download(repo_id=HUGGINGFACE_REPO_ID, filename=splitdefs_remote, repo_type='dataset')

print(f"Chip definitions downloaded to {chipdefs_local}")
print(f"Split definitions downloaded to {splitdefs_local}")

Chip definitions downloaded to /home/matt/.cache/huggingface/hub/datasets--M3LEO-miniset--conus/snapshots/a5b2fc3f3e08e4e58f4038f3969b93beaab7d168/conus_partitions_aschips_293d95e3ee589_miniset.geojson
Split definitions downloaded to /home/matt/.cache/huggingface/hub/datasets--M3LEO-miniset--conus/snapshots/a5b2fc3f3e08e4e58f4038f3969b93beaab7d168/conus_partitions_aschips_293d95e3ee589_splits_60bands_angle09_60-20-20_miniset.csv


### Load tile definitions

In [9]:
tiles_df = gpd.read_file(chipdefs_local)
splits_df = pd.read_csv(splitdefs_local)

We can now see that the that the tiles definition file contains the geometry of each file. Note that, for the miniset, we store the index of the tiles as they are found in the full dataset under original_index. You can ignore it for the rest of these notebooks.

In [10]:
tiles_df.head()

Unnamed: 0,area_km2,identifier,original_index,geometry
0,21.965738,0141dce56d692,6,"POLYGON ((-124.42348 40.41738, -124.42348 40.4..."
1,21.929954,197e06c26aa3d,25,"POLYGON ((-124.26954 40.29386, -124.26954 40.3..."
2,21.938541,142643de7174a,52,"POLYGON ((-124.26562 40.41450, -124.26562 40.4..."
3,21.959804,2018fe7c56825,56,"POLYGON ((-124.32435 40.56045, -124.32435 40.6..."
4,21.692478,112f75269df78,98,"POLYGON ((-123.51304 38.71519, -123.51304 38.7..."


The splits definition file lists the split for each tile:

In [11]:
splits_df.head()

Unnamed: 0,identifier,split,original_index
0,020d259a89240,train,41505
1,1e92ebf944d4d,train,59541
2,1440370d0d286,train,69697
3,11627afc2a613,train,131544
4,057aa09d02a81,train,146310


### Visualize tiles

There are a lot of them, even for the miniset. We've provided a .wkt file describing Massachusetts, which we'll use to limit the AOI in these notebooks. We'll also colour them by split, by merging the two dataframes.

In [14]:
#limit to massachusetts
with open('../data/massachusetts.wkt') as f:
    bounds = wkt.load(f)

tiles_df = tiles_df[tiles_df.intersects(bounds)]
chips_splits_df = tiles_df.merge(splits_df, on='identifier')
chips_splits_df.explore(column='split', cmap=colors.ListedColormap(['red','blue', 'green']))