# Maxar Image Availability Analysis

The Maxar image availability workflow takes as input a list of TerraFund project ids and returns as output a csv listing every project and how much of that project’s area has Maxar imagery coverage.

#### Workflow:
1. Pull info on project characteristics for the entire portfolio using the TerraMatch API
    - Repo/notebook: terrafund-portfolio-analysis/tm-api.ipynb
    - Input: list of TerraFund project IDs
    - Output: csv of all project features
2. Using the TM API csv, pull Maxar metadata
    - Repo/notebook: maxar-tools/decision-tree-metadata.ipynb and maxar-tools/src/decision_tree.py (? may need to change b/c of my additions to the acquire_metadata function)
    - Input: csv of project features
    - Output: csv of maxar metadata
3. Create imagery features (??)
    - Repo/notebook: terrafund-portfolio-analysis/maxar-img-avail.py
    - Input: csv of maxar metadata and csv of TM project features
    - Output: csv of project features and percent imagery coverage
4. Identify projects with 100% imagery coverage

In [1]:
import pandas as pd # used
import geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt
from shapely.geometry import shape
from shapely.geometry import Polygon, Point
from shapely import union_all
import ast
from datetime import datetime, timedelta
import re
import os
import math
import requests
import yaml
import json
import pyproj
import sys
sys.path.append('../src/')
import image_availability as img
import process_api_results as clean
import decision_trees as tree
import tm_api_utils as api_request

%load_ext autoreload
%autoreload 2

### Parameters

In [2]:
# File paths
tm_auth_path = '../secrets.yaml'
tm_staging_url = "https://api-staging.terramatch.org/research/v3/sitePolygons?"                 # use for testing queries
tm_prod_url = "https://api.terramatch.org/research/v3/sitePolygons?"                            # Use to pull data for analysis'
approved_projects = '../terrafund-portfolio-analyses/projects_all_approved_202501091214.csv'    # List of projects with approved polygons
feats = '../data/tm_api_TEST.csv'                                                               # Polygon metadata & geometries from TM API
maxar_feats = '/home/darby/github_repos/maxar-tools/data/tm_api_TEST.csv'                       # Polygon metadata & geometries from TM API saved to maxar-tools repo
maxar_md = '../data/imagery_availability/comb_img_availability_2025-02-26.csv'                  # Metadata for Maxar images corresponding to polygons

# Define filtering thesholds (stored in a dictionary)
filters = {
    'cloud_cover': 50,          # Remove images with >50% cloud cover
    'off_nadir': 30,            # Remove images with >30° off-nadir angle
    'sun_elevation': 30,        # Keep only images where sun elevation >30°
    'date_range': (-366, 0),    # Date range of 1 year before plantstart
    'img_count': 1,             # Threshold for identifying image availability (REASSESS)
    'ev_range': (730,1095)      # Early verification window (2-3 years after plantstart date) (REASSESS)
}

## Code Workflow Outline (DON'T RUN!!)

In [None]:
# Step 1: LOAD AND PREPROCESS DATA
# 1.1: Load polygon dataset
poly_csv = gpd.GeoDataFrame(polygon geometries & metadata)

# 1.2 Load image dataset
img_csv = gpd.GeoDataFrame(maxar image geometries & metadata)

# 1.3 Preprocess the data
poly_gdf = preprocess_polygons(poly_csv) # Clean data, convert geometries, enforce CRS
img_gdf = preprocess_images(img_csv) # Clean data, convert geometries, enforce CRS


# Step 2: MERGE POLYGON DATA WITH IMAGE DATA
merged_gdf = img_gdf.merge(poly_gdf, on=['project_id', 'poly_id'], how='left')

# Step 3: PRE-FILTER IMAGES
filtered_images = merged_gdf where:
    (date is within allowed date range) &
    (cloud cover < cloud_thresh) &
    (off-nadir angle < off_nadir_thresh) &
    (sun elevation < sun_elev_thresh)

# Step 4: ITERATE THROUGH PROJECTS AND POLYGONS TO CALCULATE IMAGERY COVERAGE
# 4.1 Create a dictionary for project-polygon mapping
project_polygons = {project_id: list of poly_ids associated with that project} # Create a dictionary

# 4.2 Initialize list to store low coverage cases
low_img_coverage_log = []

# 4.3 Iterate through each project
for each project_id in project_polygons:

    # 4.4 Get all polygons for this project
    project_polygons_list = list of poly_ids for this project_id

    # 4.5 Iterate through each polygon in the project
    for each poly_id in project_polygons_list:
    
        # 4.6 Get all images associated with this polygon
        poly_images = filtered_images[filtered_images['poly_id'] == poly_id]

        # Count the number of available images
        num_images = len(poly_images)

        # If no valid image exists, record 0% coverage
        if poly_images is empty:
            store result: (poly_id, project_id, None, num_images, 0, 0) # No images available
            continue

        # 4.7 Select the best image (lowest cloud cover)
        best_image = select_best_image(poly_images)

        # 4.8 Get polygon and image geometries
        poly_geom = poly_gdf[poly_gdf['poly_id'] == poly_id].geometry.iloc[0]
        best_img_geom = best_image['img_geom']

        # 4.9 Compute UTM Zone and reproject geometries
        poly_centroid = compute centroid of poly_geom
        utm_crs = get UTM CRS from centroid
        poly_geom_reprojected = reproject poly_geom to utm_crs
        best_img_geom_reprojected = reproject best_img_geom to utm_crs

        # 4.10 Calculate the polygon area dynamically (in hectares)
        poly_area_ha = poly_geom_reprojected.area / 10000

        # 4.11 Calculate area of overlap
        overlap_area = poly_geom_reprojected union best_img_geom_reprojected
        overlap_area_ha = overlap_area / 10000

        # 4.12 Compute percent of polygon area covered
        percent_img_cover = (overlap_area / poly_area_ha) * 100

        # 4.13 Log cases where imagery coverage is unexpectedly low
        if percent_img_cover < 50:
            log_entry = {
                'poly_id': poly_id,
                'project_id': project_id,
                'best_image': best_image['title'],
                'num_images': num_images,
                'poly_area_ha': poly_area_ha,
                'overlap_area_ha': overlap_area_ha,
                'percent_img_cover': percent_img_cover
            }
            low_img_coverage_log.append(log_entry)

        # 4.14 Store results
        store result: (poly_id, project_id, best_image['title'], num_images, poly_area_ha, overlap_area_ha, percent_img_cover)

# STEP 5: EXPORT LOW COVERAGE LOG IF NEEDED
if low_img_coverage_log is not empty:
    export_to_csv(low_img_coverage_log, "low_coverage_polygons.csv")

# Function Implementation

### STEP 1: LOAD & PREPROCESS DATA
Goal: ensure input data is clean & structured

In [None]:
## 1.1 LOAD IN POLYGON AND IMAGE CSVS
poly_df = pd.read_csv(feats)
img_df = pd.read_csv(maxar_md)

In [3]:
## 1.2 PREPROCESS POLYGON DATA
def preprocess_polygons(poly_df, debug=False):
    """
    Cleans up a dataframe of polygon metadata & geometries from the TerraMatch API and 
    converts it into a GeoDataframe

    Args:
        poly_df (DataFrame): Raw polygon dataset.

    Returns:
        GeoDataFrame: Processed polygon dataset with a geometry column as a shapely object.
    """
    # Enforce lowercase column names
    poly_df.columns = poly_df.columns.str.lower()

    # Rename 'name' and 'geometry' columns
    poly_df = poly_df.rename(columns={'name': 'poly_name', 'geometry': 'poly_geom'})

    # Convert 'plantstart' column to a datetime
    poly_df['plantstart'] = pd.to_datetime(poly_df['plantstart'], errors='coerce')

    # Convert stringified 'poly_geom' dictionaries into real dictionaries
    poly_df['poly_geom'] = poly_df['poly_geom'].apply(lambda x: shape(ast.literal_eval(x)) if isinstance(x, str) else shape(x))

    # Convert 'poly_geom' dictionaries from WKT to Shapely objects
    poly_df['poly_geom'] = poly_df['poly_geom'].apply(shape)

    # Convert to GeoDataFrame
    poly_gdf = gpd.GeoDataFrame(poly_df, geometry='poly_geom', crs="EPSG:4326")

    # Add a field for the polygon centroid
    poly_gdf['poly_centroid'] = poly_gdf['poly_geom'].iloc[0].centroid

    if debug:
        print(f"There are {len(poly_gdf.poly_id.unique())} unique polygons for {len(poly_gdf.project_id.unique())} projects in this dataset.")

    return poly_gdf

In [4]:
## 1.3 PREPROCESS MAXAR IMAGERY DATA
def preprocess_images(img_df, debug=True):
    """
    Cleans up a dataframe of maxar image metadata & geometries from the Maxar Discovery API and 
    converts it into a GeoDataframe

    Args:
        img_df (DataFrame): Raw image metadata dataset.
    
    Returns: 
        GeoDataFrame: Processed image dataset with a geometry column as a shapely object.
    """
    # Convert 'datetime' column to a datetime and rename
    img_df.loc[:, 'datetime'] = pd.to_datetime(img_df['datetime'], format='%Y-%m-%dT%H:%M:%S.%fZ', errors='coerce') # Convert to datetime type
    img_df.loc[:, 'datetime'] = img_df['datetime'].apply(lambda x: x.replace(tzinfo=None) if pd.notna(x) else x)    # Remove time zone info
    
    # Rename the 'datetime' column to 'img_date'
    img_df = img_df.rename(columns={'datetime': 'img_date'}) # Rename the column img_date

    # Select the relevent columns from img_df
    img_df = img_df[['title', 'project_id', 'poly_id', 'img_date', 'area:cloud_cover_percentage', 'eo:cloud_cover', 'area:avg_off_nadir_angle', 'view:sun_elevation', 'img_geom']]

    # Convert stringified 'poly_geom' dictionaries into real dictionaries
    img_df['img_geom'] = img_df['img_geom'].apply(lambda x: shape(ast.literal_eval(x)) if isinstance(x, str) else shape(x))

    # Convert 'img_geom' (image footprint geometries) from WKT to Shapely objects
    img_df['img_geom'] = img_df['img_geom'].apply(shape)

    # Convert DataFrame to GeoDataFrame
    img_gdf = gpd.GeoDataFrame(img_df, geometry='img_geom', crs="EPSG:4326")

    # Add a field for the image centroid
    img_gdf['img_centroid'] = img_gdf['img_geom'].iloc[0].centroid

    if debug:
        print(f"There are {len(img_gdf)} images for {len(img_gdf.poly_id.unique())} polygons in {len(img_gdf.project_id.unique())} projects in this dataset.")

    return img_gdf

### STEP 2: MERGE & FILTER DATA
Goal: link images to polygons and apply filters

In [10]:
## 2.1 MERGE THE POLYGON ATTRIBUTES TO THE IMAGES GEODATAFRAME
def merge_polygons_images(img_gdf, poly_gdf, debug=True):
    """ 
    Merges the polygon metadata into the Maxar image GeoDataFrame. All rows of the img_gdf are preserved.
    Also records polygons that are dropped because they don't have any associated images.

    Args:
        img_gdf (GeoDataFrame): Image metadata dataset (each row represents a Maxar image)
        poly_gdf (GeoDataFrame): Polygon dataset (each row represents a polygon from the TM API)
    
    Returns:
        tuple: (GeoDataFrame of merged dataset, list of missing polygons (poly_id, project_id))
    """
    # Merge the image data with the polygon data (preserving image data rows and adding associated polygon attributes)
    merged_gdf = img_gdf.merge(poly_gdf, on=['project_id', 'poly_id'], how='left')

    # Identify polygons without any corresponding Maxar images
    missing_polygons_df = poly_gdf[~poly_gdf['poly_id'].isin(merged_gdf['poly_id'])]

    # Save poly_id and project_id of missing polygons as a list of tuples
    missing_polygons_list = list(missing_polygons_df[['poly_id', 'project_id']].itertuples(index=False, name=None))

    if debug:
        print(f"Total images in img_gdf: {len(img_gdf)}")
        print(f"Total polygons in poly_gdf: {len(poly_gdf)}")
        print(f"Total rows in merged dataset: {len(merged_gdf)}")
        print(f"Unique polygons in merged dataset: {len(merged_gdf['poly_id'].unique())}")
    
        # Count polygons dropped due to no matching images
        missing_polygons = len(poly_gdf[~poly_gdf['poly_id'].isin(merged_gdf['poly_id'])])
        print(f"There {missing_polygons} polygons without images in the merged dataset")
        print(f"Polygons without images (dropped at this stage): {missing_polygons_list}")

    return merged_gdf, missing_polygons_list

In [None]:
### 2.2 FILTER IMAGES BASED ON HARD CRITERIA
def filter_images(merged_gd, filters, debug=True):
    """
    Filters the merged dataset to retain only images that meet filters for image quality.
    The values for the filters can be changed in the parameters section.

    Args:
        merged_gdf (GeoDataFrame): Merged dataset of images and polygons.
        filters (dict): Dictionary containing filter thresholds (in Parameters section of notebook)
    
    Returns:
        GeoDataFrame: Filtered dataset containing only the images that meet the criteria
    """
    # Ensure date columns are in correct datetime format
    merged_gdf['img_date'] = pd.to_datetime(merged_gdf['img_date'], errors='coerce')
    merged_gdf['plantstart'] = pd.to_datetime(merged_gdf['plantstart'], errors='coerce')

    # Compute the date difference (image capture date - plant start date)
    merged_gdf['date_diff'] = (merged_gdf['img_date'] - merged_gdf['plantstart']).dt.days

    # Apply filtering criteria to retain only images within the desired time range, cloud cover, 
    # off nadir angle, and sun elevation parameters
    filtered_images = merged_gdf[
        (merged_gdf['date_diff'] >= filters['date_range'][0]) &
        (merged_gdf['date_diff'] <= filters['date_range'][1]) &
        (merged_gdf['area:cloud_cover_percentage'] < filters['cloud_cover']) &
        (merged_gdf['area:avg_off_nadir_angle'] <= filters['off_nadir']) &
        (merged_gdf['view:sun_elevation'] >= filters['sun_elevation'])
    ].copy()  # Copy to avoid SettingWithCopyWarning

    if debug:
        print(f"Total images before filtering: {len(merged_gdf)}")
        print(f"Total images after filtering: {len(filtered_images)}")
        print(f"Polygons with at least one valid image: {filtered_images['poly_id'].nunique()}")
    
    return filtered_images

### STEP 3: PROCESS EACH POLYGON
Goal: Prepare polygons & select best image

In [39]:
## 3.1 GET THE BEST IMAGE FOR A GIVEN POLYGON
def get_best_image(poly_images, debug=True):
    """
    Selects the best image for a given polygon based on the lowest cloud cover. If multiple images have
    the same cloud cover, selects the one closest to the plantstart date. 

    If we want to update this to include an "expected coverage" based on cloud cover and footprint overlap,
    this is where it would go.

    Args:
        poly_images (GeoDataFrame): Subset of img_gdf_filtered containing images for one polygon.
    
    Returns:
        GeoSeries: The best image row from poly_images
    """
    # Create an absolute value date_diff column to help sort images by proximity to plantstart date
    poly_images = poly_images.copy() # Avoid modifying the original dataframe
    poly_images['abs_date_diff'] = poly_images['date_diff'].abs()

    # Sort images by cloud cover (ascending) and then by date (closest to plantstart)
    sorted_images = poly_images.sort_values(by=['area:cloud_cover_percentage', 'abs_date_diff'])

    if debug:
        print("\n Debug: Sorted images for this polygon (using cloud cover, then proximity to plantstart date):")
        print(sorted_images[['title', 'area:cloud_cover_percentage', 'img_date', 'plantstart', 'abs_date_diff']])

    # Select the best image (first row after sorting)
    best_image = sorted_images.iloc[0]

    return best_image

### STEP 4: COMPUTE COVERAGE
Goal: calculate imagery coverage per polygon

### STEP 5: EXPORT RESULTS
Goal: save results for review

# TESTING

In [5]:
## 1.1 LOAD IN POLYGON AND IMAGE CSVS
poly_df = pd.read_csv(feats)
img_df = pd.read_csv(maxar_md)

In [7]:
## 1.2 PREPROCESS POLYGON DATA
poly_gdf = preprocess_polygons(poly_df, debug=True)

There are 16 unique polygons for 3 projects in this dataset.


In [8]:
## 1.2 PREPROCESS POLYGON DATA
img_gdf = preprocess_images(img_df, debug=True)

There are 229 images for 16 polygons in 3 projects in this dataset.


In [11]:
## 2.1 MERGE THE POLYGON ATTRIBUTES TO THE IMAGES GEODATAFRAME
merged_gdf, missing_polygons = merge_polygons_images(img_gdf, poly_gdf, debug=True)

Total images in img_gdf: 229
Total polygons in poly_gdf: 16
Total rows in merged dataset: 229
Unique polygons in merged dataset: 16
There 0 polygons without images in the merged dataset
Polygons without images (dropped at this stage): []


In [13]:
### 3.1 FILTER IMAGES BASED ON HARD CRITERIA
img_gdf_filtered = filter_images(merged_gdf, filters, debug=True)

Total images before filtering: 229
Total images after filtering: 30
Polygons with at least one valid image: 15


In [None]:
# Manually get images for a polygon
example_poly_images = img_gdf_filtered[img_gdf_filtered['poly_id'] == '410696dc-9579-4412-9c7b-55194cb1867c']
best_image = get_best_image(example_poly_images, debug=True)

Index(['title', 'project_id', 'poly_id', 'img_date',
       'area:cloud_cover_percentage', 'eo:cloud_cover',
       'area:avg_off_nadir_angle', 'view:sun_elevation', 'img_geom',
       'img_centroid', 'poly_name', 'status', 'siteid', 'poly_geom',
       'plantstart', 'plantend', 'practice', 'targetsys', 'distr', 'numtrees',
       'calcarea', 'indicators', 'establishmenttreespecies',
       'reportingperiods', 'poly_centroid', 'date_diff'],
      dtype='object')

 Debug: Sorted images for this polygon (using cloud cover, then proximity to plantstart date):
                                title  area:cloud_cover_percentage  \
13  Maxar WV03 Image 1040010068D32900                          0.0   
16  Maxar WV02 Image 10300100B3A8FB00                          0.0   

                     img_date plantstart  abs_date_diff  
13 2021-06-13 07:40:21.255153 2022-01-09            210  
16 2021-01-17 07:35:47.871523 2022-01-09            357  


In [43]:
best_image[['title', 'project_id', 'poly_id', 'img_date', 'plantstart', 'area:cloud_cover_percentage',
            'area:avg_off_nadir_angle', 'view:sun_elevation']]

title                             Maxar WV03 Image 1040010068D32900
project_id                     3a860077-df4c-4e95-8fec-41520c551243
poly_id                        410696dc-9579-4412-9c7b-55194cb1867c
img_date                                 2021-06-13 07:40:21.255153
plantstart                                      2022-01-09 00:00:00
area:cloud_cover_percentage                                     0.0
area:avg_off_nadir_angle                                   25.09837
view:sun_elevation                                        52.087266
Name: 13, dtype: object