Skip to content

weecology/deepforest-pretrain

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

110 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeepLidar Pretrain

This repository contains code for generating pre-training data for aerial tree detection models. Specifically it is used to create the training data for DeepForest's base models, which are then refined using hand labelled imagery. The imagery and annotations are created from NEON data, with initial boxes created from LIDAR canopy height maps. Currently the routines for extracting crowns from LIDAR are not in this repo, but you can find them in weecology/deeplidar; in the future we plan to migrate those here and look at adding some more up-to-date extraction algorithms.

The workflow takes lists of large orthomosaics and annotations and generates tiled training data for ML models. Due to the sheer quantity of data (hundreds of gigabytes), what seems like a relatively simple task involves quite a few optimization steps. For example, annotations are stored in an SQL database instead of standard annotation formats for eaiser querying per-image, and we provide example SLURM files that can be used to distribute processing over NEON sites and years. The entire pipeline can be run in a couple of hours on a cluster instead of days.

Note: we are updating the documentation for the repo to be easier to use, and the dataset will be accessible in the near future.

Other note: as an experiment, some code in this repository was generated by LLMs/agentic models with human oversight. Standards are enforced by pre-commit and some other rules that aim to make the outputs from the models more useable. We provide the AGENTS files for rules.

Setup

This project uses uv for dependency management:

uv sync

We assume that you've got an annotation CSV (pretraining_annotations.csv). The repository makes some assumptions about how the raw data are laid out on disk.

Environment Configuration: Create a .env file in the project root with:

TIF_ROOT_DIR=/path/to/your/tif/images

This is used by all scripts.

Complete Pipeline

Sequential Pipeline (Recommended)

Run the complete end-to-end pipeline:

./scripts/run_pipeline.sh pretraining_annotations.csv ./output

This executes all 8 steps automatically:

  1. Database Creation - Convert CSV to optimized SQLite databases
  2. Sites List Generation - Extract available NEON site codes
  3. Path Validation - Verify image files exist on filesystem
  4. Preview Generation - Create overview images for quality control
  5. Confidence Computation - Compute AI quality scores (optional but recommended)
  6. Image Tiling - Generate training tiles with minimum cover algorithm
  7. Site Aggregation - Combine annotations by site and year
  8. Final Aggregation - Create master all_annotations.csv

SLURM Cluster Pipeline

ALl-in-one:

./scripts/submit_pipeline.sh pretraining_annotations.csv ./output

Individual CLI Tools

The pipeline is split into tools for convenience, and for debugging:

Database Creation

Convert CSV annotations to optimized SQLite databases:

uv run deeplidar-create-db pretraining_annotations.csv --output ./output

This tool creates site-specific SQLite databases from pretraining_annotations.csv with some appropriate indexes for faster lookup.

Database Schema

CREATE TABLE annotations (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    image_path TEXT NOT NULL,
    xmin INTEGER NOT NULL,
    ymin INTEGER NOT NULL,
    xmax INTEGER NOT NULL,
    ymax INTEGER NOT NULL,
    label TEXT NOT NULL,
    FOREIGN KEY (image_path) REFERENCES image_urls(base_filename)
);

CREATE INDEX idx_image_path ON annotations(image_path);

Sites List Generation

Extract available NEON site codes from databases:

uv run deeplidar-generate-sites-list --output ./output

Creates sites_to_process.txt with discovered site codes.

Path Validation

Verify image files exist on filesystem:

uv run deeplidar-validate-paths --output ./output [--site SITE] [--continue-on-missing]

Options:

  • --site: Validate specific site only
  • --continue-on-missing: Proceed even if some images are missing

Preview Generation

Generate overview images for quality control. This is a very useful command; we've found several misaligned LIDAR/RGB pairs. You usually don't need to export full size previews, 2k x 2k is enough to spot big problems.

uv run deeplidar-generate-previews --output ./output [--image IMAGE] [--site SITE] [--max-size 2048] [--quality 85] [--full-size]

Options:

  • --image: Process specific image only
  • --site: Process specific site only
  • --max-size: Maximum image dimension (default: 2048)
  • --quality: JPEG quality (default: 85)
  • --full-size: Generate full-size previews
  • --confidence-threshold: Filter annotations by confidence

Confidence Computation (Optional, but strongly recommended)

Compute canopy confidence using Restor's OAM-TCD model. This can then be used to automatically spot inconsistencies with labels. We also use it to remove building and other non-tree false positive labels that come from the CHM. Each box in the database is assigned the median score from the pixels that it covers.

uv run deeplidar-compute-confidence --output ./output [--site SITE] [--model-name MODEL]

Options:

  • --site: Process specific site only
  • --model-name: Model to use (default: "restor/tcd-segformer-mit-b5")
  • --tile-size: Tile size for processing (default: 1024)
  • --batch-size: Processing batch size (default: 4)
  • --empty-threshold: Empty tile threshold percentage (default: 10.0)

Image Processing Pipeline

1. Image Tiling

Process images with minimum tile cover algorithm using quality-filtered images from the database:

uv run deeplidar-tile-images --output <output_folder> [options]

Prerequisites: Quality assessment must be performed first using:

uv run deeplidar-compute-confidence --output <output_folder>

Processing Examples:

Basic Tiling (Default):

uv run deeplidar-tile-images --output ./output --only-image "2018_CUPE_1_711000_2002000_image.tif"

Site Processing with Quality Filtering:

uv run deeplidar-tile-images --output ./output --site CUPE --max-unlabeled-cover-ratio 0.1

Confidence Statistics Mode:

uv run deeplidar-tile-images --output ./output --confidence-stats --site CUPE

Key Options:

  • --only-image: Process specific image(s) by filename
  • --site: Process all images from a NEON site (e.g., CUPE, SRER)
  • --max-unlabeled-cover-ratio: Maximum unlabeled object coverage (e.g., 0.1 for 10%)
  • --max-bbox-mismatch-ratio: Maximum bbox mismatch ratio (e.g., 0.2 for 20%)
  • --min-confidence-threshold: Minimum mean bbox confidence required
  • --confidence-threshold: Filter individual annotations by confidence
  • --confidence-stats: Include confidence statistics in output CSV
  • --previews: Generate tile preview images
  • --tile-size: Square tile size (default: 1024)

Quality Filtering: Images are automatically filtered based on pre-computed quality metrics from the confidence processing step. Only images meeting quality requirements are processed.

Output Structure:

output/SITE/YEAR/image_basename/
├── tiles/                    # Cropped tile images
├── previews/                 # Tiles with bounding boxes overlaid (--previews)
└── annotations.csv           # DeepForest format with tile-relative coordinates

2. Individual Crop Generation (diagnostic, optional)

Generate individual bounding box crops from processed tiles. This is pretty verbose, but you might find it useful (for example if you were interested in clustering labels).

uv run deeplidar-generate-crops <input_path> [--output-dir OUTPUT_DIR]

Examples:

# Process all sites recursively
uv run deeplidar-generate-crops ./output

# Process specific site
uv run deeplidar-generate-crops ./output/CUPE

# Process specific image directory
uv run deeplidar-generate-crops ./output/CUPE/2018/2018_CUPE_1_711000_2002000_image
  • Recursively finds directories with annotations.csv + tiles/ folder
  • Generates individual crops for each bounding box annotation
  • Smart filename format: source_class_confidence_coords for optimal sorting

Crop Filename Format:

# With confidence statistics
2018_CUPE_1_711000_2002000_image_Tree_0.95_126_248_186_307.jpg

# Without confidence
2018_CUPE_1_711000_2002000_image_Tree_126_248_186_307.jpg

Testing

This project uses pytest for testing. To run tests:

# Run all tests
uv run pytest

# Run tests with verbose output
uv run pytest -v

# Run a specific test file
uv run pytest test_tile_cover.py

# Run tile image tests
uv run pytest test_tile_image.py

# Run a specific test class or function
uv run pytest test_tile_cover.py::TestMinimumTileCover::test_single_box

About

Tools for generating LIDAR-derived pretraining datasets for tree prediction models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors