DeepLidar Pretrain

This repository contains code for generating pre-training data for aerial tree detection models. Specifically it is used to create the training data for DeepForest's base models, which are then refined using hand labelled imagery. The imagery and annotations are created from NEON data, with initial boxes created from LIDAR canopy height maps. Currently the routines for extracting crowns from LIDAR are not in this repo, but you can find them in weecology/deeplidar; in the future we plan to migrate those here and look at adding some more up-to-date extraction algorithms.

The workflow takes lists of large orthomosaics and annotations and generates tiled training data for ML models. Due to the sheer quantity of data (hundreds of gigabytes), what seems like a relatively simple task involves quite a few optimization steps. For example, annotations are stored in an SQL database instead of standard annotation formats for eaiser querying per-image, and we provide example SLURM files that can be used to distribute processing over NEON sites and years. The entire pipeline can be run in a couple of hours on a cluster instead of days.

Note: we are updating the documentation for the repo to be easier to use, and the dataset will be accessible in the near future.

Other note: as an experiment, some code in this repository was generated by LLMs/agentic models with human oversight. Standards are enforced by pre-commit and some other rules that aim to make the outputs from the models more useable. We provide the AGENTS files for rules.

Setup

This project uses uv for dependency management:

uv sync

We assume that you've got an annotation CSV (pretraining_annotations.csv). The repository makes some assumptions about how the raw data are laid out on disk.

Environment Configuration: Create a .env file in the project root with:

TIF_ROOT_DIR=/path/to/your/tif/images

This is used by all scripts.

Complete Pipeline

Sequential Pipeline (Recommended)

Run the complete end-to-end pipeline:

./scripts/run_pipeline.sh pretraining_annotations.csv ./output

This executes all 8 steps automatically:

Database Creation - Convert CSV to optimized SQLite databases
Sites List Generation - Extract available NEON site codes
Path Validation - Verify image files exist on filesystem
Preview Generation - Create overview images for quality control
Confidence Computation - Compute AI quality scores (optional but recommended)
Image Tiling - Generate training tiles with minimum cover algorithm
Site Aggregation - Combine annotations by site and year
Final Aggregation - Create master all_annotations.csv

SLURM Cluster Pipeline

ALl-in-one:

./scripts/submit_pipeline.sh pretraining_annotations.csv ./output

Individual CLI Tools

The pipeline is split into tools for convenience, and for debugging:

Database Creation

Convert CSV annotations to optimized SQLite databases:

uv run deeplidar-create-db pretraining_annotations.csv --output ./output

This tool creates site-specific SQLite databases from pretraining_annotations.csv with some appropriate indexes for faster lookup.

Database Schema

CREATE TABLE annotations (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    image_path TEXT NOT NULL,
    xmin INTEGER NOT NULL,
    ymin INTEGER NOT NULL,
    xmax INTEGER NOT NULL,
    ymax INTEGER NOT NULL,
    label TEXT NOT NULL,
    FOREIGN KEY (image_path) REFERENCES image_urls(base_filename)
);

CREATE INDEX idx_image_path ON annotations(image_path);

Sites List Generation

Extract available NEON site codes from databases:

uv run deeplidar-generate-sites-list --output ./output

Creates sites_to_process.txt with discovered site codes.

Path Validation

Verify image files exist on filesystem:

uv run deeplidar-validate-paths --output ./output [--site SITE] [--continue-on-missing]

Options:

--site: Validate specific site only
--continue-on-missing: Proceed even if some images are missing

Preview Generation

Generate overview images for quality control. This is a very useful command; we've found several misaligned LIDAR/RGB pairs. You usually don't need to export full size previews, 2k x 2k is enough to spot big problems.

uv run deeplidar-generate-previews --output ./output [--image IMAGE] [--site SITE] [--max-size 2048] [--quality 85] [--full-size]

Options:

--image: Process specific image only
--site: Process specific site only
--max-size: Maximum image dimension (default: 2048)
--quality: JPEG quality (default: 85)
--full-size: Generate full-size previews
--confidence-threshold: Filter annotations by confidence

Confidence Computation (Optional, but strongly recommended)

Compute canopy confidence using Restor's OAM-TCD model. This can then be used to automatically spot inconsistencies with labels. We also use it to remove building and other non-tree false positive labels that come from the CHM. Each box in the database is assigned the median score from the pixels that it covers.

uv run deeplidar-compute-confidence --output ./output [--site SITE] [--model-name MODEL]

Options:

--site: Process specific site only
--model-name: Model to use (default: "restor/tcd-segformer-mit-b5")
--tile-size: Tile size for processing (default: 1024)
--batch-size: Processing batch size (default: 4)
--empty-threshold: Empty tile threshold percentage (default: 10.0)

Image Processing Pipeline

1. Image Tiling

Process images with minimum tile cover algorithm using quality-filtered images from the database:

uv run deeplidar-tile-images --output <output_folder> [options]

Prerequisites: Quality assessment must be performed first using:

uv run deeplidar-compute-confidence --output <output_folder>

Processing Examples:

Basic Tiling (Default):

uv run deeplidar-tile-images --output ./output --only-image "2018_CUPE_1_711000_2002000_image.tif"

Site Processing with Quality Filtering:

uv run deeplidar-tile-images --output ./output --site CUPE --max-unlabeled-cover-ratio 0.1

Confidence Statistics Mode:

uv run deeplidar-tile-images --output ./output --confidence-stats --site CUPE

Key Options:

--only-image: Process specific image(s) by filename
--site: Process all images from a NEON site (e.g., CUPE, SRER)
--max-unlabeled-cover-ratio: Maximum unlabeled object coverage (e.g., 0.1 for 10%)
--max-bbox-mismatch-ratio: Maximum bbox mismatch ratio (e.g., 0.2 for 20%)
--min-confidence-threshold: Minimum mean bbox confidence required
--confidence-threshold: Filter individual annotations by confidence
--confidence-stats: Include confidence statistics in output CSV
--previews: Generate tile preview images
--tile-size: Square tile size (default: 1024)

Quality Filtering: Images are automatically filtered based on pre-computed quality metrics from the confidence processing step. Only images meeting quality requirements are processed.

Output Structure:

output/SITE/YEAR/image_basename/
├── tiles/                    # Cropped tile images
├── previews/                 # Tiles with bounding boxes overlaid (--previews)
└── annotations.csv           # DeepForest format with tile-relative coordinates

2. Individual Crop Generation (diagnostic, optional)

Generate individual bounding box crops from processed tiles. This is pretty verbose, but you might find it useful (for example if you were interested in clustering labels).

uv run deeplidar-generate-crops <input_path> [--output-dir OUTPUT_DIR]

Examples:

# Process all sites recursively
uv run deeplidar-generate-crops ./output

# Process specific site
uv run deeplidar-generate-crops ./output/CUPE

# Process specific image directory
uv run deeplidar-generate-crops ./output/CUPE/2018/2018_CUPE_1_711000_2002000_image

Recursively finds directories with annotations.csv + tiles/ folder
Generates individual crops for each bounding box annotation
Smart filename format: source_class_confidence_coords for optimal sorting

Crop Filename Format:

# With confidence statistics
2018_CUPE_1_711000_2002000_image_Tree_0.95_126_248_186_307.jpg

# Without confidence
2018_CUPE_1_711000_2002000_image_Tree_126_248_186_307.jpg

Testing

This project uses pytest for testing. To run tests:

# Run all tests
uv run pytest

# Run tests with verbose output
uv run pytest -v

# Run a specific test file
uv run pytest test_tile_cover.py

# Run tile image tests
uv run pytest test_tile_image.py

# Run a specific test class or function
uv run pytest test_tile_cover.py::TestMinimumTileCover::test_single_box

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
.cursor/rules		.cursor/rules
scripts		scripts
src/deeplidar		src/deeplidar
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
AGENTS.md		AGENTS.md
INSTRUCTIONS.md		INSTRUCTIONS.md
README.md		README.md
bboxes_to_shapefile.py		bboxes_to_shapefile.py
generate_sites_list.py		generate_sites_list.py
generate_urls.py		generate_urls.py
neon_aop_urls.txt		neon_aop_urls.txt
neon_site_domain_mapping.csv		neon_site_domain_mapping.csv
pyproject.toml		pyproject.toml
review_previews.py		review_previews.py
slurm.md		slurm.md
test_minimal.csv		test_minimal.csv
unique_image_paths.txt		unique_image_paths.txt
useful_commands.md		useful_commands.md
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepLidar Pretrain

Setup

Complete Pipeline

Sequential Pipeline (Recommended)

SLURM Cluster Pipeline

Individual CLI Tools

Database Creation

Database Schema

Sites List Generation

Path Validation

Preview Generation

Confidence Computation (Optional, but strongly recommended)

Image Processing Pipeline

1. Image Tiling

2. Individual Crop Generation (diagnostic, optional)

Testing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DeepLidar Pretrain

Setup

Complete Pipeline

Sequential Pipeline (Recommended)

SLURM Cluster Pipeline

Individual CLI Tools

Database Creation

Database Schema

Sites List Generation

Path Validation

Preview Generation

Confidence Computation (Optional, but strongly recommended)

Image Processing Pipeline

1. Image Tiling

2. Individual Crop Generation (diagnostic, optional)

Testing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages