## Lidar Metadata Extraction and EDA Notebook

This notebook is designed for rapid, robust analysis of lidar .laz tiles, including:
- Extraction of per-tile and per-class summary statistics (e.g. point density, ground fraction, scan angle, etc)
- Flexible input: process entire dataset, a subset (by class), or use provided summary CSVs
- Optional: geospatial visualizations of density and overlaps

USAGE:
- By default, skips heavy processing and only loads summary for EDA.
- To process your own files (all or a subset), set RUN_PROCESSING = True and configure input locations.

CONFIG OPTIONS:
- RUN_PROCESSING: Set to True to run metadata extraction on raw .laz files; otherwise, just load outputs.
- PROCESS_CLASSES: List of class names (e.g. ['RIB_A01', 'BON_A01']) to process if processing subset. Set to None to process all.
- PATHS: Edit local/Kaggle paths as needed.

Requirements: laspy, pandas, numpy, geopandas (optional: folium), tqdm.

Author: Seamus Barnes

Date: 2025-06-25

In [3]:
# ---- CONFIG ----

# Main flag: skip .laz processing for fast EDA, or process files yourself?
RUN_PROCESSING = False      # Set to True to extract metadata from your .laz files
PROCESS_CLASSES = None      # List of class names to process, e.g. ['RIB_A01', 'BON_A01']. None = all classes.

# Data locations (edit if using locally, or add extra Kaggle `/kaggle/input/yourdataset`)
DATA_RAW_LAZ_DIRS = [
    "../input/kaggle-dataset1-laz",  # <- put the correct Kaggle dataset names / paths here!
    "../input/kaggle-dataset2-laz",
    "../input/kaggle-dataset3-laz",
    "../input/kaggle-dataset4-laz"
]

# WHERE TO EXPECT/WRITE OUTPUT CSVs:
# By default, will use pre-made outputs in ../input/precomputed-metadata/...
# If processing yourself, will also save to the same names in ../working/ for future download

# Precomputed paths (for fast EDA, use these if present)
import os
CWD = os.getcwd()
PATH_METADATA_CSV_INPUT = os.path.join(CWD,"input/precomputed-metadata/lidar_metadata_full.csv")
PATH_CLASS_SUMMARY_CSV_INPUT = os.path.join(CWD,"input/precomupted_metadata/lidar_metadata_full.csv")

# Where to write new files if running processing
PATH_METADATA_CSV_OUTPUT = "lidar_metadata_full.csv"
PATH_CLASS_SUMMARY_CSV_OUTPUT = "lidar_metadata_class_summary.csv"

# Save progress every N files if running processing
SAVE_INTERVAL = 100

In [4]:
import os
import sys
import glob
from collections import defaultdict, Counter

import json

import numpy as np
import pandas as pd

from tqdm import tqdm

import geopandas as gpd
from shapely.ops import unary_union
import folium
import branca.colormap as cm

# Lidar file reading
try:
    import laspy
except ImportError:
    print("ERROR: laspy is required for .laz file processing. Please install with `pip install laspy`.")

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# --- Print environment info for debug ---
print(f"Python {sys.version}")
print(f"laspy version: {getattr(laspy, '__version__', 'N/A')}")
if gpd:
    print(f"geopandas version: {gpd.__version__}")

Python 3.10.18 | packaged by conda-forge | (main, Jun  4 2025, 14:46:00) [Clang 18.1.8 ]
laspy version: 2.3.0
geopandas version: 1.1.0


In [None]:
# ------------------------------------------------------
# Block 3: File Discovery & Class Selection Helpers
# ------------------------------------------------------

def find_laz_files(laz_dirs, exts=(".laz", ".LAS", ".las")):
    """
    Find all .laz/.las files in provided input directories (recursively).
    Args:
        laz_dirs: List of directory paths.
        exts: Allowed file extensions.
    Returns:
        files: List of file paths (full path)
    """
    files = []
    for dir in laz_dirs:
        if not os.path.exists(dir):
            print(f"Warning: Directory does not exist: {dir}")
            continue
        for ext in exts:
            files_found = glob.glob(os.path.join(dir, f"**/*{ext}"), recursive=True)
            files.extend(files_found)
    print(f"Found {len(files):,} .laz/.las files in input directories.")
    return files

def get_class_from_filename(filename):
    """
    Parse 'class' (first two underscore-separated fields) from lidar filename, e.g. JAM_A02_2011_laz_4.laz -> 'JAM_A02'
    """
    return "_".join(os.path.basename(filename).split("_")[:2])

def filter_files_by_class(files, allowed_classes):
    """
    Args:
        files: List of full file paths.
        allowed_classes: List of allowed class names (e.g., ['RIB_A01', ...])
    Returns:
        filtered_files: List of file paths with class in allowed_classes.
    """
    if allowed_classes is None:
        return files
    filtered_files = [f for f in files if get_class_from_filename(f) in allowed_classes]
    print(f"Filtering for classes={allowed_classes}: {len(filtered_files):,} files selected.")
    return filtered_files

# ---- Example usage (don't run yet, run in processing block):
# all_laz_files = find_laz_files(DATA_RAW_LAZ_DIRS)
# subset_laz_files = filter_files_by_class(all_laz_files, PROCESS_CLASSES)