# Pipeline for running MD and labelme 

This pipeline is built by testing on an arbitrarily-chosen subset of the NZ trail cam dataset: https://lila.science/datasets/nz-trailcams

This notebook involves two conda environments (for some reason, possibly to do with opencv-python, installing megadetector messes with the qt plugin). To run this notebook, set up the following environment:
```
conda create -n megadetector python=3.11 pip -y
conda activate megadetector
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip3 install megadetector
pip3 install pillow
```
The last section (5. Label!) in this notebook generates a command to be copy and pasted into a terminal. Before running this command in the terminal, set up a conda environment for labelme as follows: 
```
git clone https://github.com/agentmorris/labelme
cd labelme
conda create -n labelme python=3.11 pip -y
conda activate labelme
pip install -e .
```

This notebook follows the following workflow:
1. Downsize images to at most 1600px wide (assuming most camera trap images have a larger width than height) to improve the latency of labelme
2. Running MDv5A on the dataset (adapted from https://github.com/agentmorris/MegaDetector/blob/main/notebooks/manage_local_batch.ipynb)
3. Running RDE on the dataset (adapted from https://github.com/agentmorris/MegaDetector/blob/main/notebooks/manage_local_batch.ipynb)
4. Prepare data for labelme 
    - Generate image folders each containing symlinks to 5k images as input to labelme to improve GUI latency (refer to https://github.com/yijint/Sentinel_Summer24/blob/main/reference_code/cxl-snapshot-relabeling.py emailed by Dan Morris 
    - Convert MD output for labelme compatability - this is an opinionated transformation that requires a confidence threshold (refer to https://github.com/agentmorris/MegaDetector/blob/main/megadetector/postprocessing/md_to_labelme.py)
5. Label! (qt plugin required)
    - If run into problems with initializing qt platform, try running `pip install pyside6`
6. Labelme to MD: convert updated annotations into MD format 

All the package requirements are listed in `requirements.txt`.

# 0. Set up

In [None]:
# import packages
import json
import sys
import os
import stat
import time
import re
from glob import glob
import PIL
from datetime import datetime
import time
import pandas as pd

import humanfriendly

from tqdm import tqdm
from collections import defaultdict

from megadetector.visualization.visualization_utils import resize_image_folder 

from megadetector.utils import path_utils
from megadetector.utils.path_utils import find_image_strings
from megadetector.utils.path_utils import recursive_file_list
from megadetector.utils.path_utils import safe_create_link
from megadetector.utils.ct_utils import split_list_into_n_chunks
from megadetector.utils.ct_utils import image_file_to_camera_folder
from megadetector.utils.ct_utils import split_list_into_fixed_size_chunks

from megadetector.detection.run_detector_batch import load_and_run_detector_batch, write_results_to_file
from megadetector.detection.run_detector import DEFAULT_OUTPUT_CONFIDENCE_THRESHOLD
from megadetector.detection.run_detector import estimate_md_images_per_second
from megadetector.detection.run_detector import get_detector_version_from_filename

from megadetector.postprocessing.jin_md_to_labelme import md_to_labelme
from megadetector.postprocessing.postprocess_batch_results import PostProcessingOptions, process_batch_results
from megadetector.postprocessing.repeat_detection_elimination import repeat_detections_core
from megadetector.postprocessing.repeat_detection_elimination import remove_repeat_detections

In [None]:
# set paths and variables 
workdir = os.getcwd() # where this notebook and the original data lies, and where all the work will be done
labelme_path = '/home/garage/Documents/jin-summer24/labelme' # where labelme was installed 
og_datapath = f"{workdir}/data" # where the original data is (just a subset of the NZ trailcam dataset for testing)
metadata_path = f'{og_datapath}/trail_camera_images_of_new_zealand_animals_1.00.json' # metadata (for ALL data in the NZ trailcam dataset)
datapath = f"{workdir}/downsized_data"
first_run = False # some operations take a long time (MD detection, RDE, and most of all, generating multiple levels of labelme jsons). To avoid re-running these operations, set this to False after the first run. 
sort_all = True # whether to show images in datetime order in labelme

In [82]:
# load metadata
with open(metadata_path) as json_file:
    metadata = json.load(json_file)

In [25]:
# retrieve unique image identifier from file path
def get_file_id(fn):
    return os.path.basename(fn).split('.')[0]

# 1. Downsize images to at most 1600px wide (assuming most camera trap images have a larger width than height) to improve the latency of labelme

In [83]:
if not os.path.exists(datapath):
    os.mkdir(datapath)
    # resize a folder of images to a new folder on multiple threads/processes.
    %time _ = resize_image_folder(input_folder=og_datapath, output_folder=datapath, target_width=1600, target_height=-1, no_enlarge_width=True, verbose=False)

# 2. Running MDv5A on the dataset (adapted from https://github.com/agentmorris/MegaDetector/blob/main/notebooks/manage_local_batch.ipynb)

Set constants

In [84]:
## Inference options

# To specify a non-default confidence threshold for including detections in the .json file
json_threshold = None

# Turn warnings into errors if more than this many images are missing
max_tolerable_failed_images = 100

# Should we supply the --image_queue_option to run_detector_batch.py?  I only set this
# when I have a very slow drive and a comparably fast GPU.  When this is enabled, checkpointing
# is not supported within a job, so I set n_jobs to a large number (typically 100).
use_image_queue = False

# Only relevant when we're using a single GPU
default_gpu_number = 0

# Should we supply --quiet to run_detector_batch.py?
quiet_mode = True

# Specify a target image size when running MD... strongly recommended to leave this at "None"
# When using augmented inference, if you leave this at "None", run_inference_with_yolov5_val.py
# will use its default size, which is 1280 * 1.3, which is almost always what you want.
image_size = None

# Should we include image size, timestamp, and/or EXIF data in MD output?
include_image_size = False
include_image_timestamp = False
include_exif_data = False

# Only relevant when running on CPU
ncores = 1

# OS-specific script line continuation character (modified later if we're running on Windows)
slcc = '\\'

#  OS-specific script comment character (modified later if we're running on Windows)
scc = '#'

# # OS-specific script extension (modified later if we're running on Windows)
script_extension = '.sh'

# If False, we'll load chunk files with file lists if they exist
force_enumeration = False

# Prefer threads on Windows, processes on Linux
parallelization_defaults_to_threads = False

# This is for things like image rendering, not for MegaDetector
default_workers_for_parallel_tasks = 30

overwrite_handling = 'skip' # 'skip', 'error', or 'overwrite'

# The function used to get camera names from image paths; can also replace this
# with a custom function.
relative_path_to_location = image_file_to_camera_folder

# This will be the .json results file after RDE; if this is still None when
# we get to classification stuff, that will indicate that we didn't do RDE.
filtered_output_filename = None

if os.name == 'nt':

    slcc = '^'
    scc = 'REM'
    script_extension = '.bat'

    # My experience has been that Python multiprocessing is flaky on Windows, so
    # default to threads on Windows
    parallelization_defaults_to_threads = True
    default_workers_for_parallel_tasks = 10


## Constants related to using YOLOv5's val.py

# Should we use YOLOv5's val.py instead of run_detector_batch.py?
use_yolo_inference_scripts = False

# Directory in which to run val.py (relevant for YOLOv5, not for YOLOv8)
yolo_working_dir = os.path.expanduser('~/git/yolov5')

# Only used for loading the mapping from class indices to names
yolo_dataset_file = None

# 'yolov5' or 'yolov8'; assumes YOLOv5 if this is None
yolo_model_type = None

# inference batch size
yolo_batch_size = 1

# Should we remove intermediate files used for running YOLOv5's val.py?
# Only relevant if use_yolo_inference_scripts is True.
remove_yolo_intermediate_results = True
remove_yolo_symlink_folder = True
use_symlinks_for_yolo_inference = True
write_yolo_debug_output = False

# Should we apply YOLOv5's test-time augmentation?
augment = False


## Constants related to tiled inference

use_tiled_inference = False

# Should we delete tiles after each job?  Only set this to False for debugging;
# large jobs will take up a lot of space if you keep tiles around after each task.
remove_tiles = True
tile_size = (1280,1280)
tile_overlap = 0.2

Job-specific constants

In [85]:
input_path = datapath

assert not (input_path.endswith('/') or input_path.endswith('\\'))
assert os.path.isdir(input_path), 'Could not find input folder {}'.format(input_path)
input_path = input_path.replace('\\','/')

organization_name_short = 'nz-trailcams-aac-aiv'
job_date = '2024-jun-07'
assert job_date is not None and organization_name_short != 'organization'

# Optional descriptor
job_tag = None

if job_tag is None:
    job_description_string = ''
else:
    job_description_string = '-' + job_tag

model_file = 'MDV5A' # 'MDV5A', 'MDV5B', 'MDV4'

postprocessing_base = os.path.expanduser(f'{workdir}/postprocessing')

# Number of jobs to split data into, typically equal to the number of available GPUs, though
# when using augmentation or an image queue (and thus not using checkpoints), I typically
# use ~100 jobs per GPU; those serve as de facto checkpoints.
n_jobs = 1
n_gpus = 1

# Set to "None" when using augmentation or an image queue, which don't currently support
# checkpointing.  Don't worry, this will be assert()'d in the next cell.
checkpoint_frequency = 10000

# Estimate inference speed for the current GPU
approx_images_per_second = estimate_md_images_per_second(model_file)

# Rough estimate for the inference time cost of augmentation
if augment and (approx_images_per_second is not None):
    approx_images_per_second = approx_images_per_second * 0.7

base_task_name = organization_name_short + '-' + job_date + job_description_string + '-' + \
    get_detector_version_from_filename(model_file)
base_output_folder_name = os.path.join(postprocessing_base,organization_name_short)
os.makedirs(base_output_folder_name,exist_ok=True)

No speed estimate available for NVIDIA GeForce RTX 2070 SUPER


Derived variables, constant validation, path setup

In [86]:
if use_image_queue:
    assert checkpoint_frequency is None,\
        'Checkpointing is not supported when using an image queue'

if augment:
    assert checkpoint_frequency is None,\
        'Checkpointing is not supported when using augmentation'

    assert use_yolo_inference_scripts,\
        'Augmentation is only supported when running with the YOLO inference scripts'

if use_tiled_inference:
    assert not augment, \
        'Augmentation is not supported when using tiled inference'
    assert not use_yolo_inference_scripts, \
        'Using the YOLO inference script is not supported when using tiled inference'
    assert checkpoint_frequency is None, \
        'Checkpointing is not supported when using tiled inference'

filename_base = os.path.join(base_output_folder_name, base_task_name)
combined_api_output_folder = os.path.join(filename_base, 'combined_api_outputs')
postprocessing_output_folder = os.path.join(filename_base, 'preview')

combined_api_output_file = os.path.join(
    combined_api_output_folder,
    '{}_detections.json'.format(base_task_name))

os.makedirs(filename_base, exist_ok=True)
os.makedirs(combined_api_output_folder, exist_ok=True)
os.makedirs(postprocessing_output_folder, exist_ok=True)

if input_path.endswith('/'):
    input_path = input_path[0:-1]

print('Output folder:\n{}'.format(filename_base))

Output folder:
/home/garage/Documents/jin-summer24/Sentinel_Summer24/nz-trailcams-test/postprocessing/nz-trailcams-aac-aiv/nz-trailcams-aac-aiv-2024-jun-07-v5a.0.0


Enumerate files (generate a list of image paths for future use)

In [87]:
# Have we already listed files for this job?
chunk_files = os.listdir(filename_base)
pattern = re.compile('chunk\d+.json')
chunk_files = [fn for fn in chunk_files if pattern.match(fn)] # generated in cells below, if this does not exist

if (not force_enumeration) and (len(chunk_files) > 0):

    print('Found {} chunk files in folder {}, bypassing enumeration'.format(
        len(chunk_files),
        filename_base))

    all_images = []
    for fn in chunk_files:
        with open(os.path.join(filename_base,fn),'r') as f:
            chunk = json.load(f)
            assert isinstance(chunk,list)
            all_images.extend(chunk)
    all_images = sorted(all_images)

    print('Loaded {} image files from {} chunks in {}'.format(
        len(all_images),len(chunk_files),filename_base))

else:

    print('Enumerating image files in {}'.format(input_path))

    all_images = sorted(path_utils.find_images(input_path,recursive=True,convert_slashes=True))

    # It's common to run this notebook on an external drive with the main folders in the drive root
    all_images = [fn for fn in all_images if not \
                  (fn.startswith('$RECYCLE') or fn.startswith('System Volume Information'))]

    print('')

    print('Enumerated {} image files in {}'.format(len(all_images),input_path))

Found 1 chunk files in folder /home/garage/Documents/jin-summer24/Sentinel_Summer24/nz-trailcams-test/postprocessing/nz-trailcams-aac-aiv/nz-trailcams-aac-aiv-2024-jun-07-v5a.0.0, bypassing enumeration
Loaded 38 image files from 1 chunks in /home/garage/Documents/jin-summer24/Sentinel_Summer24/nz-trailcams-test/postprocessing/nz-trailcams-aac-aiv/nz-trailcams-aac-aiv-2024-jun-07-v5a.0.0


Divide images into chunks for multiple processes

In [88]:
folder_chunks = split_list_into_n_chunks(all_images,n_jobs)

Estimate total time 

In [89]:
if approx_images_per_second is None:

    print("Can't estimate inference time for the current environment")

else:

    n_images = len(all_images)
    execution_seconds = n_images / approx_images_per_second
    wallclock_seconds = execution_seconds / n_gpus
    print('Expected time: {}'.format(humanfriendly.format_timespan(wallclock_seconds)))

    seconds_per_chunk = len(folder_chunks[0]) / approx_images_per_second
    print('Expected time per chunk: {}'.format(humanfriendly.format_timespan(seconds_per_chunk)))

Can't estimate inference time for the current environment


Write file lists

In [90]:
task_info = []

for i_chunk, chunk_list in enumerate(folder_chunks):

    chunk_fn = os.path.join(filename_base,'chunk{}.json'.format(str(i_chunk).zfill(3)))
    task_info.append({'id':i_chunk,'input_file':chunk_fn})
    path_utils.write_list_to_file(chunk_fn, chunk_list)

Generate commands

In [91]:
# A list of the scripts tied to each GPU, as absolute paths.  We'll write this out at
# the end so each GPU's list of commands can be run at once
gpu_to_scripts = defaultdict(list)

# i_task = 0; task = task_info[i_task]
for i_task,task in enumerate(task_info):

    chunk_file = task['input_file']
    checkpoint_filename = chunk_file.replace('.json','_checkpoint.json')

    output_fn = chunk_file.replace('.json','_results.json')

    task['output_file'] = output_fn

    if n_gpus > 1:
        gpu_number = i_task % n_gpus
    else:
        gpu_number = default_gpu_number

    image_size_string = ''
    if image_size is not None:
        image_size_string = '--image_size {}'.format(image_size)

    # Generate the script to run MD

    if use_yolo_inference_scripts:

        augment_string = ''
        if augment:
            augment_string = '--augment_enabled 1'
        else:
            augment_string = '--augment_enabled 0'

        batch_string = '--batch_size {}'.format(yolo_batch_size)

        symlink_folder = os.path.join(filename_base,'symlinks','symlinks_{}'.format(
            str(i_task).zfill(3)))
        yolo_results_folder = os.path.join(filename_base,'yolo_results','yolo_results_{}'.format(
            str(i_task).zfill(3)))

        symlink_folder_string = '--symlink_folder "{}"'.format(symlink_folder)
        yolo_results_folder_string = '--yolo_results_folder "{}"'.format(yolo_results_folder)

        remove_symlink_folder_string = ''
        if not remove_yolo_symlink_folder:
            remove_symlink_folder_string = '--no_remove_symlink_folder'

        write_yolo_debug_output_string = ''
        if write_yolo_debug_output:
            write_yolo_debug_output = '--write_yolo_debug_output'

        remove_yolo_results_string = ''
        if not remove_yolo_intermediate_results:
            remove_yolo_results_string = '--no_remove_yolo_results_folder'

        confidence_threshold_string = ''
        if json_threshold is not None:
            confidence_threshold_string = '--conf_thres {}'.format(json_threshold)
        else:
            confidence_threshold_string = '--conf_thres {}'.format(DEFAULT_OUTPUT_CONFIDENCE_THRESHOLD)

        cmd = ''

        device_string = '--device {}'.format(gpu_number)

        overwrite_handling_string = '--overwrite_handling {}'.format(overwrite_handling)

        cmd += f'python run_inference_with_yolov5_val.py "{model_file}" "{chunk_file}" "{output_fn}" '
        cmd += f'{image_size_string} {augment_string} '
        cmd += f'{symlink_folder_string} {yolo_results_folder_string} {remove_yolo_results_string} '
        cmd += f'{remove_symlink_folder_string} {confidence_threshold_string} {device_string} '
        cmd += f'{overwrite_handling_string} {batch_string} {write_yolo_debug_output_string}'

        if yolo_working_dir is not None:
            cmd += f' --yolo_working_folder "{yolo_working_dir}"'
        if yolo_dataset_file is not None:
            cmd += ' --yolo_dataset_file "{}"'.format(yolo_dataset_file)
        if yolo_model_type is not None:
            cmd += ' --model_type {}'.format(yolo_model_type)

        if not use_symlinks_for_yolo_inference:
            cmd += ' --no_use_symlinks'

        cmd += '\n'

    elif use_tiled_inference:

        tiling_folder = os.path.join(filename_base,'tile_cache','tile_cache_{}'.format(
            str(i_task).zfill(3)))

        if os.name == 'nt':
            cuda_string = f'set CUDA_VISIBLE_DEVICES={gpu_number} & '
        else:
            cuda_string = f'CUDA_VISIBLE_DEVICES={gpu_number} '

        cmd = f'{cuda_string} python run_tiled_inference.py "{model_file}" "{input_path}" "{tiling_folder}" "{output_fn}"'

        cmd += f' --image_list "{chunk_file}"'
        cmd += f' --overwrite_handling {overwrite_handling}'

        if not remove_tiles:
            cmd += ' --no_remove_tiles'

        # If we're using non-default tile sizes
        if tile_size is not None and (tile_size[0] > 0 or tile_size[1] > 0):
            cmd += ' --tile_size_x {} --tile_size_y {}'.format(tile_size[0],tile_size[1])

        if tile_overlap is not None:
            cmd += f' --tile_overlap {tile_overlap}'

    else:

        if os.name == 'nt':
            cuda_string = f'set CUDA_VISIBLE_DEVICES={gpu_number} & '
        else:
            cuda_string = f'CUDA_VISIBLE_DEVICES={gpu_number} '

        checkpoint_frequency_string = ''
        checkpoint_path_string = ''

        if checkpoint_frequency is not None and checkpoint_frequency > 0:
            checkpoint_frequency_string = f'--checkpoint_frequency {checkpoint_frequency}'
            checkpoint_path_string = '--checkpoint_path "{}"'.format(checkpoint_filename)

        use_image_queue_string = ''
        if (use_image_queue):
            use_image_queue_string = '--use_image_queue'

        ncores_string = ''
        if (ncores > 1):
            ncores_string = '--ncores {}'.format(ncores)

        quiet_string = ''
        if quiet_mode:
            quiet_string = '--quiet'

        confidence_threshold_string = ''
        if json_threshold is not None:
            confidence_threshold_string = '--threshold {}'.format(json_threshold)

        overwrite_handling_string = '--overwrite_handling {}'.format(overwrite_handling)
        cmd = f'{cuda_string} python run_detector_batch.py "{model_file}" "{chunk_file}" "{output_fn}" {checkpoint_frequency_string} {checkpoint_path_string} {use_image_queue_string} {ncores_string} {quiet_string} {image_size_string} {confidence_threshold_string} {overwrite_handling_string}'

        if include_image_size:
            cmd += ' --include_image_size'
        if include_image_timestamp:
            cmd += ' --include_image_timestamp'
        if include_exif_data:
            cmd += ' --include_exif_data'

    cmd_file = os.path.join(filename_base,'run_chunk_{}_gpu_{}{}'.format(str(i_task).zfill(3),
                            str(gpu_number).zfill(2),script_extension))

    with open(cmd_file,'w') as f:
        f.write(cmd + '\n')

    st = os.stat(cmd_file)
    os.chmod(cmd_file, st.st_mode | stat.S_IEXEC)

    task['command'] = cmd
    task['command_file'] = cmd_file

    # Generate the script to resume from the checkpoint (only supported with MD inference code)

    gpu_to_scripts[gpu_number].append(cmd_file)

    if checkpoint_frequency is not None:

        resume_string = ' --resume_from_checkpoint "{}"'.format(checkpoint_filename)
        resume_cmd = cmd + resume_string

        resume_cmd_file = os.path.join(filename_base,
                                       'resume_chunk_{}_gpu_{}{}'.format(str(i_task).zfill(3),
                                       str(gpu_number).zfill(2),script_extension))

        with open(resume_cmd_file,'w') as f:
            f.write(resume_cmd + '\n')

        st = os.stat(resume_cmd_file)
        os.chmod(resume_cmd_file, st.st_mode | stat.S_IEXEC)

        task['resume_command'] = resume_cmd
        task['resume_command_file'] = resume_cmd_file

# ...for each task

# Write out a script for each GPU that runs all of the commands associated with
# that GPU.  Typically only used when running lots of little scripts in lieu
# of checkpointing.
for gpu_number in gpu_to_scripts:

    gpu_script_file = os.path.join(filename_base,'run_all_for_gpu_{}{}'.format(
        str(gpu_number).zfill(2),script_extension))
    with open(gpu_script_file,'w') as f:
        for script_name in gpu_to_scripts[gpu_number]:
            s = script_name
            # When calling a series of batch files on Windows from within a batch file, you need to
            # use "call", or only the first will be executed.  No, it doesn't make sense.
            if os.name == 'nt':
                s = 'call ' + s
            f.write(s + '\n')
        f.write('echo "Finished all commands for GPU {}"'.format(gpu_number))
    st = os.stat(gpu_script_file)
    os.chmod(gpu_script_file, st.st_mode | stat.S_IEXEC)

# ...for each GPU

Run the tasks

The cells we've run so far wrote out some shell scripts (.bat files on Windows,
.sh files on Linx/Mac) that will run MegaDetector.  I like to leave the interactive
environment at this point and run those scripts at the command line.  So, for example,
if you're on Windows, and you've basically used the default values above, there will be
batch files called, e.g.:

c:\users\[username]\postprocessing\[organization]\[job_name]\run_chunk_000_gpu_00.bat
c:\users\[username]\postprocessing\[organization]\[job_name]\run_chunk_001_gpu_01.bat

Those batch files expect to be run from the "detection" folder of the MegaDetector repo,
typically:

c:\git\MegaDetector\megadetector\detection

All of that said, you don't *have* to do this at the command line.  The following cell
runs these scripts programmatically, so if you set "run_tasks_in_notebook" to "True"
and run this cell, you can run MegaDetector without leaving this notebook.

One downside of the programmatic approach is that this cell doesn't yet parallelize over
multiple processes, so the tasks will run serially.  This only matters if you have
multiple GPUs.

In [92]:
run_tasks_in_notebook = True

if run_tasks_in_notebook and first_run:

    assert not use_yolo_inference_scripts, \
        'If you want to use the YOLOv5 inference scripts, you can\'t run the model interactively (yet)'

    # i_task = 0; task = task_info[i_task]
    for i_task,task in enumerate(task_info):

        chunk_file = task['input_file']
        output_fn = task['output_file']

        checkpoint_filename = chunk_file.replace('.json','_checkpoint.json')

        if json_threshold is not None:
            confidence_threshold = json_threshold
        else:
            confidence_threshold = DEFAULT_OUTPUT_CONFIDENCE_THRESHOLD

        if checkpoint_frequency is not None and checkpoint_frequency > 0:
            cp_freq_arg = checkpoint_frequency
        else:
            cp_freq_arg = -1

        start_time = time.time()
        results = load_and_run_detector_batch(model_file=model_file,
                                              image_file_names=chunk_file,
                                              checkpoint_path=checkpoint_filename,
                                              confidence_threshold=confidence_threshold,
                                              checkpoint_frequency=cp_freq_arg,
                                              results=None,
                                              n_cores=ncores,
                                              use_image_queue=use_image_queue,
                                              quiet=quiet_mode,
                                              image_size=image_size)
        elapsed = time.time() - start_time

        print('Task {}: finished inference for {} images in {}'.format(
            i_task, len(results),humanfriendly.format_timespan(elapsed)))

        # This will write absolute paths to the file, we'll fix this later
        write_results_to_file(results, output_fn, detector_file=model_file)

        if checkpoint_frequency is not None and checkpoint_frequency > 0:
            if os.path.isfile(checkpoint_filename):
                os.remove(checkpoint_filename)
                print('Deleted checkpoint file {}'.format(checkpoint_filename))

    # ...for each chunk

# ...if we're running tasks in this notebook

Load results, look for failed or missing images in each task

In [93]:
# Check that all task output files exist

if first_run:
    missing_output_files = []

    # i_task = 0; task = task_info[i_task]
    for i_task, task in tqdm(enumerate(task_info),total=len(task_info)):
        output_file = task['output_file']
        if not os.path.isfile(output_file):
            missing_output_files.append(output_file)

    if len(missing_output_files) > 0:
        print('Missing {} output files:'.format(len(missing_output_files)))
        for s in missing_output_files:
            print(s)
        raise Exception('Missing output files')


    n_total_failures = 0

    for i_task,task in tqdm(enumerate(task_info),total=len(task_info)):

        chunk_file = task['input_file']
        output_file = task['output_file']

        with open(chunk_file,'r') as f:
            task_images = json.load(f)
        with open(output_file,'r') as f:
            task_results = json.load(f)

        task_images_set = set(task_images)
        filename_to_results = {}

        n_task_failures = 0

        for im in task_results['images']:

            # Most of the time, inference result files use absolute paths, but it's
            # getting annoying to make sure that's *always* true, so handle both here.
            # E.g., when using tiled inference, paths will be relative.
            if not os.path.isabs(im['file']):
                fn = os.path.join(input_path,im['file']).replace('\\','/')
                im['file'] = fn
            assert im['file'].startswith(input_path)
            assert im['file'] in task_images_set
            filename_to_results[im['file']] = im
            if 'failure' in im:
                assert im['failure'] is not None
                n_task_failures += 1

        task['n_failures'] = n_task_failures
        task['results'] = task_results

        for fn in task_images:
            assert fn in filename_to_results, \
                'File {} not found in results for task {}'.format(fn,i_task)

        n_total_failures += n_task_failures

    # ...for each task

    assert n_total_failures < max_tolerable_failed_images,\
        '{} failures (max tolerable set to {})'.format(n_total_failures,
                                                    max_tolerable_failed_images)

    print('Processed all {} images with {} failures'.format(
        len(all_images),n_total_failures))


    ##%% Merge results files and make filenames relative

    combined_results = {}
    combined_results['images'] = []
    images_processed = set()

    for i_task,task in tqdm(enumerate(task_info),total=len(task_info)):

        task_results = task['results']

        if i_task == 0:
            combined_results['info'] = task_results['info']
            combined_results['detection_categories'] = task_results['detection_categories']
        else:
            assert task_results['info']['format_version'] == combined_results['info']['format_version']
            assert task_results['detection_categories'] == combined_results['detection_categories']

        # Make sure we didn't see this image in another chunk
        for im in task_results['images']:
            assert im['file'] not in images_processed
            images_processed.add(im['file'])

        combined_results['images'].extend(task_results['images'])

    # Check that we ended up with the right number of images
    assert len(combined_results['images']) == len(all_images), \
        'Expected {} images in combined results, found {}'.format(
            len(all_images),len(combined_results['images']))

    # Check uniqueness
    result_filenames = [im['file'] for im in combined_results['images']]
    assert len(combined_results['images']) == len(set(result_filenames))

    # Convert to relative paths, preserving '/' as the path separator, regardless of OS
    for im in combined_results['images']:
        assert '\\' not in im['file']
        assert im['file'].startswith(input_path)
        if input_path.endswith(':'):
            im['file'] = im['file'].replace(input_path,'',1)
        else:
            im['file'] = im['file'].replace(input_path + '/','',1)

    with open(combined_api_output_file,'w') as f:
        json.dump(combined_results,f,indent=1)

    print('Wrote results to {}'.format(combined_api_output_file))

Post-processing (pre-RDE)

In [94]:
render_animals_only = False

options = PostProcessingOptions()
options.image_base_dir = input_path
options.include_almost_detections = True
options.num_images_to_sample = 7500
options.confidence_threshold = 0.2
options.almost_detection_confidence_threshold = options.confidence_threshold - 0.05
options.ground_truth_json_file = None
options.separate_detections_by_category = True
options.sample_seed = 0
options.max_figures_per_html_file = 2500

options.parallelize_rendering = True
options.parallelize_rendering_n_cores = default_workers_for_parallel_tasks
options.parallelize_rendering_with_threads = parallelization_defaults_to_threads

if render_animals_only:
    # Omit some pages from the output, useful when animals are rare
    options.rendering_bypass_sets = ['detections_person','detections_vehicle',
                                     'detections_person_vehicle','non_detections']

output_base = os.path.join(postprocessing_output_folder,
    base_task_name + '_{:.3f}'.format(options.confidence_threshold))
if render_animals_only:
    output_base = output_base + '_animals_only'

os.makedirs(output_base, exist_ok=True)
print('Processing to {}'.format(output_base))

options.md_results_file = combined_api_output_file
options.output_dir = output_base

if first_run:
    ppresults = process_batch_results(options)
    html_output_file = ppresults.output_html_file
    path_utils.open_file(html_output_file,attempt_to_open_in_wsl_host=True,browser_name='chrome')

Processing to /home/garage/Documents/jin-summer24/Sentinel_Summer24/nz-trailcams-test/postprocessing/nz-trailcams-aac-aiv/nz-trailcams-aac-aiv-2024-jun-07-v5a.0.0/preview/nz-trailcams-aac-aiv-2024-jun-07-v5a.0.0_0.200


# 3. Running RDE on the dataset (adapted from https://github.com/agentmorris/MegaDetector/blob/main/notebooks/manage_local_batch.ipynb)

Repeat detection elimination (RDE)

In [95]:
task_index = 0

options = repeat_detections_core.RepeatDetectionOptions()

options.confidenceMin = 0.1
options.confidenceMax = 1.01
options.iouThreshold = 0.85
options.occurrenceThreshold = 15
options.maxSuspiciousDetectionSize = 0.2
# options.minSuspiciousDetectionSize = 0.05

options.parallelizationUsesThreads = parallelization_defaults_to_threads
options.nWorkers = default_workers_for_parallel_tasks

# This will cause a very light gray box to get drawn around all the detections
# we're *not* considering as suspicious.
options.bRenderOtherDetections = True
options.otherDetectionsThreshold = options.confidenceMin

options.bRenderDetectionTiles = True
options.maxOutputImageWidth = 2000
options.detectionTilesMaxCrops = 250

# options.lineThickness = 5
# options.boxExpansion = 8

# To invoke custom collapsing of folders for a particular manufacturer's naming scheme
options.customDirNameFunction = relative_path_to_location

options.bRenderHtml = False
options.imageBase = input_path
rde_string = 'rde_{:.3f}_{:.3f}_{}_{:.3f}'.format(
    options.confidenceMin, options.iouThreshold,
    options.occurrenceThreshold, options.maxSuspiciousDetectionSize)
options.outputBase = os.path.join(filename_base, rde_string + '_task_{}'.format(task_index))
options.filenameReplacements = None # {'':''}

# Exclude people and vehicles from RDE
# options.excludeClasses = [2,3]

# options.maxImagesPerFolder = 50000
# options.includeFolders = ['a/b/c']
# options.excludeFolder = ['a/b/c']

options.debugMaxDir = -1
options.debugMaxRenderDir = -1
options.debugMaxRenderDetection = -1
options.debugMaxRenderInstance = -1

# Can be None, 'xsort', or 'clustersort'
options.smartSort = 'xsort'

In [96]:
if first_run:
    %time suspicious_detection_results = repeat_detections_core.find_repeat_detections(combined_api_output_file, outputFilename=None, options=options)

Manual RDE step (deleting the valid detections from the suspicious detections)

In [97]:
## DELETE THE VALID DETECTIONS ##

# If you run this line, it will open the folder up in your file browser
if first_run:
    path_utils.open_file(os.path.dirname(suspicious_detection_results.filterFile),
                        attempt_to_open_in_wsl_host=True)

# If you ran the previous cell, but then you change your mind and you don't want to do
# the RDE step, that's fine, but don't just blast through this cell once you've run the
# previous cell.  If you do that, you're implicitly telling the notebook that you looked
# at everything in that folder, and confirmed there were no red boxes on animals.

# Instead, either change "filtered_output_filename" below to "combined_api_output_file",
# or delete *all* the images in the filtering folder.

Re-filtering

In [98]:
filtered_output_filename = path_utils.insert_before_extension(combined_api_output_file,
                                                              'filtered_{}'.format(rde_string))

In [99]:
if first_run:
    remove_repeat_detections.remove_repeat_detections(
        inputFile=combined_api_output_file,
        outputFile=filtered_output_filename,
        filteringDir=os.path.dirname(suspicious_detection_results.filterFile)
        )

Post-processing (post-RDE)

In [100]:
render_animals_only = False

options = PostProcessingOptions()
options.image_base_dir = input_path
options.include_almost_detections = True
options.num_images_to_sample = 7500
options.confidence_threshold = 0.2
options.almost_detection_confidence_threshold = options.confidence_threshold - 0.05
options.ground_truth_json_file = None
options.separate_detections_by_category = True
options.sample_seed = 0
options.max_figures_per_html_file = 5000

options.parallelize_rendering = True
options.parallelize_rendering_n_cores = default_workers_for_parallel_tasks
options.parallelize_rendering_with_threads = parallelization_defaults_to_threads

if render_animals_only:
    # Omit some pages from the output, useful when animals are rare
    options.rendering_bypass_sets = ['detections_person','detections_vehicle',
                                      'detections_person_vehicle','non_detections']

output_base = os.path.join(postprocessing_output_folder,
    base_task_name + '_{}_{:.3f}'.format(rde_string, options.confidence_threshold))

if render_animals_only:
    output_base = output_base + '_render_animals_only'
os.makedirs(output_base, exist_ok=True)

print('Processing post-RDE to {}'.format(output_base))

options.md_results_file = filtered_output_filename
options.output_dir = output_base

if first_run:
    ppresults = process_batch_results(options)
    html_output_file = ppresults.output_html_file
    path_utils.open_file(html_output_file,attempt_to_open_in_wsl_host=True,browser_name='chrome')

Processing post-RDE to /home/garage/Documents/jin-summer24/Sentinel_Summer24/nz-trailcams-test/postprocessing/nz-trailcams-aac-aiv/nz-trailcams-aac-aiv-2024-jun-07-v5a.0.0/preview/nz-trailcams-aac-aiv-2024-jun-07-v5a.0.0_rde_0.100_0.850_15_0.200_0.200


# 4. Prepare data for labelme

In [101]:
md_results_file = filtered_output_filename 
relabeling_folder_base = datapath
symlink_folder = f'{workdir}/relabeling-symlinks'

# Ensure that paths are normalized
md_results_file = md_results_file.replace('\\','/')
relabeling_folder_base = relabeling_folder_base.replace('\\','/')
symlink_folder = symlink_folder.replace('\\','/')
use_threads = True
n_workers = 10

batch_name = 'nz-trailcams-acc-aiv'
max_images_per_chunk = 5000

assert os.path.isfile(md_results_file)
assert os.path.isdir(relabeling_folder_base)
os.makedirs(symlink_folder,exist_ok=True)

default_confidence_threshold = 0.2

# This defines the set of backup label files we generate from MD results at lower confidence thresholds
index_to_threshold = {
    1:0.1,
    2:0.05,
    3:0.01
}

In [None]:
# Convert MD results to labelme format with a default threshold
if first_run:
    _ = md_to_labelme(results_file=md_results_file,
                    image_base=relabeling_folder_base,
                    confidence_threshold=default_confidence_threshold,
                    overwrite=True,
                    extension_prefix='',
                    n_workers=n_workers,
                    use_threads=use_threads,
                    bypass_image_size_read=False,
                    verbose=True)

    # Create alternative .json files based on MD results at lower thresholds

    for index in index_to_threshold.keys():
        
        print('Generating alternative labels for index {} (threshold {})'.format(
            index,index_to_threshold[index]))
        
        md_to_labelme(results_file=md_results_file,
                    image_base=relabeling_folder_base,
                    confidence_threshold=index_to_threshold[index],
                    overwrite=True,
                    use_threads=use_threads,
                    bypass_image_size_read=False,
                    extension_prefix='.alt-{}'.format(index),
                    n_workers=n_workers)

In [None]:
# Enumerate files

all_files_relative = recursive_file_list(relabeling_folder_base,
                                         return_relative_paths=True,
                                         convert_slashes=True,
                                         recursive=True)

print('Enumerated {} files'.format(len(all_files_relative)))


# Match .json files to images
image_files_relative = find_image_strings(all_files_relative)
image_files_relative = [fp for fp in image_files_relative if 'symlinks' not in fp]
json_files = [fn for fn in all_files_relative if fn.endswith('.json') and 'symlinks' not in fn]
json_files = sorted(json_files)

print('Enumerated {} image files and {} .json files'.format(
    len(image_files_relative),len(json_files)))

Enumerated 384 files
Enumerated 38 image files and 152 .json files


Sort images by metadata

In [174]:
# load image metadata
images_df = pd.DataFrame(metadata['images'])[['location','datetime','file_name','species']]

# retrieve unique image identifier from metadata 
%time images_df['uuid'] = images_df.apply(lambda x: get_file_id(x.file_name), 1)

CPU times: user 17.6 s, sys: 134 ms, total: 17.7 s
Wall time: 17.7 s


In [175]:
# gather all image file paths we care about and retrieve their unique image identifier 
image_extensions = [
    '**/*.png', '**/*.PNG', '**/*.jpg', '**/*.JPG', '**/*.jpeg', '**/*.JPEG',
    '**/*.gif', '**/*.GIF', '**/*.bmp', '**/*.BMP', '**/*.tiff', '**/*.TIFF',
    '**/*.webp', '**/*.WEBP'
]
image_files = []
for ext in image_extensions:
    image_files.extend(glob(f"{datapath}/{ext}", recursive=True))
image_files = list(map(get_file_id, image_files))

# filter metadata to only contain images we care about
include_image = [True if image_id in image_files else False for image_id in images_df.uuid]
images_df = images_df.loc[include_image].reset_index(drop=True)

In [177]:
# covert datetime string into datetime object, and extract the value from EXIF data where necessary 
if sort_all:
    EXIF_extracted = 0

    tic = time.time()
    for row_idx, dt in enumerate(images_df['datetime']):
        if type(dt) == type(None): # If datetime is not available, extract from EXIF data.
            EXIF_extracted += 1 
            DT_TAG = 306 # tag number for DateTime in exif object 
            NZ_EXIF_DT_FORMAT = "%Y:%m:%d %H:%M:%S"
            exif_dt = PIL.Image.open(f'{datapath}/{images_df["file_name"][row_idx]}')._getexif()[DT_TAG]
            images_df['datetime'].iloc[row_idx]= datetime.strptime(exif_dt, NZ_EXIF_DT_FORMAT)
        else: 
            MD_DT_FORMAT = "%Y-%m-%d %H:%M:%S"
            images_df['datetime'].iloc[row_idx] = datetime.strptime(dt, MD_DT_FORMAT)
    toc = time.time()
    print(f"Time taken to retrieve all locations, datetime, and species for {len(location.keys())} images from metadata and extract EXIF data for {EXIF_extracted} images: {(toc-tic)} seconds")

Time taken to retrieve all locations, datetime, and species for 38 images from metadata and extract EXIF data for 0 images: 0.005779266357421875 seconds


In [None]:
if sort_all:
    # find the rank of each image sorted by location, species, and datetime
    loc_index_map = {val : idx for idx, val in enumerate(sorted(images_df['location']))}
    spec_index_map = {val : idx for idx, val in enumerate(sorted(images_df['species']))}
    dt_index_map = {val : idx for idx, val in enumerate(sorted(images_df['datetime']))}

    images_df['location_rank'] = images_df.apply(lambda x: loc_index_map[x.location], 1)
    images_df['species_rank'] = images_df.apply(lambda x: spec_index_map[x.species], 1)
    images_df['datetime_rank'] = images_df.apply(lambda x: dt_index_map[x.datetime], 1)

In [28]:
# Group json files by the image they belong to

# We'll use this to create symlinks to every file that goes with each image in
# a chunk.

image_file_base_to_json_files = defaultdict(list)

for json_file in tqdm(json_files):

    file_id = get_file_id(json_file)
    image_file_base_to_json_files[file_id].append(json_file)


100%|██████████| 152/152 [00:00<00:00, 398210.00it/s]


In [29]:
# Make sure every image has the right number of .json files

unlabeled_image_files = []

for image_file in tqdm(image_files_relative):    
    basename = get_file_id(image_file)
    json_files_this_image = image_file_base_to_json_files[basename]
    assert len(json_files_this_image) == 4
    if len(json_files_this_image) == 0:
        unlabeled_image_files.append(image_file)

100%|██████████| 38/38 [00:00<00:00, 226075.96it/s]


In [30]:
# Divide into chunks, create symlinks

chunks = split_list_into_fixed_size_chunks(image_files_relative,max_images_per_chunk)

print('Split images into {} chunks of {} images'.format(len(chunks),max_images_per_chunk))

chunk_folder_base = os.path.join(relabeling_folder_base,'symlinks-{}'.format(batch_name))
chunk_folders = []
error_files = []

for i_chunk,chunk in enumerate(chunks):
    
    print('Creating symlinks for chunk {} of {}'.format(i_chunk,len(chunks)))

    chunk_folder_abs = os.path.join(chunk_folder_base,'chunk_{}'.format(
        str(i_chunk).zfill(3)))
    os.makedirs(chunk_folder_abs,exist_ok=True)
    chunk_folders.append(chunk_folder_abs)
    
    # Find matching files
    relative_files_this_chunk = []
    
    for i_image,image_file in enumerate(chunk):
        
        # image_file_abs = os.path.join(training_images_resized_folder,image_file); open_file(image_file_abs)
        basename = get_file_id(image_file)
        json_files_this_image = image_file_base_to_json_files[basename]
        
        # These are typically images that failed to load
        if len(json_files_this_image) == 0:
            print('Warning: no .json files for {}'.format(image_file))
            error_files.append(image_file)
            continue
        
        assert len(json_files_this_image) > 0
        relative_files_this_chunk.append(image_file)
        
        for json_file in json_files_this_image:
            relative_files_this_chunk.append(json_file)          

    # Create symlinks
    for relative_file in tqdm(relative_files_this_chunk):
        source_file_abs = os.path.join(relabeling_folder_base,relative_file)
        assert os.path.isfile(source_file_abs)
        target_file_abs = os.path.join(chunk_folder_abs,relative_file)

        # sorting location, species, and datetime based on metadata, not relying on folder hierarchy
        if sort_all:
            file_id = get_file_id(relative_file)
            loc_order = images_df[images_df.uuid == file_id].location_rank.item()
            spec_order = images_df[images_df.uuid == file_id].species_rank.item()
            dt_order = images_df[images_df.uuid == file_id].datetime_rank.item()
            relative_file_pieces = relative_file.split("/")
            relative_file_base = "/".join(relative_file_pieces[:-1])
            relative_file_name = relative_file_pieces[-1]
            target_file_abs = f"{chunk_folder_abs}/loc-{loc_order}/spec-{spec_order}/dt-{dt_order}/{relative_file_name}"

        os.makedirs(os.path.dirname(target_file_abs),exist_ok=True)
        safe_create_link(source_file_abs,target_file_abs)

# ...for each chunk

error_file_list_file = os.path.join(chunk_folder_base,'error_images.json')
print('\nSaving list of {} error images to {}'.format(len(error_files),error_file_list_file))
with open(error_file_list_file,'w') as f:
    json.dump(error_files,f,indent=1)

Split images into 1 chunks of 5000 images
Creating symlinks for chunk 0 of 1


100%|██████████| 190/190 [00:00<00:00, 39606.27it/s]


Saving list of 0 error images to /home/garage/Documents/jin-summer24/Sentinel_Summer24/nz-trailcams-test/downsized_data/symlinks-nz-trailcams-acc-aiv/error_images.json





Create `labels.txt` to contain all unique classes

In [31]:
# save all unique labels to labels.txt
labels_path = f"{datapath}/labels.txt"
with open(labels_path, 'w') as f:
    for label in set(species.values()):
        f.write(f"{label}\n")

# 5. Label!
This section generates a command `cmd` to be copy and pasted into a terminal. Before running this command in the terminal, set up and activate a conda environment for labelme as follows: 
```
git clone https://github.com/agentmorris/labelme
cd labelme
conda create -n labelme python=3.11 pip -y
conda activate labelme
pip install -e .
```

In [32]:
# Label one chunk

# Specifically, generate the command to start labelme, pointed at this chunk, and copy that
# command to the clipboard.

i_chunk = 0
resume = True

chunk_folder_abs = os.path.join(chunk_folder_base,'chunk_{}'.format(
    str(i_chunk).zfill(3)))
assert os.path.isdir(chunk_folder_abs)

flags = ['ignore','empty']

flag_file = os.path.join(chunk_folder_abs,'flags.txt')
with open(flag_file,'w') as f:
    for flag in flags:        
        f.write(flag + '\n')

last_updated_file = os.path.join(chunk_folder_abs,'labelme_last_updated.txt')
cmd = f'python {labelme_path}/labelme {chunk_folder_abs} --labels {labels_path} --linewidth 12 --last_updated_file {last_updated_file} --flags {flag_file}'
if resume:
    cmd += ' --resume_from_last_update'

In [33]:
# # the following code causes an error in loading the Qt platform plugin, since it is seemingly incompatible with some
# # packages installed in the megadetector environment used to run this notebook
# os.system(cmd)

# instead, manually copy the following printed command and run it in the terminal 
# (after activating conda environment labelme)
print(cmd)

python /home/garage/Documents/jin-summer24/labelme/labelme /home/garage/Documents/jin-summer24/Sentinel_Summer24/nz-trailcams-test/downsized_data/symlinks-nz-trailcams-acc-aiv/chunk_000 --labels /home/garage/Documents/jin-summer24/Sentinel_Summer24/nz-trailcams-test/downsized_data/labels.txt --linewidth 12 --last_updated_file /home/garage/Documents/jin-summer24/Sentinel_Summer24/nz-trailcams-test/downsized_data/symlinks-nz-trailcams-acc-aiv/chunk_000/labelme_last_updated.txt --flags /home/garage/Documents/jin-summer24/Sentinel_Summer24/nz-trailcams-test/downsized_data/symlinks-nz-trailcams-acc-aiv/chunk_000/flags.txt --resume_from_last_update


# 6. Labelme to MD: convert updated annotations into MD format 

In [34]:
# to check for updates from labelme
ppresults = process_batch_results(options)
html_output_file = ppresults.output_html_file
path_utils.open_file(html_output_file,attempt_to_open_in_wsl_host=True,browser_name='chrome')

Loading results from /home/garage/Documents/jin-summer24/Sentinel_Summer24/nz-trailcams-test/postprocessing/nz-trailcams-aac-aiv/nz-trailcams-aac-aiv-2024-jun-07-v5a.0.0/combined_api_outputs/nz-trailcams-aac-aiv-2024-jun-07-v5a.0.0_detections.filtered_rde_0.100_0.850_15_0.200.json
Converting results to dataframe
Finished loading MegaDetector results for 38 images from /home/garage/Documents/jin-summer24/Sentinel_Summer24/nz-trailcams-test/postprocessing/nz-trailcams-aac-aiv/nz-trailcams-aac-aiv-2024-jun-07-v5a.0.0/combined_api_outputs/nz-trailcams-aac-aiv-2024-jun-07-v5a.0.0_detections.filtered_rde_0.100_0.850_15_0.200.json


100%|██████████| 38/38 [00:00<00:00, 12762.94it/s]

Finished loading and preprocessing 38 rows from detector output, predicted 36 positives.
...and 0 almost-positives





Rendering images with 30 processes


100%|██████████| 38/38 [00:00<00:00, 253.52it/s]

Rendered 38 images (of 38) in 1.57 seconds (0.04 seconds per image)
Finished writing html to /home/garage/Documents/jin-summer24/Sentinel_Summer24/nz-trailcams-test/postprocessing/nz-trailcams-aac-aiv/nz-trailcams-aac-aiv-2024-jun-07-v5a.0.0/preview/nz-trailcams-aac-aiv-2024-jun-07-v5a.0.0_rde_0.100_0.850_15_0.200_0.200/index.html



