# Preprocessing LILA datasets

Use this notebook to pre-process LILA datasets, which includes the following steps:

0. ***Manual inputs***: path to metadata (needed for step 2), data (needed for step 3, 4, 5), and LILA species mapping (needed for step 6)
1. Set up 
2. Extract metadata from COCO file 
3. Downsize images
4. Run megadetector and RDE
5. ***Manual RDE***
6. Post-RDE processing 
7. Converting filtered bounding boxes into database-compatible format
8. Preparing detections for data upload
9. Data upload

***Steps 0 and 5 are the only ones requiring manual actions.***

To run this notebook, set up the following environment:
```
conda create -n megadetector python=3.11 pip -y
conda activate megadetector
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip3 install megadetector
pip3 install pillow
pip3 install uuid
pip3 install sqlalchemy
```

## 0. Manual inputs
Enter the path to metadata and data.

In [11]:
metadata_path = '/home/garage/Documents/jin-summer24/Sentinel_Summer24/data/nz_data/trail_camera_images_of_new_zealand_animals_1.00.json'
data_path = '/home/garage/Documents/jin-summer24/Sentinel_Summer24/data/nz_data'
species_mapping_path = '/home/garage/Documents/jin-summer24/Sentinel_Summer24/data/nz_data/lila-taxonomy-mapping_release.csv'

## 1. Set up
Import packages

In [12]:
from glob import glob 
import json
import pandas as pd
import time
from datetime import datetime
import PIL.Image
from megadetector.visualization.visualization_utils import resize_image_folder 
from pipeline_utils import mdv5a_and_rde, post_rde, upload_db, read_db
from tqdm import tqdm
import uuid

## 2. Extract metadata from COCO file (requires `metadata_path`)

Extract relevant metadata for each image: filename, species, location, datetime, and a unique id (`uuid`).

In [13]:
# load metadata (based on the format from NZ trail cams dataset)
with open(metadata_path) as json_file:
    metadata = json.load(json_file)
metadata_df = pd.DataFrame(metadata['images'])[['file_name','species','location','datetime']]

In [14]:
# gather all image file paths we care about and retrieve their unique image identifier 
image_extensions = [
    '**/*.png', '**/*.PNG', '**/*.jpg', '**/*.JPG', '**/*.jpeg', '**/*.JPEG',
    '**/*.gif', '**/*.GIF', '**/*.bmp', '**/*.BMP', '**/*.tiff', '**/*.TIFF',
    '**/*.webp', '**/*.WEBP'
]
image_files = []
for ext in image_extensions:
    image_files.extend(glob(f"{data_path}/{ext}", recursive=True))
image_files = set(map(os.path.basename, image_files))

In [15]:
# filter metadata to only contain images we care about
%time filenames = metadata_df['file_name'].str.split('/').str[-1]
include_image = [True if fn in image_files else False for fn in filenames]
metadata_df = metadata_df.loc[include_image].reset_index(drop=True)

CPU times: user 5.14 s, sys: 215 ms, total: 5.36 s
Wall time: 5.37 s


In [16]:
# covert datetime string into datetime object, and extract the value from EXIF data where necessary 
EXIF_extracted = 0
tic = time.time()
for row_idx, dt in enumerate(metadata_df['datetime']):
    if type(dt) == type(None): # If datetime is not available, extract from EXIF data.
        EXIF_extracted += 1 
        DT_TAG = 306 # tag number for DateTime in exif object 
        NZ_EXIF_DT_FORMAT = "%Y:%m:%d %H:%M:%S"
        exif_dt = PIL.Image.open(f'{data_path}/{metadata_df["file_name"][row_idx]}')._getexif()[DT_TAG]
        metadata_df.loc[row_idx, 'datetime'] = datetime.strptime(exif_dt, NZ_EXIF_DT_FORMAT)
    else: 
        MD_DT_FORMAT = "%Y-%m-%d %H:%M:%S"
        metadata_df.loc[row_idx, 'datetime'] = datetime.strptime(dt, MD_DT_FORMAT)
toc = time.time()
print(f"Time taken to retrieve all locations, datetime, and species for {len(metadata_df)} images from metadata and extract EXIF data for {EXIF_extracted} images: {(toc-tic)} seconds")

Time taken to retrieve all locations, datetime, and species for 38 images from metadata and extract EXIF data for 0 images: 0.002527475357055664 seconds


In [17]:
# check metadata
metadata_df.head(2)

Unnamed: 0,file_name,species,location,datetime
0,ACC/banded_rail/0067A52A-FB22-4CB4-B54A-1894E7...,banded_rail,ACC_T006,2023-06-01 09:16:50
1,ACC/banded_rail/0331C7C7-21BA-4198-8811-248F84...,banded_rail,ACC_T006,2023-05-15 13:12:07


## 3. Downsize images (requires `data_path`)
Downsize images to at most 1600px wide (assuming most camera trap images have a larger width than height) to improve the latency of labelme further down the pipeline.

In [18]:
downsized_data_path = f"{os.getcwd()}/downsized_data" # relative from work directory 
if not os.path.exists(downsized_data_path):
    os.mkdir(downsized_data_path)
    # resize a folder of images to a new folder on multiple threads/processes.
    %time _ = resize_image_folder(input_folder=data_path, output_folder=downsized_data_path, target_width=1600, target_height=-1, no_enlarge_width=True, verbose=False)

## 4. Run megadetector and RDE
The following cell runs MDv5a detection on the dataset at `downsized_data_path` and launches a browser window displaying the images and their detections. Then, it will run repeated detection elimination by first detecting the suspicious repeated detections. 

***IMPORTANT: Finally, it will launch a folder of images for you to browse through the detected suspicious repeated detections. Manually go through each image file in the folder and delete the valid detections. Then, close the file explorer and continuing running Section 6 (post-RDE processing). The undeleted files will be treated as suspicious detections and all the associated detections will be filtered out before further data processing.***

In [19]:
outputs = mdv5a_and_rde(input_path = downsized_data_path, 
                        job_name = 'nz-test', 
                        job_date = '27jun2024',
                        postprocessing_base = 'nz_postprocessing')

No speed estimate available for NVIDIA GeForce RTX 2070 SUPER
Loaded 38 image filenames from .json list file nz_postprocessing/nz-test/nz-test-27jun2024-v5a.0.0/chunk000.json
Downloading model MDV5A
Bypassing download of already-downloaded file md_v5a.0.0.pt
PyTorch reports 1 available CUDA devices
GPU available: True


Fusing layers... 
Fusing layers... 
Model summary: 733 layers, 140054656 parameters, 0 gradients, 208.8 GFLOPs
Model summary: 733 layers, 140054656 parameters, 0 gradients, 208.8 GFLOPs


Sending model to GPU
Loaded model in 0.42 seconds
Loaded model in 0.42 seconds


100%|██████████| 38/38 [00:03<00:00, 10.65it/s]


Output file saved at nz_postprocessing/nz-test/nz-test-27jun2024-v5a.0.0/chunk000_results.json


100%|██████████| 1/1 [00:00<00:00, 15709.00it/s]
100%|██████████| 1/1 [00:00<00:00, 1983.12it/s]
100%|██████████| 1/1 [00:00<00:00, 17549.39it/s]


Loading results from nz_postprocessing/nz-test/nz-test-27jun2024-v5a.0.0/combined_api_outputs/nz-test-27jun2024-v5a.0.0_detections.json
Converting results to dataframe
Finished loading MegaDetector results for 38 images from nz_postprocessing/nz-test/nz-test-27jun2024-v5a.0.0/combined_api_outputs/nz-test-27jun2024-v5a.0.0_detections.json


100%|██████████| 38/38 [00:00<00:00, 16434.68it/s]

Finished loading and preprocessing 38 rows from detector output, predicted 36 positives.
...and 0 almost-positives





Rendering images with 30 processes


100%|██████████| 38/38 [00:00<00:00, 262.80it/s]

Rendered 38 images (of 38) in 2.15 seconds (0.06 seconds per image)
Finished writing html to nz_postprocessing/nz-test/nz-test-27jun2024-v5a.0.0/preview/nz-test-27jun2024-v5a.0.0_0.200/index.html





Loading results from nz_postprocessing/nz-test/nz-test-27jun2024-v5a.0.0/combined_api_outputs/nz-test-27jun2024-v5a.0.0_detections.json
Converting results to dataframe
Finished loading MegaDetector results for 38 images from nz_postprocessing/nz-test/nz-test-27jun2024-v5a.0.0/combined_api_outputs/nz-test-27jun2024-v5a.0.0_detections.json
Separating images into locations...


100%|██████████| 38/38 [00:00<00:00, 12907.64it/s]

Custom dir name function made 0 replacements (of 38 images)
Finished separating 38 files into 2 locations
Finding similar detections...
Pool of 30 requested, but only 2 folders available, reducing pool to 2





Starting comparison pool with 2 processes


2it [00:00, 374.47it/s]



Finished looking for similar detections
Marking repeat detections...
Found 0 suspicious detections in directory 0 (ACC/banded_rail)
Found 0 suspicious detections in directory 1 (ACC/morepork)
Finished marking repeat detections
Found 0 unique detections on 0 images that are suspicious
Updating output table
Finished updating detection table
Changed 0 detections that impacted 0 maxPs (0 to negative) (0 across confidence threshold)
Creating filtering folder...


100%|██████████| 2/2 [00:00<00:00, 47127.01it/s]
100%|██████████| 2/2 [00:00<00:00, 29127.11it/s]


Starting rendering pool with 30 processes


0it [00:00, ?it/s]

Done





## 5. Manual RDE
After running the cell above, a manual RDE step is required before running the code in section 6 (post-RDE processing). 

***IMPORTANT: The cell above (section 4) should launch a folder of images for you to browse through the detected suspicious repeated detections. Manually go through each image file in the folder and delete the valid detections. Then, close the file explorer and continuing running Section 6 (post-RDE processing). The undeleted files will be treated as suspicious detections and all the associated detections will be filtered out before further data processing.***

In [20]:
# generate an error to stop the code flow here just in case the user selects Run All

# if you are here and not sure what to do: 
# 1. If you have completed the manual RDE step of deleting valid detections from the folder launched in
#    section 4, go ahead and run section 6 and beyond. 
# 2. If you have not, read the instructions above for Manual RDE and delete the valid detections
#    accordingly. Then, go ahead and run section 6 and beyond.

break

SyntaxError: 'break' outside loop (2669011907.py, line 9)

## 6. Post-RDE processing 
The following cell runs MDv5a detection on the dataset at `downsized_data_path` and launches a browser window displaying the images and their detections. 

In [21]:
combined_api_output_file, rde_string, suspicious_detection_results, default_workers_for_parallel_tasks, parallelization_defaults_to_threads, postprocessing_output_folder, base_task_name = outputs
final_md = post_rde(downsized_data_path, combined_api_output_file, rde_string, suspicious_detection_results, 
                    default_workers_for_parallel_tasks, parallelization_defaults_to_threads, 
                    postprocessing_output_folder, base_task_name)

Bypassing detection-finding, loading from nz_postprocessing/nz-test/nz-test-27jun2024-v5a.0.0/rde_0.100_0.850_15_0.200_task_0/filtering_2024.07.01.15.26.23/detectionIndex.json
Loading results from nz_postprocessing/nz-test/nz-test-27jun2024-v5a.0.0/combined_api_outputs/nz-test-27jun2024-v5a.0.0_detections.json
Converting results to dataframe
Finished loading MegaDetector results for 38 images from nz_postprocessing/nz-test/nz-test-27jun2024-v5a.0.0/combined_api_outputs/nz-test-27jun2024-v5a.0.0_detections.json
Separating images into locations...


100%|██████████| 38/38 [00:00<00:00, 12369.70it/s]


Custom dir name function made 0 replacements (of 38 images)
Finished separating 38 files into 2 locations
Removed 0 of 0 total detections via manual filtering
Updating output table
Writing detection results to nz_postprocessing/nz-test/nz-test-27jun2024-v5a.0.0/combined_api_outputs/nz-test-27jun2024-v5a.0.0_detections.filtered_rde_0.100_0.850_15_0.200.json
Finished writing detection results to nz_postprocessing/nz-test/nz-test-27jun2024-v5a.0.0/combined_api_outputs/nz-test-27jun2024-v5a.0.0_detections.filtered_rde_0.100_0.850_15_0.200.json
Finished updating detection table
Changed 0 detections that impacted 0 maxPs (0 to negative) (0 across confidence threshold)
Loading results from nz_postprocessing/nz-test/nz-test-27jun2024-v5a.0.0/combined_api_outputs/nz-test-27jun2024-v5a.0.0_detections.filtered_rde_0.100_0.850_15_0.200.json
Converting results to dataframe
Finished loading MegaDetector results for 38 images from nz_postprocessing/nz-test/nz-test-27jun2024-v5a.0.0/combined_api_outpu

100%|██████████| 38/38 [00:00<00:00, 23735.45it/s]

Finished loading and preprocessing 38 rows from detector output, predicted 36 positives.
...and 0 almost-positives





Rendering images with 30 processes


100%|██████████| 38/38 [00:00<00:00, 318.29it/s]

Rendered 38 images (of 38) in 2.14 seconds (0.06 seconds per image)
Finished writing html to nz_postprocessing/nz-test/nz-test-27jun2024-v5a.0.0/preview/nz-test-27jun2024-v5a.0.0_rde_0.100_0.850_15_0.200_0.200/index.html





## 7. Converting filtered bounding boxes into database-compatible format

In [22]:
# load detections
with open(final_md, 'r') as f:
    md_data = json.load(f)
    print(f"md_data.keys(): {md_data.keys()}")

# extract the bbox detections for each image and convert them into our database format 
# the outputs by megadetector is relative coordinates so the downsizing does not matter
md_df = pd.DataFrame(md_data['images'])
fn, xmin, ymin, xmax, ymax, confidence = [], [], [], [], [], []

# set up progress bar counter
pbar = tqdm(total=len(md_df))
for i in range(len(md_df)):
    detections = md_df['detections'][i]
    try:
        for detection in detections:
            fn.append(md_df['file'][i])
            if detection['bbox'] is None:
                xmin.append(None)
                ymin.append(None)
                xmax.append(None)
                ymax.append(None)
                confidence.append(None)
            else:
                xmin.append(detection['bbox'][0])
                ymin.append(detection['bbox'][1])
                xmax.append(detection['bbox'][0]+detection['bbox'][2])
                ymax.append(detection['bbox'][1]+detection['bbox'][3])
                confidence.append(detection['conf'])  
    except Exception as e:
        print(e)

    pbar.update(1)

pbar.close()

md_df = pd.DataFrame({'file_name': fn, 'voc_xmin': xmin, 'voc_ymin': ymin, 'voc_xmax': xmax, 'voc_ymax': ymax, 'confidence': confidence})
md_df['file_name'] = md_df['file_name'].str.replace('\\', '/')
md_df['image_id'] = md_df['file_name'].str.split('/').str[-1].str.split('.').str[0]

# generate unique id for each bbox detection
%time uuids = [str(uuid.uuid4()) for i in range(len(md_df))] 
md_df['bb_id'] = uuids

md_data.keys(): dict_keys(['info', 'detection_categories', 'images'])


100%|██████████| 38/38 [00:00<00:00, 31711.81it/s]

CPU times: user 406 μs, sys: 193 μs, total: 599 μs
Wall time: 407 μs





In [23]:
# check converted data 
print(f"There are {len(set(md_df['image_id']))} unique images and {len((md_df['image_id']))} bounding boxes detected.")
md_df.head(2)

There are 38 unique images and 42 bounding boxes detected.


Unnamed: 0,file_name,voc_xmin,voc_ymin,voc_xmax,voc_ymax,confidence,image_id,bb_id
0,ACC/banded_rail/0067A52A-FB22-4CB4-B54A-1894E7...,0.0,0.084,0.052,0.248,0.015,0067A52A-FB22-4CB4-B54A-1894E7F2B1A5,e270ad1b-dab7-4475-8cbe-6e5582348434
1,ACC/banded_rail/0067A52A-FB22-4CB4-B54A-1894E7...,0.742,0.318,0.966,0.569,0.851,0067A52A-FB22-4CB4-B54A-1894E7F2B1A5,f63c8d8f-b1dd-4c1b-9e31-a0368bbce1fe


## 8. Preparing detections for data ingestion
- adding metadata 
- species mapping 
- formatting column names 

In [24]:
# merge megadetector data with metadata 
df = pd.merge(md_df, metadata_df, on='file_name', how='outer')
df = df.rename(columns={'location': 'location_id','species': 'original_label', 'confidence': 'bb_confidence'})

In [25]:
# species mapping 

# load data
taxa_df = pd.read_csv(species_mapping_path)
length_before_merge = len(df)

# extract relevant data 
# set(taxa_df.dataset_name) # run this to check for dataset name 
taxa_df = taxa_df[taxa_df['dataset_name'] ==  'Trail Camera Images of New Zealand Animals']
taxa_df = taxa_df[['query','kingdom','phylum','class','order','family','genus','species','subspecies']]
taxa_df.rename(columns={'query':'original_label'}, inplace=True)

# merge taxa info before saving
df = pd.merge(df, taxa_df, on='original_label', how='inner')
df.reset_index(drop=True, inplace=True)

# check merging validity
print(f"Number of rows before and after merging: {length_before_merge} -> {len(df)}")
df.head(2)

Number of rows before and after merging: 42 -> 42


Unnamed: 0,file_name,voc_xmin,voc_ymin,voc_xmax,voc_ymax,...,order,family,genus,species,subspecies
0,ACC/banded_rail/0067A52A-FB22-4CB4-B54A-1894E7...,0.0,0.084,0.052,0.248,...,gruiformes,rallidae,gallirallus,gallirallus philippensis,
1,ACC/banded_rail/0067A52A-FB22-4CB4-B54A-1894E7...,0.742,0.318,0.966,0.569,...,gruiformes,rallidae,gallirallus,gallirallus philippensis,


In [26]:
# fill in other metadata 

# new columns in images_v4 for tracking data reliability 
# df['species_confirmed'] = False
# df['bb_confirmed'] = False
# df['wrong_species'] = None

# filter out low confidence predictions and keep track of threshold used 
df['detector_threshold'] = 0.4
df = df[df['bb_confidence'] > df['detector_threshold']].reset_index(drop=True)

# fill in values
df['cloud_path'] = df['file_name']
df['gcp_path'] = 'gs://public-datasets-lila/nz-trailcams/' + df['file_name']
df['last_updated'] = int(time.time())
df['country_code'] = 'NZ' # based on database query
df['host_location'] = 'GCP Public'
df['camera_trap'] = 1 # whether data is from camera traps 
df['data_type'] = 'RGB'
df['detector_algorithm'] = md_data['info']['detector'] # not yet standardized in images_v3
df['bb_confirmed'] = False

# missing variables based on dataset columns that exist in images_v3 
df['seq_id'] = None
df['frame_num'] = None
df['sex'] = None
df['lifeStage'] = None
df['behavior'] = None
df['feature'] = None
df['color'] = None
df['individual_id'] = None
df['error_status'] = None
df['image_signature'] = None
df['loc_index'] = None
df['rights_holder'] = None

In [27]:
# check data
print(df.keys())
df.head(2)

Index(['file_name', 'voc_xmin', 'voc_ymin', 'voc_xmax', 'voc_ymax',
       'bb_confidence', 'image_id', 'bb_id', 'original_label', 'location_id',
       'datetime', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus',
       'species', 'subspecies', 'detector_threshold', 'cloud_path', 'gcp_path',
       'last_updated', 'country_code', 'host_location', 'camera_trap',
       'data_type', 'detector_algorithm', 'bb_confirmed', 'seq_id',
       'frame_num', 'sex', 'lifeStage', 'behavior', 'feature', 'color',
       'individual_id', 'error_status', 'image_signature', 'loc_index',
       'rights_holder'],
      dtype='object')


Unnamed: 0,file_name,voc_xmin,voc_ymin,voc_xmax,voc_ymax,...,individual_id,error_status,image_signature,loc_index,rights_holder
0,ACC/banded_rail/0067A52A-FB22-4CB4-B54A-1894E7...,0.742,0.318,0.966,0.569,...,,,,,
1,ACC/banded_rail/0331C7C7-21BA-4198-8811-248F84...,0.359,0.404,0.51,0.837,...,,,,,


## 9. Data upload

In [28]:
# upload data 
upload_db(df, 'nz-trailcams-test', 'images_v3')

100%|██████████| 1/1 [00:00<00:00,  2.43it/s]


In [29]:
# check uploaded data
uploaded_df = read_db('nz-trailcams-test', 'images_v3')
print(uploaded_df.keys())
uploaded_df.head(2)

Engine Created
Index(['index', 'bb_id', 'image_id', 'dataset', 'data_type', 'host_location',
       'file_name', 'gcp_path', 'seq_id', 'frame_num', 'camera_trap',
       'original_label', 'kingdom', 'class', 'phylum', 'order', 'family',
       'genus', 'species', 'subspecies', 'sex', 'lifeStage', 'behavior',
       'feature', 'datetime', 'country_code', 'voc_xmin', 'voc_ymin',
       'voc_xmax', 'voc_ymax', 'bb_confirmed', 'bb_confidence',
       'rights_holder', 'color', 'individual_id', 'last_updated',
       'error_status', 'detector_threshold', 'detector_algorithm',
       'image_signature', 'cloud_path', 'loc_index', 'location_id'],
      dtype='object')


Unnamed: 0,index,bb_id,image_id,dataset,data_type,...,detector_algorithm,image_signature,cloud_path,loc_index,location_id
0,,21a6ea01-abf4-4dda-bf68-5db776268872,D1618D4F-E1B7-419C-80A6-8E85FB61B9F6,nz-trailcams-test,RGB,...,MDv5a.0.0,,ACC/banded_rail/D1618D4F-E1B7-419C-80A6-8E85FB...,,ACC_T19
1,,9a0cd58f-64bc-4d4a-82ca-cc6d8a9696d4,0D68ED90-ADBD-4DD2-92F1-5E0E3084FF39,nz-trailcams-test,RGB,...,MDv5a.0.0,,ACC/banded_rail/0D68ED90-ADBD-4DD2-92F1-5E0E30...,,ACC_T25
