## ReIDHub

Name: Victor Ruto

[![View Documentation](https://img.shields.io/badge/View%20Project%20Documentation-ðŸ“–-blue?style=for-the-badge)](https://vickruto.github.io/reidhub "reidhub docs (work in progress)")

This work is still in progress and actively being developed. There's probably a newer version of this notebook with new updates. Check it out on Colab  [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vickruto/reidhub/blob/main/notebooks/2025-09-22-reidhub-checkpoint.ipynb "Open this notebook in Colab")


## **Abstract**

Animal re-identification (ReID) is a critical task in wildlife monitoring and conservation research, yet workflows for accessing, processing, and analyzing open-source ReID datasets remain fragmented. This project introduces **`ReIDHub`**, a framework structured around the Accessâ€“Assessâ€“Address paradigm to streamline dataset-driven research in animal ReID. The framework enables users to:  
 (i) **Access** open-source ReID datasets in standardized formats and visualize them through tools like FiftyOne;  
 (ii) **Assess** datasets by computing summary statistics, potential image quality issues such as over exposure or over illumination, and reusable artifacts such as foundation model embeddings and open source model predictions, while caching results; and  
 (iii) **Address** scientific and practical needs by evaluating the datasets on reidentification benchmark models such as `MegaDescriptor` and traditional computer vision approaches such as `SIFT`
  
In addition, ReIDHub centralizes dataset-related metadata, including original publications and subsequent research outputs, supported by easily accessible and clear documentation built with MkDocs. By making datasets and analyses more portable, reproducible, and extensible, ReIDHub lowers barriers to entry for animal ReID research and accelerates collaborative conservation science.


## Installs and Setup

In [5]:
!uv pip install fiftyone

[2mUsing Python 3.12.11 environment at: /usr[0m
[2mAudited [1m1 package[0m [2min 103ms[0m[0m


In [6]:
import os
import sys
from pathlib import Path
import numpy as np
import pandas as pd

In [7]:
## Load reidhub package
checkpoint_commit_hash = "5deed430e8c9b6ab762c40245808e5da695db1d5"

try:
    import reidhub ## For local development, the package is alread installed using `poetry install`
    print('Loaded pre-installed package')

except:
    !git clone https://github.com/vickruto/reidhub.git
    !cd reidhub && git checkout {checkpoint_commit_hash}
    full_path = os.path.abspath('reidhub')
    sys.path.insert(0, full_path)
    print(full_path)
    import reidhub
    print('\n\nSuccessfully cloned and loaded package from github')

Loaded pre-installed package


In [8]:
help(reidhub)

Help on package reidhub:

NAME
    reidhub

PACKAGE CONTENTS
    access (package)
    address (package)
    assess (package)
    config
    tests (package)

FILE
    /content/reidhub/reidhub/__init__.py




## Access

### Loading a Raw Dataset

Animal reidentification datasets come in different formats depending on many factors such as:  

The reidhub package has utilities written to load each of the datasets already added to the package to download them.

We are going to demonstrate this with an example dataset that is already added to the package.
We will be using the [Great Zebra and Giraffe Count Dataset](https://lila.science/datasets/great-zebra-giraffe-id)

ReIDHub caches datasets that are already downloaded.

In [9]:
from reidhub.access.provenance.gzgc import download_and_extract
help(download_and_extract)

Help on function download_and_extract in module reidhub.access.provenance.gzgc:

download_and_extract() -> str
    downloads the gzgc from the gcp bucket url provided by lila datasets.
    Accessible here: https://lila.science/datasets/great-zebra-giraffe-id

    Args:
        DATASET_ID: str :- the identifier for the dataset.

    returns:
        a path to the extracted and formatted `reidhub` dataset



In [10]:
dataset_root = download_and_extract()
print(f'Dataset downloaded to: {dataset_root}')

Dataset downloaded to: /root/.reidhub_cache/gzgc


### Systematize the Dataset Metadata

To be enable standard reproducible animal reidentification workflows using many different datasets, we systematize the dataset metadata. This will ensure that the dataset metadata such as `animal identification`, `viewpoint` eg left/right/front etc, `species`, `timestamp` and a lot more are standardized across different datasets from different sources.

In [11]:
from reidhub.access.provenance.gzgc import systematize_dataset_metadata
help(systematize_dataset_metadata)

Help on function systematize_dataset_metadata in module reidhub.access.provenance.gzgc:

systematize_dataset_metadata(dataset_root: str) -> pandas.core.frame.DataFrame
    Systematize dataset metadata for the Great Zebra and Giraffe Count and ID dataset
    Args:
        dataset_root (str) : The root path containing the extracted dataset

    Returns:
        pd.DataFrame: output dataframe containing systematized metadata



In [12]:
metadata_df = systematize_dataset_metadata(dataset_root)

In [13]:
## Add a full path column
metadata_df['fullpath'] = metadata_df['filepath'].apply(
    lambda x : os.path.join(dataset_root, x)
)

In [14]:
metadata_df

Unnamed: 0,filepath,bbox,viewpoint,species,identity,license,height,width,photographer,timestamp,latitude,longitude,secondary_identities,image_id,fullpath
0,gzgc.coco/images/train2020/000000000001.jpg,"[895.5, 437.0, 1221.0, 690.0]",left,zebra_plains,IBEIS_PZ_1561,http://creativecommons.org/licenses/by-nc-nd/2.0/,2000,3000,"NNP GZC Car '10WHITE', Person 'A', Image 0005",2015-03-01 14:53:46,-1.351341,36.800374,"[2, 1, 3, 3459]",1,/root/.reidhub_cache/gzgc/gzgc.coco/images/tra...
1,gzgc.coco/images/train2020/000000000002.jpg,"[951.0, 488.5, 1178.5, 728.5]",left,zebra_plains,IBEIS_PZ_1561,http://creativecommons.org/licenses/by-nc-nd/2.0/,2000,3000,"NNP GZC Car '10WHITE', Person 'A', Image 0006",2015-03-01 14:53:46,-1.351341,36.800374,"[2, 1, 3, 3459]",2,/root/.reidhub_cache/gzgc/gzgc.coco/images/tra...
2,gzgc.coco/images/train2020/000000000003.jpg,"[981.0, 552.5, 1131.0, 750.0]",left,zebra_plains,IBEIS_PZ_1561,http://creativecommons.org/licenses/by-nc-nd/2.0/,2000,3000,"NNP GZC Car '10WHITE', Person 'A', Image 0007",2015-03-01 14:53:52,-1.351341,36.800374,"[2, 1, 3, 3459]",3,/root/.reidhub_cache/gzgc/gzgc.coco/images/tra...
3,gzgc.coco/images/train2020/000000000004.jpg,"[432.5, 531.0, 1740.0, 938.5]",left,zebra_plains,IBEIS_PZ_1563,http://creativecommons.org/licenses/by-nc-nd/2.0/,2000,3000,"NNP GZC Car '10WHITE', Person 'A', Image 0008",2015-03-01 14:53:58,-1.351341,36.800374,[4],4,/root/.reidhub_cache/gzgc/gzgc.coco/images/tra...
4,gzgc.coco/images/train2020/000000000005.jpg,"[1568.5, 942.5, 450.0, 462.5]",left,giraffe_masai,NNP_GIRM_0140,http://creativecommons.org/licenses/by-nc-nd/2.0/,2000,3000,"NNP GZC Car '10WHITE', Person 'A', Image 0010",2015-03-01 15:02:32,-1.367088,36.781978,"[6, 7, 5, 2126, 2270]",5,/root/.reidhub_cache/gzgc/gzgc.coco/images/tra...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6920,gzgc.coco/images/train2020/000000004944.jpg,"[1084.2696629213483, 0.0, 1152.3876404494383, ...",left,giraffe_masai,NNP_GIRM_0074,http://creativecommons.org/licenses/by-nc-nd/2.0/,2000,3000,,2015-02-26 13:50:34,-1.376729,36.830786,"[4724, 6921, 4971]",4944,/root/.reidhub_cache/gzgc/gzgc.coco/images/tra...
6921,gzgc.coco/images/train2020/000000004945.jpg,"[779.494382022472, 363.76404494382024, 681.179...",left,giraffe_masai,NNP_GIRM_0030,http://creativecommons.org/licenses/by-nc-nd/2.0/,2000,3000,,2015-02-26 13:50:40,-1.370916,36.791815,"[6923, 6922]",4945,/root/.reidhub_cache/gzgc/gzgc.coco/images/tra...
6922,gzgc.coco/images/train2020/000000004946.jpg,"[1358.1460674157304, 0.0, 912.2191011235956, 1...",left,giraffe_masai,NNP_GIRM_0030,http://creativecommons.org/licenses/by-nc-nd/2.0/,2000,3000,,2015-02-26 13:50:50,-1.370916,36.791815,"[6923, 6922]",4946,/root/.reidhub_cache/gzgc/gzgc.coco/images/tra...
6923,gzgc.coco/images/train2020/000000004947.jpg,"[1221.2078651685395, 231.03932584269666, 458.5...",front,giraffe_masai,NNP_GIRM_0069,http://creativecommons.org/licenses/by-nc-nd/2.0/,2000,3000,,2015-02-26 13:51:10,-1.370916,36.791815,"[6925, 6924]",4947,/root/.reidhub_cache/gzgc/gzgc.coco/images/tra...


### Create Fiftyone Dataset

[Fiftyone](https://docs.voxel51.com/) is a powerful tool for interactive visualization of computer vision datasets.

Fiftyone's plugin feature allows for extensibility of it's functionality. In the next steps, we are going to be using a few plugins from the Fiftyone community to visualize and enrich our datasets, such as the following:
- Image Quality Issues Plugin
- Dashboard Plugin

In [15]:
from reidhub.access.utils import create_fiftyone_dataset
help(create_fiftyone_dataset)

  return '(?ms)' + res + '\Z'


Help on function create_fiftyone_dataset in module reidhub.access.utils:

create_fiftyone_dataset(root_path: str, metadata_df: pandas.core.frame.DataFrame, dataset_name: str, fields: Optional[List[str]] = None) -> fiftyone.core.dataset.Dataset
    creates a fiftyone dataset
    Args:
        root_path (str) : the root of the dataset containing the images <- also allow pathlib Paths
        metadata_df (pd.DataFrame) : the metadata for the dataset. Should have the columns (identity, image_path, image_type,)
        dataset_name (str) : The name of the fiftyone dataset created
        fields (list of str) : list of fields required in the fiftyone dataset created. should be existing columns in the metadata_df.                 Default : None : Use all the columns in the metadata_df
    Returns:
        fo.Dataset: Fiftyone dataset



In [16]:
dataset = create_fiftyone_dataset(dataset_root, metadata_df, dataset_name='gzgc')

You are running the oldest supported major version of MongoDB. Please refer to https://deprecation.voxel51.com for deprecation notices. You can suppress this exception by setting your `database_validation` config parameter to `False`. See https://docs.voxel51.com/user_guide/config.html#configuring-a-mongodb-connection for more information




 100% |â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 6925/6925 [2.4s elapsed, 0s remaining, 3.1K samples/s]      


INFO:eta.core.utils: 100% |â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 6925/6925 [2.4s elapsed, 0s remaining, 3.1K samples/s]      


Computing metadata...


INFO:fiftyone.core.metadata:Computing metadata...


 100% |â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 6925/6925 [12.4s elapsed, 0s remaining, 538.0 samples/s]      


INFO:eta.core.utils: 100% |â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 6925/6925 [12.4s elapsed, 0s remaining, 538.0 samples/s]      


## Assess

### Static Visualizations
We are going to start assessing the downloaded dataset by first plotting a few non-interactive visualizations.
Some of the static visualizations implemented in reidhub that we are going to use include:
 - Sample Image Grids: To showcase some sample images in the dataset. This is just to get a glimpse of the dataset.
 - Identity Distributions: To get a rough idea of the number of sightnings for the animals. Some of the datasets have very long tailed distributions

#### 1. Show sample images

To get a feel of the dataset and to check that indeed that dataset has been successfully loaded and and preprocessed to systematize the dataset metadata, we are going to plot a grid of sample images from the dataset.

The purpose of the static visualizations is just to get a rough idea of what the dataset images look like. Keep in mind that different datasets come in different image formats. Some are already segmented while others are full images. Some of the datasets have different individuals already cropped out from the original images while some come with the full images and include bounding boxes.

The images in the grid have different coloured borders to showcase different identities.

#### Show a grid of sample images

In [17]:
from reidhub.assess.statics import plot_grid
help(plot_grid)

Help on function plot_grid in module reidhub.assess.statics:

plot_grid(images: List[Union[numpy.ndarray, PIL.Image.Image]], ids: List[int], grid_shape: Tuple[int, int] = (3, 3), img_size: Tuple[int, int] = (224, 224), spacing: float = 0.05) -> matplotlib.figure.Figure
    Plot a grid of images with colored borders per identity.

    Args:
        images (List[Union[np.ndarray, PIL.Image.Image]]): List of images (either numpy arrays or PIL images).
        ids (List[int]): List of identity labels corresponding to each image.
        grid_shape (Tuple[int, int], optional): Shape of the grid as (rows, cols). Default is (3, 3).
        img_size (Tuple[int, int], optional): The (height, width) to resize the images. Default is (224, 224).
        spacing (float, optional): Fractional spacing between subplots. Default is 0.05.

    Returns:
        plt.Figure: The figure containing the grid of images with borders.



In [18]:
## Randomly select images to plot in the grid
from PIL import Image
n_rows, n_cols = 8, 5 # 5, 8 -- TODO: invert

samples_df = metadata_df.sample(n_rows*n_cols)
images = [Image.open(i) for i in samples_df['fullpath']]
ids = samples_df['identity'].values
grid_shape = (n_rows, n_cols)

In [None]:
fig = plot_grid(images, ids, grid_shape)
fig.savefig('gzgc-grid.png')  # saves the output grid as an image

#### Visualize Identity Distributions



In [None]:
%%writefile reidhub/reidhub/assess/statics.py
"""
This module contains functions that are useful for generatic static objects for assessing reid datasets
Examples: Sample images grid
          Identity distributions
          etc

"""

from collections import Counter
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import numpy as np
from PIL import Image
from matplotlib.cm import get_cmap
from typing import List, Union, Tuple


def plot_grid(
    images: List[Union[np.ndarray, Image.Image]],
    ids: List[int],
    grid_shape: Tuple[int, int] = (3, 3),
    img_size: Tuple[int, int] = (224, 224),
    spacing: float = 0.05,
) -> plt.Figure:
    """
    Plot a grid of images with colored borders per identity.

    Args:
        images (List[Union[np.ndarray, PIL.Image.Image]]): List of images (either numpy arrays or PIL images).
        ids (List[int]): List of identity labels corresponding to each image.
        grid_shape (Tuple[int, int], optional): Shape of the grid as (rows, cols). Default is (3, 3).
        img_size (Tuple[int, int], optional): The (height, width) to resize the images. Default is (224, 224).
        spacing (float, optional): Fractional spacing between subplots. Default is 0.05.

    Returns:
        plt.Figure: The figure containing the grid of images with borders.
    """
    # Unpack grid shape
    cols, rows = grid_shape

    # Calculate total number of cells in the grid
    n = rows * cols

    # Sample n images if more are provided
    if len(images) > n:
        idxs = np.random.choice(len(images), n, replace=False)
        images = [images[i] for i in idxs]
        ids = [ids[i] for i in idxs]

    # Normalize identities into a color map
    unique_ids = sorted(set(ids))
    cmap = plt.cm.get_cmap("tab20", len(unique_ids))  # Updated cmap access
    id2color = {uid: cmap(i) for i, uid in enumerate(unique_ids)}

    # Create subplots
    fig, axes = plt.subplots(rows, cols, figsize=(cols * 2.5, rows * 2.5))
    axes = np.array(axes).reshape(rows, cols)

    # Set transparent background
    fig.patch.set_alpha(0)

    # Plot each image with its border
    for ax, img, identity in zip(axes.flatten(), images, ids):
        # Convert to PIL Image if necessary and resize
        if not isinstance(img, Image.Image):
            img = Image.fromarray(img)
        img_resized = img.resize(img_size)

        # Display image
        ax.imshow(img_resized)
        ax.axis("off")

        # Draw border around the image
        rect = patches.Rectangle(
            (0, 0),
            img_size[0],
            img_size[1],
            linewidth=10,
            edgecolor=id2color[identity],
            facecolor="none",
            transform=ax.transData,
        )
        ax.add_patch(rect)

    # Remove extra axes if fewer images are provided
    for ax in axes.flatten()[len(images) :]:
        ax.axis("off")

    # Adjust spacing between subplots
    plt.subplots_adjust(wspace=spacing, hspace=spacing)

    return fig


def plot_identity_histogram(ids, bins=50, log_scale=False, alpha=0.6, figsize=(8, 5)):
    """
    Plot a transparent histogram of identity frequencies
    (how many images per identity).

    Args:
        ids (list): List of identity labels.
        bins (int or list): Number of bins or explicit bin edges.
        log_scale (bool): Whether to use log scale for y-axis.
        alpha (float): Transparency of histogram bars (0=fully transparent, 1=opaque).
        figsize (tuple): Figure size.
    """
    # Count how many images per identity
    counts = Counter(ids).values()

    # Plot histogram
    fig, ax = plt.subplots(figsize=figsize)
    ax.hist(counts, bins=bins, color="steelblue", edgecolor="black", alpha=alpha)

    # Transparent backgrounds
    fig.patch.set_alpha(0)   # Figure background
    ax.patch.set_alpha(0)    # Axes background

    ax.set_xlabel("Number of images per identity")
    ax.set_ylabel("Number of identities")
    ax.set_title("Identity Frequency Distribution")

    if log_scale:
        ax.set_yscale("log")
        ax.set_ylabel("Number of identities (log scale)")

    plt.tight_layout()
    return fig

In [None]:
from reidhub.assess.statics import plot_grid

In [None]:
from reidhub.assess.statics import plot_identity_histogram
help(plot_identity_histogram)

In [None]:
from collections import Counter
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import numpy as np
from PIL import Image
from matplotlib.cm import get_cmap
from typing import List, Union, Tuple

def plot_identity_histogram(ids, bins=50, log_scale=False, alpha=0.6, figsize=(8, 5)):
    """
    Plot a transparent histogram of identity frequencies
    (how many images per identity).

    Args:
        ids (list): List of identity labels.
        bins (int or list): Number of bins or explicit bin edges.
        log_scale (bool): Whether to use log scale for y-axis.
        alpha (float): Transparency of histogram bars (0=fully transparent, 1=opaque).
        figsize (tuple): Figure size.
    """
    # Count how many images per identity
    counts = Counter(ids).values()

    # Plot histogram
    fig, ax = plt.subplots(figsize=figsize)
    ax.hist(counts, bins=bins, color="steelblue", edgecolor="black", alpha=alpha)

    # Transparent backgrounds
    fig.patch.set_alpha(0)   # Figure background
    ax.patch.set_alpha(0)    # Axes background

    ax.set_xlabel("Number of images per identity")
    ax.set_ylabel("Number of identities")
    ax.set_title("Identity Frequency Distribution")

    if log_scale:
        ax.set_yscale("log")
        ax.set_ylabel("Number of identities (log scale)")

    plt.tight_layout()
    return fig

In [None]:
identities = metadata_df['identity'].values

fig_identities = plot_identity_histogram(identities)

> As can be seen from the distributions of identities, most of the animals surveyed during the census were only re-encountered once, making a total of about 1300 animals only seen twice.

> Some `"celebrity"` animals however sighted more than 50 times. We will be exploring this a bit further using Fiftyone

### Interactive Visualization with Fiftyone

In [None]:
import fiftyone as fo

session = fo.launch_app(dataset, auto=False)

In [None]:
session.show()

In [None]:
## Take a snapshot of the current state of the Fiftyone App
session.freeze()

#### Issues:
From a visual inspection of the dataset , we can identify a few issues that we will need to deal with:
1) Some of the images are spuriously rotated. The exif data has been stripped from the images and therefore systematically finding these issues will be a problem.

2) Some of images are crowded. Each image however, focuses on particular animals. This means we will need to save a version of the dataset where the images are cropped into bounding boxes containing individual animals.

3)

### Image quality issues

[Fiftyone](https://docs.voxel51.com/) has a powerful plugin utility that allows us to extend it's functionality. Moreover, there a number of plugins contributed by the community that we can use to programmatically identify issues in image issues using simple heuristics such as the average pixel values etc.

We will be using the [image issues plugin](https://github.com/jacobmarks/image-quality-issues)

In [None]:
from reidhub.access.utils import fiftyone_check_image_quality_issues
help(fiftyone_check_image_quality_issues)

In [None]:
# select a subset of operations

# Since these computations are quite computationally heavy, \
# we will just compute one of the issues that doesn't require too much compute: aspect ratio

IMAGE_ISSUES_OPERATIONS = [
    "compute_aspect_ratio",
    # "compute_brightness",
    # "compute_contrast",
    # "compute_exposure",
    # "compute_saturation",
    # "compute_vignetting",
    # "compute_blurriness",
    # "compute_entropy",
]

In [None]:
dataset = await fiftyone_check_image_quality_issues(dataset, IMAGE_ISSUES_OPERATIONS)

### Visualize Image quality Issues

We can interactively visualize the image quality issues we have computed using Fiftyone.   
Some of the things we can :
1) Filter for images above or below a certain `aspect ratio` threshold using the slider on the left  
2) Sort images by aspect ratio  
3) Visualize `aspect ratio` against other fields such as `brightness` using the **`Dashboard`** plugin  

In [None]:
session.show()

#### Identifying Exact and Near Duplicates

We need to be wary of duplicates since it might lead to data leakage whereby exact or near duplicate images appear in both the training and evaluation subsets of the datasets. To ensure that the models we will be training are not evaluated on images that it has already seen, we need to identify and deal with duplicate images in the dataset.

Duplicates occur in camera trap datasets due to a number of reasons:

1) Some datasets are created from sampling video frames. If the camera and the subject do not move over a period of time, then we can get almost the same image from different frames. This can also occur with static images captured in bursts.

2) Exact duplicates can occur if the exact same image is renamed and accidently added to the dataset.


In [None]:
# from reidhub.assess.statics import check_for_duplicates
# help(check_for_duplicates)

#### Remove The Identified Duplicates

#### Save Dataset To Hugging Face

### Data Enrichment

## Address