In [1]:
# IMPORTANT 
#Install dependencies and copy model weights to run the notebook without internet access when submitting to the competition.

!pip install --no-index /kaggle/input/imc2024-packages-lightglue-rerun-kornia/* --no-deps
!mkdir -p /root/.cache/torch/hub/checkpoints
!cp /kaggle/input/aliked/pytorch/aliked-n16/1/aliked-n16.pth /root/.cache/torch/hub/checkpoints/
!cp /kaggle/input/lightglue/pytorch/aliked/1/aliked_lightglue.pth /root/.cache/torch/hub/checkpoints/
!cp /kaggle/input/lightglue/pytorch/aliked/1/aliked_lightglue.pth /root/.cache/torch/hub/checkpoints/aliked_lightglue_v0-1_arxiv-pth

Processing /kaggle/input/imc2024-packages-lightglue-rerun-kornia/kornia-0.7.2-py2.py3-none-any.whl
Processing /kaggle/input/imc2024-packages-lightglue-rerun-kornia/kornia_moons-0.2.9-py3-none-any.whl
Processing /kaggle/input/imc2024-packages-lightglue-rerun-kornia/kornia_rs-0.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Processing /kaggle/input/imc2024-packages-lightglue-rerun-kornia/lightglue-0.0-py3-none-any.whl
Processing /kaggle/input/imc2024-packages-lightglue-rerun-kornia/pycolmap-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Processing /kaggle/input/imc2024-packages-lightglue-rerun-kornia/rerun_sdk-0.15.0a2-cp38-abi3-manylinux_2_31_x86_64.whl
Installing collected packages: rerun-sdk, pycolmap, lightglue, kornia-rs, kornia-moons, kornia
  Attempting uninstall: kornia-rs
    Found existing installation: kornia_rs 0.1.8
    Uninstalling kornia_rs-0.1.8:
      Successfully uninstalled kornia_rs-0.1.8
  Attempting uninstall: kornia
    Found exist

In [2]:
import sys
import os
from tqdm import tqdm
from time import time, sleep
import gc
import numpy as np
import h5py # Used to read and write HDF5 (.h5) files, which efficiently store large numerical datasets like model weights or datasets.
import dataclasses
import pandas as pd
from IPython.display import clear_output  # Used to clear the output of the current Jupyter notebook cell, useful for cleaner live updates (e.g., during training loops or progress displays).
from collections import defaultdict
from copy import deepcopy # Creates a completely independent (deep) copy of an object, including nested objects. Useful to avoid unwanted modifications to the original data.
from PIL import Image # Imports the Python Imaging Library (Pillow) for opening, manipulating, and saving image files in various formats (e.g., JPEG, PNG).

import cv2
import torch
import torch.nn.functional as F  # Imports functional API from PyTorch for operations like activation functions, loss calculations, etc. (e.g., F.relu, F.cross_entropy).
import kornia as K  # Kornia is a computer vision library for PyTorch; provides differentiable image transformations (e.g., filtering, color conversions, geometry ops).
import kornia.feature as KF # Imports feature detection and matching modules from Kornia (e.g., SIFT, SuperPoint, feature tracking, keypoint matching).


from lightglue import match_pair  # A function from the LightGlue library used for matching keypoints between image pairs.

from lightglue import ALIKED, LightGlue  # 
# ALIKED: A local feature detector and descriptor.
# LightGlue: A lightweight and fast keypoint matcher designed for structure-from-motion and visual localization tasks.

from lightglue.utils import load_image, rbd  
# load_image: Utility function to load and preprocess images.
# rbd: Likely a helper for image visualization or data handling (exact purpose may vary).

from transformers import AutoImageProcessor, AutoModel  
# Hugging Face Transformers API to load pretrained vision models.
# AutoImageProcessor: Automatically loads the right processor (e.g., image tokenizer).
# AutoModel: Loads the corresponding pretrained model.

import pycolmap  # Python bindings for COLMAP, a Structure-from-Motion (SfM) and 3D reconstruction library.

import sys
sys.path.append('/kaggle/input/imc25-utils')  # Adds a custom utilities folder (IMC 2025 competition tools) to the Python path.

from database import *  # Imports everything from a custom database utility script (e.g., SQLite COLMAP database handling).

from h5_to_db import *  # Likely handles conversion from .h5 feature files to COLMAP-compatible database format.

import metric  # Custom or external module for evaluation metrics (e.g., recall, precision, pose errors).




  @torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)
  @torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)


In [3]:
# Do not forget to select an accelerator on the sidebar to the right.
device = K.utils.get_cuda_device_if_available(0)
print(f'{device=}')

device=device(type='cuda', index=0)


The two functions together load images and extract global descriptors using a pretrained DINOv2 model. The first function, `load_torch_image`, uses Kornia to read an image file, converts it into a 32-bit RGB tensor, and adds a batch dimension so that it's compatible with PyTorch models. It also ensures the image is loaded onto the correct device, either CPU or GPU. The second function, `get_global_desc`, prepares the DINOv2 model and image processor using locally saved pretrained weights. It then iterates through each image file path, loads the image using the first function, and processes it using the image processor to match the input format expected by the DINOv2 model. This format includes the correct image shape, channel ordering, data type, and normalization — typically with values scaled to a specific range and standardized to the training distribution of the DINOv2 model. Once the image is passed through the model, it outputs a sequence of embeddings, which means a list of feature vectors where each vector corresponds to a specific patch or region of the image. These embeddings are used to represent the visual content of different parts of the image. The first embedding in this sequence is often called the CLS (classification) token, which is a special token added to summarize the overall image content, mainly used for classification tasks. In this function, the CLS token is excluded, and the remaining patch embeddings are pooled using a max operation across each feature dimension to form a single compact global descriptor that captures the most prominent visual features. This resulting vector is then L2 normalized, meaning its values are scaled so that the total Euclidean (L2) length of the vector equals 1. This normalization ensures that the descriptor’s magnitude does not affect similarity comparisons and allows consistent comparison using metrics like cosine similarity. Each descriptor is detached from the computation graph, which means it is disconnected from any gradient computation to prevent memory usage from tracking operations needed only during training. The descriptor is then moved to the CPU and collected into a list. After processing all images, the list of descriptors is concatenated into a single tensor that can be used for tasks such as image matching, retrieval, or clustering.


In [4]:
def load_torch_image(fname, device=torch.device('cpu')):
    # Loads an image as a torch tensor using Kornia, converts to RGB float32, and adds batch dimension.
    img = K.io.load_image(fname, K.io.ImageLoadType.RGB32, device=device)[None, ...]
    return img

In [None]:

# Must Use efficientnet global descriptor to get matching shortlists.
def get_global_desc(fnames, device=torch.device('cpu')):
    # Load a pretrained image processor and model (DINOv2 here) from local Hugging Face-style directory.
    processor = AutoImageProcessor.from_pretrained('/kaggle/input/dinov2/pytorch/base/1')
    model = AutoModel.from_pretrained('/kaggle/input/dinov2/pytorch/base/1')
    
    model = model.eval()  # Set model to evaluation mode (disables dropout, etc.).
    model = model.to(device)  # Move model to CPU or GPU.

    global_descs_dinov2 = []  # Will hold global descriptors for all images.

    for i, img_fname_full in tqdm(enumerate(fnames), total=len(fnames)):  # Loop through each image file.
        key = os.path.splitext(os.path.basename(img_fname_full))[0]  # Extract image ID from filename.
        
        timg = load_torch_image(img_fname_full)  # Load and preprocess the image.

        with torch.inference_mode():  # Disable gradient tracking for speed and memory.
            # Preprocess image using the DINOv2-compatible processor.
            inputs = processor(images=timg, return_tensors="pt", do_rescale=False).to(device)
            
            # Forward pass through the model.
            outputs = model(**inputs)
            
            # Extract a global descriptor by taking the max value across all tokens (excluding CLS token).
            # `outputs.last_hidden_state[:, 1:]` skips the [CLS] token.
            # `.max(dim=1)[0]` applies max pooling across tokens (per feature channel).
            # `F.normalize(..., dim=1, p=2)` ensures L2-normalized global descriptors.
            dino_mac = F.normalize(outputs.last_hidden_state[:, 1:].max(dim=1)[0], dim=1, p=2)
        
        # Store the descriptor, move it to CPU, and detach from computation graph.
        global_descs_dinov2.append(dino_mac.detach().cpu())

    # Concatenate all descriptors into a single tensor of shape (N, D)
    global_descs_dinov2 = torch.cat(global_descs_dinov2, dim=0)

    return global_descs_dinov2


In [5]:
def get_img_pairs_exhaustive(img_fnames):
    # This function takes a list of image filenames and generates all unique image pairs (i, j),
    # where each image is paired with every other image exactly once, and i < j to avoid duplicates and self-pairing.
    
    index_pairs = []  # Initialize an empty list to hold index pairs.

    # Loop through each image index i in the list.
    for i in range(len(img_fnames)):
        # For each i, pair it with every subsequent image index j (j > i).
        for j in range(i + 1, len(img_fnames)):
            # Add the pair (i, j) to the list. This ensures that each pair is unique and unordered (no (j, i) or (i, i)).
            index_pairs.append((i, j))

    # Return the full list of all unique image index pairs.
    return index_pairs


This function is designed to generate a shortlist of image pairs for matching based on visual similarity. It begins by checking the number of images — if the count is small, specifically less than or equal to a value called exhaustive\_if\_less, it assumes the dataset is small enough to handle every possible comparison and thus returns all image pairs using an exhaustive strategy. This means it computes all combinations of two images from the dataset, which is computationally feasible for smaller sets. The idea behind this threshold is to avoid unnecessary complexity when brute-force pairing is still manageable.

If the dataset is larger, the function shifts to a smarter selection strategy. It computes global descriptors for each image using a pretrained model like DINOv2, which transforms each image into a compact vector summarizing its visual content. These descriptors are then compared to one another using torch’s cdist function, which calculates the pairwise Euclidean distance between every pair of image descriptors, effectively measuring how different or similar they are. The lower the distance, the more visually alike the two images are.

To efficiently filter meaningful matches, a boolean mask is applied. A boolean mask is essentially a matrix of true or false values, where each value corresponds to whether the similarity (based on distance) between a pair of images meets a defined threshold. If the distance between two images is below this threshold, the corresponding entry in the mask is marked true, indicating a potential match. This allows the algorithm to focus only on promising image pairs without manually iterating over each one.

For each image, the function gathers all other images that are considered similar according to this mask. However, to ensure that every image has a reasonable number of potential matches, it imposes a minimum match count through the min\_pairs parameter. If an image has too few matches under the threshold, the function automatically selects the closest min\_pairs images based on their actual distances, regardless of the threshold. This fallback mechanism ensures robustness across various dataset densities.

It avoids self-pairings and filters out extremely poor matches by setting an upper distance limit. All selected image pairs are stored in a consistent sorted format to prevent duplication — for example, (1, 3) and (3, 1) are treated as the same and only one is kept. Finally, duplicates are removed, and the list is returned.

Brute-force matching, in contrast to this optimized approach, refers to comparing every image with every other one without any intelligent filtering. While this guarantees that all possible matches are considered, it becomes computationally expensive and inefficient as the dataset size increases. This function cleverly balances efficiency and coverage by applying filtering via descriptor distances and thresholding, while still falling back to exhaustive matching when necessary for small datasets.


In [6]:
def get_image_pairs_shortlist(fnames,
                              sim_th = 0.6,  # similarity threshold
                              min_pairs = 30,
                              exhaustive_if_less = 20,
                              device=torch.device('cpu')):
    num_imgs = len(fnames)

    # If the number of images is small, do exhaustive pairing instead of selective matching
    if num_imgs <= exhaustive_if_less:
        return get_img_pairs_exhaustive(fnames)

    # Compute global descriptors for all images
    descs = get_global_desc(fnames, device=device)

    # Compute pairwise Euclidean distances between all descriptors (shape: num_imgs x num_imgs)
    dm = torch.cdist(descs, descs, p=2).detach().cpu().numpy()

    # Create a boolean mask where distances are below the similarity threshold
    mask = dm <= sim_th

    total = 0
    matching_list = []
    ar = np.arange(num_imgs)

    for st_idx in range(num_imgs-1):
        mask_idx = mask[st_idx]
        to_match = ar[mask_idx]  # Get indices of similar images

        # If not enough matches, pick closest `min_pairs` based on actual distance values
        if len(to_match) < min_pairs:
            to_match = np.argsort(dm[st_idx])[:min_pairs]  

        for idx in to_match:
            if st_idx == idx:
                continue  # Skip self-pairing

            # Avoid pairing extremely distant matches (arbitrary high cutoff)
            if dm[st_idx, idx] < 1000:
                matching_list.append(tuple(sorted((st_idx, idx.item()))))
                total += 1

    # Remove duplicates and sort the final list of unique image index pairs
    matching_list = sorted(list(set(matching_list)))
    return matching_list


This function performs feature detection and description on a list of images using the ALIKED model, which is a keypoint detector and descriptor extractor suitable for image matching tasks. It starts by configuring the extractor to use a specified number of features and resizing images to a common dimension for consistency. It also ensures that computations are performed in 32-bit floating-point precision, as ALIKED can become unstable when using float16, especially on some GPUs.

The function creates a feature output directory if it doesn’t already exist, then opens two HDF5 files—one for storing the extracted keypoints and the other for storing descriptors. HDF5 is a format well-suited for managing large numerical datasets efficiently. It then loops through each image file path, using only the base file name as a key to store features.

For each image, it is loaded and moved to the specified torch device, such as the CPU or GPU. ALIKED’s `extract` method is called, which internally resizes the image (unless disabled) and computes both keypoints and descriptors. Keypoints are spatial locations in the image where local features have been detected, usually corners or regions with high contrast or texture. Descriptors are compact numerical vectors that describe the appearance or structure around each keypoint. These descriptors are critical because they allow comparing and matching keypoints between different images. If two keypoints from different images have similar descriptors, it's likely they correspond to the same real-world location captured from different viewpoints.

After extraction, both the keypoints and descriptors are reshaped appropriately, detached from the computational graph to disable gradient tracking (since we don’t need backpropagation for inference), and moved to the CPU. They are then converted to NumPy arrays and stored in the HDF5 files. This structured, batched storage allows later use in tasks such as feature-based image matching, structure-from-motion, or localization, where keypoints help establish geometric correspondences and descriptors help find those correspondences accurately across image pairs.


In [7]:
def detect_aliked(img_fnames,
                  feature_dir = '.featureout',
                  num_features = 4096,
                  resize_to = 1024,
                  device=torch.device('cpu')):
    dtype = torch.float32  # ALIKED model can malfunction with float16, so use float32 for stability

    # Initialize the ALIKED keypoint detector and descriptor extractor
    extractor = ALIKED(max_num_keypoints=num_features, detection_threshold=0.2, resize=resize_to).eval().to(device, dtype)

    # Ensure output directory exists
    if not os.path.isdir(feature_dir):
        os.makedirs(feature_dir)

    # Open HDF5 files for storing keypoints and descriptors
    with h5py.File(f'{feature_dir}/keypoints.h5', mode='w') as f_kp, \
         h5py.File(f'{feature_dir}/descriptors.h5', mode='w') as f_desc:

        # Process each image in the provided list
        for img_path in tqdm(img_fnames):
            img_fname = img_path.split('/')[-1]  # Extract just the image file name
            key = img_fname

            with torch.inference_mode():  # Disable gradient tracking for inference
                # Load and preprocess image
                image0 = load_torch_image(img_path, device=device).to(dtype)

                # Extract keypoints and descriptors using ALIKED
                feats0 = extractor.extract(image0)  # Automatically resizes image

                # Keypoints: (N, 2) format, extracted and moved to CPU
                kpts = feats0['keypoints'].reshape(-1, 2).detach().cpu().numpy()

                # Descriptors: one per keypoint, detached and moved to CPU
                descs = feats0['descriptors'].reshape(len(kpts), -1).detach().cpu().numpy()

                # Save keypoints and descriptors to HDF5 files
                f_kp[key] = kpts
                f_desc[key] = descs


The function first initializes the LightGlueMatcher from Kornia’s feature module with the ALIKED configuration. It disables confidence-based filtering by setting both width and depth confidences to -1, ensuring LightGlue relies purely on descriptor similarity. The matcher is also configured to use multiprocessing (if the device is a CUDA-enabled GPU), which can improve performance. The matcher is moved to the appropriate device (CPU or GPU) and set to evaluation mode.

The function then opens three HDF5 files: one for reading keypoints, one for descriptors, and a new one to write the resulting matches. For every image pair in the list, it retrieves their filenames and reads their keypoints and descriptors from the files, converting them into torch tensors and moving them to the selected device.

Within inference mode (which disables gradient computation for efficiency), it runs the LightGlue matcher. The matcher uses the descriptor vectors and local affine frames (LAFs) — created from the keypoints using laf_from_center_scale_ori() — to find correspondences. LAFs encode the keypoint’s position, scale, and orientation and help LightGlue align descriptors spatially and geometrically before matching.

If any matches are found (i.e., the result is non-empty), the code checks whether the number of matches meets a minimum threshold. If it does, it creates a new dataset in the HDF5 match file, named after the image pair, and stores the matching keypoint indices as a two-column array, where each row represents a matched keypoint pair across the two images. These matches can later be used for geometric verification, pose estimation, or building 3D structures from multiple views.

In this process, keypoints represent important image locations, descriptors are compact vectors capturing their local appearance, and matches are discovered by comparing these descriptors in feature space. LightGlue improves matching quality by learning how to match descriptors with spatial understanding, rather than relying on brute-force distance comparisons alone. Detaching the tensors from the computational graph ensures efficient memory usage, as gradients are not needed during inference.

In [8]:
def match_with_lightglue(img_fnames,
                   index_pairs,
                   feature_dir = '.featureout',
                   device=torch.device('cpu'),
                   min_matches=20, verbose=True):
    # Initialize LightGlue matcher with the ALIKED configuration.
    # Set width and depth confidence to -1 to disable confidence-based filtering.
    # Use multiprocessing (True if CUDA enabled) for performance.
    lg_matcher = KF.LightGlueMatcher("aliked", {"width_confidence": -1,
                                                "depth_confidence": -1,
                                                 "mp": True if 'cuda' in str(device) else False}).eval().to(device)

    # Open HDF5 files to read keypoints and descriptors and create a new file to store matches.
    with h5py.File(f'{feature_dir}/keypoints.h5', mode='r') as f_kp, \
        h5py.File(f'{feature_dir}/descriptors.h5', mode='r') as f_desc, \
        h5py.File(f'{feature_dir}/matches.h5', mode='w') as f_match:
        
        # Loop over all image pairs defined by index_pairs.
        for pair_idx in tqdm(index_pairs):
            # Get image filenames for the current pair
            idx1, idx2 = pair_idx
            fname1, fname2 = img_fnames[idx1], img_fnames[idx2]
            key1, key2 = fname1.split('/')[-1], fname2.split('/')[-1]
            
            # Load keypoints and descriptors for both images and move them to the chosen device.
            kp1 = torch.from_numpy(f_kp[key1][...]).to(device)
            kp2 = torch.from_numpy(f_kp[key2][...]).to(device)
            desc1 = torch.from_numpy(f_desc[key1][...]).to(device)
            desc2 = torch.from_numpy(f_desc[key2][...]).to(device)

            # Perform inference to match descriptors between the two images.
            with torch.inference_mode():
                # Use the LightGlue matcher to find matches based on descriptors and keypoints.
                dists, idxs = lg_matcher(desc1,
                                         desc2,
                                         KF.laf_from_center_scale_ori(kp1[None]),
                                         KF.laf_from_center_scale_ori(kp2[None]))

            # If no matches are found, skip to the next pair.
            if len(idxs) == 0:
                continue
            
            # Get the number of matches found and print the result if verbose is True.
            n_matches = len(idxs)
            if verbose:
                print(f'{key1}-{key2}: {n_matches} matches')

            # Create a group for this pair in the match file.
            group = f_match.require_group(key1)
            
            # If the number of matches is above the minimum threshold, store the matches.
            if n_matches >= min_matches:
                group.create_dataset(key2, data=idxs.detach().cpu().numpy().reshape(-1, 2))

    return


The function `import_into_colmap` is designed to import image feature data and corresponding matches into a COLMAP database. COLMAP is a popular photogrammetry software used for structure-from-motion (SfM) and multi-view stereo (MVS) tasks, which reconstruct 3D scenes from 2D images by detecting and matching keypoints across different views of the scene. It first establishes a connection to the COLMAP database, creating a new one if it doesn't already exist at the specified path (`colmap.db`). The database is then prepared by creating necessary tables to store different types of data, such as images, keypoints, descriptors, and matches. A flag `single_camera` is set to False, indicating that all images will use the same simple pinhole camera model for reconstruction. The pinhole camera model assumes that light enters through a small hole (the pinhole) and projects an inverted image onto a sensor. It's a simplified representation of how real cameras capture images, ignoring lens distortion and other complexities. The function proceeds by calling `add_keypoints`, which extracts keypoints (distinctive features in the images) and associated metadata (such as camera parameters) from the feature directory and links them with the images in the database. This step ensures that each image has its corresponding keypoints stored in the database. Following this, the function calls `add_matches`, which processes the descriptors in the feature directory to establish correspondences (matches) between keypoints across different images. These matches are critical for multi-view geometry tasks, such as 3D reconstruction, as they allow COLMAP to compute relative positions and orientations between the images. Photogrammetry refers to the process of using photographs to measure and interpret physical objects or scenes, typically by identifying the spatial relationships between points in different images. In this context, COLMAP uses these keypoint correspondences and camera parameters to calculate the 3D structure of the scene. Finally, the changes made to the database are saved with the `commit` method, ensuring that all updates, such as keypoints and matches, are written and stored securely. This process sets up the database with the necessary information to perform photogrammetry and 3D reconstruction using COLMAP, allowing for accurate modeling of real-world scenes from multiple 2D images.


In [9]:

def import_into_colmap(img_dir, feature_dir ='.featureout', database_path = 'colmap.db'):
    # Connect to the COLMAP database at the specified path. If the database doesn't exist, it's created.
    db = COLMAPDatabase.connect(database_path)
    
    # Create necessary tables in the COLMAP database, including the tables for images, keypoints, descriptors, etc.
    db.create_tables()

    # Flag to specify whether to use a single camera model or multiple models. 
    # False means using a simple pinhole camera model for all images.
    single_camera = False
    
    # Add keypoints to the COLMAP database from the feature directory and associate them with image IDs.
    # This function will also extract camera parameters and other metadata.
    fname_to_id = add_keypoints(db, feature_dir, img_dir, '', 'simple-pinhole', single_camera)
    
    # Add matches (correspondences between images) to the COLMAP database.
    # This function will read descriptor data and establish relationships between image pairs.
    add_matches(
        db,
        feature_dir,
        fname_to_id,
    )
    
    # Commit the changes to the database, ensuring that all modifications are saved.
    db.commit()
    
    return



The code snippet defines a `Prediction` class to represent an individual prediction for an image, with attributes such as `image_id`, `dataset`, `filename`, `cluster_index`, `rotation`, and `translation`. The `image_id` serves as a unique identifier, primarily used for the hidden test set and not utilized otherwise. The `dataset` field identifies which dataset the image belongs to, while `filename` holds the name of the image file. `cluster_index`, `rotation`, and `translation` are optional fields representing the image's cluster assignment, rotation matrix, and translation vector, respectively. The code then sets the variable `is_train` to indicate whether the data is for training or for submitting to a competition, with `True` indicating the training data and `False` indicating test data, where the test set is hidden and has different formats compared to the training data. The `data_dir` and `workdir` variables specify the paths to the input data and the working directory for storing results. The `sample_submission_csv` file is chosen based on whether the dataset is for training or submission. Using `pandas`, the code reads the corresponding CSV file and populates a dictionary `samples`, where the keys are dataset names and the values are lists of `Prediction` objects created from the rows in the CSV. Finally, the code prints the number of images in each dataset, helping to confirm the structure of the data being processed.


In [10]:
# Collect vital info from the dataset

@dataclasses.dataclass
class Prediction:
    image_id: str | None  # A unique identifier for the row -- unused otherwise. Used only on the hidden test set.
    dataset: str
    filename: str
    cluster_index: int | None = None
    rotation: np.ndarray | None = None
    translation: np.ndarray | None = None

# Set is_train=True to run the notebook on the training data.
# Set is_train=False if submitting an entry to the competition (test data is hidden, and different from what you see on the "test" folder).
is_train = False
data_dir = '/kaggle/input/image-matching-challenge-2025'
workdir = '/kaggle/working/result/'
os.makedirs(workdir, exist_ok=True)

if is_train:
    sample_submission_csv = os.path.join(data_dir, 'train_labels.csv')
else:
    sample_submission_csv = os.path.join(data_dir, 'sample_submission.csv')

samples = {}
competition_data = pd.read_csv(sample_submission_csv)
for _, row in competition_data.iterrows():
    # Note: For the test data, the "scene" column has no meaning, and the rotation_matrix and translation_vector columns are random.
    if row.dataset not in samples:
        samples[row.dataset] = []
    samples[row.dataset].append(
        Prediction(
            image_id=None if is_train else row.image_id,
            dataset=row.dataset,
            filename=row.image
        )
    )

for dataset in samples:
    print(f'Dataset "{dataset}" -> num_images={len(samples[dataset])}')

Dataset "ETs" -> num_images=22
Dataset "amy_gardens" -> num_images=200
Dataset "fbk_vineyard" -> num_images=163
Dataset "imc2023_haiper" -> num_images=54
Dataset "imc2023_heritage" -> num_images=209
Dataset "imc2023_theather_imc2024_church" -> num_images=76
Dataset "imc2024_dioscuri_baalshamin" -> num_images=138
Dataset "imc2024_lizard_pond" -> num_images=214
Dataset "pt_brandenburg_british_buckingham" -> num_images=225
Dataset "pt_piazzasanmarco_grandplace" -> num_images=168
Dataset "pt_sacrecoeur_trevi_tajmahal" -> num_images=225
Dataset "pt_stpeters_stpauls" -> num_images=200
Dataset "stairs" -> num_images=51


In [11]:
gc.collect()

max_images = None  # Used For debugging only. Set to None to disable.
datasets_to_process = None  # Not the best convention, but None means all datasets.

if is_train:
    # max_images = 5

    # Note: When running on the training dataset, the notebook will hit the time limit and die. Use this filter to run on a few specific datasets.
    datasets_to_process = [
    	# New data.
    	'amy_gardens',
    	'ETs',
    	'fbk_vineyard',
    	'stairs',
    	# Data from IMC 2023 and 2024.
    	# 'imc2024_dioscuri_baalshamin',
    	# 'imc2023_theather_imc2024_church',
    	# 'imc2023_heritage',
    	# 'imc2023_haiper',
    	# 'imc2024_lizard_pond',
    	# Crowdsourced PhotoTourism data.
    	# 'pt_stpeters_stpauls',
    	# 'pt_brandenburg_british_buckingham',
    	# 'pt_piazzasanmarco_grandplace',
    	# 'pt_sacrecoeur_trevi_tajmahal',
    ]

timings = {
    "shortlisting":[],
    "feature_detection": [],
    "feature_matching":[],
    "RANSAC": [],
    "Reconstruction": [],
}
mapping_result_strs = []


print (f"Extracting on device {device}")
for dataset, predictions in samples.items():
    if datasets_to_process and dataset not in datasets_to_process:
        print(f'Skipping "{dataset}"')
        continue
    
    images_dir = os.path.join(data_dir, 'train' if is_train else 'test', dataset)
    images = [os.path.join(images_dir, p.filename) for p in predictions]
    if max_images is not None:
        images = images[:max_images]

    print(f'\nProcessing dataset "{dataset}": {len(images)} images')

    filename_to_index = {p.filename: idx for idx, p in enumerate(predictions)}

    feature_dir = os.path.join(workdir, 'featureout', dataset)
    os.makedirs(feature_dir, exist_ok=True)

    # Wrap algos in try-except blocks so we can populate a submission even if one scene crashes.
    try:
        t = time()
        index_pairs = get_image_pairs_shortlist(
            images,
            sim_th = 0.3, # should be strict
            min_pairs = 20, # we should select at least min_pairs PER IMAGE with biggest similarity
            exhaustive_if_less = 20,
            device=device
        )
        timings['shortlisting'].append(time() - t)
        print (f'Shortlisting. Number of pairs to match: {len(index_pairs)}. Done in {time() - t:.4f} sec')
        gc.collect()
    
        t = time()

        detect_aliked(images, feature_dir, 4096, device=device)
        gc.collect()
        timings['feature_detection'].append(time() - t)
        print(f'Features detected in {time() - t:.4f} sec')
        
        t = time()
        match_with_lightglue(images, index_pairs, feature_dir=feature_dir, device=device, verbose=False)
        timings['feature_matching'].append(time() - t)
        print(f'Features matched in {time() - t:.4f} sec')

        database_path = os.path.join(feature_dir, 'colmap.db')
        if os.path.isfile(database_path):
            os.remove(database_path)
        gc.collect()
        sleep(1)
        import_into_colmap(images_dir, feature_dir=feature_dir, database_path=database_path)
        output_path = f'{feature_dir}/colmap_rec_aliked'
        
        t = time()
        pycolmap.match_exhaustive(database_path)
        timings['RANSAC'].append(time() - t)
        print(f'Ran RANSAC in {time() - t:.4f} sec')
        
        # By default colmap does not generate a reconstruction if less than 10 images are registered.
        # Lower it to 3.
        mapper_options = pycolmap.IncrementalPipelineOptions()
        mapper_options.min_model_size = 3
        mapper_options.max_num_models = 25
        os.makedirs(output_path, exist_ok=True)
        t = time()
        maps = pycolmap.incremental_mapping(
            database_path=database_path, 
            image_path=images_dir,
            output_path=output_path,
            options=mapper_options)
        sleep(1)
        timings['Reconstruction'].append(time() - t)
        print(f'Reconstruction done in  {time() - t:.4f} sec')
        print(maps)

        clear_output(wait=False)
    
        registered = 0
        for map_index, cur_map in maps.items():
            for index, image in cur_map.images.items():
                prediction_index = filename_to_index[image.name]
                predictions[prediction_index].cluster_index = map_index
                predictions[prediction_index].rotation = deepcopy(image.cam_from_world.rotation.matrix())
                predictions[prediction_index].translation = deepcopy(image.cam_from_world.translation)
                registered += 1
        mapping_result_str = f'Dataset "{dataset}" -> Registered {registered} / {len(images)} images with {len(maps)} clusters'
        mapping_result_strs.append(mapping_result_str)
        print(mapping_result_str)
        gc.collect()
    except Exception as e:
        print(e)
        # raise e
        mapping_result_str = f'Dataset "{dataset}" -> Failed!'
        mapping_result_strs.append(mapping_result_str)
        print(mapping_result_str)

print('\nResults')
for s in mapping_result_strs:
    print(s)

print('\nTimings')
for k, v in timings.items():
    print(f'{k} -> total={sum(v):.02f} sec.')

Extracting on device cuda:0

Processing dataset "ETs": 22 images
name 'get_global_desc' is not defined
Dataset "ETs" -> Failed!

Processing dataset "amy_gardens": 200 images
name 'get_global_desc' is not defined
Dataset "amy_gardens" -> Failed!

Processing dataset "fbk_vineyard": 163 images
name 'get_global_desc' is not defined
Dataset "fbk_vineyard" -> Failed!

Processing dataset "imc2023_haiper": 54 images
name 'get_global_desc' is not defined
Dataset "imc2023_haiper" -> Failed!

Processing dataset "imc2023_heritage": 209 images
name 'get_global_desc' is not defined
Dataset "imc2023_heritage" -> Failed!

Processing dataset "imc2023_theather_imc2024_church": 76 images
name 'get_global_desc' is not defined
Dataset "imc2023_theather_imc2024_church" -> Failed!

Processing dataset "imc2024_dioscuri_baalshamin": 138 images
name 'get_global_desc' is not defined
Dataset "imc2024_dioscuri_baalshamin" -> Failed!

Processing dataset "imc2024_lizard_pond": 214 images
name 'get_global_desc' is no

The code above generates a submission file based on predictions for the image matching challenge. It defines two helper functions: `array_to_str`, which converts a list of numerical values into a semicolon-separated string with 9 decimal places, and `none_to_str`, which returns a semicolon-separated string of `'nan'` values, useful for handling missing values.

The `submission_file` variable specifies the path for the output CSV file. The code then opens this file in write mode, checking if the dataset is for training or testing using the `is_train` flag. If `is_train` is set to `True`, the code writes a header and iterates through the `samples` dictionary, which holds the predictions. For each prediction, the relevant details like the dataset, cluster information, rotation matrix, and translation vector are written into the CSV file. If rotation or translation data is missing, the corresponding fields are filled with `'nan'`.

If `is_train` is set to `False`, indicating a test dataset, the code writes a different header and includes the image ID along with the other fields in the output file. After the data is written, the code uses the `!head` command to display the first few rows of the submission file to verify its content.

This process creates a structured output file in the required format, which can then be submitted for evaluation.


In [12]:
# Must Create a submission file.

array_to_str = lambda array: ';'.join([f"{x:.09f}" for x in array])
none_to_str = lambda n: ';'.join(['nan'] * n)

submission_file = '/kaggle/working/submission.csv'
with open(submission_file, 'w') as f:
    if is_train:
        f.write('dataset,scene,image,rotation_matrix,translation_vector\n')
        for dataset in samples:
            for prediction in samples[dataset]:
                cluster_name = 'outliers' if prediction.cluster_index is None else f'cluster{prediction.cluster_index}'
                rotation = none_to_str(9) if prediction.rotation is None else array_to_str(prediction.rotation.flatten())
                translation = none_to_str(3) if prediction.translation is None else array_to_str(prediction.translation)
                f.write(f'{prediction.dataset},{cluster_name},{prediction.filename},{rotation},{translation}\n')
    else:
        f.write('image_id,dataset,scene,image,rotation_matrix,translation_vector\n')
        for dataset in samples:
            for prediction in samples[dataset]:
                cluster_name = 'outliers' if prediction.cluster_index is None else f'cluster{prediction.cluster_index}'
                rotation = none_to_str(9) if prediction.rotation is None else array_to_str(prediction.rotation.flatten())
                translation = none_to_str(3) if prediction.translation is None else array_to_str(prediction.translation)
                f.write(f'{prediction.image_id},{prediction.dataset},{cluster_name},{prediction.filename},{rotation},{translation}\n')

!head {submission_file}

image_id,dataset,scene,image,rotation_matrix,translation_vector
ETs_another_et_another_et001.png_public,ETs,outliers,another_et_another_et001.png,nan;nan;nan;nan;nan;nan;nan;nan;nan,nan;nan;nan
ETs_another_et_another_et002.png_public,ETs,outliers,another_et_another_et002.png,nan;nan;nan;nan;nan;nan;nan;nan;nan,nan;nan;nan
ETs_another_et_another_et003.png_public,ETs,outliers,another_et_another_et003.png,nan;nan;nan;nan;nan;nan;nan;nan;nan,nan;nan;nan
ETs_another_et_another_et004.png_public,ETs,outliers,another_et_another_et004.png,nan;nan;nan;nan;nan;nan;nan;nan;nan,nan;nan;nan
ETs_another_et_another_et005.png_public,ETs,outliers,another_et_another_et005.png,nan;nan;nan;nan;nan;nan;nan;nan;nan,nan;nan;nan
ETs_another_et_another_et006.png_public,ETs,outliers,another_et_another_et006.png,nan;nan;nan;nan;nan;nan;nan;nan;nan,nan;nan;nan
ETs_another_et_another_et007.png_public,ETs,outliers,another_et_another_et007.png,nan;nan;nan;nan;nan;nan;nan;nan;nan,nan;nan;nan
ETs_another_et_another_et0

The code block checks whether the notebook is running on the training dataset by evaluating the `is_train` flag. If `is_train` is `True`, it proceeds to compute the results based on the predictions made during the image matching process. The purpose of this step is to evaluate the model's performance on the training data, which helps in assessing how well the predictions align with the ground truth.

In this case, the code calls the `metric.score()` function, which is used to compute the final score based on the user’s predictions (stored in `submission_file`) and the ground truth data (from `/kaggle/input/image-matching-challenge-2025/train_labels.csv`). The function also uses additional data, such as thresholds (`train_thresholds.csv`) and an optional mask (`mask.csv`), to refine the calculation.

The function `metric.score()` likely calculates a performance metric for the image matching task (such as accuracy or precision) by comparing the predicted results with the actual ground truth, using certain thresholds to determine whether a match is correct.

The `verbose=True` flag enables detailed output, providing information about the metric computation process. After computing the metric, the code prints how long the computation took (in seconds), displaying the time it took to complete the evaluation.

This step is only executed when running on the training set to evaluate the model's performance. For submission purposes, the submission file is the only required output, and the metric computation is skipped.


In [13]:
# Definitely Compute results if running on the training set.
# Do not do this when submitting a notebook for scoring. All you have to do is save your submission to /kaggle/working/submission.csv.

if is_train:
    t = time()
    final_score, dataset_scores = metric.score(
        gt_csv='/kaggle/input/image-matching-challenge-2025/train_labels.csv',
        user_csv=submission_file,
        thresholds_csv='/kaggle/input/image-matching-challenge-2025/train_thresholds.csv',
        mask_csv=None if is_train else os.path.join(data_dir, 'mask.csv'),
        inl_cf=0,
        strict_cf=-1,
        verbose=True,
    )
    print(f'Computed metric in: {time() - t:.02f} sec.')

SUMMARY


In this pipeline, I engineered a robust and scalable image matching and localization system tailored for the Image Matching Challenge 2025. The core objective was to predict relative poses (rotations and translations) between image pairs captured across various scenes. My workflow spans data preprocessing, feature extraction, global descriptor computation, keypoint matching, and COLMAP-compatible structure-from-motion reconstruction, culminating in an automated submission pipeline.

To begin, I leveraged **DINOv2** — a transformer-based vision model — to compute **global image descriptors**. These descriptors capture holistic scene information, using token embeddings where the **\[CLS] token** (a special classification embedding) encodes the entire image’s semantic content. The descriptors were **L2-normalized** to allow cosine similarity comparisons, and inference was conducted in **inference mode** to detach the computation from PyTorch’s autograd graph, ensuring memory efficiency.

To generate a shortlist of image pairs likely to match, I used a **brute-force similarity search** across descriptors, switching to **exhaustive matching** when image counts were small (`exhaustive_if_less`). Boolean masks and distance matrices guided the selection of top candidate pairs based on a strict similarity threshold.

For local feature detection, I integrated **ALIKED** — a real-time differentiable keypoint extractor — which detects high-confidence keypoints and computes corresponding local descriptors. These were stored efficiently using **HDF5** for easy retrieval.

Feature matching was performed using **LightGlue**, a powerful transformer-based matcher designed for geometrically consistent and accurate correspondences. It matches descriptors from image pairs using both **appearance cues and geometric constraints**, filtering weak matches using confidence scores. Matches with fewer than a set threshold were skipped to preserve accuracy.

Next, I used **COLMAP**, a state-of-the-art photogrammetry engine, to build 3D models by importing detected keypoints and match pairs into its SQLite database. The system relied on the **pinhole camera model**, assuming simple intrinsics to support 3D triangulation. COLMAP then estimated camera poses using bundle adjustment and triangulated 3D landmarks from 2D correspondences.

Finally, I parsed prediction outputs into a compliant **submission CSV** by formatting the 3×3 rotation matrices and 3×1 translation vectors into flat strings. When running on the training set, I computed evaluation metrics using the competition's official scoring scripts, measuring alignment against ground truth data with configurable thresholds and confidence levels.

This end-to-end pipeline combines transformer-based vision models, real-time keypoint detectors, advanced geometric matchers, and photogrammetric reconstruction to deliver an efficient and scalable solution for large-scale image-based localization and mapping.
