Run with docker:
docker build -t image-align .
docker run -v /full/path/to/local/data:/data image-align
This scores the approach on both the train and validation sets. Note that it takes ~20 minutes to run and uses all CPU cores.
I start on problems like these by defining a metric, establishing a baseline, then iterating with the simplest possible solutions. Given that I don't have experience with this sort of data (aerial imagery of trees), I started with classical methods to get a deeper understanding of the problem space.
I first considered using the transform
array provided in the manifests as
the target variable, but found that they were defined for the full survey
data thus don't provide the best alignments for tile data.
The next simplest target is the aligned image, and a standard error metric between it and the transformed misaligned image; I used the pixel-wise mean squared error (MSE).
In summary, the problem is:
Given triplets of reference images, ref
, misaligned images, x
, and
aligned images, y
, determine a transformation that minimises
mse(transform(x|ref), y)
.
I used keypoint-based alignment of grayscale images, specifically the oriented FAST and rotated BRIEF (ORB) detector and extractor along with random sample consensus (RANSAC), both implemented in skimage and described below. This is a classical approach to image alignment using features (keypoints and descriptors).1
The rough idea is as follows:
- Find keypoints (points of interest) in both images.
- Extract descriptors (feature vectors that describe each keypoint) for each keypoint.
- Match keypoints from the first image to those in the second image, by comparing their descriptors.
- Estimate a transform using keypoint pairs.
ORB, as the name suggests, uses the Features From Accelerated Segment Test (FAST) feature point extractor and a modified Binary Robust Independent Elementary Features (BRIEF) binary fature descriptor.
FAST determines if a given pixel is a corner (or keypoint) by considering
a circle of pixels around it: If N
contiguous pixels along the circle are
either darker (or lighter) than the center pixel within some threshold
, its
labelled as a corner.
BRIEF computes a bit string via binary tests comparing the intensities of
pairs of pixels in a given arrangement. Since these binary tests only rely on
intensity, they are noise-sensitive, therefore the image is smoothed
beforehand.2 The spatial arrangement of point pairs was determined
through an experiment comparing random arrangements with different
distributions. The experiment used ground truth data of images undergoing
known homographies. The ability to redetect keypoints from the first image in
further images was quantified. The best arrangement was found to be an
isotropic Gaussian distribution with variance 1/25 S^2
where S
is the
width of the square image patch.
I used skimage's match_descriptors
which uses brute-force matching, i.e.,
each descriptor in the reference image is matched to the "closest" descriptor
in the misaligned image using the Hamming distance (by default, skimage uses
the Hamming distance for binary descriptors).
RANSAC iteratively fits a model (in our case, an affine transformation) to a dataset (point pairs between the reference and misaligned images) on a minimal subset of the data, discards outliers in the dataset using the model's likelihood, and repeats until a sufficient proportion of the data are inliers.
The code is structured using a primary find_transform
function that is
composed using preprocess
, match_keypoints
, and estimate_transform
functions. The design is to enable flexibly changing those functions in search
for a more performant method, either through hyperparameter search or by
implementing new functions at each step with the same interfaces.
The problem is embarrassingly
parallel; parallel
processing is implemented using the standard library
concurrent.futures.ProcessPoolExecutor
.
The main performance criteria are speed (train time and inference time) and accuracy (MSE between aligned and transformed misaligned images). Tests were performed on a MacBook Pro (2017, 3.1 GHz Dual-Core Intel Core i5, 16 GB 2133 MHz LPDDR3, Intel Iris Plus Graphics 650 1536 MB).
- Speed. Train time: No training or hyperparameter tuning was performed. Inference time: ~8 seconds per image.
- Accuracy. Average train MSE: 3.998e-4, average validation MSE: 5.032e-5.
INFO:main:Train MSE: 3.998e-4 0.00039981827INFO:main:Train duratio n: 0:18:55.040967
INFO:main:Val MSE: 5.031572e-05 INFO:main:Val duration: 0:10:46.224172
See the next section for suggested improvements.
I see the following shortcomings in this solution:
- Does not leverage the supervised dataset. I prefer to start with simpler solutions, thus spent time on more classical approaches. However, since the problem already has a supervised dataset with a few hundred training examples, a method that actually learns from data would likely be more performant.
- Does not make the best use of all three RGB channels. ORB only works on a single channel, thus they must be aggregated or selected somehow, losing possibly crucial information.
- Slow. Detecting keypoints, extracting descriptors, and running RANSAC on each new image pair is slow. An end-to-end model trained to predict a transformation given reference and misaligned images would likely have slower training times but faster inference times.
- Downscaling reference image loses information.
With more time, I'd consider trying an end-to-end convolutional neural network (CNN) that accepts a pair of images as input, reference and misaligned, and predicts a transformation matrix. This paper and this implementation look like good starting points.
I excluded this solution's code because the instructions suggested that only tile images are available, but I think it's interesting enough to mention.
If you're provided full raster image TIFF files, there's a simple solution. Assuming that RGB and RED sensors are placed close together thus capturing mostly the same scene, and that both images were captured in the same drone flight, the shape of non-zero image values in the RGB image and the RED image should be identical. We can therefore find a transform using an area-based alignment method on non-zero pixel value masks.
I used opencv's findTransformECC
, which is implements the algorithm in this
paper.
It's a gradient-based iterative approach to maximising the nonlinear
optimisation criterion, the enhanced correlation coefficient of intensities
of pairs of pixels in either image.
Footnotes
-
Further reading:
- Image Alignment (Feature Based) using OpenCV (C++/Python)
- ORB feature detector and binary descriptor
- Using geometric transformations
-
A nice way to think of this is that the test computes the sign of a derivative. Smoothing is therefore useful here for the same reasons as in edge detection. ↩