!["HCI Banner Logos for ATU Sligo, the HCI Human capital initiative and Higher Education 4.0"](images/HCIBanner.png)


# SIFT

SIFT - Scale Invariant Feature Transform.

This is both a feature detector and descriptor and was the work of David Lowe 
[SIFT Paper](https://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf).

It is one of the most important topics in computer vision of the last two decades.

However, it is patented by the University of British Columbia.

There are similar methods, such as [RootSIFT](https://www.robots.ox.ac.uk/~vgg/publications/2012/Arandjelovic12/arandjelovic12.pdf), which are not.

SIFT is believed to be similar to systems that evolved in Primate visual systems.

## Goals of SIFT

- Extract distinctive invariant features. i.e. it should be possible to correctly match a feature in one image against a database of many images. This makes it suitable for the wide-baseline case.
- Feature descriptors should be invariant to scale and rotation.
- The features should be robust to 
- Affine distortion.
- Changes in 3D view point.
- Addition of noise.
- Change in Illumination

Think back to the optic flow. It required brightness constancy and small changes in viewpoint.

SIFT's goal was to overcome all those issues.


## Advantages of SIFT

- Features are local, so they are robust to occlusion and clutter.
- Individual features can be matched to a large database of objects.
- Many features can be generated for even small objects.  Typically, a $500\times500$ image will produce approximately $2000$ stable features.
- At the time SIFT was introduced it had near real-time performance. In the two decades that have passed since, computer hardware has increased in relation to Moore's law.
- However embedded systems are always restricted in speed and energy usage. Whether it can run in real time is a matter for the power and resources of the hardware platform.

## Major Stages

- Scale-space extrema detection: This first stage of computation searches over all scales and image locations. It is implemented efficiently by using a difference of Gaussian function to identify potential interest points that are invariant to scale and orientation.
- Keypoint localisation: At each candidate location, a detailed model is fit to determine location and scale. Keypoints are selected based on measures of their stability.
- Orientation and assignment: One or more orientations are assigned to each keypoint location based on local image gradient directions.  All future operations performed on image data that has been transformed relative to the assigned orientation, scale, and location for each feature, thereby providing invariance to these transformations.
- Keypoint description: The local image gradients are measured at the selected scale in the region around each keypoint. These are transformed into a representation that allows for significant levels of local shape distortion and changes in illumination.

##  Scale
The problem with many feature detectors is that they are good if the feature is of just the right size.

But too big or too small and they don't detect it.

This is also an issue if the scale in the image changes as a feature may be found in one image but not in another, so it cannot be matched.

In Canny or LoG the scale is determined by the $\sigma$ value.

If we want to look at multiple scales, we can apply the detector at each of these $\sigma$ values but how then do we combine the results?

### Earlier Work on Scale

Marr-Hildreth used a spatial coincidence assumption. i.e. if the zero crossings coincided over several scales then they are physically significant.
Andrew Witkin in his paper[Scale-space filtering](https://www.ijcai.org/Proceedings/83-2/Papers/091.pdf)  applied a whole spectrum of scales.
He then plotted the zero-crossings Vs the scales in what he called a scale-space.
He determined _stable_ features as ones that are stable over a range of scales.


### $\sigma$ and LoG
Scale is about the size of the $\sigma$ value we use in the LoG function.

Different size $\sigma$ will give local maxima at different positions and scales.

Unlike Witkin, we are going to use the one best scale here, i.e. the one with the highest maxima.


### Building a Scale-space
All scales must be examined to identify scale invariant features.

An efficient way is to compute the Laplacian Pyramid via a Difference of Gaussians (DoG).

This was outlined in a previous lecture.



### Difference of Gaussian Pyramid}

![](images/DoGPyramid.png}

David Lowe: [SIFT Paper](https://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf)


### How many scales?
Lowe determined that the best number  of scales-per-octave was three (more than shown in the diagram).

The [SIFT Paper](https://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) is a master class in how to find the best parameter values and is worth reading just for that.



### Peak Detection


How to detect the peak in the scale space?

Compare pixel ($\times$)  with the 26 pixels in current and adjacent scales (green).

It is selected, only if it is larger than all of its 26 neighbours or smaller than all of them.

There will be a large number of these extrema.

Generally, extrema that are close together are unstable.




![](images/26pixels.png}
David Lowe: SIFT Paper








### Refining Results
So we need to remove extrema that are poor candidates.

First the candidates are tested against nearby data, for location, scale and ratio of principle curvatures.

If points have low contrast or are poorly localised along an edge, they are rejected.\\
To do this they use a Taylor expansion up to quadratic terms of the scale-space function $D(x,y,\sigma)$ at the sample point.

\begin{equation}
	D(x) = D + \frac{\partial D^{\top}}{\partial x}x + \frac{1}{2}x^{\top}\frac{\partial^2D}{\partial x^2}x
\end{equation}

The extremum, $\hat{x}$, is determined by taking the derivative of this function with respect to $x$ and setting to zero.

\begin{equation}
	\hat{x}=-\frac{\partial^2D^{-1}}{\partial x^2}\frac{\partial D}{\partial x}
\end{equation}



If the offset $\hat{x}$ is more than 0.5 in any dimension it means the extremum is closer to a different sample point.

In which case it is moved and the interpolation re-performed.

The final offset $\hat{x}$ is added to its sample point to get the interpolated estimate for the location of the extremum.

The function value at the extremum, $D(\hat{x})$ is used for rejecting unstable extremum with low contrast.
It is given by 
\begin{equation}
	D(\hat{x}) = D +\frac{1}{2}\frac{\partial D^{\top}}{\partial x}\hat{x}
\end{equation}

Lowe used a value of $<0.03$ as the threshold for discarding an extremum, assuming image pixel values are in the range $[0,1]$.



### Extremum Removal



![](images/ExtremumRemoval.png}

Image Credit: David Lowe - SIFT Paper




### Eliminating Edge Responses
A poorly defined peak in the DoG will have a large principal curvature across the edge but a small one in the perpendicular direction.

Principle curvature can be computed from a $2\times2$ [Hessian matrix](https://en.wikipedia.org/wiki/Hessian_matrix) $\mathbf{H}$ computed at the position and scale of the keypoint. 

\begin{equation}
	\mathbf{H}=\begin{bmatrix}
D_{xx} & D_{xy}\\
D_{xy} & D_{yy}
 \end{bmatrix}
\end{equation}

The eigenvalues of $\mathbf{H}$ are proportional to the principal curvatures of D. In the same way as with the structure matrix $\textbf{M}$ in Harris-Stephens we can determine something about the two eigen values without actually calculating them.

\begin{equation}
	Tr(\mathbf{H}) = \lambda_1+\lambda_2
\end{equation}

\begin{equation}
	Det(\mathbf{H}) = \lambda_1\lambda_2
\end{equation}


The important test here is whether the eigen values are different from each other in size.
If the ratio is above some threshold $r$ then they should be eliminated.
Lowe uses a value of $r=10$

\begin{equation}
	\frac{Tr(\mathbf{H}^2)}{Det(\mathbf{H})}<\frac{(r+1)^2}{r}
\end{equation}

This is quite fast to compute.

## Orientation Assignment
We need to achieve rotation invariance of the key-points, and this means assigning a consistent orientation to each.

The scale of the keypoint is used to select the Gaussian smoothed image, L, with the closest scale so that all computations are performed in a scale invariant manner.

For each image sample, $L(x,y)$ at this scale, the gradient magnitude, $m(x,y)$, and orientation,  $\theta(x,y)$ is pre-computed using pixel differences.



\begin{equation}
m(x,y) = \sqrt{(L(x+1,y) - L(x-1,y))^2+(L(x,y+1)-L(x,y-1))^2}
\end{equation}

\begin{equation}
\theta(x,y) = \tan^{-1}\left(\frac{L(x,y+1)-L(x,y-1)}{L(x+1,y) - L(x-1,y)}\right)
\end{equation}

An orientation histogram is formed from the gradient orientations of sample points within a region around the keypoint.

The orientation histogram has 36 bins covering $360^o$ range of orienations.

Each sample is added to the histogram after it is weighted by its gradient magnitude and by a gaussian-weighted circular window with a $\sigma$ that is 1.5 times the scale of the keypoint.

Peaks in the orientation histogram correspond to dominant directions of local gradients.
The highest peak in the histogram is detected, and then any other local peak that is within 80% of the highest peak is also used to create a key point with that orientation.

So for locations with multiple peaks of similar magnitude, there will be multiple keypoints created at the same location and scale but different orientations.





### Orientation Histogram
Only about 15% of the points are assigned multiple orientations, but they contribute significantly to the stability of matching.

Finally, a parabola is fit to the three histogram values closest to each peak to interpolate the peak position for better accuracy.

The peak is important. It tells us the dominant orientation for the whole feature.

Along with location and scale, this is the third piece of information that will be used as the anchor that the rest of our feature descriptor will be relative to.

## Keypoint Descriptors
The previous operations have assigned an image location, scale, and orientation to each key point.

These parameters impose a repeatable local 2D coordinate system in which to describe the local image region and therefore provide invariance to these parameters.

The next step is to compute a descriptor for the local image region that is highly distinctive yet is as invariant as possible to remaining variations, such as a change in illumination or 3D viewpoint.


### Feature Descriptor

![](images/KeypointDescriptor.png)

Gradients to keypoint descriptors

### Orientation Histogram


First, the image gradient magnitudes and orientations are sampled around the keypoint location, using the scale of the keypoint to select the Gaussian blur for the image.

**To achieve orientation invariance, the coordinates of the descriptor and the gradient orientations are rotated relative to the keypoint orientation.**

Each of the computed gradients are indicated with small arrows at each sample location in the figure.


![](images/OrientationHistogram.png)

Image Credit: Szeliski - Text Book




A Gaussian weighting function with $\sigma$ equal to half the width of the descriptor window is used to assign a weight to the magnitude of each sample point. This is the circle in the figure.

This prevents sudden changes in the descriptor with small changes in the position of the window and gives less emphasis to gradients that are far from the center of the descriptor, as these are most affected by misregistration errors.



The keypoint descriptor is shown on the right side of the figure. 

It allows for a significant shift in gradient positions by creating orientation histograms over $4\times4$ sample regions. 

The figure shows eight directions for each orientation histogram, with the length of each arrow corresponding to the magnitude of that histogram entry.

A gradient sample on the left can shift up to four sample positions while contributing to the same histogram on the right, thereby allowing for larger positional shifts.



![](images/KeypointDescriptor.png)

Gradients to keypoint descriptors

## Example}


![](images/sift_keypointsA.jpg)





## Boundary Affects
It is important to avoid boundary effects in which the descriptor abruptly changes as a sample shifts smoothly from being in one histogram to another or from one orientation to another.

Therefore, trilinear interpolation is used to distribute the value of each gradient sample into adjacent histogram bins.

So, each entry in a bin is multiplied by a weight of $1-d$ for each dimension, where $d$ is the distance of the sample from the central value of the bin as measured in units of the histogram bin spacing.



## Keypoint Descriptors

The descriptor is formed from a vector containing the values of all the orientation histograms entries, corresponding to the lengths of the arrows on the right side of the figure.

The figure shows a $2\times2$ array of orientation histograms, whereas the experiments in the paper show that the best results are achieved with a $4\times4$ array of histograms with eight orientation bins in each.

So this leads to $4\times4\times8=128$ element feature vector for each keypoint.


## Illumination
Finally, the feature vector is modified to reduce the effects of illumination change.

First, the vector is normalised to unit length.

If a change in image contrast changes each pixel by multiplying it by a constant, this will multiply gradients by the same constant, so this constant change will be cancelled by the vector normalisation.

A brightness change in which a constant is added to the image pixels will not affect the gradient values as they are computed from pixel differences.

Therefore the descriptor is invariant to affine changes in illumination.

##  Non-linear Illumination

Non-linear illumination can cause a large change in relative magnitudes for some gradients but are less likely to affect the gradient orientations.

Therefore, to reduce the influence of large gradient magnitudes, we threshold the values in the unit feature vector to each be no larger than 0.2, and then re-normalising to unit length.

So matching the magnitudes for large gradients is no longer as important, instead the distribution of orientations has greater emphasis. 

The value of 0.2 was determined experimentally using images containing differing illuminations for the same 3D objects.

!["HigherEd 4.0 is funded by the Human Capital Initiative Pillar 3. HCI Pillar 3 supports projects to enhance the innovation and agility in response to future skills needs"](images/HCIFunding.png)