!["HCI Banner Logos for ATU Sligo, the HCI Human capital initiative and Higher Education 4.0"](images/HCIBanner.png)

# Photometry to Geometry

Up to this point we talk a lot about points in space.

But these don't really exist unless we define them with a coordinate system.

There are an infinite number of points in the world. 

We talked about how points mapped to images, but really it's not the point that maps to the image but the brightness of the point that maps to the brightness intensity value (or colour) at a pixel. 

Pick a point in the middle of the air. Is there any point in asking what this maps to? Not really because it has no reflectance. 

Likewise a completely black object in the dark wouldn't map to anything.

This is important because when we want to re-construct the 3D world we will start with images and images don't contain points, they contain brightness values or colour values.

So the measure of these values is called photometry and we need to be able to make an association between these colour values and points in the image, so that we can then associate those points with points in a 3D space, which correspond with objects of interest.

This makes the job of feature recognition in images very important. 

By features here I am referring to primitive features such as edges, lines, corners and primitive shapes. 

The term object recognition is usually reserved for recognising composite objects like cars and people.

As we saw in the section on perspective projection, 3D information is lost in an image. 
So we cannot determine 3D points from a single image. 

Instead we will need to find corresponding points in more than one image (perspective) and from that infer what the 3D point must be. 

This means recognising features as existing in two images despite being imaged from two different perspectives. 
Later we will look at different methods of doing this but for the moment you need to be aware that this is necessary in order to move from photometry to geometry.
     

## Sparse or Dense reconstruction

Now not every pixel relates to a point that is useful to reconstruct a scene but the question arises as to how many corresponding features must you find between two images in order to determine the 3D geometry of a space. 

This is not a simple question to answer. 
To determine camera intrinsic and extrinsic parameters we have a set number of unknowns and this will govern how many points would be required. 

For 3D reconstruction however the truth is that the more points the more accurate reconstruction but the more calculation and memory required. 

Few points means the re-construction of fewer points in the 3D space but less processing time in the reconstruction.
  
Let's think about the autonomous driving case. 
What do we need? 

Do we need to see the raised writing on the back of a car that tells us the model? 
Do we need to see the contours of the car? 
Would a very basic box do the job satisfactorily? 

In Autonomous driving, real time is important and processing load is important.

But, if a basic box around a car misses a toe bar or a bike rack, that may be problematic. 
One thing is clear, there will always have to be compromise.
   

Reconstruction from a small number of points is called sparse reconstruction whereas using lots of points is called dense. 

There is no obvious border between the two so these are somewhat fluid terms.

Get used to them though as they have many uses throughout maths, science and engineering.

As this module is aimed at the autonomous vehicle industry we will mainly look at sparse methods. 
But be aware that dense methods exist.

  

## Small Vs Wide baseline 

If we wish to find points in two images that correspond to the same point in the world then we need to determine something about the distance between the two camera views. 

If we can assume that there is only a small difference between two views then this opens up algorithms and options that are not available if the base line between the views is large.

However, small baselines often make re-construction less precise as we do not have the requisite gap between our stereo images to determine depth. 

Usually the further apart the two views, the more precise the depth estimation. 

In wide baseline we normally have to have much richer detail in our feature descriptors in order to spot the difference from views that have changed so much.

It is often desirable to use a combination of the two.
  

The lines between small and large base line are also becoming blurred with time as small baseline methods improve and begin to stretch their range.

For example we can work on several re-sized versions of images (often called an image pyramid) from low-res to high-res and this can allow us to extend much further out in the high-res images.

For the rest of this lecture we will be looking at the small baseline (small deformation) model.
  
Normally the small deformation model does not originate images from multiple cameras but instead the images are generated from a single moving camera. 

This, of course, means that the views are not just separated in space but in time as well. 
For small baseline we make a number of assumptions but the most important of these is that the views have a very small displacement in space and time. 

So even with video (30-60fps) this limits the speed of movement as the gap between frames could be large or something within the view could move between frames.

In the autonomous vehicle setting the camera or other cars can move quickly, although they do a least tend to move smoothly and in predictable directions and so this can help. 
     

## Optic Flow Estimation

For small displacement, classical optic flow estimation can be used. 

This dates to 1981 when separately Lucas-Kanade and Horn-Schunck came up with algorithms for it. 

While Horn-Schunck was a dense method giving a displacement for every pixel in the image, Lucas-Kanade's method was a sparse method and as we are more interested in the sparse methods, we will concentrate on Lucas-Kanade.

The analysis of flow is used throughout science and engineering but Optic flow refers to apparent flow, i.e. the flow in a perspective image.

Example: Bicycle on the back of a car.
  

## Assumptions in the Optic Flow 
1. The scene between views is static.
2. The movements are rigid.
3. Brightness Constancy.

    

## Lucas-Kanade

The brightness constancy assumption can be defined as follows.

Let $x(t)$ denote a moving point at time $t$, and $I(x,t)$ a sequence of images (video), then:
$$I(x(t),t)= \text{const    } \forall t$$

We don't know what the constant is but if we take the derivative of a function that is constant we should get zero.

$$\frac{d}{dt}I(x(t),t)=\nabla I^{\top} \left(\frac{dx}{dt}\right)+\frac{\partial I}{\partial t} = 0$$
  
$$\frac{d}{dt}I(x(t),t)=\nabla I^{\top} \left(\frac{dx}{dt}\right)+\frac{\partial I}{\partial t} = 0$$
What we are looking to find out is how points (pixels) have moved in the image.

This is given by the one thing we can't compute here which is the derivative of the pixel's position with respect to time. We will call this the velocity vector $v$.
$$ v=\frac{dx}{dt}$$
Keep in mind that $x$ represents a coordinate $(x,y)$.
    
Unfortunately we cannot solve this because $\nabla I^{\top} \left(\frac{dx}{dt}\right)$ results in the loss of any part of $v$ that is not in the direction of $\nabla I^{\top} $.

However if we make a further assumption that there is constant motion in some small neighbourhood of pixels then we can try to find a _best_ vector $v$ that solves the equation. We will call the neighbourhood $W(x)$

$$\nabla I(x',t)^{\top}v+\frac{\partial I}{\partial t}(x',t) = 0 \quad \forall x' \in W(x)$$

$$\mathcal{L}(v) = \sum_{W(x)}|\nabla I(x',t)^{\top}v +I_t(x',t)|^2 \,  dx'$$
$\mathcal{L}$ is the loss or cost function which we want to find the lowest value of, i.e. we want to find the velocity vector $v$ that gives us the smallest loss. Ideally this loss would be zero.
    
To find the minimum of a function we will find its derivative w.r.t. to $v$ and set it to zero.
Expanding the terms.

$$\mathcal{L}(v) = \sum_{W(x)}(v^{\top}\nabla I \nabla I^{\top}v +2I_t\nabla Iv+ I_t^2) \; dx'$$
We can lose any terms that do not have a v component as they won't effect the minimum.
$$\frac{d\mathcal{L}}{dv}=\sum_{W(x)}(2\nabla I \nabla I^{\top}v + 2I_t\nabla I) = 0$$
    

We will make the following matrix and vector to make life easier on ourselves.
$$\mathbf{M}= \sum_{W(x)}\begin{bmatrix}
               I^2_x & I_xI_y   \\
               I_xI_y & I^2_y          
            \end{bmatrix} \, dx'
	 \text{ and } q = \sum_{W(x)} I_t \nabla I \, dx'$$
	 
So we have 

$$\frac{d\mathcal{L}}{dv}=2Mv+2q = 0$$



Now we can find $v$ as follows, but only if $M$ is invertible.
$$v = -M^{-1}q$$
  

## Translational motion

What does the velocity vector $v$ look like?

Well, as we are in the image plane we can only move in x and y. 
No other options are available.

So the velocity vector $v$ will be a two-dimensional vector (three in Homogeneous coordinates). 

At least that is what the original Lucas-Kanade paper made the assumption i.e. under very small movements the motion could be well approximated by a simple translation in (x,y). 
  
As this is optical flow and therefore motion in 2D then translation only has two degrees of freedom $(x,y)$. 

We will call this version of the velocity vector $v_{tr}$.

$$\text{minimise  }\mathcal{L}(v_{tr}) = \sum_{W(x)}|\nabla I^{\top}v_{tr}+I_t|^2 \, dx'$$

$$\frac{d\mathcal{L}}{dv_{tr}}=\sum_{W(x)}(2\nabla I \nabla I^{\top}v_{tr} + 2I_t\nabla I) = 0 $$
$$v_{tr} = -M^{-1}q$$
As this is translation, to find the new coordinates from the old we would use.
$$h(x) = x+v_{tr}$$

    

## Affine motion
Now consider the following. 

A front mounted camera on a car is approaching an object that is head on (i.e. on the optical axis or close to it). 
The object will not move in a translation. 
But the object will get larger. 

This, along with rotations and translations comes under the heading of Affine motion. 
It has six degrees of freedom.

However we still need to make a move for each pixel in 2D and our M matrix is a $2\times2$ matrix so how do we get these six degrees of freedom into equations above?
    
We will call the affine velocity vector $v_{af}$.
We create a $2\times6$ matrix out of our (x,y,1) homogeneous coordinates as follows.
$$S(x) = \begin{bmatrix}
               x & y&1&0&0&0   \\
               0&0&0&x & y&1          
            \end{bmatrix}$$
With our $v_{af}$ as follows
$$v_{af} = \begin{bmatrix}
                p_1\\
                p_2\\
                p_3\\
                p_4\\
                p_5\\
                p_6
            \end{bmatrix}$$
            

Now to determine where any given coordinate will move in (x,y) we get


$$u(x) = S(x)v_{af} = \begin{bmatrix}
               x & y&1&0&0&0   \\
               0&0&0&x & y&1          
            \end{bmatrix} \begin{bmatrix}
                p_1\\
                p_2\\
                p_3\\
                p_4\\
                p_5\\
                p_6
            \end{bmatrix} = S(x)v_{af} = \begin{bmatrix}
               p_1x+p_2y+p_3  \\
               p_4x+p_5y+p_6          
            \end{bmatrix}$$ 
            
            
Note that $u(x)$ is not the new value of the coordinate but the change in the coordinate so the new value of a coordinate would be given by 
$$h(x) = x+u(x)$$
    

So ...
$$\text{minimise  }\mathcal{L}(v_{af}) = \sum_{W(x)}|\nabla I^{\top}S(x')v_{af}+I_t|^2 \, dx'$$

$$\frac{d\mathcal{L}}{dv_{af}}=\sum_{W(x)}(2S(x')^{\top}\nabla I \nabla I^{\top}S(x')v_{af} + 2I_t\nabla I) = 0 $$
$$v_{af} = -(S(x')^{\top}MS(x'))^{-1}q$$\\
     

## The Aperture Problem
There is one major problem in the mathematics of Lucas-Kanade which can cause it to fail entirely.

If the M matrix is not invertible then the velocity vector cannot be recovered.
What would cause M to not be invertible?

If the window $W(x)$ has an entirely constant intensity. (a uniform road surface for example or the flat side of a vehicle like a lorry).
     

A constant intensity in the spatial domain means $\nabla I(x) = 0$ and $I_t(x) = 0$ for all of the points in the window.
Then 
$$\mathbf{M}= \sum_{W(x)}\begin{bmatrix}
               I^2_x & I_xI_y   \\
               I_xI_y & I^2_y          
            \end{bmatrix} dx'$$
is not invertible because it is full of zeros.
In the situation where the window straddles some edge where there is a gradient in one direction but not in both, then M will not be entirely zero but it will be a singular matrix with $det(M) = 0$ and still not invertible.

While this is undesirable, all is not lost as we can still compute the motion in the direction of the image gradient, i.e. the part we have a gradient for. 
This would be called the normal motion as the gradient is _normal_ to some edge.
     

## Simple Feature Tracking Algorithm (KLT)
Assuming the matrix M is invertible.


- For a given time sample $t$ (frame), compute at each coordinate (x,y) in the image frame the matrix M 

$$\mathbf{M}= \sum_{W(x)}\begin{bmatrix}
               I^2_x & I_xI_y   \\
               I_xI_y & I^2_y          
            \end{bmatrix} dx'$$

- Mark all coordinates for which the determinant of M is larger than a threshold $\theta > 0$ i.e. 

$$|\mathbf{M}(x)|\geq \theta$$

- For all those coordinates the local velocity vector can be calculated by

$$v(x,t)=-\mathbf{M}(x)^{-1}\begin{bmatrix}
	\sum_{W(x)}I_xI_tdx'\\
	\sum_{W(x)}I_yI_tdx'\\
	\end{bmatrix}$$

- Repeat the above steps for the points $x+v$ at time $t+1$.


     

!["HigherEd 4.0 is funded by the Human Capital Initiative Pillar 3. HCI Pillar 3 supports projects to enhance the innovation and agility in response to future skills needs"](images/HCIFunding.png)