# Homography and perspective transforms

We have already looked at affine transformations and we have seen the properties and the limitations that they have. Just to recap:

An affine transform is any transform of the form
$$\begin{bmatrix}A_{00} & A_{01} & b_0\\A_{10} & A_{11} & b_1\\ 0 & 0 & 1\end{bmatrix}$$

but since the last row remains constant we can also write it like:
$$\begin{bmatrix}A_{00} & A_{01} & b_0\\A_{10} & A_{11} & b_1\end{bmatrix}$$

which is also the output of OpenCV's `getAffineTransform` as we saw earlier.

We know that an affine transform gives us 6-degrees of freedom and that to get an affine transform of a 2D shape, we need three points from the source and 3 from the resultant shape. We also know that for affine transforms, parallel lines remain parallel and the ratios are preserved and that this limits us in what we can do.

There is another type of matrix that allows us to do more than an affine transform would. This matrix is known as a homography matrix written as:

$$\begin{bmatrix}H_{11} & H_{12} & H_{13}\\H_{21} & H_{22} & H_{23}\\ H_{31} & H_{32} & 1\end{bmatrix}$$

It turns out that by not restricting ourselves to a 2x3 matrix, we can do much more. But before we look at what we can do with homography, let us look at how we get there in the first place.

# Linear camera model

Let's take a simple tour to see how we got to this homography matrix. In order to take an image, a camera needs to take the 3D world coordinates, convert them to 3D camera coordinates then convert them to 2D coordinates. The linear camera model, which enables us to do this can be written as:

$$\begin{bmatrix}\tilde{u} \\ \tilde{v} \\ \tilde{w}\end{bmatrix} = \begin{bmatrix}C_{11} & C_{12} & C_{13} & C_{14}\\ C_{21} & C_{22} & C_{23} & C_{24}\\ C_{31} & C_{32} & C_{33} & 1\end{bmatrix}\begin{bmatrix}X \\ Y \\ Z \\ 1\end{bmatrix}$$

This 3x4 matrix that does this is known as a **projection matrix**. It maps the 3D world coordinates $[X, Y, Z]$ to pixel coordinates $[u, v]$. Let us focus in on it:
$$\begin{bmatrix}C_{11} & C_{12} & C_{13} & C_{14}\\ C_{21} & C_{22} & C_{23} & C_{24}\\ C_{31} & C_{32} & C_{33} & C_{34}\end{bmatrix}$$

This projection matrix can be decomposed into *extrinsic* and *intrinsic* camera parameters. Extrinsic parameters are those that are external to the camera and are concerned with transforming 3D world coordinates to 3D camera coordinates. Intrinsic parameters are concerned with the camera and transformation from 3D camera coordinates to 2D pixel coordinates. The extrinsic parameter equation is:

$$\begin{bmatrix}x_c \\ y_c \\ z_c \\ 1\end{bmatrix} = \begin{bmatrix}r_{11} & r_{12} & r_{13} & t_{14}\\ r_{21} & r_{22} & r_{23} & t_{24}\\ r_{31} & r_{32} & r_{33} & t_{34} \\ 0 & 0 & 0 & 1\end{bmatrix} \begin{bmatrix}X \\ Y \\ Z \\ 1\end{bmatrix}$$

The matrix doing this transformation contains a 3x3 rotation matrix and a 3x1 translation matrix. It can also be written as $\begin{bmatrix}R & t \\ 0_{1x3} & 1 \end{bmatrix}$.

The intrinsic matrix equation can be written as:
$$\begin{bmatrix}\tilde{u} \\ \tilde{v} \\ \tilde{w}\end{bmatrix} = \begin{bmatrix}f_x & 0 & o_x & 0\\ 0 & f_y & o_y & 0\\ 0 & 0 & 1 & 0\end{bmatrix} \begin{bmatrix}x_c \\ y_c \\ z_c \\ 1\end{bmatrix}$$

We note that the intrinsic matrix $M_{int}$ is a a familiar 3x3 matrix, an affine transform with $f_x, f_y$ as scale (scaling the object from world dimensions to camera sensor dimensions) and $o_x, o_y$ a translation (signifying displacement from the optical centre), in the x, y directions.

Hence, the projection matrix can also be written as:
$$
\begin{bmatrix}u \\ v \\ 1\end{bmatrix} \equiv
\begin{bmatrix}\tilde{u} \\ \tilde{v} \\ \tilde{w}\end{bmatrix} = \begin{bmatrix}f_x & 0 & o_x & 0\\ 0 & f_y & o_y & 0\\ 0 & 0 & 1 & 0\end{bmatrix} \begin{bmatrix}r_{11} & r_{12} & r_{13} & t_{14}\\ r_{21} & r_{22} & r_{23} & t_{24}\\ r_{31} & r_{32} & r_{33} & t_{34} \\ 0 & 0 & 0 & 1\end{bmatrix} \begin{bmatrix}X \\ Y \\ Z \\ 1\end{bmatrix}
$$

In summary, from the furthest right, we have a 3D coordinate of a point in the world which we convert to a 3D coordinate in the camera using the extrinsic matrix. We then convert this 3D camera point to a 2D image point using the intrinsic matrix. We then convert this homogenous 2D coordinate to cartesian coordinates.

This is the linear camera model and we will use it for camera callibration.

### Distortion Effects

It should be noted that a camera has distortion effects that have not been included in the linear camera matrix. Two major ones are *radial* and *tangential* distortion effects. Calculating these is necessary for proper camera callibration.

## Camera Calibration

Camera calibration involves finding the projection matrix $P$ for a camera. Once these values are found, they can be used to undistort images. OpenCV provides us with a method for calibration `calibrateCamera` which returns the *camera intrinsics matrix*, the *distortion coefficients*, the *rotation matrix* $R$ and the *translation vector* $t$. We can then use these outputs to undistort images by this camera.

We also have a method known as `solvePnP`, which we can use to get the location of the object being viewed. (Also, camera pose estimation problem) In this case we pass in the camera intrinsics and the distortion coefficients to get the external parameters.

## Planar Homography

We have seen that the linear camera model allows us to represent a point at coordinates $(X, Y, Z)$ in the world as a pixel at coordinates $(u, v)$ in an image.

Now let us imagine describing the position of not one point but all points in a fixed Z, i.e setting the origin on a plane with x, y within the plane and Z coming out of the plane (depth). Since these points are on a plane, we can rewrite our projective equation with z = 0:

$$\begin{bmatrix}\tilde{u} \\ \tilde{v} \\ \tilde{w}\end{bmatrix} = \begin{bmatrix}C_{11} & C_{12} & C_{13} & C_{14}\\ C_{21} & C_{22} & C_{23} & C_{24}\\ C_{31} & C_{32} & C_{33} & 1\end{bmatrix}    \begin{bmatrix}X \\ Y \\ 0 \\ 1\end{bmatrix}$$

Consequently, the third column also goes, we are left with:

$$\begin{bmatrix}\tilde{u} \\ \tilde{v} \\ \tilde{w}\end{bmatrix} = \begin{bmatrix}H_{11} & H_{12} & H_{13}\\H_{21} & H_{22} & H_{23}\\ H_{31} & H_{32} & 1\end{bmatrix}\begin{bmatrix}X \\ Y \\ 1\end{bmatrix}$$

We get the matrix we had earlier, a 3x3 matrix we call the **planar homography** which maps points in a plane to points in an image. At this point, there is no depth information like we start with in the linear camera model, we are only concerned with the $(X, Y)$ coordinates of a point in a plane.
(The matrix is generally normalized with $h_{33} = 1$ or  $h_{11}^2 + h_{12}^2 + h_{13}^2 + h_{21}^2 + h_{22}^2 + h_{23}^2 + h_{31}^2 + h_{32}^2 + h_{33}^2 = 1$.)

## What can we do with this?

Now that we know how we derive our 3x3 homography matrix, we ask: what can we do with this? Well, all we can really do is reshape one parallelogram into any other parallelogram of our choosing. In fact, this is what homography does, mapping one plane (a parallelogram) into another plane through a point. This means that we need at least 4 points from each plane to estimate a homography that connects these two planes. However, we are not confined to use 4 points only, which allows for more accuracy.

Practical applications of homographies include:
* **perspective rectification** - if we have an image in the wrong perspective, and we happen to know at least four points in the image that lie on the same plane, we can rectify the perspective of the image. For such an application, we only require 4 points from each plane.
* **image alignment** - A photo of a filled form may not be in the proper perspective. We can align the form if we have a template of how it should look like. In this application, we would fare better if we can find as many similar points in each plane as possible to compute homographies.
* **image stitching** - we can stitch multiple images together to create panoramas. Just like image alignment, we will fare much better with more than the required minimum of 4.

We will next see these applications in action.