# Model View Projection

A model (either a triangle mesh, a signed distance function, etc) is specified with respect to a _local frame_. When building a scene, the local frame of each object must be transform into a _world frame_. The world in computer graphics is seen through a camera, so each object in the world frame, must be transform to a _view frame_ that gives its position with respect to the camera. We are not done yet, a (pinhole) camera can see in a 2D square, so we need to _project_ the view frame into a 2D plane (called the _image plane_). The projection can be choosen to be:
* orthographic: keep parallel lines at infinity.
* perspective: merge parallel lines at infinity.

```
    Local frame  -->  World frame  -->  View frame   -->  Image plane
```

Let
* $M\colon \mathbb R^3 \to \mathbb R^3$ be the transformation from the local frame to the world frame,
* $V\colon \mathbb R^3 \to \mathbb R^3$ be the transformation from the world frame to the view frame and
* $P\colon\mathbb R^3 \to \mathbb R^2$ be the transformation of the view frame to the image plane.

## Model Matrix

The movement of local frame to world frame are rigid transformation (scaling, reflection and translation). Then, the function $M$ adopts the form
$$
    M(\mathbf x) = \mathbf o + \mathbf S \mathbf x
$$
where $\mathbf o$ is the translation vector and $\mathbf S$ is are the other transformations. Now, we will apply a trick that is often used in computers graphics. Since most computations run on the GPU, we want every transformation to look like matrix multiplication. To achieve this, its convenient to modify the local coordinates as to append a $1$ at the end i.e. $\mathbf x \leftarrow (\mathbf x, 1)\in\mathbb R^3$. Then, the transformation $M$ can be writen with a matrix $\mathbf M \in \mathbb R^{4\times 4}$ as:
$$
    \mathbf M = 
    \begin{bmatrix}
        \mathbf S & \mathbf o\\
        0 & 1
    \end{bmatrix}
$$
where the first 3 coordinates in the image are the model coordinates and the last one is a mousketool for latter.

In [1]:
import numpy as np

In [2]:
x = np.array([1, 1, 0, 1])

M = np.array([
    [1, 0, 0, -1],
    [0, 1, 0, 0],
    [0, 0, 1, 0],
    [0, 0, 0, 1]
])

print('Local frame', x)
print('Model frame', M@x)

Local frame [1 1 0 1]
Model frame [0 1 0 1]


## View Matrix

In this step, world coordinates must be expressed from the camera view point. Let $\mathbf c_O, \mathbf c_F, \mathbf c_R$ and $\mathbf c_U$ be vectors in $\mathbb R^3$ representing the camera origin, the camera front direction, the camera right direction and the camera upward direction such that $\mathbf c_R = \mathbf c_F \times \mathbf c_U$.

Let $\mathbf x \in \mathbb R^3$. Then, $\mathbf x' = \mathbf x - \mathbf c_O$ is the vector $\mathbf x$ with the respect to the origin $\mathbf c_O$. Then, we want to find coefficients $y_1, y_2$ and $y_3$ such that 
$$
    y_1 \mathbf c_F + y_2 \mathbf c_R + y_3 \mathbf c_U = \mathbf x'.
$$
Let $\mathbf C = \begin{bmatrix} c_F & c_R & c_U \end{bmatrix}$ and $\mathbf y = (y_1, y_2, y_3)$. Then,
$$
    \mathbf C \mathbf y = \mathbf x'
    \iff
    \mathbf y = \mathbf C^{-1} \mathbf x'.
$$
Therefore, the transformation can be writen as:
$$
    V(\mathbf x) = \mathbf C^{-1}(\mathbf x - \mathbf c_O)
$$

In [3]:
x = np.array([1, 1, 0])

cO = np.array([1, 0, 1])
cF = np.array([0, 0, -1])
cU = np.array([0, 1, 0])
cR = np.cross(cF, cU)

print('Camera information')
print('Origin :', cO)
print('Front  :', cF)
print('Upward :', cU)
print('Right  :', cR)

C = np.array([cF, cU, cR]).T
print('Matrix :')
print(np.matrix(C))

print('Old coord:', x)
print('New coord:', np.linalg.solve(C, (x-cO)))

Camera information
Origin : [1 0 1]
Front  : [ 0  0 -1]
Upward : [0 1 0]
Right  : [1 0 0]
Matrix :
[[ 0  0  1]
 [ 0  1  0]
 [-1  0  0]]
Old coord: [1 1 0]
New coord: [1. 1. 0.]


As a matrix transformation, we append a one a consider the 4 by 4 matrix:
$$
    \mathbf V = \begin{bmatrix} \mathbf C^{-1} & -\mathbf C^{-1} \mathbf c_O\\ 0 & 1\end{bmatrix}
$$

## Projection Matrix

We are one step away of finally drawing something on screen. Recall that projection matrices
come in two flavors: orthographic and perspective.

**Perspective Projection** Consider a truncated pyramid coming out of the camera. Everything inside of that truncated pyramid will appear on screen. The truncated pyramid can be define by a number of parameters, in this section we consider the following parameterization:
* $n$ be the near distance plane (objects closer that $n$ to the camera will not appear on screen),
* $f$ be the far distance plane (objects farther that $f$ to the camera will not appear on screen),
* $\alpha$ be the vertical field of view of the camera,
* $\beta$ be the horizontal field of view of the camera.


We first map the truncated pyramid (or view frustrum) defined by the above parameters into the unit cube $[-1,1]^{3}$ called the _image space_ (two dimensions for the objects location and the third one is a hint for the rasterizer).
Consider a slice of the view frustrum (which is a parallelepid) at depth $d$, where depth is measured from the camera origin. Let $h$ and $v$ be the length of the horizontal and vertical sides measure from the frustum center. Then, the map $(r_h,r_v,r_d) \to (h,v,d)$ from image space to the view frustum is given by:
$$
    v = r_v \tan(\alpha) d
    \qquad
    h = r_h \tan(\beta) d
    \qquad
    d = \frac{n+f}{2} + r_d \frac{f-n}{2}
$$
Computing the inverse gives the map from the view frustrum to image space.
$$
    \varphi(h,v,d) 
    =
    \left(
        h\, \frac{\cot \beta}{d},
        y\, \frac{\cot \alpha}{d},
        d\, \frac{2}{f-n} - \frac{f+n}{f-n}
    \right).
$$
Note that $\varphi$ is not a linear function, so cannot write in a matrix form. Nevertheless, we consider the following "linearization trick":
$$
    \hat\varphi(h,v,d,1) =
    \left(
        h\, \cot \beta,
        y\, \cot \alpha,
        \frac{2}{f-n} - d \frac{f+n}{f-n},
        d
    \right),
$$
which can be writen with the following matrix:
$$
    \mathbf P 
    =
    \begin{bmatrix}
        \cot \beta & 0 & 0 & 0\\
        0 & \cot\alpha & 0 & 0\\
        0 & 0 & -\frac{f+n}{f-n} & \frac{2}{f-n}\\
        0 & 0 & 1 & 0
    \end{bmatrix}
$$
then, the coordinates in image space can be recovered by dividing the first three coordinates by the fourth one.

### Finally, a working example

In [4]:
'''  Camera information  '''
cO = np.array([0, 0, 2])
cF = np.array([0, 0, -1])
cU = np.array([0, 1, 0])
cR = np.cross(cF, cU)

C = np.array([cR, cU, cF]).T

n = 0.5
f = 4
alpha = 80 * np.pi/180
beta = 80 * np.pi/180

'''  Model, View, Projection  '''

M = np.eye(4)

V = np.block([
    [np.linalg.inv(C), -np.linalg.solve(C, cO).reshape((3, 1))],
    [np.zeros(3), 1]
])

P = np.array([
    [np.divide(1, np.tan(beta)), 0, 0, 0],
    [0, np.divide(1, np.tan(alpha)), 0, 0],
    [0, 0, -(f+n)/(f-n), 2/(f-n)],
    [0, 0, 1, 0]
])


'''  Geometry information  '''
cube = np.array([
    #  Back
    [-1, -1, -1, 1],
    [-1, 1, -1, 1],
    [1, 1, -1, 1],
    [1, -1, -1, 1],
    #  Front
    [-1, -1, 1, 1],
    [-1, 1, 1, 1],
    [1, 1, 1, 1],
    [1, -1, 1, 1],
]).T

''' Applying transfomation '''
cube_4d = P @ V @ M @ cube
print('Transformed cube (in 4d)')
np.set_printoptions(3)
print(cube_4d)

cube_img_space = cube_4d[:3] / cube_4d[-1]
print('Perspective cube')
print(cube_img_space)

Transformed cube (in 4d)
[[-0.176 -0.176  0.176  0.176 -0.176 -0.176  0.176  0.176]
 [-0.176  0.176  0.176 -0.176 -0.176  0.176  0.176 -0.176]
 [-3.286 -3.286 -3.286 -3.286 -0.714 -0.714 -0.714 -0.714]
 [ 3.     3.     3.     3.     1.     1.     1.     1.   ]]
Perspective cube
[[-0.059 -0.059  0.059  0.059 -0.176 -0.176  0.176  0.176]
 [-0.059  0.059  0.059 -0.059 -0.176  0.176  0.176 -0.176]
 [-1.095 -1.095 -1.095 -1.095 -0.714 -0.714 -0.714 -0.714]]


We can see reading each vector that the back face is closed to the origin than the front face (1 order of magnitude, 0.059 vs 0.176). Since numbers are not so satisfying to read, let's build an image.

In [5]:
img_size = (200, 200)

cube_coords = img_size[0]//2 + np.multiply(cube_img_space[:2], img_size[0]).astype(int)

depth = (cube_img_space[2].clip(min=-1.0, max=1.0) + 2.0)  #  a random proportionality function

from ipycanvas import Canvas, hold_canvas

canvas = Canvas(width=img_size[0], height=img_size[1], sync_image_data=True)
canvas.fill_rects(cube_coords[0], cube_coords[1], 5.0 * depth)

In [6]:
canvas.to_file('images/mvp_cube.png')

<img src="images/mvp_cube.png"></img>

Note that we did a depth deformation using the 3rd coordinate of the cube in image space. Now, let's build something more interesting.

In [7]:
''' We will use pyvista to load the data, it's not hard doing it by hand but that is not the purpose here '''

import pyvista as pv

MAX_FACES = 15_000

#  Mesh from https://www.thingiverse.com/thing:6703649/files
mesh = pv.PolyData('resources/3D_models/blue_whale/files/ballena_azul_Lowpoly.stl')

num_faces = len(mesh.faces) // 4

if num_faces > MAX_FACES:
    mesh = mesh.decimate(1.0 - (MAX_FACES / num_faces))
    print('Decimated mesh, from ', num_faces, 'faces to', len(mesh.faces)//4, 'faces')

n_nodes = mesh.points.shape[0]
nodes = np.concatenate((mesh.points, np.ones((n_nodes, 1))), axis=1).T
triangles = np.delete(mesh.faces.reshape(-1, 4), 0, 1).T

print('# Nodes  :', nodes.shape)
print('# Faces  :', triangles.shape)

# Nodes  : (4, 1845)
# Faces  : (3, 3686)


In [8]:
'''  Camera information  '''
camera_target = np.mean(nodes[:3], axis=1)

min_1, max_1 = np.amin(nodes[0]), np.amax(nodes[0])
min_2, max_2 = np.amin(nodes[1]), np.amax(nodes[1])
min_3, max_3 = np.amin(nodes[2]), np.amax(nodes[2])

cO = 2 * np.array([3 * max_1, 3 * max_2, 0.5 * (max_3 + min_3)])
cF = cO - camera_target
cF /= np.linalg.norm(cF)
cU = -np.array([0, 1, 0])
cR = np.cross(cF, cU)

C = np.array([cR, cU, cF]).T  #  x (horizontal), y (vertical), z (depth) as seem from the camera 

n = 10
f = 100
alpha = 30.0 * np.pi/180
beta = 30.0 * np.pi/180

'''  Model=id, View, Projection  '''
V = np.block([
    [np.linalg.inv(C), -np.linalg.solve(C, cO).reshape((3, 1))],
    [np.zeros(3), 1]
])

P = np.array([
    [np.divide(1, np.tan(beta)), 0, 0, 0],
    [0, np.divide(1, np.tan(alpha)), 0, 0],
    [0, 0, -(f+n)/(f-n), 2/(f-n)],
    [0, 0, 1, 0]
])

''' Draw time '''
img_size = (500, 500)

img_4d = P @ V @ nodes
img_space = img_4d[:3] / img_4d[-1]

depth = (img_space[2].clip(min=-1.0, max=1.0) + 2.0)

coords = img_size[0]//2 + np.multiply(img_space[:2], img_size[0]).astype(int)
mean = np.mean(coords, axis=0)

plt = Canvas(width=img_size[0], height=img_size[1], sync_image_data=True)

plt.fill_rects(coords[0], coords[1], 1.5 * depth)

plt.fill_style = 'red'
plt.fill_rect(mean[0], mean[1], 5.0)

In [9]:
plt.to_file('images/mvp_final.png')

<img src="images/mvp_final.png"></img>

References:
* UC Davis (2015). The Camera Transform. Available at: https://youtu.be/mpTl003EXCY?si=Y7s9tljaW3w3VYuO  (Accessed: 9 May 2025).