# 7: Stereo Matching Fundamentals Continued

In the [previous workshop](./06_stereo_matching_fundamentals.ipynb), we looked at stereo matching fundamentals. More precisely, we looked at the disparity map computation using the block matching algorithm. In this workshop, we will continue with stereo matching fundamentals and how we can do this without using disparity, that assumes a linear shift in $x$ direction.

## Epipolar Geometry




In [1]:
import matplotlib.pyplot as plt
from nptyping import Float32, NDArray, Shape
import numpy as np
from scipy.ndimage import map_coordinates
from scipy.signal import convolve2d
from scipy.signal.windows import gaussian
from scipy.spatial.transform import Rotation

from oaf_vision_3d._test_data_paths import TestDataPaths
from oaf_vision_3d.convolve2d import convolution_2d_nan
from oaf_vision_3d.lens_model import CameraMatrix, DistortionCoefficients, LensModel
from oaf_vision_3d.point_cloud_visualization import open3d_visualize_point_cloud
from oaf_vision_3d.poly_2_subvalue_fit import find_subvalue_poly_2
from oaf_vision_3d.project_points import project_points
from oaf_vision_3d.transformation_matrix import TransformationMatrix
from oaf_vision_3d._stereo_data_reader import StereoData

We load up a lens model:

In [2]:
lens_model = LensModel(
    camera_matrix=CameraMatrix(fx=2500.0, fy=2500.0, cx=1250.0, cy=1000.0),
    distortion_coefficients=DistortionCoefficients(k1=0.3, k2=-0.1, p1=-0.02),
)

And a transformation matrix between the two cameras, for now explained with the same lens model:

In [3]:
transformation_matrix = TransformationMatrix(
    rotation=Rotation.from_rotvec(
        np.array([0.0, -11.31, 0.0], dtype=np.float32), degrees=True
    ),
    translation=np.array([100.0, 0.0, 0.0], dtype=np.float32),
)

If we now pick a random pixel, let's say (1250, 1000) in the main camera, and a depth of 500 mm:

In [4]:
pixel = np.array([1250.0, 1000.0], dtype=np.float32)
depth = 500.0

We can calculate the 3d position of this point in the main camera, if that pixel was at that depth:

In [5]:
undistorted_pixel = lens_model.undistort_pixels(
    normalized_pixels=lens_model.normalize_pixels(pixels=pixel[None, None, :])
)
xyz = np.pad(undistorted_pixel, ((0, 0), (0, 0), (0, 1)), constant_values=1.0) * depth

Now we can project this point onto the second camera, by transforming by the inverse of the transformation matrix, since we go from the main camera to the second camera. We can use the [`project_points`](../oaf_vision_3d/project_points.py) function for this and return the pixel position in the second camera:

In [None]:
projected_point = project_points(
    points=xyz[0],
    lens_model=lens_model,
    transformation_matrix=transformation_matrix.inverse(),
)[0]

print(f"Projected point, X: {projected_point[0]:.1f}, Y: {projected_point[1]:.1f}")

It turns out the projected pixel hit in the center of the second camera, this is not a suprise since I designed it this way. With a baseline $t_x = 100$ mm and depth of $500$ mm, the selected angle $\theta_y = \arctan(\frac{t_x}{d})$. 

Let's do this for a range of depths and plot the results:

In [None]:
depth_min = 350.0
depth_max = 650.0
num_samples = 10

undistorted_pixel = lens_model.undistort_pixels(
    normalized_pixels=lens_model.normalize_pixels(pixels=pixel[None, None, :])
)
camera_vector = np.pad(undistorted_pixel, ((0, 0), (0, 0), (0, 1)), constant_values=1.0)

depths = np.linspace(depth_min, depth_max, num_samples)
projected_points = []
for depth in depths:
    xyz = camera_vector * depth
    projected_points.append(
        project_points(
            points=xyz[0],
            lens_model=lens_model,
            transformation_matrix=transformation_matrix.inverse(),
        )[0]
    )
projected_points = np.array(projected_points, dtype=np.float32)

plt.figure(figsize=(10, 10))
plt.plot(projected_points[:, 0], projected_points[:, 1], ".-")
plt.xlim(0, 2500)
plt.ylim(2000, 0)
plt.show()

We see that the the point makes up a line, this line is distorted since we use the `project_points` function. We can do this for ta grid around the image:

In [None]:
pixels = np.stack(
    np.meshgrid(np.linspace(250, 2250, 6), np.linspace(200, 1800, 5)), axis=-1
).astype(np.float32)
undistorted_normalized_pixels = lens_model.undistort_pixels(
    normalized_pixels=lens_model.normalize_pixels(pixels=pixels)
)
camera_vectors = np.pad(
    undistorted_normalized_pixels, ((0, 0), (0, 0), (0, 1)), constant_values=1.0
)

depths = np.linspace(depth_min, depth_max, num_samples)
projected_points = []
for depth in depths:
    xyz = camera_vectors * depth
    projected_points.append(
        project_points(
            points=xyz.reshape(-1, 3),
            lens_model=lens_model,
            transformation_matrix=transformation_matrix.inverse(),
        ).reshape(pixels.shape)
    )
projected_points = np.array(projected_points, dtype=np.float32)


projected_points_reshaped = projected_points.reshape(
    -1, pixels.shape[0] * pixels.shape[1], 2
)

plt.figure(figsize=(10, 10))
plt.plot(projected_points_reshaped[..., 0], projected_points_reshaped[..., 1], ".-")
plt.xlim(0, 2500)
plt.ylim(2000, 0)
plt.axis("image")
plt.show()

We see how these pixels in the main camera project as lines in the second camera. These lines are called epipolar lines. The epipolar lines are lines in the second camera that correspond to a point in the main camera. This is the basis of stereo matching, we can search for the corresponding point in the second camera along the epipolar line. This is a much easier problem than searching the entire image for the corresponding point. It also work regardless of the orientation of the cameras, as long as we know the transformation between them.

### Essential Matrix

In todays session we looked at the essential matrix, which is a matrix that describes the transformation between two cameras. It is defined as:

$$
E = [\vec{t}]_x R
$$

Where $[\vec{t}]_x$ is the skew symmetric matrix of the translation vector $t$ and $R$ is the rotation matrix. The essential matrix is used to calculate the epipolar line, the essential matrix solves:

$$
\vec{v}_0^T E \vec{v}_1 = 0
$$

Where $\vec{v}_0$ and $\vec{v}_1$ are the normalized undistorted pixel coordinates in the two cameras. This gives if we say:

$$
\begin{bmatrix} a & b & c \end{bmatrix} = \vec{v}_0^T E
$$

Then the epipolar line in the second camera is:

$$
a x''_1 + b y''_1 + c = 0
$$

Such that we have the line equation:

$$
y''_1 = -\frac{a}{b} x''_1 - \frac{c}{b}
$$

This mean that for any pixel in the first camera, we can caluclate every pixel we could see in the second camera with simple matrix multiplication. 

Let's calculate the essential matrix for the example above:

::: {admonition} Skew Symmetric Matrix
:class: toggle, tip
$$
[\vec{v}]_x = \begin{bmatrix} 0 & -v_z & v_y \\ v_z & 0 & -v_x \\ -v_y & v_x & 0 \end{bmatrix}
$$
:::

In [9]:
def skew_symmetric_matrix(
    vector: NDArray[Shape["3"], Float32],
) -> NDArray[Shape["3, 3"], Float32]:
    return np.array(
        [
            [0.0, -vector[2], vector[1]],
            [vector[2], 0.0, -vector[0]],
            [-vector[1], vector[0], 0.0],
        ],
        dtype=np.float32,
    )


essential_matrix = (
    skew_symmetric_matrix(transformation_matrix.translation)
    @ transformation_matrix.rotation.as_matrix()
)

Then we can use this to calculate the slope and intercept of the epipolar line:

In [10]:
abc = (camera_vectors @ essential_matrix).reshape(-1, 3)

slope = -abc[..., 0] / abc[..., 1]
intercept = -abc[..., 2] / abc[..., 1]

Which we can distort and denormalize to plot on top of the projected points to show that they match:

In [None]:
undistorted_normalized_projector_pixels_reshaped = lens_model.undistort_pixels(
    normalized_pixels=lens_model.normalize_pixels(pixels=projected_points_reshaped)
)

x_min = undistorted_normalized_projector_pixels_reshaped[..., 0].min(axis=0)
x_max = undistorted_normalized_projector_pixels_reshaped[..., 0].max(axis=0)

new_pixels = []
for _slope, _intercept, _x_min, _x_max in zip(slope, intercept, x_min, x_max):
    x = np.linspace(_x_min, _x_max, 10)
    y = _slope * x + _intercept
    _new_pixels = np.stack([x, y], axis=-1)
    new_pixels.append(
        lens_model.denormalize_pixels(
            pixels=lens_model.distort_pixels(normalized_pixels=_new_pixels[None, ...])
        )[0]
    )
new_pixels = np.array(new_pixels)


plt.figure(figsize=(10, 10))
plt.plot(projected_points_reshaped[..., 0], projected_points_reshaped[..., 1], ".")
plt.plot(new_pixels[..., 0].T, new_pixels[..., 1].T, "k-")
plt.xlim(0, 2500)
plt.ylim(2000, 0)
plt.axis("image")
plt.show()

## Plane Sweep Algorithm

Now we have seen how we can move along a line in the image regardless of the orientation of the cameras. Even though we will not use the epipolar geometry, in form of the essential matrix or the fundamental matrix. We can still use the concept of projecting an image that we worked on in [workshop 3](./03_image_distortion_and_undistortion.ipynb) and [workshop 4](./04_3d_2d_projections_and_pnp.ipynb). To do this we can use a plane sweeping algorithm, let's start by loading the same dataset as in the [previous workshop](./06_stereo_matching_fundamentals.ipynb):

In [None]:
data_dir = TestDataPaths.stereo_data_0_dir
stereo_data = StereoData.from_path(data_dir)

fig, ax = plt.subplots(2, 1, figsize=(12, 12))
ax[0].imshow(stereo_data.image_0, cmap="gray", vmin=0, vmax=1)
ax[0].set_title("Image left")
ax[0].axis("off")
ax[1].imshow(stereo_data.image_1, cmap="gray", vmin=0, vmax=1)
ax[1].set_title("Image right")
ax[1].axis("off")
plt.tight_layout()
plt.show()

We start be defining a given distance, for now I will use 140 mm. We can then create a plane in the coordinate system of the main camera, and project this plane onto the second camera. We can then recreate time image by interpolating the image values from the second camera. Lets start by defining the camera vector of the main camera:

In [13]:
pixels = np.indices(stereo_data.image_0.shape[:2], dtype=np.float32)[::-1].transpose(
    (1, 2, 0)
)
undistorted_normalized_pixels = stereo_data.lens_model_0.undistort_pixels(
    normalized_pixels=stereo_data.lens_model_0.normalize_pixels(pixels=pixels)
)
camera_vectors = np.pad(
    undistorted_normalized_pixels, ((0, 0), (0, 0), (0, 1)), constant_values=1.0
)

Then we project all those pooint to the second camera and map the coordinates:

In [None]:
def repeoject_image_at_depth(
    image: NDArray[Shape["H, W, ..."], Float32],
    camera_vectors: NDArray[Shape["H, W, 3"], Float32],
    depth: float,
    lens_model: LensModel,
    transformation_matrix: TransformationMatrix,
) -> NDArray[Shape["H, W, ..."], Float32]:
    xyz = camera_vectors * depth

    projected_points = project_points(
        points=xyz.reshape(-1, 3),
        lens_model=lens_model,
        transformation_matrix=transformation_matrix.inverse(),
    ).reshape(*camera_vectors.shape[:2], 2)

    return np.stack(
        [
            map_coordinates(
                input=_image,
                coordinates=[projected_points[..., 1], projected_points[..., 0]],
                order=1,
                mode="constant",
                cval=np.nan,
            )
            for _image in image.transpose(2, 0, 1)
        ],
        axis=-1,
        dtype=np.float32,
    )


depth = 140.0

new_image = repeoject_image_at_depth(
    image=stereo_data.image_1,
    camera_vectors=camera_vectors,
    depth=depth,
    lens_model=stereo_data.lens_model_1,
    transformation_matrix=stereo_data.transformation_matrix,
)

plt.figure(figsize=(10, 10))
plt.imshow(new_image, cmap="gray", vmin=0, vmax=1)
plt.axis("off")
plt.title(f"Reprojected image at depth {depth:.1f}")
plt.show()

We can now subtract the difference between the projected image and the main camera image, I'm doing sum of absolute differences here:

In [None]:
image_diff = np.abs(stereo_data.image_0 - new_image).sum(axis=-1)

plt.figure(figsize=(10, 10))
plt.imshow(image_diff, vmin=-0.1, vmax=0.1)
plt.axis("off")
plt.title("Image difference")
plt.show()

We see that this is similar to what we did with disparity and we might reason that we can do the same algorithm for stereo matching, just using depth instead of disparity. This also allows us to get depth directly, so the trianguilation is built into the algorithm. Let repeat this process for a range of depths, 130 to 150 mm:

In [None]:
def plane_sweeping(
    image_0: NDArray[Shape["H, W"], Float32],
    lens_model_0: LensModel,
    image_1: NDArray[Shape["H, W"], Float32],
    lens_model_1: LensModel,
    transformation_matrix: TransformationMatrix,
    depth_range: NDArray[Shape["2"], Float32],
    step_size: float,
    block_size: int,
    subpixel_fit: bool,
) -> NDArray[Shape["H, W, 3"], Float32]:
    pixels = np.indices(image_0.shape[:2], dtype=np.float32)[::-1].transpose((1, 2, 0))
    undistorted_normalized_pixels = lens_model_0.undistort_pixels(
        normalized_pixels=lens_model_0.normalize_pixels(pixels=pixels)
    )
    camera_vectors_0 = np.pad(
        undistorted_normalized_pixels, ((0, 0), (0, 0), (0, 1)), constant_values=1.0
    )

    depths = np.arange(
        start=depth_range[0],
        stop=depth_range[1] + step_size,
        step=step_size,
        dtype=np.float32,
    )
    error = []
    for depth in depths:
        shifted_image_1 = repeoject_image_at_depth(
            image=image_1,
            camera_vectors=camera_vectors_0,
            depth=depth,
            lens_model=lens_model_1,
            transformation_matrix=transformation_matrix,
        )
        single_pixel_error = np.abs(image_0 - shifted_image_1).sum(axis=-1)

        convoluted_error = convolve2d(
            convolve2d(
                single_pixel_error, np.ones((1, block_size)) / block_size, mode="same"
            ),
            np.ones((block_size, 1)) / block_size,
            mode="same",
        )
        error.append(convoluted_error)
    error_array = np.array(error, dtype=np.float32)

    if subpixel_fit:
        output_value = find_subvalue_poly_2(values=depths, function_value=error_array)
    else:
        output_value = depths[np.argmin(error_array, axis=0)].astype(np.float32)

    output_value[output_value >= depths.max()] = np.nan
    output_value[output_value <= depths.min()] = np.nan

    return camera_vectors_0 * output_value[..., None]


xyz_single_plane_sweep = plane_sweeping(
    image_0=stereo_data.image_0,
    lens_model_0=stereo_data.lens_model_0,
    image_1=stereo_data.image_1,
    lens_model_1=stereo_data.lens_model_1,
    transformation_matrix=stereo_data.transformation_matrix,
    depth_range=np.array([130.0, 150.0], dtype=np.float32),
    step_size=1.5,
    block_size=29,
    subpixel_fit=True,
)

plt.figure(figsize=(12, 8))
plt.imshow(xyz_single_plane_sweep[..., 2])
plt.colorbar()
plt.axis("off")
plt.title("Depth map from single plane sweeping")
plt.show()

In [None]:
open3d_visualize_point_cloud(xyz=xyz_single_plane_sweep, rgb=stereo_data.image_0)

## Homework - Multiple Views

Now that we have an approach to match any camera to another camera at any position we are no limited to 1 camera. We can add as many cameras as we want, as long as we know the transformation between them. We will now load 2 dataset, where we add another image to the left of the left image:

In [None]:
stereo_data_0 = StereoData.from_path(TestDataPaths.stereo_data_0_dir)
stereo_data_1 = StereoData.from_path(TestDataPaths.stereo_data_1_dir)

fig, ax = plt.subplots(3, 1, figsize=(12, 15))
ax[0].imshow(stereo_data_1.image_1, cmap="gray", vmin=0, vmax=1)
ax[0].set_title("Image left")
ax[0].axis("off")
ax[1].imshow(stereo_data_0.image_0, cmap="gray", vmin=0, vmax=1)
ax[1].set_title("Image main")
ax[1].axis("off")
ax[2].imshow(stereo_data_0.image_1, cmap="gray", vmin=0, vmax=1)
ax[2].set_title("Image right")
ax[2].axis("off")
plt.tight_layout()
plt.show()

For any distance we could now reproject both the left and right image to the center image:

In [19]:
pixels = np.indices(stereo_data_0.image_0.shape[:2], dtype=np.float32)[::-1].transpose(
    (1, 2, 0)
)
undistorted_normalized_pixels = stereo_data_0.lens_model_0.undistort_pixels(
    normalized_pixels=stereo_data_0.lens_model_0.normalize_pixels(pixels=pixels)
)
camera_vectors_0 = np.pad(
    undistorted_normalized_pixels, ((0, 0), (0, 0), (0, 1)), constant_values=1.0
)

In [None]:
depth = 140.0

shifted_image_0 = repeoject_image_at_depth(
    image=stereo_data_0.image_1,
    camera_vectors=camera_vectors_0,
    depth=depth,
    lens_model=stereo_data_0.lens_model_1,
    transformation_matrix=stereo_data_0.transformation_matrix,
)
shifted_image_1 = repeoject_image_at_depth(
    image=stereo_data_1.image_1,
    camera_vectors=camera_vectors_0,
    depth=depth,
    lens_model=stereo_data_1.lens_model_1,
    transformation_matrix=stereo_data_1.transformation_matrix,
)

fig, ax = plt.subplots(3, 1, figsize=(12, 15))
ax[0].imshow(shifted_image_1, cmap="gray", vmin=0, vmax=1)
ax[0].set_title("Image left (shifted)")
ax[0].axis("off")
ax[1].imshow(stereo_data_0.image_0, cmap="gray", vmin=0, vmax=1)
ax[1].set_title("Image main")
ax[1].axis("off")
ax[2].imshow(shifted_image_0, cmap="gray", vmin=0, vmax=1)
ax[2].set_title("Image right (shifted)")
ax[2].axis("off")
plt.tight_layout()
plt.show()

And we can show the absolute difference between the center image and the reprojected images:

In [None]:
diff_0 = np.abs(stereo_data_0.image_0 - shifted_image_0).sum(axis=-1)
diff_1 = np.abs(stereo_data_0.image_0 - shifted_image_1).sum(axis=-1)

fig, ax = plt.subplots(2, 1, figsize=(12, 15))
ax[0].imshow(diff_1, vmin=0, vmax=0.2)
ax[0].set_title("Difference left")
ax[0].axis("off")
ax[1].imshow(diff_0, vmin=0, vmax=0.2)
ax[1].set_title("Difference right")
ax[1].axis("off")
plt.tight_layout()
plt.show()

We can also draw som reference lines compared to the main camera:

In [None]:
fig, ax = plt.subplots(3, 1, figsize=(12, 15))
ax[0].imshow(shifted_image_1, cmap="gray", vmin=0, vmax=1)
ax[0].plot([0, shifted_image_0.shape[1]], [1000, 1000], "r-")
ax[0].plot([1280, 1280], [0, shifted_image_0.shape[0]], "r-")
ax[0].set_title("Image left (shifted)")
ax[0].axis("off")
ax[1].imshow(stereo_data_0.image_0, cmap="gray", vmin=0, vmax=1)
ax[1].plot([0, shifted_image_0.shape[1]], [1000, 1000], "r-")
ax[1].plot([1280, 1280], [0, shifted_image_0.shape[0]], "r-")
ax[1].set_title("Image main")
ax[1].axis("off")
ax[2].imshow(shifted_image_0, cmap="gray", vmin=0, vmax=1)
ax[2].plot([0, shifted_image_0.shape[1]], [1000, 1000], "r-")
ax[2].plot([1280, 1280], [0, shifted_image_0.shape[0]], "r-")
ax[2].set_title("Image right (shifted)")
ax[2].axis("off")
plt.tight_layout()
plt.show()

### Task

You task is now to make a plane sweeping function that no longer takes in a set of image but a main image and any number of cameras. The function should return the depth of the main image. You can use the code below as a starting point:

In [23]:
def multi_camera_plane_sweeping(
    image: NDArray[Shape["H, W, ..."], Float32],
    lens_model: LensModel,
    secondary_images: list[NDArray[Shape["H, W, ..."], Float32]],
    secondary_lens_models: list[LensModel],
    secondary_transformation_matrices: list[TransformationMatrix],
    depth_range: NDArray[Shape["2"], Float32],
    step_size: float,
    block_size: int,
    subpixel_fit: bool,
) -> NDArray[Shape["H, W, 3"], Float32]: ...

In [None]:
xyz_multiple_plane_sweep_ = multi_camera_plane_sweeping(
    image=stereo_data_0.image_0,
    lens_model=stereo_data_0.lens_model_0,
    secondary_images=[stereo_data_0.image_1, stereo_data_1.image_1],
    secondary_lens_models=[stereo_data_0.lens_model_1, stereo_data_1.lens_model_1],
    secondary_transformation_matrices=[
        stereo_data_0.transformation_matrix,
        stereo_data_1.transformation_matrix,
    ],
    depth_range=np.array([135.0, 142.0], dtype=np.float32),
    step_size=0.5,
    block_size=29,
    subpixel_fit=True,
)

xyz_multiple_plane_sweep = (
    xyz_single_plane_sweep
    if xyz_multiple_plane_sweep_ is None
    else xyz_multiple_plane_sweep_
)

plt.figure(figsize=(12, 8))
plt.imshow(xyz_multiple_plane_sweep[..., 2])
plt.colorbar()
plt.axis("off")
plt.title("Depth map from multiple plane sweep")
plt.show()

We have now used 2 secondary cameras to calculate the depth of the main camera. Regardless of that we still see the limitiation of the stereo matching algorithm, that it is hard to get depth in areas where there is no texture. Let's visualize it with Open3D:

In [None]:
open3d_visualize_point_cloud(xyz=xyz_multiple_plane_sweep, rgb=stereo_data_0.image_0)

### Filtering

To improve the result we can add some simple filtering, for example to reduce the amount of outliers. My idea was to apply a relativly big `ones` kernel and then zap points that are a certain distance away from the filtered depth. Since what we are looking at is smooth and flat, this should work. But remember that it most likely will not work for more complex scenes.

In [None]:
kernel_size = 25
limit = 0.5

xyz_filtered = xyz_multiple_plane_sweep.copy()

z_convolved = convolution_2d_nan(
    xyz_filtered[..., 2], np.ones(kernel_size, dtype=np.float32)
)
invalid = np.abs(xyz_filtered[..., 2] - z_convolved) > limit
xyz_filtered[invalid] = np.nan

plt.figure(figsize=(12, 8))
plt.imshow(xyz_filtered[..., 2], interpolation="none")
plt.colorbar()
plt.axis("off")
plt.title("Depth map (filtered)")
plt.show()

In [None]:
open3d_visualize_point_cloud(xyz=xyz_filtered, rgb=stereo_data_0.image_0)

We can also add a gaussian blur to the resulting point cloud. This should reduce the noise in the point cloud:

In [None]:
xyz_filtered_smoothed = xyz_filtered.copy()
kernel = gaussian(M=7, std=1.5)
kernel /= kernel.sum()

z_filtered_smoothed = convolution_2d_nan(image=xyz_filtered[..., 2], kernel=kernel)
xyz_filtered_smoothed *= z_filtered_smoothed[..., None] / xyz_filtered[..., 2:3]

plt.figure(figsize=(10, 10))
plt.imshow(xyz_filtered_smoothed[..., 2], interpolation="none")
plt.colorbar()
plt.axis("off")
plt.title("Depth map (filtered and smoothed)")
plt.show()

In [None]:
open3d_visualize_point_cloud(xyz=xyz_filtered_smoothed, rgb=stereo_data_0.image_0)