Osnabrück University - Computer Vision (Winter Term 2020/21) - Prof. Dr.-Ing. G. Heidemann, Ulf Krumnack, Axel Schaffland, Ludwig Schallner, Artem Petrov

# Recap I

This sheet is a recap of the first half of the term. Neither do you have to present it to your tutors nor will it count to the number of passed sheets required for the exam. I.e. you do not have to complete this sheet but we highly recommend that you solve the assignments as part of your preparations for the exam. 

## Introduction

**a)** What are the goals of *computer vision* and *image processing*? Name some subtasks. Give one example problem and describe how to solve it with the algorithms presented in this course. 

**Goal of CV**: Recognition of the image by the computer
- detection of regions of interest
- boundary detection
- feature extraction
- classification of colors, shapes, objects
- 3D representations of real scenes
- reconstruction of 3D surfaces
- motion detection
    - object / background separation
    - direction and velocity computation
    - object tracking

**Goal of image processing**: Enhance images to facilitate analysis by a human
- repair corrupted images
- compensation of bad acquisition conditions (e.g. contrast enhancement)
- improve perceptibility (e.g. contrast enhancement)
- 'highlight' information

**Example problem: Basic object recognition**

A simple approach for object recognition is **template matching**:
- construct a template (prototypical model of the object you'd like to find in the image)
- search for template in image by computing  similarity between template and underlying image patch
    - two similarity measures
        - mean absolute difference (MAD)
        - correlation coefficient (better)

**b)** Describe the difference between *top down* and *bottom up* strategies. From another perspective they are called?

Those are processing strategies:

**bottom up**:
- starting from data
- looking for increasingly complex features and connections until they match the model
- aka **data driven**

**top down**:
- try to find model within data
- aka **model driven**

Commonly a mixture of both is used!

**c)** What is the semantic gap?

The **semantic gap** refers to the hope for a correlation between low level features and high level concepts.

## Image Acquisition

**a)** Draw (on paper) the concept of a pinhole camera. Draw at least an object, rays, the pinhole, and the image plane.

<img src="img/pinhole.png" width="600"/>

**b)** Explain how human color vision works.

- visible wavelengths: $\approx 380 nm - 750 nm$
- the eye contains three types of receptors with different spectral sensitivities (RGB)
- arranged side by side in the retina
- so we reduce the incoming spectrum to just three stimuli $\infty \rightarrow 3$ dimensions

**c)** Is a Bayer-Filter a local operator? Explain your answer!

No, even though it's also called filter (just like filter kernels), it's a pixel layout of a 1-chip camera.

**d)**  What is the smallest distance between two pixels under 4-/8-neighborhood?

$4$-neighborhood: **manhattan distance**: $|x_1 - x_2| + |y_1 - y_2|$  
$\rightarrow$ In $4$-neighborhood, you can move only up/down/left/right, that's manhattan distance

$8$-neighborhood: **chessboard distance**: $\max (|x_1 - x_2|, |y_1 - y_2|)$  
$\rightarrow$ In $8$-neighborhood, you can move as the king in chess

**e)** Name the two types of loss of information and give an example for each.

**Stochastic loss:** E.g. noise

**Deterministic loss:** E.g. projection and sampling, bad camera parameters (over-/underexposure, bad focus), motion blur

## Basic Operators

**a)** What is a *point operator*, a *local operator*, a *global operator*? Provide examples for each of them. Which are *linear* which are not? Give an example for a *non-homogenous* operator. Describe application scenarios for different operators. What is a *rank filter*?

**point operator:** $g'(x, y) = O(g(x, y))$ $\rightarrow$ result pixel depends only on input pixel
- e.g. **thresholding**: $g'(x, y) = \Theta (g(x, y) - \vartheta)$ with threshold $\vartheta$ and $\Theta (x) = 0$ for $x < 0$ and $\Theta (x) = 1$ otherwise
    - non-linear, it does matter whether you first multiply each pixel by scalar and then apply or vice versa
    - application example: binarization of a bimodal distribution
- another example would be a **linear transform**, e.g. $g'(x, y) = a \cdot g(x, y) + b$
    - obviously linear
    - application example: luminance- and contrast-enhancement

**local operator:** $g'(x, y) = O(g(x, y), g(surroundings(x, y)))$ $\rightarrow$ result pixel depends on input pixel + surrounding pixels
- e.g. **convolution** defined by filter kernel: $g'(x, y) = \sum_{i \in [-m, m]} \sum_{j \in [-n, n]} k(i+m, j+n) \cdot g(x+i, y+j)$ (scalar product of kernel and image patch)
    - linear (scalar product, we had to prove that)
    - application example: smoothing, edge detection

**global operator:** $g'(x, y) = O(g(all pixels))$ $\rightarrow$ result pixel depends on all pixels of the input image
- e.g. **Fourier transform**: Transforms image $g$ from the spatial domain to the frequency domain
    - linear (additivity and homogeneity hold)
    - application example: 
        - fast computation of convolution in Fourier space (just a multiplication)
        - detect texture in images
        - compression

**non-homogeneous operator**
 - depends explicitly on the location: $g'(x, y) = O(g(x, y), x, y)$
 - means that the operator does change based on the pixel location (different behavior)
 - e.g. darken upper part of the image such as a bright sky (depends on $y$ value)

**Rank filter:**
- local operator
- non-linear (can not be implemented as convolution)
- sort gray values covered by kernel
- select gray value from sorted list that replaces the current pixel (result)
- the selection of the position determines the type of rank filter:
    - **min filter**: select min gray value (first position)
    - **median filter**: select center of the list
    - **max filter**: select max gray value (last position)
- rank filters can be used for several purposes:
    - image quality enhancement, e.g. smoothing, sharpening
    - image pre-processing, e.g. noise reduction, contrast enhancement
    - feature extraction, e.g. border detection, isolated point detection
    - image post-processing, e.g. small object removal, object grouping, contour smoothing

**b)** Load an image and apply different local operators (convolution, nonlinear smoothing, morphological) and display the results. Explain their effects and possible applications.

In [None]:
%matplotlib inline
from skimage import filters, morphology
import matplotlib.pyplot as plt
import numpy as np
import random

def get_test_img():
    return plt.imread('img/test.JPG')

def get_test_img_gray():
    img = get_test_img()
    black_white = np.zeros((img.shape[0], img.shape[1]))
    for x in range(img.shape[0]):
        for y in range(img.shape[1]):
            black_white[x][y] = np.sum(img[x][y]) / 3
    return black_white / black_white.max()

def generate_noisy_img(img):
    prob = 0.5
    noisy_img = img.copy()

    for x in range(1, noisy_img.shape[0] - 1, 2):
        for y in range(1, noisy_img.shape[1] - 1, 4):
            # add noise
            if random.random() < prob:
                if random.choice([0, 1]) == 0:
                    noisy_img[x][y] = 0
                    noisy_img[x - 1][y] = 0
                    noisy_img[x + 1][y] = 0
                else:
                    noisy_img[x][y] = 1
                    noisy_img[x - 1][y] = 1
                    noisy_img[x + 1][y] = 1
    return noisy_img

def apply_laplace(img):
    # laplace is really prone to noise --> smooth first via gaussian
    img = filters.gaussian(img)
    return filters.laplace(img)

def get_binarized_img(img, thresh):
    tmp = img.copy()
    for x in range(img.shape[0]):
        for y in range(img.shape[1]):
            if tmp[x][y] < thresh:
                tmp[x][y] = 1
            else:
                tmp[x][y] = 0
    return tmp

plt.figure(figsize=(26, 10))

img = get_test_img_gray()

plt.subplot(2, 7, 1); plt.title('original image'); plt.imshow(img, cmap='gray')
plt.subplot(2, 7, 2); plt.title('conv gaussian'); plt.imshow(filters.gaussian(img, sigma=2.5), cmap='gray')
# there's also sobel_v and sobel_h for just using the vertical / horizontal version
plt.subplot(2, 7, 3); plt.title('conv sobel'); plt.imshow(filters.sobel(img), cmap='gray')
plt.subplot(2, 7, 4); plt.title('noisy img'); plt.imshow(generate_noisy_img(img), cmap='gray')
plt.subplot(2, 7, 5); plt.title('median filtering of noisy img'); plt.imshow(filters.rank.median(generate_noisy_img(img)), cmap='gray')
plt.subplot(2, 7, 6); plt.title('rank max'); plt.imshow(filters.rank.maximum(img, morphology.disk(10)), cmap='gray')
plt.subplot(2, 7, 7); plt.title('binarized img'); plt.imshow(get_binarized_img(img, 0.2), cmap='gray')
plt.subplot(2, 7, 8); plt.title('erosion of binarized img'); plt.imshow(morphology.binary_erosion(get_binarized_img(img, 0.2)), cmap='gray')
plt.subplot(2, 7, 9); plt.title('dilation of binarized img'); plt.imshow(morphology.binary_dilation(get_binarized_img(img, 0.2)), cmap='gray')
plt.subplot(2, 7, 10); plt.title('binary opening'); plt.imshow(morphology.binary_opening(get_binarized_img(img, 0.2)), cmap='gray')
plt.subplot(2, 7, 11); plt.title('binary closing'); plt.imshow(morphology.binary_closing(get_binarized_img(img, 0.2)), cmap='gray')
plt.subplot(2, 7, 12); plt.title('non-linear smoothing (rank - mean)'); plt.imshow(filters.rank.mean(img, morphology.disk(10)), cmap='gray')
plt.subplot(2, 7, 13); plt.title('conv laplace'); plt.imshow(apply_laplace(img), cmap='gray')
plt.subplot(2, 7, 14); plt.title('enhanced contrast'); plt.imshow(filters.rank.enhance_contrast(img, morphology.disk(10)), cmap='gray')

plt.show()

**Effects and possible applications**
- Convolution with **Gaussian** filter
    - smoothing (blurring)
    - noise reduction
- Convolution with **Sobel** filter
    - edge detection (horizontal, vertical, diagonal)
    - corresponds to first derivative of pixel intensity values
    - e.g. for object detection
- Rank filter **median**
    - sort gray values covered by kernel
    - select center
    - e.g. removing noise (very good)
- Rank filter **max**
    - sort gray values covered by kernel
    - select max
    - e.g. brightening?
- **Binarized** image
    - thresholded (below thresh $0$, above $1$)
- **Erosion** of binarized image
    - removes irregularities
    - cuts off fringe of objects
- **Dilation** of binarized image
    - enlarges objects (adds pixel at fringe)
    - fills up holes
- **Opening**
    - compound operation: erosion followed by dilation
    - e.g. remove irregularities without making object smaller
- **Closing**
    - compound operation: dilation followed by erosion
    - e.g. fill up holes without making object larger
- **Non-linear smoothing** by rank mean
    - rank filter that takes mean of image patch
    - isn't that the same as a box filter?
    - e.g. median filter - remove outliers
- Convolution with **Laplace** filter
    - edge detection (independent of direction)
    - detects gray value jumps
    - corresponds to second derivative of pixel intensity values
    - very noise-sensitive; should be smoothed before application
- **Contrast enhancement** by rank filter
    - replaces each pixel by the local maximum if the pixel gray value is closer to the local maximum than the local minimum
    - otherwise it is replaced by the local minimum

**c)**
With pen and paper: Generate a random  $5 \times 5$ image and smooth this image by a $3 \times 3$ laplace filter. Select a border handling mode of your choice.

In [None]:
from scipy import ndimage
import numpy as np
import matplotlib.pyplot as plt

img = np.array([[0, 100, 200, 100, 0], [0, 0, 200, 0, 0], [0, 0, 200, 0, 0], [0, 0, 100, 0, 0], [0, 0, 100, 0, 0]])
laplace_kernel = np.array([[0, 1, 0], [1, -4, 1], [0, 1, 0]])
laplace = ndimage.convolve(img, laplace_kernel, mode='nearest')

print("result:\n", laplace)

plt.figure(figsize=(10, 10))
plt.subplot(1, 2, 1); plt.title('original img'); plt.imshow(img, cmap='gray')
plt.subplot(1, 2, 2); plt.title('laplace filtered'); plt.imshow(laplace, cmap='gray')

plt.show()

**d)** Give an example $3\times3$ kernel for the following filters and briefly explain their use:
* Box
* Binomial
* Sobel (one direction of your choice)
* Laplace

**Box Filter**

- smoothing, e.g. for noise reduction
- super simple, just averaging
- problem: hard border (one pixel completely in, next completely out)

$k_{Box} = \frac{1}{9} \cdot
\left[ \begin{array}{rrr}
1 & 1 & 1 \\ 
1 & 1 & 1 \\
1 & 1 & 1 \\ 
\end{array}\right]$

**Binomial Filter**
- discrete approximation of Gaussian kernel
- makes hard border of box filter smooth
- also smoothing (noise reduction), but better

$k_{Bin} = \frac{1}{16} \cdot
\left[ \begin{matrix}
1 & 2 & 1 \\ 
2 & 4 & 2 \\ 
1 & 2 & 1 \\ 
\end{matrix} \right]$

**Sobel Filter (vertical version)**

- edge detection (for vertical edges)
- corresponds to 1st derivative of pixel intensities

$k_{Sobel} = \frac{1}{4} \cdot
\left[ \begin{matrix}
1 & 0 & -1 \\ 
2 & 0 & -2 \\ 
1 & 0 & -1 \\ 
\end{matrix} \right]$

**Laplace Filter**

- detects jumps of gray values (edges)
- edges in all direction
- corresponds to 2nd derivative of pixel intensities

$k_{Laplace} = 
\left[ \begin{matrix}
0 & 1 & 0 \\ 
1 & -4 & 1 \\ 
0 & 1 & 0 \\ 
\end{matrix} \right]$


**e)** What are separable filter kernels?

The computational effort for an $m \times n$ kernel is $O(mn)$.  
Some kernels are separable, i.e. are product of row vector and column vector, which leads to a more efficient convolution $O(m+n)$.  

Examples for separable kernels are Gaussian, Sobel, and Box.

## Image Enhancement

**a)**  What is the histogram of an image? What is a gradient image and how is it computed? What is a histogram of gradients? Name some applications.

A **histogram** of a grayscale image counts the occurrences for each gray value in the image.  
Application examples:
- find threshold for binarization of a bimodal distribution
- histogram equalization to enhance contrast

An image **gradient** is a directional change in the intensity or color in an image:  
- gradient of the image is one of the fundamental building blocks in image processing
- gradients ($x$ and $y$ derivatives) of an image are useful because the magnitude of gradients is large around edges and corners  
(regions of abrupt intensity changes) and edges and corners provide a lot more information about object shape than flat regions
- e.g. canny edge detector uses image gradient for edge detection
- mathematically, the gradient of a two-variable function (here the image intensity function) at each image point is a 2D  
 vector with the components given by the derivatives in the horizontal and vertical directions
- at each image point, the gradient vector points in the direction of largest possible intensity increase
- the length of the gradient vector corresponds to the rate of change in that direction
- most common **way to approximate the image gradient is to convolve an image with a kernel**, such as the sobel operator

**Gradient image**

- image after filtering with a gradient filter, e.g. Sobel
- filtering the image with the horizontal and vertical Sobel filters $[-1, 0, 1]$ and $[-1, 0, 1]^T$

**Histogram of oriented gradients (HOG)**

- to calculate a HOG descriptor, we need to first calculate the horizontal and vertical gradients (sobel)
- afterwards, we can get the magnitude and direction of the gradient at every pixel:
    - **gradient magnitude**: $m'(x, y) = \sqrt{\Delta_x g^2 + \Delta_y g^2}$
    - **direction:** Use the inverse tangent: $\beta(x, y) = arctan(\frac{\Delta_y g}{\Delta_x g})$
- next step is to create a histogram of gradients:
    - a bin is selected based on the direction, and the vote (the value that goes into the bin) is selected based on the magnitude
- such a HOG can be further processed and become a feature vector that can for example be used in classification

**b)** Give formulae for information content and average information content. What do information content and entropy measure? On the slides $\log_n$ is used for information content and $\log_2$ is used for entropy. Why?

**Information content** $I(e) = -\log_n P(e)$ for $n$-valued symbols where $P(e)$ is the probability of event $e$
- measures 'information' based on the idea that less probable events are more informative
- quantifying the level of "surprise" of a particular outcome
- aka surprisal or Shannon information
- the information content can be expressed in various units of information, of which the most common is the bit
- different choices of base correspond to different units of information

**Avg information content aka entropy** $E = - \sum_{i = 1}^{n} P(e_i) \log_2 P(e_i)$
- measures the average information content and thus
    - is very high if every event is equally likely
    - very low if there is only one outcome
- quantifying how surprising the random variable is "on average"
- measure on information in terms of uncertainty (chaos)
- e.g. if you have a coin that has only one result (heads on both sides), then you know beforehand what the result  
  will be and there is no information gain by actually flipping it (entropy $0$)
- the entropy of a regular coin toss is $1$ - you can find out whether it's heads or tails with just one question: "Is it heads?"
- amount of information in $5$ coin flips is $5$ bits
- entropy - least amount of questions we have to ask
- we have a '$-$' in front of the sum, because we have the $\log$ of a value between $0$ and $1$ which gives us a negative value and we want our entrpy to be positive
    - large uncertainty (chaos) -> high positive number

The information content formula with base $n$ is just more general, but would be base $2$ as well for our purposes, we work with bits.

**Bonus**

Claude Shannon's definition of information content was chosen to meet several axioms:
- an event with probability 100% is perfectly unsurprising and yields no information
- the less probable an event is, the more surprising it is and the more information it yields
- if two independent events are measured separately, the total amount of information is the sum of the self-informations of the individual events

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from skimage import measure

img = np.array([
    [0, 1, 3, 7, 1, 0],
    [5, 3, 1, 4, 6, 0],
    [5, 7, 2, 0, 1, 4],
    [4, 2, 2, 5, 3, 7],
    [3, 4, 5, 0, 2, 1],
    [3, 7, 4, 1, 0, 2]
])

heights, bins = np.histogram(img, 8)

# max entropy for 3 bits would be 3
print("ENTROPY:", measure.shannon_entropy(img))

plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1); plt.title('img'); plt.imshow(img)
plt.subplot(1, 2, 2); plt.title('histogram'); plt.bar(bins[:-1], heights)
plt.show()


**c)** Discuss histogram equalization. Name some problems and explain how they can be addressed.

**Histogram equalization** is basically maximizing the information content $I(e) = -log_n P(e)$.  
It's based on the idea that an optimal image has maximum information content. Therefore, we maximize the entropy $E$ of the image's histogram.  
The entropy is just the average information content and is defined as $E = - \sum_{i=1...N} P(e_i) \cdot log_2 P(e_i)$.

The **entropy of the image is maximized** if $P(g) = const$ for all $g = 0, ..., 255$.

To maximize the entropy, we need to apply a transfer function. For continuous histograms with unit norm, it's just $g'(g) = \int_{0...g} H(w) dw$. 

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++  
- **High level idea**
    - HE is used to improve contrast in images
    - it accomplishes this by effectively spreading out the most frequent intensity values
        - i.e. stretching out the intensity range of the image

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++  

Problem: $g'(g)$ transforms $g$ to real values which need to be rounded (loss of information).

Another problem is that the transformation of gray values is not related to image content. Thus, unimportant aspects may  
be enhanced while important aspects vanish.

How to address the **problems**? -> Several refinements:
- **Adaptive histogram equalization (AHE)**
    - histogram for each pixel individually
    - individual transfer function for each pixel
- **Contrast limited AHE (CLAHE)**
    - like AHE, but contrast is enhanced only up to predefined limit
    - prevents excessive local contrast enhancements (e.g. for background)

## Morphological operators

**a)** What is a structuring element? How is it applied in erosion and dilation?

The structuring element is a binary matrix that specifies a morphological operation.  
It is moved across the image line by line like a filter kernel.  
The structuring element is compared to the underlying image patch and the resulting similarity value is assigned  
to a result matrix at the location corresponding to the anchor point of the structuring element (usually the center).

**Erosion**
- assign $1$ to result pixel if all $1$-elements of the structuring element cover $1$-pixels of the image, else assign $0$
    - conjunction of implications where a $1$ in the structuring element implies a $1$ in the underlying image patch
    - cuts off fringe of objects

**Dilation**
- assign $1$ to result pixel if at least one $1$-element of the structuring element covers a $1$-pixel in the image, else assign $0$
    - disjunction of conjunctions of pair pixels between structuring element and underlying image patch
    - adds pixels at the fringe ob objects and fills holes

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from skimage import morphology

img = np.array([
    [0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 1, 0],
    [0, 1, 1, 0, 1, 0],
    [0, 0, 0, 0, 0, 0],
    [0, 0, 1, 1, 1, 0],
    [0, 0, 0, 0, 0, 0]
])

struct_one = np.array([[1, 1]])
struct_two = np.array([
    [0, 1, 0],
    [1, 1, 1],
    [0, 1, 0]
])

plt.figure(figsize=(20, 6))
plt.subplot(1, 5, 1); plt.title('img'); plt.imshow(img)
plt.subplot(1, 5, 2); plt.title('erosion S1'); plt.imshow(morphology.binary_erosion(img, struct_one))
plt.subplot(1, 5, 3); plt.title('dilation S1'); plt.imshow(morphology.binary_dilation(img, struct_one))
plt.subplot(1, 5, 4); plt.title('erosion S2'); plt.imshow(morphology.binary_erosion(img, struct_two))
plt.subplot(1, 5, 5); plt.title('dilation S2'); plt.imshow(morphology.binary_dilation(img, struct_two))
plt.show()

In [None]:
from scipy import ndimage

# hit-or-miss

img = np.array([
    [0, 0, 1, 0, 1, 0],
    [0, 1, 1, 1, 0, 0],
    [1, 0, 1, 0, 1, 0],
    [0, 0, 0, 0, 0, 0],
    [1, 0, 1, 1, 1, 0],
    [0, 0, 0, 0, 0, 0]
])

struct_elem = np.array([
    [1, 0, 1]
])

plt.figure(figsize=(14, 6))
plt.subplot(1, 3, 1); plt.title('img'); plt.imshow(img)
plt.subplot(1, 3, 2); plt.title('struct elem'); plt.imshow(struct_elem)
plt.subplot(1, 3, 3); plt.title('hit-or-miss'); plt.imshow(ndimage.binary_hit_or_miss(img, struct_elem))
plt.show()

**b)** Give pseudocode for the distance transform using morphological operators.

given: image $g$, structuring element $S$

$g^0_{boundary} = g \backslash (g \ominus S)$ yields the boundary (set of all pixels with distance $0$ to the boundary)

$g \ominus S$ is the object without boundary

$g^n_{boundary} = (g(\ominus S)^n) \backslash (g(\ominus S)^{(n+1)})$ where $(\ominus S)^n$ is short for the $n$-time erosion using $S$

The distance transform $D$ is obtained from the union of the boundaries of all distances:

$D = \bigcup_{n=1}^{\infty} n \cdot g^n_{boundary}$

$\rightarrow$ not Euclidean distance, but manhattan or chessboard

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from scipy.ndimage import morphology

img = np.array([
    [0, 0, 0, 0, 0, 0, 0],
    [0, 0, 1, 1, 1, 0, 0],
    [0, 1, 1, 1, 1, 1, 0],
    [0, 1, 1, 1, 1, 1, 0],
    [0, 0, 1, 1, 1, 1, 0],
    [0, 0, 1, 1, 0, 0, 0],
    [0, 0, 0, 0, 0, 0, 0]
])

def my_boundary(img):
    erosion = morphology.binary_erosion(img)
    return np.logical_xor(erosion, img)

def my_distance_transform(img):    
    dt = np.zeros(img.shape, np.int32)
    lvl = 1
    # erode until nothing is left
    while np.any(img):
        boundary = my_boundary(img)
        # pixels with manhattan distance lvl to the boundary
        dt[boundary] = lvl
        # object without boundary
        img = morphology.binary_erosion(img)
        lvl += 1
    return dt

dt = morphology.distance_transform_cdt(img, metric='manhattan')
dt_own = my_distance_transform(img)

plt.figure(figsize=(12, 4))
plt.subplot(1, 3, 1); plt.title('img'); plt.imshow(img)
plt.subplot(1, 3, 2); plt.title('distance transform'); plt.imshow(dt)
plt.subplot(1, 3, 3); plt.title('own implementation'); plt.imshow(dt_own)
plt.show()

## Color

**a)** Which of the follwoing use additive color mixing and which use subtractive color mixing:
* Printer -> **subtractive**
* Cathode ray tube (old screens) -> **additive**
* LCD screen -> **subtractive**
* Van Gogh -> **subtractive**
* Analog cinema projector -> **subtractive**
* Digital projector (DLP) -> **additive**

**b)** Name two color spaces and list their advantages.

**RGB (Red Green Blue)**
- primary colors based on CIE experiments
- well-suited for computer screens, because the pixels are self-luminous -> additive color mixing

**CMYK (Cyan (absorbs R) Magenta (absorbs G) Yellow (absorbs B) Key (black - absorbs all))**
- color printing is based on subtractive mixing of CYMK
- for not self-luminous materials
- CMY is complementary to RGB
- in principle, K (black) is not necessary, because it can be mixed from CMY
- in practice, CMY can only mix a dark blue, so black is used in printing as an additional color

**HSV**
- based on perception and verbal description of colors
- colors can be mixed and described using HSV more easily than using primary colors
- **Hue**: angle on color circle (0° red, 120° green, 240° blue)
- **Saturation**: 0% no color, 100% pure color
- **Value**: Percentage of maximum brightness

**Lab**
- perceived color distance corresponds to Euclidean distance in Lab space
- a: gree-red, b: blue-yellow, L: luminance

**YCbCr**
- separate luminance and color
- luminance -> brightness of grayscale img
- $Cb$ -> blueness of the img, $Cr$ -> redness of the img
- the human eye is a lot more sensitive to luminance (changes in intensity) than to color

## Segmentation

**a)** Explain *region based* and *edged based* *segmentation*. What are the differences between *split and merge* and *region merging*? What is the idea of *color segmentation* and does it give any advantage?

**In short:**
- **region based**
    - find regions based on homogeneity measure
- **edge based**
    - find borders of regions and hope that they enclose meaningful segments
- **split-and-merge**
    - implicit multiscale strategy, better runtime
- **region merging**
    - explicit multiscale strategy, high runtime
- **color**
    - finding color clusters in color space
    - can exploit color information
    - but usefulness depends on domain

**Region based segmentation**
- each segment must be homogeneous
- what we want: high level concepts such as water, building, sky
- what we have: low level features such as color and texture
- we hope for correlation of low level features with high level concepts
- multiple scales need to be considered to avoid chaining (locally homogeneous, but globally inhomogeneous)
- ensures homogeneity within regions - we hope for sharp region boundaries
- what we have:
    - multi-scale strategies
    - region merging, split-and-merge
    - texture segmentation

**Edge based segmentation**
- segments are found by search for boundaries between regions of different features
- we hope for homogeneity within regions
- advantage: more robust against shading
- what we do:
    - compute gradient
    - edge point detection, e.g. by thresholding gradient magnitude
    - region labeling based on edge points
    - problem: edge points are not yet connected edges
- four methods:
    - edge linking
    - canny operator
    - detection of zero crossings
    - watershed transform

**Region Merging**
- image is mapped to RAG (region adjacency graph)
    - segment -> node
    - neighborhood of two segments -> edge
- init: each pixel is segment (node)
- repeat
    - search for edges that satisfy homogeneity condition
    - merge nodes where edge has best homogeneity
- until
    - no edge fulfills homogeneity condition

**Split-and-Merge**
- init: entire image is single segment
- splitting: divide segment interatively into four sub-segments as long as homogeneity condition not fulfilled
- merging: merge neighboring segments if they fulfill the homogeneity condition even after merging
- result: complete and unique decomposition

**Split-and-Merge vs. Region Merging**
- split-and-merge has significantly less computational effort, because splitting usually stops before pixel level is reached
- split-and-merge has also the advantage that large segments are better suited for estimation of distribution than just a few pixels

**Color segmentation**
- idea: color contains much more information than gray values
- aim: find segments of constant color
- homogeneity condition: distance / similarity measure in a color space
- numerous algos (k-Means clustering, mean shift segmentation)
- advantage: color classes of image can be found using clustering

**b)**  Provide pseudocode for the $k$-means clustering algorithm in color space.

- init $k$ vectors as cluster centers with random RGB values (in RGB space)
- $k$ is the number of different colors to find in the data set
- assign each pixel to its closest cluster center
- recompute cluster center for each cluster
- until conversion (only negligible changes)

**c)** Give two examples for interactive segmentation and discuss them.

**Interactive region growing**
- **flood fill** (fill a homogeneous region with a label) for a gray value image using a single seed point
    - problems:
        - leakage of region
        - region may be very small
        - sensitive to noise
        - shading

**Interactive edge search** (edge following)
- user sets starting point
- search perpendicular to current gradient direction
- edge point found if
    - sufficiently close
    - exhibits sufficient gradient
    - direction similar to current direction
- problems:
    - noise
    - varying gradient strength
    - sudden changes of direction of the edge

## Hough Transform

**a)** What is the idea of *Hough transform*? What is an *accumulator space*? How to determine its dimensionality? Can you interpret the linear Hough space? How many dimensions has the accumulator space for circular Hough transform?

The original **Hough transform** is a method to detect points on straight lines, but it can be generalized to arbitrary geometrical shapes.  

There is a kind of voting procedure that is carried out in a parameter space, from which shape candidates are obtained as local maxima.  
This space is called **accumulator space** and it is constructed by spanning the space of possible parameter combinations.  
Therefore, the dimensionality is determined by the number of parameters.

**Example for straight lines**

Lines can be represented as: $r = x \cos \theta + y \sin \theta$  
A point $(x, y)$ in Euclidean space would represent all curves $(r, \theta)$ in the accumulator space that go through the point.  
On the other hand, an $(r, \theta)$ combination in the accumulator space represents all points $(x, y)$ in Euclidean space that are part of the line.  
Thus, points on one line in the Euclidean space intersect in one point ($r, \theta$) in the accumulator space.

The general idea is to transform every data point into the accumulator space (discretized) and count for each parameter combination  
how often it occurs (accumulator array). The maximum corresponds to the 'best' correspondence of the shape in the Euclidean space.  
In the case of the search for straight lines, one would look for the $(r, \theta)$ combination with the most hits in the accumulator space.

The approach is **model-based** in the sense that you need to have a mathematical model of the shape you are looking for, e.g. a line.

The **linear Hough transform** is the procedure described in the above example of using the HT to detect straight lines.  
So, as described, the accumulator space is just $2D$ for the two parameters $r$ and $\theta$.  
The element $(r, \theta)$ with the maximum number of hits in the accumulator array is the best candidate for a line in the original space.  
Points on one line in the original space intersect in one point $(r,\theta)$ in the Hough space.

**Meaning of coordinates**:
- $r$: perpendicular distance between origin and line
- $\theta$: angle between $x$-axis and the perpendicular line ($r$)

Circular HT does for circles what linear HT does for lines. Thus, our model is a circle here.  
Points on a circle obey $(x - x_c)^2 + (y - y_c)^2 = r^2$  
Therefore, the accumulator space has three dimensions: $x_c, y_c, r$ where $x_c$ and $y_c$ are in the image plane and $r$ the radius of the circle.  

**Practical problem:** Since we have a third dimension, we have a way higher computational effort compared to linear HT.  
However, we can use a simplification and search only for a given radius $r$.  

**Procedure:**
- increment counter in accumulator array along a circle of radius $r$ around each edge pixel
- accumulation points are centers of circles

# Recap II

This sheet is a recap of the second half of the term. Neither do you have to present it to your tutors nor will it count to the number of passed sheets required for the exam. I.e. you do not have to complete this sheet but we highly recomment that you solve the assignments as part of your preparations for the exam. We will discuss the results in the last practice session on February 11. Also, if you have questions on any of the topics, please send them to us and we will discuss them in that session.

## Fourier Transform

**a)** What is the idea of *Fourier Transform*, and why is it useful for image processing? Can you provide a formula? Why is it called an orthogonal transformation? Which aspects of an image can be recognized in its Fourier transform?

**In short:**
- decomposes image into combination of sine and cosine waves
- usefulness
    - convolution is just multiplication in Fourier space
    - understand and design filters in Fourier space
- $
F(u, v) = \sum_{x=0}^{N-1}\sum_{y=0}^{M-1} f(x,y) \cdot e^{-i 2 \pi (\frac{u}{N}x + \frac{v}{M}y)}\\
u = \frac{k_x N}{2\pi}
$
- transformation to an orthogonal basis
- contribution of different frequencies

+++++++++++++++++++++++++++++++

The **Fourier transform** is a mathematical tool that transforms (global operation) the given information (image) into another domain.  
In our case, using it for images, we transform the information from the spatial domain into the frequency domain.  
Sometimes the frequency space enables more efficient computations, e.g. for convolutions.

So, the Fourier transform transforms the signal into the frequency space, where it is a sum (or integral)  
of sine waves of different frequencies, each of which represents a frequency component.  
A point in that space would be a combination of weighted functions (sin / cos curves).

- FT: Representation of an image using a basis of sine and cosine patterns
- FT is a global operation
- FT transforms image $g$ from the spatial domain to the frequency domain
- $g$ is projected onto an orthonormal function system of $2$D sine and cosine functions
- FT preserves information and can be inverted as a consequence
- applications:
    - filtering based on Fourier representation of the image
    - filter design based on Fourier representation of the kernel
    - pattern recognition: Analyze Fourier transform

- 2D FT decomposes image into waves of different frequency and direction
- the FT holds the following information for each waves
    - **amplitude: brightness** of the waves
    - **phase: most of the structural information**

- translation in space -> phase shift in Fourier space
- rotation in space -> rotation in Fourier space

- FFT (fast Fourier transform)
    - together with the separability of the FT we get an efficient method for a 2D FT

**Convolution Theorem**
- convolution in space is a multiplication in Fourier space
    - fast computation of convolution in Fourier space
- application of conv. theo.:
    - transform image
    - transform kernels
    - multiplication in Fourier space
    - transform result back
- most effective for large kernels

**Border problem**
- perdiodic continuation of the image
    - automatically done when multiplying image and kernel in frequency space

## Template Matching

**a)** Explain the principle of template matching.

The idea is to take a prototypical small image of what you are looking for (template) in the image and move that template  
across the image just as in convolution to compare it to the underlying image patch with the goal of finding the part of the image that matches the template.  

It's a **model-based** approach - the template is a model of what we are looking for in the image.

**Advantages**:
- robust against noise
- efficient implementation as convolution

**Disadvantages**:
- little robustness against variation of viewpoint / illumination
- gray value scaling can cause problems

It's probably good to use it in situations where not much variation of viewpoint and illumination is to be  
expected such as a part of quality control in manufacturing.

**b)** When and why does the correlation coefficient perform better than the mean absolute difference?

Correlation coefficient is invariant to a linear transform of gray values, whereas mean average distance takes absolute gray values into account.  
Therefore the first performs better when the exposure of template and image are not equal.

**MAD**
- measure for similarity between template $T(i, j)$ and image $g(x, y)$
- idea: mean difference of gray values: $MAD(x, y) = \frac{1}{mn} \cdot \sum_{ij} | g(x+i, y+j) - T(i, j)|$
- **advantages:** robust to noise, easy to compute
- **disadvantages:** gray value scaling can cause problems, sensitive to rotation

**Correlation Coefficient**
- computes a correlation coefficient to measure similarity between the image and the template
- $C_{g, T} = \frac{\sigma_{g, T}}{\sigma_g \cdot \sigma_T}$ where $\sigma_{g, T}(x, y)$ is the covariance between $g$ and $T(i, j)$
  and $\sigma_g, \sigma_T$ are the standard deviations of $g$ and $T$
- the possible values range from $−1$ to $1$, where $-1$ or $+1$ indicate the strongest possible pos / neg correlation and $0$ means that they don't correlate
- **advantages:** robust to gray value scaling and noise, rotation invariant
- **disadvantages:** not as efficient to compute as MAD

## Pattern Recognition

**a)** What are the principle components of a 2-dimensional data distribution. What are the principle components when of an image?

The two principal components of a 2-dimensional data distribution are orthogonal and ordered.  
The first one points in the direction of largest variance.

With PCA we can find a new basis for our dataset. The basis vectors can be ordered by maximun variance / minimun reconstruction error.  
In the best case, each basis vector corresponds to an interpretable feature.

Principal components are the eigenvalues of the covariance matrix of a dataset. The eigenvalues give the ordering of the components.

Finally, the idea is to not just have an ordering of PCs, but to use them for dimensionality reduction, e.g. take the first $n$ PCs to capture $> 90 \%$ of the variance.

If PCA is applied to images, each image is represented as a 1-dimensional vector with $width \times height$ entries and the covariance matrix is computed  
using these vectors. The eigenvectors are vectors of the same dimension and can be seen as images. In case of face images these eigenvectors are called eigenfaces.

## Local Features

**a)** Describe the *Moravec* and the *Harris corner detectors*. What are the differences?

**Moravec IP operator**

The Moravec IP operator **measures the saliency** or "uniqueness" of a window arround a pixel.  

- idea: a window is salient if it's unique in its surroundings
- simplest check for uniqueness: Shift window by one pixel and compare to itself
- operator compares whifted window in four directions
- detector response at $(x, y)$ is the minimum over the four directions
- $E(x, y)$ -> saliency map
- corner is detected where $E(x, y)$ exceeds a threshold
- problems:
    - anisotropic (directon dependent)
    - hard window like box filter

**Harris corner detector**

Harris corner detector uses a Gaussian as isotropic windowing function addressing both shortcommings,  
the "hard" window and the anisotropy of the Moravec detector.

- they use the Gaussian as sliding windowing function
- and a structure tensor as an analytical solution to compute the differences
    - idea: combine gradients in the neighborhood of a point
    - the eigenvalues of the tensor:
        - both small: homogeneous area
        - one large, one small: edge
        - both large: corner

**b)** What are *local features* and what are they used for? Name some examples? Describe the main steps of SIFT and explain how invariances of the features are achieved.

**Local features** are used in object recognition to make it more <ins>**resiliant against object rotation, partial occlusion, illuminence differences, scaling**</ins> etc.  
This is achieved by using not the whole object but the most interesting parts of it - local patches that describe the object and are most of the time  
available in other representations of that object (for example the wheels and lights of a car, but not it's color).
- examples:
    - SIFT descriptor
    - rectangle feature

**Interest points** are points in an image and can be used for object recognition.  
They should be salient, i.e. “special” or “rare”, either within the image, or with respect to “common” images.  
They should be stable, i.e. should keep positions under disruptions in an image and should remain in the same position  
with respect to the physical world in a different image of the same scene (e.g., change of viewpoint or illumination).

- idea: object recognition from local patches
- method 1: compare to reference image
- method 2: search for discriminative features
- motivated by human perception:
    - scans scene with a few fixations
    - fast object identification
    - high robustness against variety of appearance
- ways to select image patches:
    - all (sliding window)
    - randomly chosen (not good)
    - salient ones (good, but how to determine?)
    - aim: the same image patches should be found in transformed versions of an image

Various concepts for interest point (IP) detection:
- **context free**
    - maxima of a saliency measure designed for arbitrary images
    - independent of the image in question
- **context dependent**
    - depends on the particular image (e.g. white spot is salient on black image, but not in general)
    - model for IPs

Requirements to an IP detector:
- saliency: IPs should be "special" or "rare"
- stability: 
    - IPs should keep positions under disruptions
    - IPs should remain in the same position with respect to the physical world in a  
     different image of the same scene (e.g. change of viewpoint / illumination)

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++  
So, in summary, local features are used to recognize objects across several images and different conditions.  
For example, such features could be used for a robot in order to localize in a scene.  
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

**SIFT (Scale Invariant Feature Transform)**

- **In short**
    - compute scale adaptive interest points
    - compute region descriptors from the histogram of gradients
    - compare images based on region descriptors
    - +++++++++++++++++++

- **Procedure**
    - detect extrema in scale space to obtain scale invariant feature descriptors:
        - construct a Gaussian pyramid (DoG-Pyramid): pyramid of differently Gaussian (increasing sigma) filtered versions of the image
        - search for keypoints over all pixels and all scales
    - detect stable keypoints:
        - look for points with extreme value compared to its local neighbors and neighboring scales
        - find positions with subpixel precision
    - direction detection to obtain rotation invariant descriptors:
        - for each keypoint, we compute a local HOG for its neighborhood
        - extraction of one (or more) directions from that histogram
        - basically search for the gradient with biggest magnitude and normalize based on that
            - normalize all other gradients in the environment such that the direction of the one with the biggest magnitude points upwards
            - if you rotate the image, the gradients will be normalized to the same direction (rotation invariant)
    - keypoint descriptor:
        - obtain an image region based on scale and orientation of the keypoint
        - transform to a standardized (normalized) description (SIFT feature) - a $128$-dimensional vector based on local gradients
        - use gradient image to achieve stability against change in illumination

- **Properties**
    - features are invariant against rotation and scaling, but not against translation

- **Compare keypoints between images (based on the descriptor)**
    - the idea is that you have such a detailed description of the keypoint that the same keypoint in another image would yield a very similar descriptor
    - based on a similarity definition, e.g. Euclidean distance between the descriptors (vectors) of two features in two images,  
    you can determine whether you found a feature in another image
    - to find the same object in two images:
        - apply SIFT to both images
        - compare descriptor vectors
            - if you find a reasonably close pair of vectors, you know that the two points corresponding to the vectors match between the two images
            - if you find a sufficiently large number of such corresponding points, you know that you found the object in the other image
            - can for example be used for stitching in panorama images

- **Why not translation invariant?**
    - if we change the perspective (sufficiently strong), it won't work anymore, the gradients are too different

## Compression

**a)** How does Huffman-Coding work?

**Huffman-Coding Steps**
- compute normalized histogram
- order symbols according to probability
- generate tree
    - merge the two least likely symbols
    - repeat until only two symbols remain
- start from the back and generate a prefix-free code according to the probability

The result is a coding scheme that assigns the shortest codes to the most likely signals and the longest codes to the least likely ones,
thereby removing the coding redundancy.  
The theoretical maximum compression factor is given by $\frac{\#bits}{entropy}$, which intuitively makes sense, because if the entropy is high ('chaotic' image ~uniformly distributed gray values), there is not much to be compressed (low compression factor). If, on the other hand, the entropy is low (only a few different gray values), there is a lot to be compressed.

Huffman coding works particularly well for images with low entropy, that is images where some gray values are dominant.  
The entropy of the image is a lower bound for the avg code length when using Huffman coding.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from skimage import measure
import huffman

img = np.array([
    [0, 0, 1, 0, 0, 0],
    [0, 1, 2, 1, 1, 0],
    [0, 0, 1, 0, 1, 0],
    [0, 0, 0, 0, 5, 0],
    [4, 4, 1, 2, 3, 0],
    [0, 3, 0, 0, 0, 0]
])

plt.figure(figsize=(5, 5))

entropy = measure.shannon_entropy(img)
print("entropy:", entropy)
print("theoretical max compression factor:", 3 / entropy)

occurrences = [(s, np.count_nonzero(img == s)) for s in np.unique(img)]

huffman_codes = huffman.codebook(occurrences)
print("huffman codes", huffman_codes)
print("memory consumption before:", img.size * 3, "bits")

mem_after = 0
for k in huffman_codes:
    mem_after += np.count_nonzero(img == k) * len(huffman_codes[k])

print("memory consumption after:", mem_after, "bits")
print("actual compression ratio:", (img.size * 3) / mem_after)

plt.imshow(img)
plt.show()


**b)** What is the Gray code and what is its relation to run length encoding?

Gray code is an ordering of the binary system such that two successive values differ in only one bit (Hamming distance $1$).  

The Gray code is useful in the context of run length encoding, if one assumes that neighboring pixels will usually have similar values.  
If encoded with a Gray code, this means that one can expect less bit switches between neighboring pixels and hence longer runs  
(especially if the different bitmaps of an image are compressed separately).

The enhancement results from the fact that successive number differ in only one bit, which means that only one bit plane is disturbed. 

## Understanding the Wireframe-Model


**a)** Explain in your own words the functions on slide  (CV-12 slide 9). Also explain when and why it may make sense to use $m$ instead of $m'$.

From the initial pose, the wire frame model is iteratively adapted to the image based on gradients.  

$m$: magnitude, $\beta$: orientation, $g(x, y)$: image  

**$x$-gradient**: $\Delta_x g = g(x+1, y) - g(x-1, y) \rightarrow$ for a fixed $y$, it's the difference between the pixels to the left and to the right  
**$y$-gradient**: $\Delta_y g = g(x, y+1) - g(x, y-1) \rightarrow$ for a fixed $x$, it's the difference between the pixels to above and below  
**gradient magnitude**: $m'(x, y) = \sqrt{\Delta_x g^2 + \Delta_y g^2}$  
**orientation:** Use the inverse tangent: $\beta(x, y) = arctan(\frac{\Delta_y g}{\Delta_x g})$  

For the gradient magnitude, there's an alternative computation which is thresholded:  
$m(x, y) = \Theta(m'(x, y) - T)$ (only takes magnitudes that are sufficiently large)  

When and why $m$ instead of $m'$?
- e.g. when you only want to consider salient edges in the image