Osnabrück University - Machine Learning (Summer Term 2020) - Prof. Dr.-Ing. G. Heidemann, Ulf Krumnack, Axel Schaffland

# Exercise Sheet 08

## Introduction

This week's sheet should be solved and handed in before the end of **Saturday, June 27, 2020**. If you need help (and Google and other resources were not enough), feel free to contact your groups' designated tutor or whomever of us you run into first. Please upload your results to your group's Stud.IP folder.

The second half of this sheet and the following Sheet 09 will be a recap of previous topics, to help you prepare for the final exam.

Also if you hit any question that should be discussed in more detail in the next practice session, please let us know.

## Assignment 0: Math recap (Conditional Probability) [0 Points]

This exercise is supposed to be very easy and is voluntary. There will be a similar exercise on every sheet. It is intended to revise some basic mathematical notions that are assumed throughout this class and to allow you to check if you are comfortable with them. Usually you should have no problem to answer these questions offhand, but if you feel unsure, this is a good time to look them up again. You are always welcome to discuss questions with the tutors or in the practice session. Also, if you have a (math) topic you would like to recap, please let us know.

**a)** Explain the idea of conditional probability. How is it defined?

YOUR ANSWER HERE

**b)** What is Bayes' theorem? What are its applications?

YOUR ANSWER HERE

**c)** What does the law of total probability state? 

YOUR ANSWER HERE

## Assignment 1: MLP and RBFN [10 Points]

This exercise is aimed at deepening the understanding of Radial Basis Function Networks and how they relate to Multilayer Perceptrons. Not all of the answers can be found directly in the slides - so when answering the (more algorithmic) questions, first take a minute and think about how you would go about solving them and if nothing comes to mind search the internet for a little bit. If you are interested in a real life application of both algorithms and how they compare take a look at this paper: [Comparison between Multi-Layer Perceptron and Radial Basis Function Networks for Sediment Load Estimation in a Tropical Watershed](http://file.scirp.org/pdf/JWARP20121000014_80441700.pdf)

![Schematic of a RBFN](RBFN.png)

We have prepared a little example that shows how radial basis function approximation works in Python. This is not an example implementation of a RBFN but illustrates the work of the hidden neurons.

In [None]:
%matplotlib inline
import numpy as np
from numpy.random import uniform

from scipy.interpolate import Rbf

import matplotlib
import matplotlib.pyplot as plt
from matplotlib import cm


def func(x, y):
    """
    This is the example function that should be fitted.
    Its shape could be described as two peaks close to
    each other - one going up, the other going down
    """
    return (x + y) * np.exp(-4.0 * (x**2 + y**2))


# number of training points (you may try different values here)
training_size = 50

# sample 'training_size' data points from the input space [-1,1]x[-1,1] ...
x = uniform(-1.0, 1.0, size=training_size)
y = uniform(-1.0, 1.0, size=training_size)

# ... and compute function values for them.
fvals = func(x, y)

# get the approximation via RBF
new_func = Rbf(x, y, fvals)

# Plot both functions:
# create a 100x100 grid of input values
x_grid, y_grid = np.mgrid[-1:1:100j, -1:1:100j]

fig, ax = plt.subplots(ncols=2, sharey=True, figsize=(10, 6))
# This plot represents the original function
f_orig = func(x_grid, y_grid)
img = ax[0].imshow(f_orig, extent=[-1, 1, -1, 1], cmap='RdBu')
ax[0].set(title='Original Function')
# This plots the approximation of the original function by the RBF
# if the plot looks strange try to run it again, the sampling
# in the beginning is random
f_new = new_func(x_grid, y_grid)
plt.imshow(f_new, extent=[-1, 1, -1, 1], cmap='RdBu')
ax[1].set(title='RBF Result', xlim=[-1, 1], ylim=[-1, 1])
# scatter the datapoints that have been used by the RBF
plt.scatter(x, y, color='black')
fig.colorbar(img, ax=ax)
plt.show()

### Radial Basis Function Networks

#### What are radial basis functions?

Radial basis functions provide a global approximation of a target function by a linear combination of local approximations.

**Architecture**:
- single layer of neurons (units)
- each neuron gets the complete input
- neurons have certain weights
- there is a unimodal activation function called kernel function
- method works locally
- neurons contribute to the output vector according to their activation
- things to do:
    - find suitable input weights (instance-based learning / clustering)
    - find radii
    - define output weights (perceptron-like rule)

#### What is the structure of a RBFN? You may also use the notion from the above included picture.

An RBNF is a network with a single layer of neurons (hidden layer) where each of the neurons gets the complete input. The output is a global approximation of the target function that arises from a linear combination of the local approximations which are the outputs of each neuron's activation function.

#### How is a RBFN trained?

The training is a three step process:
- find suitable input weights by instance based learning or clustering
- find suitable radii of influence
- find output weights by using a perceptron-like rule

### Comparison to the Multilayer Perceptron

#### What do both models have in common? Where do they differ?

**Effect of adaptation step:**
- **RBF**: only local on input area
- **MLP**: may change all weights and therefore the performance on the whole data set

**Architectural params:**
- **RBF**: only #basis functions (easy to interpret)
- **MLP**: #layers, #neurons (difficult to interpret)

**Adaptation params:**
- **RBF**: decoupled and easy to interpret (clustering params, radii, stepsize)
- **MLP**: coupled and interacting in complex manner (stepsize, momentum etc.)

**In common:**
- both are feed-forward nets
- non-linear activation

#### How can classification in both networks be visualized?

- An MLP naturally separates the classes with hyperplanes in the input space
- RBF separate class distributions by localizing radial basis functions

#### When would you use a RBFN instead of a Multilayer Perceptron?

MLPs are not as good as RBF in dealing with noisy data sets. It's probably a good idea to use RBF when there's a lot of noise in the data.

## Recap 1: Concept Learning [2 Points]

### a) Concept Learning

What is Concept Learning? Is it supervised? Is it local?

The problem that is considered in concept learning is 'how to learn general concepts from specific examples'. A concept can be represented by a boolean function which assigns true to the appropriate entities.  
Since we need to know whether an example belongs to the concept or not during the learning, it's supervised.  
Concepts can be learned using various different learning approaches and therefore there could be local and non-local approaches.

### b) Find-S
Describe the Find-S Algorithm in pseudo code. What is its inductive bias? What are its advantages and drawbacks?

- init $h$ to the most specific hypothesis
- for each positive training instance $x$
    - for each attribute constraint $a_i \in h$
        - if $a_i$ is not satisfied by $x$
            - replace $a_i$ in $h$ by the next more general constraint that is satisfied by $x$
- return $h$

**Inductive Bias**:
- the target function must be inside of the hypotheses set
- all instances are going to be negative, unless the opposite is entailed by its knowledge

**Drawbacks**:
- learns nothing from negative examples
- can't tell whether it has learned the concept
- can't tell whether training data is inconsistent
- picks maximally specific $h$

### c) Hypotheses space

What is the hypotheses space for Candidate-Elimination used in the lecture?

The hypotheses space is the set of all hypotheses between the most general and the most specific hypotheses, basically the set that contains every possible hypothesis.

## Recap 2: Decision Trees [2 Points]

### a) Overfitting
What is overfitting? How can it be avoided?

Overfitting means fitting the noise of the training data which prevents good generalization properties for unseen data sets. It can be avoided by stopping to grow the tree when the data split is no longer statistically significant or by post-pruning after building up the complete tree. 

### b) Pruning

Name one method for pruning a decision tree and describe it!

**Reduced error pruning** removes nodes to achieve better generalization on the validation set.  
The method stops when further pruning would decrease the performance on the validation set below the original tree. First, the impact on the validation set for pruning a node is evaluated for each possible node. The node that most improves the accuracy on the validation set gets removed afterwards.


### c) Information gain
What are entropy and information gain? Provide explanation and formulae. How are they used in ID3?

The entropy $E \in [0, 1]$ measures the impurity of a set of training examples $S$ and is defined as follows:  

$E(S) = -p_+ \thinspace log_2 \thinspace p_+ \thinspace - \thinspace p_- \thinspace log_2 \thinspace p_-$  

$p_+ := $ proportion of positive examples in $S$  

$p_- := $ proportion of negative examples in $S$

A set $S$ that contains only one class of examples would be entirely pure and therefore have an entropy of $0$. On the other hand, for a set that has as many examples from one class as from the other, the entropy would be $1$, because of the maximum impurity of the set.  

The information gain is the expected reduction in entropy for a set $S$ due to sorting on a certain attribute $A$. It is defined as:  

$G(S, A) = E(S) - \sum_v E(S_v) \cdot \frac{|S_v|}{|S|}$  

$S_v$ is the subset of $S$ for which $A$ has value $v$. That means that the entropy values for each $S_v$ weighted with their proportion of the whole set are subtracted from the entropy. It is easy to see that the information gain is maximal when the $S_v$ are pure. The other extreme would be totally chaotic $S_v$ which would lead to a low information gain.

The ID3 algorithm builds the tree based on the information gain achieved by attributes.

## Recap 3: Data Mining [2 Points]

### a) Missing values

How can you deal with missing values? Name an important algorithm and explain how to use it.

There are different approaches to deal with missing values:
- take mean for missing attribute (artifacts)
- estimate model to predict missing values (linear regression) -> artificial concentration of values along regression lines
- estimate data distribution and generate missing values by random samples from that distribution
- Expectation Maximization (EM) algorithm

One important algorithm to deal with missing value situations is the EM algorithm. It's an iterative method to find (local) maximum likelihood estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent or missing variables in the next E step. 

### b) Outliers

What are outliers? Can we detect them? If so, how?

Outliers are data values that do not seem representative for the rest of the data, because they are 'far away' from the rest of the data points and do not fit the expectation. There are numerous causes of outliers, e.g.:
- technical / measurement errors
- unexpected true effect
- data with high variation

To consider a data point an outlier, we need to define what is regular. One approach to detect outliers is the so called z-test. It considers the distances of points from the mean of the data set and checks whether the distances exceed some limit. If the distance of a point exceeds this limit, the point is considered an outlier. A more sophisticated version of this is called Rosner test where those outliers are removed iteratively.

### c) Expectation Maximization
What does the Q-function express in the EM algorithm?

The Q-function expresses the averaged (expected) likelihood.

## Recap 4: Clustering [4 Points]

### a) Clustering

Explain the difference between single-linkage and complete-linkage clustering.

Single-linkage clustering employs the minimum cluster distance $D_{min}$ (distance of closest points between two clusters).

 Complete-linkage clustering employs the maximum cluster distance $D_{max}$ (max distance between two points of the two clusters).

### b) Metrics

Name three different distance measures and briefly explain them. Check the metric axioms for one of them.

- **Euclidean distance**: Straight-line distance between two points in Euclidean space
    - $d(x, y) = \sqrt{\sum_i (x_i - y_i)^2}$
- **Manhattan distance**: Distance between two points is the sum of the absolute differences of their Cartesian coordinates
    - $d(x, y) = \sum_i |x_i - y_i|$
- **Hamming distance**: Number of positions where two strings of equal length differ
    - $d(x, y) = |\{i \in \{1, ..., n\} \thinspace | \thinspace x_i \neq y_i\}|$

**Metric axioms for Euclidean distance:**

- identity of indiscernibles: $d(x, y) = 0 \iff x = y$


### c) Mixture models

What is a mixture model? Explain. Can you provide a formula?

It's a mixture of several components, where each component has a simple parametric form (such as a Gaussian). 
We assume each data point belongs to one of the components, and we try to infer the distribution for each component separately. In  general, a mixture model assumes the data points are generated by the following process: first we sample $z$, and then we sample the observables $x$ from a distribution which depends on $z$, i.e. $p(z, x) = p(z) p(x|z)$.

## Recap 5: Dimension Reduction [2 Points]

### a) Visualization

Name three different data visualization techniques to visualize high dimensional data. Explain one in detail.

- **Scatterplot matrix**: Project on two of the dimensions and display all combinations as matrix of 2D plots
- **Glyphs**: Map each dimension onto the parameters of a geometrical figure
- **Chernoff Faces**: Parameters are mapped to facial features
    - The human visual system is particularly sensitive for faces, i.e. we are very good at interpreting and recognizing human faces. Chernoff used faces to make use of this property when representing high dimensional data.


### b) PCA

Draw a few data points (ASCII arts or on a sheet of paper) and mark the principal components. What are the principal components?

<img src="PC.png" width="400"/>

PCs are vectors pointing in the direction of largest variance. Their magnitude (length) expresses that variance.

### c) Covariance matrix
What does a covariance matrix express? How is it computed from data? How is it used in PCA?

The covariance matrix tells you how much one feature in your data set covariates with another feature in the data set.
On the diagonal, you just have the variance of the individual features and the non-diagonal elements you see how feature $i$ covariates with feature $j$. If you normalize it, you get the correlation matrix with values between $-1$ and $1$ which actually tell how much two features correlate.

Compute covariance matrix:
$C(X) = E((X - \mu)(X - \mu)^T)$ where $\mu$ is the expected value (mean vector) of the data set.

The covariance matrix is an essential part of PCA, because the PCs are the eigenvectors of this matrix.