Osnabrück University - Machine Learning (Summer Term 2020) - Prof. Dr.-Ing. G. Heidemann, Ulf Krumnack, Axel Schaffland

# Exercise Sheet 09

## Introduction

This week's sheet should be solved and handed in before the end of **Saturday, July 04, 2020**. If you need help (and Google and other resources were not enough), feel free to contact your groups' designated tutor or whomever of us you run into first. Please upload your results to your group's Stud.IP folder.

Again, the second half of this sheet will be a recap of previous topics, to help you prepare for the final exam.

Also if you hit any question that should be discussed in more detail in the next practice session, please let us know.

## Assignment 1: Reinforcement Learning [12 Points]

In this assignment you will have a look at the Q-Learning algorithm described in the lecture (ML-10 Slide 18). For this we generate a field with random rewards. A learning agent is then exploring the field and learns the optimal path to navigate through it. The code below is again filled with some ``TODO``s that should be filled by you in order to implement the Q-Learning algorithm. 

Below the code there are some questions! You also find a free-code field for a complete own implementation. You may use your own test mazes.

In [None]:
import numpy as np
import numpy.random as rand

def generate_field(x, y, num_rewards, max_reward):
    """
    Generate a random game field with rewards.
    
    Args:
        x (int):            x dimension of the field
        y (int):            y dimension of the field 
        num_rewards (int):  the number of rewards that should be randomly placed
        max_reward (int):   the maximum reward that can be placed 
        
    Returns:
        ndarray: A field with randomly initialized rewards, the rest of the 
        entries is zero
    """
    
    # Change or comment out to get different random data in each run
    np.random.seed(42)
    
    field = np.zeros((y,x), dtype=np.uint8)
    
    for i in range(num_rewards):
        field[rand.randint(y), rand.randint(x)] = rand.choice(max_reward)
    
    return field

In [None]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patheffects as PathEffects


class QLearning:
    """
    This class contains all the necessary methods to navigate through
    a maze or game with the help of a little bit of Q-Learning.
    """

    def __init__(self, field, actions, gamma):
        """
        Initializes the QLearning Algorithm with the necessary parameters.
        All q values are stored in self.q - this is an array that has
        ACTIONS x map_x x map_y dimensions to store a value for each action
        in each field. The starting position self.pos is randomly initialized.
        
        Args:
            field (ndarray):  the map
            actions (list):   the available actions
            gamma (float):    the gamma in the lecture slides
        
        Returns:
            QLearning: An instance that can be used for Q-Learning on the field
        """
        # q stores the q_values for each action in each space of the field.
        self.field = field
        self.actions = actions
        self.gamma = gamma
        
        # Remember the map extend for further navigation.
        self.map_y = self.field.shape[0]
        self.map_x = self.field.shape[1]
        
        # Create q value matrix.
        self.q = np.zeros((len(self.actions), self.map_y, self.map_x))

        # Start on a random position in the field.
        self.pos = [np.random.randint(self.map_y), np.random.randint(self.map_x)]
        self.fig, self.axes = plt.subplots(3, 3, num='QLearning State')
        for ax in self.axes.flat:
            ax.axis('off')

    def get_coordinates(self, position, action):
        """
        Returns the coordinates that follow a certain action, depending
        on the current position of the learner. If the border is reached
        the agent just stops there.
        
        Args:
            position (pair):  the current position
            action (string):  the action that should be performed (one of: 'up', 'down', ...)
            
        Returns:
            pair of int: the updated coordinates
        """
        # return the right new coordinates depending on the position
        # YOUR CODE HERE
        x = position[1]
        y = position[0]
        
        if action == 'left':
            return (y, x - 1) if x > 0 else (y, x)
        elif action == 'right':
            return (y, x + 1) if x < self.map_x - 1 else (y, x)
        elif action == 'up':
            return (y - 1, x) if y > 0 else (y, x)
        elif action == 'down':
            return (y + 1, x) if y < self.map_y - 1 else (y, x)

    def update(self):
        """
        Implementation of the update step. Closely follows the Algorithm described on
        ML-10 Sl.18. Note that you have attributes available as specified in the
        __init__ method of this class, in addition to that is the FIELD variable that
        stores the real field the agent is iterating about, as well as ACTIONS which
        stores the available actions.
        """
        # Select a random action that should be performed next.
        # Be careful to handle the case where you hit the wall!
        # YOUR CODE HERE
        rand_action = np.random.choice(self.actions)
        resulting_pos = self.get_coordinates(self.pos, rand_action)

        if resulting_pos != self.pos:
            # Receive the reward for the new position from the field.
            # YOUR CODE HERE
            reward = self.field[resulting_pos[0], resulting_pos[1]]
            
            # Update the q-value for the performed action.
            # YOUR CODE HERE
            q_values = self.q[:, resulting_pos[0], resulting_pos[1]]
            q_update = reward + self.gamma * max(q_values)
            # we have one table for each action
            self.q[self.actions.index(rand_action), self.pos[0], self.pos[1]] = q_update

            # Update the position of the player to the new field.
            # YOUR CODE HERE
            self.pos = resulting_pos

    def plot(self):
        """
        Plots the current state.
        """
        fs = 8
        for i, action in enumerate(self.actions):
            ax = self.axes.flat[2*i + 1]
            ax.cla()
            ax.set(title=action)
            ax.set_xticks(np.arange(self.q[i,:,:].shape[1]))
            ax.set_yticks(np.arange(self.q[i,:,:].shape[0]))
            ax.imshow(self.q[i,:,:], interpolation='None')

            for j in range(self.q.shape[1]):
                for k in range(self.q.shape[2]):
                    text = ax.text(k, j, "{:.1f}".format(self.q[i,j,k],1),
                       ha="center", va="center", color="black", fontsize=fs)
                    plt.setp(text, path_effects=[
        PathEffects.withStroke(linewidth=1, foreground="w")])

        self.fig.canvas.draw()

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import matplotlib.patheffects as PathEffects

# Determine the size of the field, change this parameters as you like
m_x = 5
m_y = 4

steps = 500

actions = ['up','left','right','down']  # Those are the availabe actions for the QLearning.
field = generate_field(m_x, m_y, num_rewards=5, max_reward=10) # The field that is used for learning.

# Plotting the generated field
fs = 18
figure, ax = plt.subplots()
#plt.axis('off')
ax.imshow(field, interpolation='none')
ax.set_xticks(np.arange(field.shape[1]))
ax.set_yticks(np.arange(field.shape[0]))
for j in range(field.shape[0]):
    for k in range(field.shape[1]):
        text = plt.text(k, j, field[j,k],
        ha="center", va="center", color="black", fontsize=fs)
        plt.setp(text, path_effects=[PathEffects.withStroke(linewidth=3, foreground="w")])

figure.suptitle("Field",fontsize=fs)          
figure.canvas.draw()


# Generate a QLearning instance with the right parameters.
# YOUR CODE HERE
player = QLearning(field, actions, 0.85)

# Now we perform steps many learning iterations on the field with
# the generated QLearning instance.
for i in range(steps):     
    player.update()
    player.plot()

Explain in your own words, how the algorithm works. What is depicted on the resulting plots. How can an action policy be derived from these data?

Q-learning is a RL technique used for learning the optimal policy in a MDP (in the sense of maximizing the expected value of the total reward over all possible steps, starting from the current state). "Q" names the function that returns the reward used to provide the reinforcement and can be said to stand for the "quality" of an action taken in a given state.

The algorithm starts by initializing all $q$-values to $0$. In the loop, we execute an action and update the $q$-value for the current state and the performed action based on the reward of performing this action in the current state and the maximum $q$-value over all possible actions in the resulting state weighted by the discount factor. Afterwards, the current state gets updated to the new one.

The field plot just shows the reward for ending up in each field.  
The plots below the field plot show the $q$-values for each of the possible actions. A cell in one of the tables contains the $q$-value for performing that specific action when being on that position on the field. For example, when the current position on the field is $(1, 4)$, it's a good idea to move up and reach the state with a high reward of $7$, that's why 'up' has a high $q$-value at this position.

The plots always show a light color for cells with a high reward or $q$-value and a dark one for low values.

An action policy can be derived by suggesting the action with the highest $q$-value on the specified field.


You are also free to write your complete own implementation of the QLearning algorithm (instead of completing the code above). Use the following cell for your implementation.

# Recap (part II)

This is the second part of the recap material. These exercises do not need to be solved in order to qualify for the final exam but it is highly recommended for preparation. Also if you hit any question that should be discussed in more detail, please let us know.

## Recap 6: Neural Networks [3 Points]

### a) Neural Networks

Name three different kinds of Artificial Neural Networks discussed in the lecture.

- **Multilayer perceptron (MLP)**
- **Radial basis function networks (RBFN)**
- **Self-organizing maps (SOM)**

### b) Backpropagation

Which of the following formulae describes the backpropagation of the error through hidden layers in a Multilayer Perceptron?
Assume they are calculated for each $k=L_H \dots 1$ and $i=1\dots N(k)$.

1. $\delta_i(k) = f^\prime(o_i(k)) \sum\limits_{j=1}^{N(k+1)} w_{ji}(k+1, k)o_j(k)$
2. $\delta_i(k) = f^\prime(o_i(k)) \sum\limits_{j=1}^{N(k+1)} w_{ji}(k+1, k)\delta_j(k+1)$
3. $\delta_i(k) = f^\prime(o_i(k)) \sum\limits_{j=1}^{N(k+1)} w_{ji}(k, k-1)\delta_j(k+1)$

Formula $2$ is the correct one.

### c) Hebb's rule
Explain Hebb's rule. Provide a formula. What is the relation to Oja's rule?

The idea of Hebb's rule can be summarized by the catch phrase 'neurons that fire together, wire together'.  
It can be expressed by the formula: $\Delta w_i = \epsilon \cdot y(\vec{x}\vec{w})x_i$  
Hebb's rule has the problem that weights never decrease, they become arbitrarily large.  
Oja's rule is an approach to deal with that issue by introducing a weight decay.

## Recap 7: Local Methods [2 Points]

### a) Local methods

What are differences between local and global methods? What are advantages or disadvantages?

Local methods are local with respect to the input space: The output is computed individually for different regions of the input space, so adaptation has only local effects. MLPs for example are not local, adaptation of a single weight based on a single example may influence the performance of the entire net (all output channels) and on the complete set of inputs.  

Advantages of local methods:
- better in dealing with noise in the training data
- better suited to deal with data flows that are dynamically evolving
- effects of parameters are easier to interpret

**Question:** What are disadvantages of local methods?  
Maybe in some scenarios it would be desirable to have the whole net adapt very fast according to dynamic changes?

### b) MLP and RBFN

Is an MLP or are RBFN local methods? Why?

A MLP is a non-local method because the adaptation of a single weight based on a single example may influence the performance of the entire net. RBFN on the other hand are local methods since each neuron only affects its own receptive field.

### c)  Nearest neighbor

How does the nearest neighbor approach work? How can it be improved?

The **nearest neighbor** approach first memorizes all examples ('training'). In the application phase, the output of the best match for the input vector $\vec{x}$ from the memorized examples is used as output. An extension for this approach is the **$k$-nearest neighbor** approach where not only one nearest neighbor from the set of memorized examples is considered, but $k$. For discrete valued outputs, one votes among the $k$ nearest neighbors to select one and for the continuous case, one simply takes the mean. An even more sophisticated approach would be to use **distance-weighted $k$-nearest neigbors** where k-nearest neighbors is improved by weighting with the distance to the input.

## Recap 8: Classification [3 Points]

### a) Classfier

What is a classifier? What is the relation to a concept?

A **classifier** assigns a discrete class to an object based on attributes:  
e.g. assign class *dry* to a *towel*

A **concept** can be represented by a boolean function which assigns true to the appropriate entities:  
e.g. car(thisChair) = $false$

### b) Comparison of classifiers

Name three different classifiers and compare them. Think about biases and assumptions, separatrices, sensitivity, locality, parameters and speed. 

| classifier | biases (assumptions) | separatrices | sensitivity | locality | parameters | speed |
|---|---|---|---|---|---|---|
| Euclidean classifier | **no idea** | linear | sensitive to far outliers | not local | - | very fast |
| Quadratic classifier | **no idea** | conic section | **no idea** | not local | - | fast |
| Nearest neighbor classifier | nbrs should have same / similar classification | implicitly defined by nbrs | **no idea** | local | - | depends on size |

### c) SVM

What is a support vector? How does the kernel trick work?

The support vectors are examples close to the class boundary.

The kernel trick is used to solve non-linear problems:
- the data is projected into a higher dimensional space
- for sufficiently high dimension, every problem becomes linearly separable by a hyperplane
- the projection of this hyperplane back to the original data space is a non-linear separatrix