# DS4DS Homework Exercise Sheet 7

**General Instructions:**

- Collaborations between students during problem-solving phase on a discussion basis is OK
- However: individual code programming and submissions per student are required
- Code sharing is strictly prohibited
- We will run checks for shared code, general plagiarism and AI-generated solutions
- Any fraud attempt will lead to an auto fail of the entire course
- Do not use any additional packages except for those provided in the task templates
- Please use Julia Version 1.10.x to ensure compatibility
- Please only write between the `#--- YOUR CODE STARTS HERE ---#` and `#--- YOUR CODE ENDS HERE ---#` comments
- Please do not delete, add any cells or overwrite cells other than the solution cells (**Tip:** If you use a jupyerhub IDE, you should not be able to add or delete cells and write in the non-solution cells by default)

In [1]:
using LinearAlgebra
using Plots
using StatsBase
using DelimitedFiles
using BenchmarkTools
using Optim

### Task 1: Non-linear function optimization from scratch - (3.5 points)

In this task, we will implement our own non-linear optimization algorithm and apply it to the 2D Rosenbrock function. The function is defined by

$$
\begin{align}
    \mathcal{L}(\mathbf{w}) = (a - w_1)^2 + b (w_2 - w_1^2)^2
\end{align}
$$

where $a$ and $b$ are parameters of the function. In the following, the parameters will be set to $a = 1$ and $b = 100$.

#### a) **- (0.5 points)**
Implement the Rosenbrock function so that the input is a vector $\mathbf{w}$ and it returns the scalar cost function value $\mathcal{L}(\mathbf{w})$. 

In [2]:
function rosenbrock(w; a=1, b=100)
    #--- YOUR CODE STARTS HERE ---#
    f_value = (a - w[1])^2 + b * (w[2] - w[1]^2)^2
    #--- YOUR CODE ENDS HERE ---#
    return f_value
end

rosenbrock (generic function with 1 method)

In [3]:
@assert isa(rosenbrock, Function)

w = [2, 1]
@assert isa(rosenbrock(w), Number)


In [4]:
# Please leave this cell as its is


#### b) **- (0.5 points)**
Derive the analytical gradient of the Rosenbrock function and implement it in the function outlined below so that it takes $\mathbf{w}$ as input and returns the gradient $\nabla\mathcal{L}(\mathbf{w})$ of the Rosenbrock function. 

In [5]:
function rosenbrock_gradient(w; a=1, b=100)
    #--- YOUR CODE STARTS HERE ---#
    dw1 = 4 * b * w[1]^3 + 2 * w[1] - 4 * b * w[2] * w[1] - 2 * a
    dw2 = 2 * b * (w[2] - w[1]^2)
    gradient = [dw1, dw2]
    #--- YOUR CODE ENDS HERE ---#
    return gradient
end

rosenbrock_gradient (generic function with 1 method)

In [6]:
@assert isa(rosenbrock_gradient, Function)

point = [2, 1]
@assert size(rosenbrock_gradient(point)) == (2,)


In [7]:
# Please leave this cell as its is


#### c) **- (0.5 points)**
Implement the steepest gradient descent rule for updating a parameter vector $\mathbf{w}_{current}$ to $\mathbf{w}_{next}$ in the function provided below. The step size $\eta$ is a parameter of this function. 

In [8]:
function steepest_GD_parameter_update(gradient_function, w_current, eta)
    #--- YOUR CODE STARTS HERE ---#
    Δw = gradient_function(w_current)
    w_next = w_current - eta * Δw
    #--- YOUR CODE ENDS HERE ---#
    return w_next
end

steepest_GD_parameter_update (generic function with 1 method)

In [9]:
@assert isa(steepest_GD_parameter_update, Function)

point = [2, 1]
@assert size(steepest_GD_parameter_update(rosenbrock_gradient, point, 0.1)) == (2,)


In [10]:
# Please leave this cell as its is


#### d) **- (1 point)**
Implement the entire steepest descent method using the functions from the previous tasks using the template outlined below.

In [11]:
function steepest_GD(f, grad_f, w_init, eta, max_iters=1000, tolerance=1e-5)
    """Steepest gradient descent implementation.

    Args:
        f: The function to be optimized
        grad_f: The analytical gradient of the function to be optimized
        w_init: The initial condition at which the optimization algorithm starts
        eta: The learning rate (here, the value is to be a constant during optimization)
        max_iters: A cap for the count of iterations
        tolerance: the tolerance value when the algorithm is terminated because the norm of the difference between
            the old and new point is too small (note that both max_iters and the tolerance cap can stop the descent,
            i.e., you need to check for both conditions)

    Returns:
        w_opt: The end value of the optimization

    """

    #--- YOUR CODE STARTS HERE ---#
    w_old = w_init; 
    w_new = w_init;
    for iter = 1:max_iters
        w_new = steepest_GD_parameter_update(grad_f, w_old, eta)
        if norm(w_new - w_old) <= tolerance
            break
        end
        w_old = w_new
    end
    w_opt = w_new
    #--- YOUR CODE ENDS HERE ---#

    return w_opt

end

steepest_GD (generic function with 3 methods)

In [12]:
@assert size(steepest_GD(rosenbrock, rosenbrock_gradient, [2, 1], 0.001)) == (2,)

In [13]:
# Please leave this cell as its is


#### e) **- (1 point)**
Add a backtracking mechanism for selecting a step size that satisfies the Armijo condition. Hence, start with eta_init = 1 and then reduce it by a factor of 2 until the Armijo condition is satisfied. 

In [14]:
function gradient_descent_backtracking(f, grad_f, w_init, eta_init=1, max_iters=1000, tolerance=1e-5)

    #--- YOUR CODE STARTS HERE ---#
    w_current = w_init
    w_next = w_init
    for i = 1:max_iters
        eta_current = backtracking_line_search(f, grad_f, w_current, eta_init, 0.5)
        w_next = w_current - eta_current * grad_f(w_current)
        if norm(w_next - w_current) <= tolerance
            break
        end
        w_current = w_next
    end
    w_opt = w_next
    #--- YOUR CODE ENDS HERE ---#
    return w_opt
end

function backtracking_line_search(f, grad_f, w_current, eta_init, alpha=0.3, beta=0.5)
    #--- YOUR CODE STARTS HERE ---#
    eta_current = eta_init
    current_val = f(w_current)
    w_next = w_current - eta_current * grad_f(w_current)
    next_val = f(w_next)
    while next_val > current_val + beta * eta_current * dot(grad_f(w_current), -grad_f(w_current))
        eta_current = eta_current * alpha
        w_next = w_current - eta_current * grad_f(w_current)
        next_val = f(w_next)
    end
    eta = eta_current
    #--- YOUR CODE ENDS HERE ---#
    return eta
end

backtracking_line_search (generic function with 3 methods)

In [15]:
@assert isa(backtracking_line_search, Function)
@assert isa(gradient_descent_backtracking, Function)

@assert isa(backtracking_line_search(rosenbrock, rosenbrock_gradient, [2, 2], 0.1), Number) "eta needs to be a scalar."
@assert size(gradient_descent_backtracking(rosenbrock, rosenbrock_gradient, [2, 1])) == (2,) "The gradient has 2 elements (There are two parameters)"

### Task 2: Fitting a surrogate model to geodata - (6.5 points)

In this task, we are given a dataset of the geodata of north-western Germany and some parts of adjacent European countries. 
The dataset (obtained from [here](https://gdz.bkg.bund.de/index.php/default/digitale-geodaten/geodaetische-basisdaten/quasigeoid-der-bundesrepublik-deutschland-quasigeoid.html)) 
consists of a large number of points with two positions (longitude and latitude; $\mathbf{z} \in \mathbb{R}^2$) and an associated height value ($y \in \mathbb{R}$).

<img src="./approximate_map_area.png" alt="height_map" width="400"/>  <img src="./surface_top.svg" alt="data_height_map" width="400"/> <img src="./surface_front.svg" alt="data_height_map" width="400"/>

Source for topographic map (left-most image): https://de-de.topographic-map.com/map-95z57/Deutschland/

(There is some extrapolation issues at the sides of the plots, but these are not present in the data itself)

---

Your task is to fit a set of $q$ (slightly modified) radial basis functions (RBFs) to the data

$$
\begin{align}
    h(\mathbf{z}, \mathbf{w}) &= \sum_{i=1}^q \mathrm{RBF}(\mathbf{z}, \mathbf{w}_i) \\
    &=  \sum_{i=1}^q w_{i, 1} \cdot \exp\left( - \exp(w_{i, 2}) \cdot \left\| \begin{bmatrix} z_{1} \\ z_{2} \end{bmatrix} - \begin{bmatrix} w_{i, 3} \\ w_{i, 4} \end{bmatrix} \right\|_2^2\right),
\end{align}
$$

parameterized by $q$ parameter vectors $\mathbf{w}_i = \begin{bmatrix} w_{i, 1} \\ w_{i, 2} \\ w_{i, 3} \\ w_{i, 4} \end{bmatrix}$ for $i \in 1,\dots,q$ and the total (flattened) parameter vector $\mathbf{w}$ 

$$
\begin{align}
    \mathbf{w} = \begin{bmatrix} w_{1, 1} \\ w_{1, 2} \\  w_{1, 3} \\ \vdots \\ w_{q, 2} \\ w_{q, 3} \\ w_{q, 4} \\ \end{bmatrix}.
\end{align}
$$

##### **a)** - (0.5 points)
Implement a single $\mathrm{RBF}(\mathbf{z}, \mathbf{w}_i)$ that takes $\mathbf{z}$ and the parameter vector $\mathbf{w}_i$ with four elements as input.

In [16]:
function RBF(z, w_i)

    #--- YOUR CODE STARTS HERE ---#
    f_value = w_i[1] * exp(-exp(w_i[2]) * norm(z - w_i[3:4], 2)^2)
    #--- YOUR CODE ENDS HERE ---#

    return f_value
end

RBF (generic function with 1 method)

In [17]:
@assert isa(RBF, Function)
@assert isa(RBF([0, 0], [1, 0.5, -0.1, 0.2]), Number) "The output of the RBF needs to be a scalar number."


In [18]:
# Please leave this cell as its is


##### **b)** - (0.5 points)
Create a model $h(\mathbf{z}, \mathbf{w})$ that consists of a sum of $q$ RBF functions parameterized by a vector $\mathbf{w}$ with $4 \cdot q$ elements.

In [19]:
function h(z, w, q)

    @assert length(w) == q * 4 "Invalid number of parameters."

    #--- YOUR CODE STARTS HERE ---#
    f_value = 0
    for i = 0:(q-1)
        indices = [1, 2, 3, 4] .+ i*4
        f_value += RBF(z, w[indices])
    end
    #--- YOUR CODE ENDS HERE ---#

    return f_value
end

h (generic function with 1 method)

In [20]:
@assert isa(h, Function)
@assert isa(h([0, 0], [1, 0.5, 0.2, -0.1, 1, 0.5, 0.2, -0.1,], 2), Number) "The output of the model needs to be a scalar number."


In [21]:
# Please leave this cell as its is


##### **c)** - (1.5 points)
Provide a function for the symbolic gradient of the RBF w.r.t. the weights, i.e., $\nabla_{\mathbf{w}} \mathrm{RBF}(\mathbf{z}, \mathbf{w})$.

In [22]:
function gradient_RBF(z, w_i)

    #--- YOUR CODE STARTS HERE ---#
    square_norm = norm(z - w[3:4], 2)^2
    exp_term = exp(w[2] - exp(w[2]) * square_norm)
    dw1 = exp(-exp(w[2]) * square_norm)
    dw2 = w[1] * square_norm * (-exp_term)
    dw3 = 2 * w[1] * (z[1] - w[3]) * exp_term
    dw4 = 2 * w[1] * (z[2] - w[4]) * exp_term
    gradient = [dw1, dw2, dw3, dw4]
    #--- YOUR CODE ENDS HERE ---#

    return gradient
end

gradient_RBF (generic function with 1 method)

In [40]:
@assert isa(gradient_RBF, Function)
@assert size(gradient_RBF([0.6, 0], [1, 0.5, 0.2, 0.1])) == (4,) "The gradient is a vector of 4 elements, as there are 4 parameters for the RBF"


In [24]:
# Please leave this cell as its is


##### **d)** - (1 point)
Provide a function for the gradient of the sum of $q$ RBF functions 
w.r.t. the weight vector $\mathbf{w}$, i.e., $\nabla_{\mathbf{w}} h(\mathbf{z}, \mathbf{w})$.

In [25]:
function gradient_h(z, w, q)

    #--- YOUR CODE STARTS HERE ---#
    gradient = zeros(4*q)
    for i = 0:(q-1)
        indices = [1, 2, 3, 4] .+ i*4
        gradient[indices] = gradient_RBF(z, w[indices])
    end
    #--- YOUR CODE ENDS HERE ---#

    return gradient
end

gradient_h (generic function with 1 method)

In [26]:
@assert isa(gradient_h, Function)

q = 2
@assert size(gradient_h([-1, -1], [1, 0.5, 0.7, 1, 0.5, 0.7, 0.2, 0.1], q)) == (4 * q,) "The gradient is a vector of 4 * q elements, as there are 4 * q parameters for the RBF"


BoundsError: BoundsError: attempt to access 2-element Vector{Int64} at index [3:4]

In [27]:
# Please leave this cell as its is


##### **e)** - (1 point)
In order to assess the model quality, we want to compare the mean squared error (MSE) between the data $y[k]$ and the prediction made by the trained model $y_{est}[k] = h(\mathbf{z}[k], \mathbf{w})$. Implement this function into the template below. $\mathbf{y}$ is the vector containing all elements of $y[k]$, i.e.,

$$
\begin{align}
    \mathbf{y} = \begin{bmatrix} y[1] \\ y[2] \\ \vdots \\ y[N] \end{bmatrix}
\end{align}
$$


and $\mathbf{Z}$ is a Matrix containing all $\mathbf{z}[k]$, i.e.,

$$
\begin{align}
    \mathbf{Z} = \begin{bmatrix} \mathbf{z}[1] \\ \mathbf{z}[2] \\ \vdots \\ \mathbf{z}[N] \end{bmatrix}
\end{align}.
$$

In [28]:
function MSE_loss(y, Z, w, q)

    #--- YOUR CODE STARTS HERE ---#
    mse = 0
    N = length(y)
    for i = 1:N
        y_est = h(Z[i,:], w, q)
        mse += (y[i] - y_est)^2
    end
    mse /= N
    return mse
    #--- YOUR CODE ENDS HERE ---#

end

MSE_loss (generic function with 1 method)

In [29]:
@assert isa(MSE_loss, Function)

q = 10
@assert isa(MSE_loss([1, 2, 3], [1 2; 3 4; 5 6], ones(4 * q), q), Number) "The loss is a scalar value"


In [30]:
# Please leave this cell as its is


##### **f)** - (1 point)
Provide a function that determines the gradient of the MSE loss w.r.t. the parameter vector $\mathbf{w}$, i.e., $\nabla_{\mathbf{w}} \mathrm{MSE}(\mathbf{y}, \mathbf{Z}, \mathbf{w})$. You may reuse the functions you introduced in previous subtasks.

In [31]:
function gradient_MSE_loss(y, Z, w, q)

    #--- YOUR CODE STARTS HERE ---#
    gradient = zeros(4*q)
    N = length(y)
    for i = 1:N
        y_est = h(Z[i,:], w, q)
        gradient += -2 * (y[i] - y_est) * gradient_h(Z[i,:], w, q) 
    end
    gradient /= N
    #--- YOUR CODE ENDS HERE ---#

    return gradient
end

gradient_MSE_loss (generic function with 1 method)

In [32]:
@assert isa(gradient_MSE_loss, Function)

q = 2
w = ones(q * 4)
@assert size(gradient_MSE_loss([1, 2, 3], [1 2; 3 4; 5 6], w, q)) == (q * 4,)


In [33]:
# Please leave this cell as its is


##### Loading and preparing the dataset

In [34]:
function load_dataset()
    ### load and prepare the data set
    X = readdlm("./GCG2016_WE.txt", Float64)
    N = size(X, 1)

    # find valid data points and store them in the 
    # arrays for position (Z) and height (y)
    Z = zeros(N, 2)
    y = zeros(N)
    s = 0
    for i in 1:N
        if X[i, 3] < 1e3
            s += 1
            Z[s, :] = X[i, 1:2]
            y[s] = X[i, 3]
        end
    end
    Z = Z[1:50:s, :]
    y = y[1:50:s]
    N = ceil(s / 50)

    print("Size of the final data set: ")
    println(N)

    # display(Plots.surface(Z[:, 2], Z[:, 1], y, camera=(135, 60), size=(600,600)))
    # savefig("surface_front.svg")

    # display(Plots.surface(Z[:, 2], Z[:, 1], y, camera=(0, 90), size=(600,600)))
    # savefig("surface_top.svg")

    function min_max_normalization(a)
        max = maximum(a)
        min = minimum(a)

        return (a .- min) ./ (max - min) .* 2 .- 1
    end

    y = min_max_normalization(y)

    Z_1 = min_max_normalization(Z[:, 1])
    Z_2 = min_max_normalization(Z[:, 2])

    Z = hcat(Z_1, Z_2)
    return Z, y, N
end

Z, y, N = load_dataset();

###

Size of the final data set: 2832.0


([1.0 -0.6243093922651936; 1.0 -0.34806629834254166; … ; -1.0 0.635359116022099; -1.0 0.9116022099447509], [-0.8614245416078978, -0.8212858486130691, -0.7490009402914909, -0.8339797837329559, -0.934297132110953, -1.0, -0.8538434414668551, -0.8028913963328633, -0.7430359661495052, -0.8323930418429716  …  0.8774094969440525, 0.8396803008932769, 0.7182357780912085, 0.9496062529384104, 0.8338034790785134, 0.7937823225199816, 0.7690115185707558, 0.9274212505876822, 0.8286612599905967, 0.7492360131640807], 2832.0)

Code for the plotting of the data and your results is provided here in Markdown. 
However, please **remove any plotting code from your notebook before submission**, as this may cause issues for the grading. Thank you!

~~~
display(Plots.plot(log.(loss_values), xlabel="iterations", ylabel="MSE", legend=false))

y_init = [h(Z[k, :], w0, q) for k in 1:size(Z, 1)]
display(Plots.surface(Z[:, 2], Z[:, 1], y_init, camera=(135, 60), title="initial guess"))

display(Plots.surface(Z[:, 2], Z[:, 1], y, camera=(135, 60), title="data set"))

y_est = [h(Z[k, :], w_opt, q) for k in 1:size(Z, 1)]
display(Plots.surface(Z[:, 2], Z[:, 1], y_est, camera=(135, 60), title="result"))
~~~



##### **g)** - (1 point)
Use the standard gradient descent algorithm to fit a model with $q=100$ RBFs to the entire data set (use your own implementation, you are not allowed to use the ```Optim```-package in this task). Run for $200$ iterations (without early termination) with step-length $\eta=3$ and use the given initial conditions $\mathbf{w}_0$ to initialize your algorithm.

Report on the MSE between the RBF model and the data set in each iteration of the algorithm (exactly $201$ loss values where the first one is from the initial guess and the last one is from after the last iteration). 

In [35]:
function compute_initial_position(q)
    x_ = range(-1, stop=1, length=Integer(sqrt(q)))
    y_ = range(-1, stop=1, length=Integer(sqrt(q)))
    X = repeat(x_, Integer(sqrt(q)))[:]
    Y = repeat(y_', Integer(sqrt(q)))[:]
    gridPoints = [X Y]'

    w0 = [ones(q) * 0.05 ones(q) * 3.4 gridPoints']
    w0 = vcat(w0'...)
    return w0
end

compute_initial_position (generic function with 1 method)

In [38]:
q = 100
w0 = compute_initial_position(q)
Z, y, N = load_dataset();
max_n_iterations = 200
eta = 3
loss_values = zeros(max_n_iterations + 1)  # fill this with your MSE loss values

#--- YOUR CODE STARTS HERE ---#
loss_values[1] = MSE_loss(y, Z, w0, q)
w_current = w0
w_next = w0
for iter = 1:max_n_iterations
    w_next = w_current - eta * gradient_MSE_loss(y, Z, w_current, q)
    loss_values[iter+1] = MSE_loss(y, Z, w_next, q)
    w_current = w_next
end
#--- YOUR CODE ENDS HERE ---#

Size of the final data set: 2832.0


In [None]:
# Please leave this cell as its is


##### Food for thought (no credit): 
What is the influence of the parameter $w_{i, 2}$, why would one use $\exp(w_{i, 2})$ instead of simply $w_{i, 2}$ in the RBF? 