### Model

If we create the kernel matrix for the most trivial case. With a model that only has two nodes as input two notes in a single hidden layer and output is a single note.

![Alt text](NN_simple.png)



Let's break down the parameters for each dense layer:

**First Dense layer:**
- Input units: 2
- Output units: 2
- Parameters: $$[2 \times(weights)]\times [2\times(input)] + 2 \times (biases) = 6 \textit{ parameters}$$

**Second Dense layer:**
- Input units: 2 (output units from the previous layer)
- Output units: 1
- Parameters: $$[1 \times(weights)]\times[ 2\times(input)] + 1\times (biases) = 3 \textit{ parameters}$$

### Kernel Matrix

We want to define the function $f(x_i;\theta)$, since we have written the listed down all the parameters it's trivial:

$$f(x_i;\theta)=W_2\left[W_1\times x_i + B_1\right]+B_2$$

With $\theta$ = $\{W_1,B_1,W_2,B_2\}$

The kernel matrix is defined as 
$$K=\begin{bmatrix}\sum _k \frac{\partial f(x_1;\theta)}{\partial \theta_k}\frac{\partial f(x_1;\theta)}{\partial \theta_k} & \ldots & \sum _k \frac{\partial f(x_1;\theta)}{\partial \theta_k}\frac{\partial f(x_m;\theta)}{\partial_k} \\ \vdots & \ddots & \vdots \\ \sum _k \frac{\partial f(x_n;\theta)}{\partial \theta_k}\frac{\partial f(x_1;\theta)}{\partial\theta_k} & \ldots & \sum _k \frac{\partial f(x_n;\theta)}{\theta_k} \frac{\partial f(x_m;\theta)}{\partial \theta_k} \end{bmatrix} $$

We calculate the partial deriative with respect with all parameters:

$$
\frac{\partial f(x_i;\theta)}{\partial \theta_k}$$ 
$$ \frac{\partial f(x_i;\theta)}{\partial W_1} = W_2^Tx_i^T\text{ ,  } \frac{\partial f(x_i;\theta)}{\partial B_1} = W_2^T \text{ ,  }\frac{\partial f(x_i;\theta)}{\partial W_2} = [W_1x_i+B_1]^T \text{ ,  }\frac{\partial f(x_i;\theta)}{\partial B_2} = 1
$$

With that we calculate the first entry of the kernel matrix where
$$ x_1 = \begin{bmatrix}a  \\ b \end{bmatrix} \text{ and } x_2 = \begin{bmatrix}c  \\ d \end{bmatrix}$$

$$ K_{1,1} = \sum_k \frac{\partial f(x_1;\theta)}{\partial \theta_k}\frac{\partial f(x_1;\theta)}{\partial \theta_k}$$ 
$$ =  W_2^T\begin{bmatrix}a & b\end{bmatrix}+ W$$
Where
$$W_1 \in\R^{2\times 2}\text{ , } W2\in\R^{1\times 2}\text{ , }B_1 \in\R^{1\times 2}\text{ , }B_2\in\R^{1\times 1}$$


In [279]:
using Flux, LinearAlgebra

x1 = [1; 2]
x2 = [3; 4]

x1 = Float32.(x1)
x2 = Float32.(x2)

X = hcat(x1,x2)

model = Chain(
    Dense(2 => 2),
    Dense(2 => 1)
)

n = size(X)[2]


gs_x = []

for i in 1:n
    gs = gradient(() -> model(X)[i], Flux.params(model))
    push!(gs_x, gs)
end



# w1 =  model.layers[1].weight
# w2 = model.layers[2].weight
# b1 = model.layers[1].bias
# b2 = model.layers[2].bias
# @show gs_x


# for i in 1:length(gs_x)
#     g1_xi = gs


gs_x1 = Flux.gradient(() -> model(x1)[1], Flux.params(model))


Flux.params(model)[1] .= ones(2,2)
Flux.params(model)[2] .= [1 1]
Flux.params(model)[3] .= [1 1]
Flux.params(model)[4] .= 1

# # @show model(x1)
g1_x1 = gs_x1[Flux.params(model)[1]] # W1 x1
g2_x1 = gs_x1[Flux.params(model)[2]] # B1 x1
g3_x1 = gs_x1[Flux.params(model)[3]] # W2 X1
g4_x1 = gs_x1[Flux.params(model)[4]] # B2 x1






K_11 = dot(g1_x1,g1_x1)+dot(g2_x1,g2_x1)+dot(g2_x1,g2_x1)+dot(g3_x1,g3_x1)+dot(g4_x1,g4_x1)



# δw1_x1 = w2_x1'*x1'
# δb1_x1 = w2_x1'
# δw2_x1 = w1_x1*x1+b1_x1; δw2_x1'
# δb2_x1 = 1

# Extracting gradients using the parameter names directly
# δw1_x1 = gs_x1[model.layers[1].weight]
# δb1_x1 = gs_x1[model.layers[1].bias]
# δw2_x1 = gs_x1[model.layers[2].weight]
# δb2_x1 = gs_x1[model.layers[2].bias]

# Done by hand




# display(δw1_x1)
# display(δb1_x1)
# display(δw2_x1)
# display(δb2_x1)

# display(g1_x1)
# display(g2_x1)
# display(g3_x1)
# display(g4_x1)

# gs_x1 = Flux.gradient(() -> model(x2)[1],Flux.params(model))

# Extracting gradients using the parameter names directly
# δw1_x1 = gs_x1[model.layers[1].weight]
# δb1_x1 = gs_x1[model.layers[1].bias]
# δw2_x1 = gs_x1[model.layers[2].weight]
# δb2_x1 = gs_x1[model.layers[2].bias]




# gs_x2 = Flux.gradient(() -> model(x2)[1],Flux.params(model))

# g_3 = gs_x2[Flux.params(model)[1]] # W1 x2
# g_4 = gs_x2[Flux.params(model)[2]] # B1 x2

# K_11 = dot(g_1,g_1)+dot(g_2,g_2) # This is the whole some over the dot-product
# K_12 = dot(g_1,g_3)+dot(g_2,g_4)
# K_21 = dot(g_3,g_1)+dot(g_2,g_2)
# K_22 = dot(g_3,g_3)+dot(g_4,g_4)

# K_1 = hcat(K_11,K_12)
# K_2 = hcat(K_21,K_22)

# K = vcat(K_1,K_2)

# eig_info = eigen(K)
# eig_vals = eig_info.values

DimensionMismatch: DimensionMismatch: cannot broadcast array to have fewer non-singleton dimensions

In [284]:
using Flux
using Zygote
using MLDatasets
using LinearAlgebra

model = Chain(  Dense(2 => 2), Dense(2 => 1)) # W_2[1x2](W_1[2x2]x[2,1]+b_1[2x1])+b_2[1]

x=Float32[0.5852378, 0.62436277] # random datapoint

W1 = Flux.params(model)[1]  # W_1
b1 = Flux.params(model)[2]  # b_1

W1 .= ones(2,2)  #  Hér má setja eitthvað "fixed" fylki, breyti gildum í W1
b1 .= [1,1]

W2 = Flux.params(model)[3]  # W_1
b2 = Flux.params(model)[4]  # b_1

W2 .= ones(1,2)
b2 .= 1

y=model(x)


gs=Flux.gradient(() -> model(x)[1],Flux.params(model))   # Reikna allar hlutaafleiður
all_params, r = Flux.destructure(model)


# g1_x1 = gs_x1[Flux.params(model)[1]] # W1 x1
# g2_x1 = gs_x1[Flux.params(model)[2]] # B1 x1
# g3_x1 = gs_x1[Flux.params(model)[3]] # W2 X1
# g4_x1 = gs_x1[Flux.params(model)[4]] # B2 x1


gs1_x1_1 = gs[W1]; gs1_x1 = vec(gs1_x1_1)
gs2_x1 = gs[b1]
gs3_x1_1 = gs[W2]; gs3_x1 = vec(gs3_x1_1)
gs4_x1 = gs[b2]

# 
K_11 = gs1_x1'*gs1_x1+gs2_x1'*gs2_x1+gs3_x1'*gs3_x1+gs4_x1'*gs4_x1





14.229333f0

In [232]:
using Flux, LinearAlgebra

x1 = [1; 2]
x2 = [3; 4]

x1 = Float32.(x1)
x2 = Float32.(x2)

X = hcat(x1,x2)

model = Chain(
    Dense(2 => 2),
    Dense(2 => 1)
)

n = size(X)[2]


gs_x = []

for i in 1:n
    gs = gradient(() -> model(X)[i], Flux.params(model))
    push!(gs_x, gs)
end

for i in 

gs_x = Any[Grads(...), Grads(...)]


2-element Vector{Any}:
 Grads(...)
 Grads(...)

In [267]:
model = Chain(
    Dense(2 => 2),
    Dense(2 => 1)
)


Chain(
  Dense(2 => 2),                        [90m# 6 parameters[39m
  Dense(2 => 1),                        [90m# 3 parameters[39m
) [90m                  # Total: 4 arrays, [39m9 parameters, 292 bytes.

### Gradients of n-th model

The model function for a model with h many hidden layers. With
$n = h+1\textit{ , } f_0 = x_i$:
$$f(x_i;\theta)_n = W_n f_{n-1} + B_n$$

for the first two hidden layers we show the derivative with respect to all parameters $\theta$

##### For h = 0, n = 1
$f(x_i;\theta)_1 = W_1x_i + B_1 $
$$\frac{\partial f(x_i;\theta)_1}{\theta_k} \text{ ;  } \frac{\partial f(x_i;\theta)_1}{\partial W_1}= x_i^T \text{ ,  }\frac{\partial f(x_i;\theta)_1}{\partial B_1} = 1 $$

