# Assignment 1: Universal Function Approximator


The goal of this exercise is to compare three different neural network architectures and analyze their capacity for function approximation:

1. $N_1$: One-layer network (linear transformation only)
2. $N_2$: One-layer network with non-linear activation function
3. $N_3$: Two-layer network (hidden layer with non-linear activation function)

They will be trained via gradient descent (with weight decay). To show the flexibility of the approach, three different functions will be approximated:
1. $X_1: t = \cos(3x)$ for $x\in[-2,2]$
2. $X_2: t = e^{-x^2}$ for $x\in[-1,1]$
3. $X_3: t = x^5 + 3x^4 - 6x^3 -12x^2 + 5x + 129$ for $x\in[-4,2.5]$

In the theoretical section, the networks will be designed, and the necessary derivatives will be computed by hand.

In the coding section, we will:

- implement the networks and their gradients,
- generate target data for three different functions,
- apply the training procedure to the data, and
- plot the resulting approximated function together with the data samples.

## Section 1: Theoretical Questions

### Network Design

#### Task 1.1: Network Structure

Given input $\vec x = (1, x)^T$, define three neural networks ($N_1$, $N_2$, $N_3$) mathematically, to reach output $y$. Use $g()$ to represent the activation function.

Explain how their structures differ and analyze their function approximation capabilities.

---
Note:

For one-layer networks, define parameter $\Theta=\vec w \in\mathbb R^{D+1}$

For two-layer network, define parameters $\Theta=(\mathbf W^{(1)},\vec w^{(2)})$ that are split into $\mathbf W^{(1)}\in\mathbb R^{K\times {(D+1)}}$ for the first layer and $\vec w^{(2)}\in\mathbb R^{K+1}$ for the second layer.

Answer:

1. $N_1$: A one-layer network is simply a linear mapping: $y=\vec w^T\vec x$. This model can only approximate linear functions.

2. $N_2$: A one-layer network with activation function applies a non-linearity:

   1. Linear layer: $\vec a=\vec w^T\vec x$
   2. Apply the activation function element-wise: $y=g(\vec a)$

   This can approximate some non-linear functions but is limited.

3. $N_3$: A two-layer network introduces a hidden layer:

   1. First layer: $\vec a_- = \mathbf W^{(1)} \vec x$
   2. Apply the activation function element-wise: $\vec a_- : \vec h_- = g(\vec a_-)$.
   3. Prepend the bias neuron $h_0=1$ to arrive at $\vec h$
   4. Compute the network output: $y = \vec w^{(2)}{}^T\vec h$

   This allows for more complex function approximation by transforming the input space non-linearly.



#### Task 1.2: Network Comparison

Can the one-layer network approximate all three functions well? Why or why not?

What advantages does the two-layer network have compared to a one-layer network?

How can we determine the appropriate number of hidden neurons?
When looking at the example plots in the OLAT, how many hidden neurons do we need in order to approximate the functions? Is there any difference between the three target functions?



Answer:

The one-layer network (linear) cannot fit the nonlinear functions well.

The one-layer network with activation can approximate smooth functions better but is still limited, only one change of direction can be approximated.

The two-layer network can approximate complex functions, but the number of hidden neurons should be chosen carefully.

The number of hidden neurons depends on the complexity of the function:

$X_1$: 10

$X_2$: 2

$X_3$: More complex functions usually require more units: here we take 80.

More neurons allow for better approximation but increase computation cost and risk of overfitting.



#### Task 1.3: Network Performance

If the network struggles to approximate a function well, what are some possible reasons?

How can we improve the network's performance?


Answer:

Reasons:
1. Too few hidden neurons (underfitting).
2. Poor weight initialization.
3. Inappropriate activation function.

Solutions:
1. Changing the number of hidden neurons.
2. Adjusting the learning rate, e.g. use adaptive learning rate techniques
3. Choosing a different activation function.
4. Modifying the loss function.
5. Adding output normalization.


### Derivatives

#### Task 1.4: Activation Function

Given the hyperbolic tangent ($\tanh$) activation function as:

$$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$

Prove:

$$\frac{\partial}{\partial x} \tanh(x) = 1 - \tanh^2(x)$$

Hint: Apply the derivative rules as defined in the Lecture:
* Quotient rule
* Sum rule
* Exponential rule

Also, avoid factoring out parentheses.

Answer:

$$\begin{aligned}
\frac{\partial}{\partial x} \tanh(x) &= \frac{\partial  \frac{e^x - e^{-x}}{e^x + e^{-x}}}{\partial x} \\[6ex]
\text{(quotient rule)} &= \frac{\frac{\partial(e^x - e^{-x})}{x} (e^x + e^{-x}) - \frac{\partial(e^x + e^{-x})}{x} (e^x - e^{-x})}{(e^x + e^{-x})^2} \\[6ex]
\text{(exponential rule)} &= \frac{(e^x + e^{-x})(e^x + e^{-x}) - (e^x - e^{-x})(e^x - e^{-x})}{(e^x + e^{-x})^2}\\[6ex]
 &= \frac{(e^x + e^{-x})(e^x + e^{-x})}{(e^x + e^{-x})^2} - \frac{(e^x - e^{-x})(e^x - e^{-x})}{(e^x + e^{-x})^2} \\[6ex]
 &= \frac{(e^x + e^{-x})^2}{(e^x + e^{-x})^2} - \frac{(e^x - e^{-x})^2}{(e^x + e^{-x})^2} \\[6ex]
 &= 1 - \left(\frac{e^x - e^{-x}}{e^x + e^{-x}}\right)^2 = 1 - \tanh^2(x)
\end{aligned}$$



#### Task 1.5: Weight Decay

Consider a loss function with L2 regularization (weight decay):
$$
L'(\theta) = L(\theta) + \frac{\lambda}{2} \|\theta\|^2
$$

Compute its derivative with respect to $\theta$: $$\frac{\partial}{\partial \theta} L'(\theta)$$


Answer:

Taking the derivative,
$
\frac{\partial}{\partial \theta} L'(\theta) = \frac{\partial}{\partial \theta} L(\theta) + \frac{\lambda}{2} \frac{\partial}{\partial \theta} \sum_i \theta_i^2
$

Since $\frac{\partial}{\partial \theta} \sum_i \theta_i^2 = 2\theta$, we get:
$
\nabla_\theta L' = \nabla_\theta L + \lambda \theta
$

#### Task 1.6

How large should an appropriate weight decay parameter $\lambda$ as shown in Task 1.5 be? What would happen if $\lambda$ is set too high or too low?

Answer:

λ is usually set in a small range (e.g., $10^{-5}$ to $10^{-2}$).

It should be selected to maintain the best trade-off between model complexity and generalization.

If $\lambda$ is too small, the model learns almost without regularization, meaning weights can grow large. This can lead to overfitting, where the model memorizes training data but performs poorly on unseen data.

If $\lambda$ is too large, the weight decay term dominates the optimization process, leading to:

1. Excessive shrinkage of weights, causing the model to underfit the data.
2. Loss of model capacity, where the neural network struggles to capture even simple relationships in the data.
3. Degradation in performance, as the model's predictions become too simplistic (e.g., outputting the same value for all inputs).
4. In extreme cases, setting $\lambda$ too high can collapse all weights