#### **&#x1F516;** **(&#x31;)** 

Why do the layers in a deep neural network architecture need to be non-linear? In other words, why linear layer is not desirable in neural nets?

--- 

Answer:

Because linear functions are closed under composition, this is equivalent to having a single layer. Therefore, no matter how many layers exist, the network can only learn linear functions.

#### **&#x1F516;** **(&#x32;)** 

You are solving a binary classification task for a wifi modulation signal problem. The final two layers in your network are a ReLU activation followed by a sigmoid activation. What will happen?

--- 

Answer:

Using ReLU then sigmoid will cause all predictions to be positive.

#### **&#x1F516;** **(&#x33;)** 

Softmax takes in an n-dimensional vector $x$  and outputs another n-dimensional vector $y$:

$
y = \frac{e^{x_i}}{\sum_k e^{x_k}}
$

The objective is to compute the gradient of $y$ w.r.t $x$. Let $\delta_{ij} = \frac{\partial y_i}{\partial y_j}$ 

Derive an expression for $(i) \delta_{ii}$, $(ii) \delta_{i,j}$ when $i \neq j$

Hint: Quotient rule of calculus. Let $h(x) = \frac{f(x)}{g(x)}$ then $
\frac{\partial h}{\partial x} = \frac{\frac{\partial f(x)}{\partial x} * g(x) - \frac{\partial g(x)}{\partial x} * f(x)}{ (g(x))^2}
$

--- 

Answer:

Let's denote $f(x)= e^{x_i}$; $g(x) = \sum e^{x_k}$ 

- [ ]  When $ i = j $:

$$
\begin{flalign*}
\delta_{ii} &= \frac{\partial y_i}{\partial x_i} = \frac{\frac{\partial f(x)}{\partial x} g(x) - \frac{\partial g(x)}{\partial x} * f(x)}{ (g(x))^2} \\
&= \frac{e^{x_i} \displaystyle\sum_k e^{x_k} - e^{x_i} e^{x_i}}{(\displaystyle\sum_k e^{x_k})^2} \\
&= \frac{e^{x_i} ( \displaystyle\sum_k e^{x_k} - e^{x_i}) }{(\displaystyle\sum_k e^{x_k})^2} \\
&= y (1 - y)
\end{flalign*}
$$

- [ ] When $ i \neq j $:

$$
\begin{flalign*}
\delta_{ij} &= \frac{\partial y_i}{\partial x_j} = \frac{0 - e^{x_i} e^{x_j}}{(\displaystyle\sum e^{x_k})^2} - \frac{e^{x_i} e^{x_j}}{\displaystyle\sum e^{x_i} \displaystyle\sum e^{x_j}} \\
&= -y y_j
\end{flalign*}
$$


#### **&#x1F516;** **(&#x34;)** 

Let $s_k$ be the score for a specific class $k$ and $\theta$ is a constant that we substract from all scores of a sample. Swho that $\text{softmax}(s_k)$ is equal to $\text{softmax}(s_k - \theta)$.

What does this property of the softmax function implies? Would you consider this property useful when training neural nets?

--- 

Let's start by showing that $\text{softmax}(s_k)$ is equal to $\text{softmax}(s_k - \theta)$.

### Property Proof
Given the softmax function for a vector of scores $ \mathbf{s} $:
$
\text{softmax}(s_k) = \frac{e^{s_k}}{\sum_{j=1}^{n} e^{s_j}}
$

Now, if we subtract a constant $\theta$ from all scores, we get a new vector $ \mathbf{s'} $ where $ s'_i = s_i - \theta $ for all $ i $. The softmax function applied to this new vector is:
$
\text{softmax}(s'_k) = \frac{e^{s_k - \theta}}{\sum_{j=1}^{n} e^{s_j - \theta}}
$

We can simplify the numerator and denominator:
$
\text{softmax}(s'_k) = \frac{e^{s_k - \theta}}{\sum_{j=1}^{n} e^{s_j - \theta}} = \frac{e^{s_k} \cdot e^{-\theta}}{\sum_{j=1}^{n} e^{s_j} \cdot e^{-\theta}}
$

Since $ e^{-\theta} $ is a constant and can be factored out of the sum in the denominator:
$
\text{softmax}(s'_k) = \frac{e^{s_k} \cdot e^{-\theta}}{e^{-\theta} \cdot \sum_{j=1}^{n} e^{s_j}} = \frac{e^{s_k}}{\sum_{j=1}^{n} e^{s_j}}
$

Therefore,
$
\boxed { \text{softmax}(s'_k) = \text{softmax}(s_k) }
$



#### **&#x1F516;** **(&#x35;)** 

You design a fully connected neural net architecture for an end-toend communication system, where all functions are sigmoid. You initilize the weigths with large positive numbers. Is this a good idea? Explain

--- 

Answer:

Large weigths assumes $w_x$ to be large. When $w_x$ is large the gradient is small for sigmoid function. Hence, we encounter the vanishing gradient problem.

#### **&#x1F516;** **(&#x36;)** 

We would like to implement a deep wireless transmitter by training a fully connected neural net with 5 hidden layers, each with 10 hidden units. The input is a 20 dimensional vector and the output is a scalar. Calculate the total number of trainable parameters.

--- 

Let's represent the calculation in a compact form. Given:

- Input dimension: $ 20 $
- Hidden layers: $ 5 $ hidden layers, each with $ 10 $ units
- Output dimension: $ 1 $

The total number of trainable parameters includes the weights and biases for each layer.

### Layers Calculation

1. **Input Layer to Hidden Layer 1:**
   - Weights: $ 20 \times 10 $
   - Biases: $ 10 $

2. **Hidden Layer $ i $ to Hidden Layer $ i+1 $ (for $ i = 1 $ to $ 4 $):**
   - Weights: $ 10 \times 10 $
   - Biases: $ 10 $

3. **Hidden Layer 5 to Output Layer:**
   - Weights: $ 10 \times 1 $
   - Biases: $ 1 $

### Total Calculation

Let's sum these up:

1. **Input to Hidden Layer 1:**
   $
   20 \times 10 + 10 = 200 + 10 = 210
   $

2. **Hidden Layer 1 to Hidden Layer 2:**
   $
   10 \times 10 + 10 = 100 + 10 = 110
   $

3. **Hidden Layer 2 to Hidden Layer 3:**
   $
   10 \times 10 + 10 = 100 + 10 = 110
   $

4. **Hidden Layer 3 to Hidden Layer 4:**
   $
   10 \times 10 + 10 = 100 + 10 = 110
   $

5. **Hidden Layer 4 to Hidden Layer 5:**
   $
   10 \times 10 + 10 = 100 + 10 = 110
   $

6. **Hidden Layer 5 to Output Layer:**
   $
   10 \times 1 + 1 = 10 + 1 = 11
   $

### Sum of All Parameters

$
210 + 110 + 110 + 110 + 110 + 11 = 661
$

Thus, in a compact form, the total number of trainable parameters is $ \boxed{661} $.

$\text{\# weights} = (20 * 10) + ... = 610 $

$\text{\# bins} = 51$

#### **&#x1F516;** **(&#x37;)** 

We would like to design an end-to-end communication system using an autoencoder. For that, we try to find a useful representation $r \in \mathbb{R}^r$ of the input $s \in \mathbb{R}^k$ at some intermediate layer through learning to reproduce the input at the output if $k = 5$, how much $n$ should be?

--- 

Answer:

Given $ k = 5 $ (input dimension), a reasonable choice for the intermediate layer dimension \( n \) for an autoencoder would typically be less than the input dimension to achieve compression. Therefore, a compact representation for \( n \) would be:

$
n < k \implies n < 5
$
