#### **&#x1F516;** **(&#x32;)** 

You are solving a binary classification task for a wifi modulation signal problem. The final two layers in your network are a ReLU activation followed by a sigmoid activation. What will happen?

--- 

Having a ReLU activation followed by a sigmoid activation in the final layers of a binary classification neural network will likely lead to issues, because of the way these activations function. Here's a detailed explanation of the potential problems:

1. **ReLU Activation Properties**: The Rectified Linear Unit (ReLU) activation function outputs zero for any negative input and outputs the input itself for any positive input. Mathematically, it can be expressed as:
   $
   \text{ReLU}(x) = \max(0, x)
   $
   This means that the output of the ReLU function is always non-negative.

2. **Sigmoid Activation Properties**: The sigmoid activation function outputs a value between 0 and 1, given by:
   $
   \text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}
   $
   The sigmoid function maps the input value into a probability space.

3. **Combining ReLU and Sigmoid**: If you have a ReLU activation followed by a sigmoid activation, the output of the ReLU activation will be non-negative. Thus, the input to the sigmoid activation will always be non-negative as well. The sigmoid function, when given non-negative inputs, outputs values in the range (0.5, 1). 

    - This means that the network will never output values in the range (0, 0.5). As a result, the model will be biased and unable to represent one half of the probability space.

4. **Impact on Binary Classification**: For binary classification, you typically want the output to span the full range from 0 to 1, representing the probability of each class. By limiting the output to (0.5, 1), the network won't properly learn to classify instances of the negative class (output close to 0).

**Solution**:
- Remove the ReLU activation before the sigmoid layer. The output layer for a binary classification problem should directly use the sigmoid activation to ensure that the output can span the entire range of (0, 1).

In summary, having a ReLU followed by a sigmoid activation in the final layers will prevent the network from outputting probabilities below 0.5, severely limiting its ability to perform binary classification tasks effectively.

#### **&#x1F516;** **(&#x33;)** 

Softmax takes in an n-dimensional vector $x$  and outputs another n-dimensional vector $y$:

$
y = \frac{e^{x_i}}{\sum_k e^{x_k}}
$

The objective is to compute the gradient of $y$ w.r.t $x$. Let $\delta_{ij} = \frac{\partial y_i}{\partial y_j}$ 

Derive an expression for $(i) \delta_{ii}$, $(ii) \delta_{i,j}$ when $i \neq j$

Hint: Quotient rule of calculus. Let $h(x) = \frac{f(x)}{g(x)}$ then $
\frac{\partial h}{\partial x} = \frac{\frac{\partial f(x)}{\partial x} * g(x) - \frac{\partial g(x)}{\partial x} * f(x)}{ (g(x))^2}
$

--- 

To compute the gradient of the softmax output $ y $ with respect to its input $ x $, let's start by defining the softmax function more formally:

Given an $ n $-dimensional vector $ x $:
$
y_i = \frac{e^{x_i}}{\sum_{k=1}^{n} e^{x_k}}
$

We need to derive the expression for the partial derivatives $ \delta_{ij} = \frac{\partial y_i}{\partial x_j} $.

### Step 1: Compute $\frac{\partial y_i}{\partial x_j}$

By applying the quotient rule:
$
y_i = \frac{e^{x_i}}{\sum_{k=1}^{n} e^{x_k}}
$
Let's define:
$
f(x) = e^{x_i} \quad \text{and} \quad g(x) = \sum_{k=1}^{n} e^{x_k}
$
Using the quotient rule:
$
\frac{\partial y_i}{\partial x_j} = \frac{\frac{\partial f(x)}{\partial x_j} g(x) - f(x) \frac{\partial g(x)}{\partial x_j}}{(g(x))^2}
$

#### When $ i = j $:
$
\frac{\partial e^{x_i}}{\partial x_j} = e^{x_i} \quad \text{and} \quad \frac{\partial g(x)}{\partial x_j} = e^{x_j} = e^{x_i}
$
Thus,
$
\frac{\partial y_i}{\partial x_i} = \frac{e^{x_i} \sum_{k=1}^{n} e^{x_k} - e^{x_i} e^{x_i}}{(\sum_{k=1}^{n} e^{x_k})^2}
$
$
\frac{\partial y_i}{\partial x_i} = \frac{e^{x_i} ( \sum_{k=1}^{n} e^{x_k} - e^{x_i})}{(\sum_{k=1}^{n} e^{x_k})^2}
$
$
\frac{\partial y_i}{\partial x_i} = \frac{e^{x_i} \sum_{k \neq i} e^{x_k}}{(\sum_{k=1}^{n} e^{x_k})^2}
$
$
\frac{e^{x_i}}{\sum_{k=1}^{n} e^{x_k}} = y_i
$
So,
$
\frac{\partial y_i}{\partial x_i} = y_i \left( 1 - y_i \right)
$

#### When $ i \neq j $:
$
\frac{\partial e^{x_i}}{\partial x_j} = 0 \quad \text{and} \quad \frac{\partial g(x)}{\partial x_j} = e^{x_j}
$
Thus,
$
\frac{\partial y_i}{\partial x_j} = \frac{0 \cdot \sum_{k=1}^{n} e^{x_k} - e^{x_i} e^{x_j}}{(\sum_{k=1}^{n} e^{x_k})^2}
$
$
\frac{\partial y_i}{\partial x_j} = - \frac{e^{x_i} e^{x_j}}{(\sum_{k=1}^{n} e^{x_k})^2}
$
$
\frac{e^{x_i}}{\sum_{k=1}^{n} e^{x_k}} = y_i \quad \text{and} \quad \frac{e^{x_j}}{\sum_{k=1}^{n} e^{x_k}} = y_j
$
So,
$
\frac{\partial y_i}{\partial x_j} = - y_i y_j
$

### Summary
$
\delta_{ij} = \frac{\partial y_i}{\partial y_j} = 
\begin{cases}
y_i (1 - y_i) & \text{if } i = j \\
- y_i y_j & \text{if } i \neq j
\end{cases}
$

#### **&#x1F516;** **(&#x34;)** 

Let $s_k$ be the score for a specific class $k$ and $\theta$ is a constant that we substract from all scores of a sample. Swho that $\text{softmax}(s_k)$ is equal to $\text{softmax}(s_k - \theta)$.

What does this property of the softmax function implies? Would you consider this property useful when training neural nets?

--- 

Let's start by showing that $\text{softmax}(s_k)$ is equal to $\text{softmax}(s_k - \theta)$.

### Property Proof
Given the softmax function for a vector of scores $ \mathbf{s} $:
$
\text{softmax}(s_k) = \frac{e^{s_k}}{\sum_{j=1}^{n} e^{s_j}}
$

Now, if we subtract a constant $\theta$ from all scores, we get a new vector $ \mathbf{s'} $ where $ s'_i = s_i - \theta $ for all $ i $. The softmax function applied to this new vector is:
$
\text{softmax}(s'_k) = \frac{e^{s_k - \theta}}{\sum_{j=1}^{n} e^{s_j - \theta}}
$

We can simplify the numerator and denominator:
$
\text{softmax}(s'_k) = \frac{e^{s_k - \theta}}{\sum_{j=1}^{n} e^{s_j - \theta}} = \frac{e^{s_k} \cdot e^{-\theta}}{\sum_{j=1}^{n} e^{s_j} \cdot e^{-\theta}}
$

Since $ e^{-\theta} $ is a constant and can be factored out of the sum in the denominator:
$
\text{softmax}(s'_k) = \frac{e^{s_k} \cdot e^{-\theta}}{e^{-\theta} \cdot \sum_{j=1}^{n} e^{s_j}} = \frac{e^{s_k}}{\sum_{j=1}^{n} e^{s_j}}
$

Therefore,
$
\text{softmax}(s'_k) = \text{softmax}(s_k)
$

This demonstrates that the softmax function is invariant to the addition (or subtraction) of a constant $\theta$ to all input scores.

### Implications of this Property
This property of the softmax function implies that the output probabilities remain unchanged if the same constant is subtracted from (or added to) all the scores. This invariance has several important implications and uses in training neural networks:

1. **Numerical Stability**: When dealing with large scores, the exponentiation in the softmax function can result in extremely large values that can cause numerical overflow. By subtracting the maximum score (or any large constant) from all scores, the exponentials become smaller and more manageable, reducing the risk of overflow. This practice is common in the implementation of the softmax function to improve numerical stability.

2. **Simplification in Implementation**: This property allows us to subtract the maximum score from all scores without changing the output. This simplifies the computation and avoids dealing with very large numbers during training, especially in deep neural networks where score magnitudes can vary significantly.

3. **Gradient Descent Optimization**: During training, subtracting a constant does not change the gradients computed for backpropagation because the softmax function is invariant to this shift. This means the optimization process remains effective and efficient.

4. **Regularization and Smoothing**: This property can be used to normalize the scores in certain scenarios, ensuring that the values fed into the softmax function are within a more controlled range, which can sometimes aid in achieving more stable training dynamics.

In conclusion, this property of the softmax function is very useful when training neural networks, especially in ensuring numerical stability and simplifying implementations without affecting the resulting probabilities.

#### **&#x1F516;** **(&#x36;)** 

We would like to implement a deep wireless transmitter by training a fully connected neural net with 5 hidden layers, each with 10 hidden units. The input is a 20 dimensional vector and the output is a scalar. Calculate the total number of trainable parameters.

--- 

To calculate the total number of trainable parameters in a fully connected neural network, we need to consider the weights and biases for each layer. Here's a step-by-step breakdown:

### Network Architecture
1. **Input Layer**: 20-dimensional input vector
2. **Hidden Layer 1**: 10 hidden units
3. **Hidden Layer 2**: 10 hidden units
4. **Hidden Layer 3**: 10 hidden units
5. **Hidden Layer 4**: 10 hidden units
6. **Hidden Layer 5**: 10 hidden units
7. **Output Layer**: 1 scalar output

### Calculation of Trainable Parameters

#### Input to Hidden Layer 1
- Weights: $ 20 $ (input units) $\times 10 $ (hidden units) = $ 200 $
- Biases: $ 10 $ (one bias per hidden unit)

Total for Hidden Layer 1: $ 200 + 10 = 210 $

#### Hidden Layer 1 to Hidden Layer 2
- Weights: $ 10 $ (hidden units) $\times 10 $ (hidden units) = $ 100 $
- Biases: $ 10 $ (one bias per hidden unit)

Total for Hidden Layer 2: $ 100 + 10 = 110 $

#### Hidden Layer 2 to Hidden Layer 3
- Weights: $ 10 $ (hidden units) $\times 10 $ (hidden units) = $ 100 $
- Biases: $ 10 $ (one bias per hidden unit)

Total for Hidden Layer 3: $ 100 + 10 = 110 $

#### Hidden Layer 3 to Hidden Layer 4
- Weights: $ 10 $ (hidden units) $\times 10 $ (hidden units) = $ 100 $
- Biases: $ 10 $ (one bias per hidden unit)

Total for Hidden Layer 4: $ 100 + 10 = 110 $

#### Hidden Layer 4 to Hidden Layer 5
- Weights: $ 10 $ (hidden units) $\times 10 $ (hidden units) = $ 100 $
- Biases: $ 10 $ (one bias per hidden unit)

Total for Hidden Layer 5: $ 100 + 10 = 110 $

#### Hidden Layer 5 to Output Layer
- Weights: $ 10 $ (hidden units) $\times 1 $ (output unit) = $ 10 $
- Biases: $ 1 $ (one bias for the output unit)

Total for Output Layer: $ 10 + 1 = 11 $

### Summing Up All Layers
$
210 \text{ (Layer 1)} + 110 \text{ (Layer 2)} + 110 \text{ (Layer 3)} + 110 \text{ (Layer 4)} + 110 \text{ (Layer 5)} + 11 \text{ (Output Layer)} = 661
$

Thus, the total number of trainable parameters in this neural network is $ 661 $.