# Arg Max and Soft Max
---

Prerequisite:
- working and main idea behind neural nets
- main idea behind backpropagation
- How neural networks works with multiple input and output nodes.


## Why Arg-Max or Soft-Max?

In multi input output neural network we have seen that when we plug two input values to the network for sepal width and petal width, the network will give us the three o/p values for setosa, versicolor, and verginica.

And we can notice that it does not gives us output in range of 0 to 1.

like some times it will give us -1.7, or 0.6, or 1.8 something like that for multiple output nodes.

**Reason one**
And  these **broad range of raw output makes it harder to interpret** the results.


That's why before the final decision is made, we pass raw output to the either Arg-max or Soft-max




## **Arg-max**

![](https://raw.githubusercontent.com/srkds/deep-learning-foundation/main/assets/argmax.jpg)

Simply sets 1 to the larger value of the all raw output values and sets others to 0.

**For example**

from below table we can observe that Argmax sets final output of setosa to 1. and others to 0


|Label|Raw output| Argmax value |
|---|---|---|
|Setosa|1.43|1|
|Versicolor|-0.4|0|
|Virginica|0.23|0|

**Takeaway**

so when we use Argmax in NN, the prediction is simply the output with a 1 in it.

And it makes output super easy to interpret.

**Problem with Argmax**

We can't use it to optimize the weights and biases in the neural networks

This is because the output values from Argmax are constant/discrete 0 and 1.

Understand it by below shown diagram.
![](https://raw.githubusercontent.com/srkds/deep-learning-foundation/main/assets/argmax-derivative.jpg)

Ploting the second largest output value, 0.23 on a graph.

Since argmax will give 1 for any output value  grater than 0.23 and it will output 0 for any output value less than 0.23.

Because the slope of the two lines are both 0, their derivative are also zero.

That means when we want to find optimal value of any weights or biases of the network then we will endup plugin 0  into the chain rule as the derivative of Argmax wrt some raw output values. Then the whole derivative will be 0.

And if we plug 0 into gradient decent then we won't step towards the optimal parameter values.

**📌 That means we can't use ArgMax for backpropagation.**


$$ \frac{d \; Loss}{some_weight} = \frac{d \; Loss}{d \;ArgMax } * \frac{d \; ArgMax }{d \; Raw output} * ... $$


$$ \frac{d \; Loss}{some_weight} = \frac{d \; Loss}{d \;ArgMax } * 0 * ... $$


$$ \frac{d \; Loss}{some_weight} = 0 $$


## SoftMax function

![](https://raw.githubusercontent.com/srkds/deep-learning-foundation/main/assets/softmax.jpg)

So problem of ArgMax leads us to move on SoftMax function.

When we want to use ArgMax for output we can use SoftMax for training.

Let's do some calculation.

**Softmax output value for Setosa**

$$  SoftMax_{setosa}(Output Values) =  \frac{e^{setosa}}{e^{setosa} * e^{versicolor} * e^{virginica} } $$

$$  SoftMax_{setosa}(Output Values) =  \frac{e^{1.43}}{e^{1.43} * e^{-0.4} * e^{0.23} } = 0.69 $$

0.69 Softmax output value for setosa.

**Softmax o/p value for Versicolor**

Only the changes at numerator
$$  SoftMax_{versicolor}(Output Values) =  \frac{e^{versicolor}}{e^{setosa} * e^{versicolor} * e^{virginica} } $$

$$  SoftMax_{versicolor}(Output Values) =  \frac{e^{-0.4}}{e^{1.43} * e^{-0.4} * e^{0.23} } = 0.10 $$

0.10 Softmax output value for versicolor.


**Finally the Softmax o/p value for Versicolor**

Only the changes at numerator
$$  SoftMax_{virginica}(Output Values) =  \frac{e^{virginica}}{e^{setosa} * e^{versicolor} * e^{virginica} } $$

$$  SoftMax_{virginica}(Output Values) =  \frac{e^{0.23}}{e^{1.43} * e^{-0.4} * e^{0.23} } = 0.21 $$

0.10 Softmax output value for versicolor.

|Label|Raw output| SoftMax value |
|---|---|---|
|Setosa|1.43|0.69|
|Versicolor|-0.4|0.10|
|Virginica|0.23|0.21|


- You can observe that largest value 1.43 is paired with largest softmax output value 0.69.
- like wise for second largest value and last value.
- We can see that the SoftMax function preserves the original order or ranking of Raw output values.
- It is worth taking note that SoftMax gives output in range between 0 and 1.
- that is something that SoftMax assures.
- regardless of how many Row output values there are, SoftMax o/p values will always be between 0 and 1.

## Probabilities and SoftMax

0.69 + 0.10 + 0.21 = 1
If we sum all the o/p values of SoftMax function that will be = 1
That means that as long as the output are mutually exclusive then the SM o/p values can be interpreted as predicted "Probabilities".

But Keep in mind that we can't trust its **accuracy** because they are dependent on weights and biases in the NNs.

And weights and biases are in turn, dependent on rendomly selected initial values.

so if we select the different initial value of parameters then, after training we endup getting optimal values that are as good as previously trained parameters to classify the data.

But it will give different raw output values and so do the final softmax o/p values.

**Takeaway**

That means the predicted values are not only dependent on the different input values but also to the random initial values for weights and biases. so we can't put lot of trust on its probability.


continue...


## SoftMax Equation

This is the general equation for the SoftMax function

$$  SoftMax_{i}(Output Values) =  \frac{e^{Output\;value_i}}{ \sum_{j=1}^ke^{Output \; value_j}}$$

i = individual raw o/p value

eg: if i = 2, we are talking about raw o/p value for Versicolor.

##  Takeaway

ArgMax is easy to interpret but cannot be used in backpropagation, because its derivative is 0.

In contrast SoftMax can be used in backpropagation as it has derivative

In [13]:
import matplotlib.pyplot as plt
import numpy as np
from math import e

In [14]:
def softmax(values,i):
  j = len(values)
  sum = 0
  for value in values:
    sum += e**value
  out = e**values[i]/sum
  return out

In [15]:
raw_op = [1.43, -0.4, 0.23]
softmax(raw_op, 0) # getting softmax value for setosa 1st value

0.6841780769762262

In [16]:
softmax(raw_op, 1) # getting softmax value for Versicolor 2nd value


0.10975144632131326

In [17]:
softmax(raw_op, 2) # getting softmax value for setosa 3rd value


0.20607047670246045

In [18]:
# Derivative of setosa
# bump up its value by small amout of h
# equation of derivative
# d = (f(x+h) - f(x)) / h

h = 0.0001

# d of pred prob of setosa wrt raw o/p value of setosa
new_raw_op = [1.43+h, -0.4, 0.23]
swrts = (softmax(new_raw_op, 0) - softmax(raw_op, 0)) / h

new_raw_op = [1.43, -0.4+h, 0.23] # bump up value of versicolor with small value of h
s_wrt_vc = (softmax(new_raw_op, 0) - softmax(raw_op, 0)) / h

new_raw_op = [1.43, -0.4, 0.23+h] # bump up value of virginica with small value of h
s_wrt_virg = (softmax(new_raw_op, 0) - softmax(raw_op, 0)) / h

print(f"der of pred value of Setosa wrt raw o/p value of Setosa is : {swrts}")
print(f"der of pred value of Versicolor wrt raw o/p value of Versicolor is : {s_wrt_vc}")
print(f"der of pred value of Virginica wrt raw o/p value of Virginica is : {s_wrt_virg}")

der of pred value of Setosa wrt raw o/p value of Setosa is : 0.21607445616411702
der of pred value of Versicolor wrt raw o/p value of Versicolor is : -0.07509246389925117
der of pred value of Virginica wrt raw o/p value of Virginica is : -0.1409930465556819



So from above experiment we can see that the derivative of SoftMax is not always 0 and that's why we can use it for Gradient decent. rather than using the ArgMax which doen't have derivative or 0.

So, NNs with multiple output often use SoftMax for training and then use ArgMax, which has easy to understand output, to classify new observation.

## Next Cross entropy

In example of working of neural network, I used sum of the squred residual loss function to determine how well the NN fit the data.

However, when we use SoftMax function, and because the o/p values are predicted probabilities between 0 and 1, we often use **Cross entropy** to determine how well NN fits the data.

-0.022617197629470898

0.16361024418598547