(softmax)= 
# Chapter 18 -- Softmax

The cross-entropy cost can be used to address the problem of learning slowdown. However, I want to briefly describe another approach to the problem, based on what are called softmax layers of neurons. Softmax is still worth understanding, in part because it's intrinsically interesting, and in part because we'll use softmax layers in our discussion of deep neural networks.

The idea of softmax is to define a new type of output layer for our neural networks. It begins in the same way as with a sigmoid layer, by forming the weighted inputs

$$
h=w_1x_1 + w_2x_2 + ... + b
$$ (eq18_1)

However, we don't apply the sigmoid function to get the output. Instead, in a softmax layer we apply the so-called softmax function

$$
softmax(h)^{(l)}_a = \frac{e^{h^{(n)}_{a}}}{\sum_b e^{h^{(n)}_b}}
$$ (eq18_2)


where $softmax(h)^{(l)}_a$ is a prediction probability between 0 and 1 in the output layer. 

The sum of the $softmax(h)^{(l)}_a$ is always 1. In other words, the output from the softmax layer can be thought of as a probability distribution. The fact that a softmax layer outputs a probability distribution is rather pleasing. In many problems it's convenient to be able to interpret the output activation $a^{(n)}_a$ as the network's estimate of the probability that the correct output is $a^{th}$ neuron in the output layer. So, for instance, in the MNIST classification problem, we can interpret $a^{(n)}_a$ as the network's estimated probability that the correct digit classification is the $a^{th}$ neuron.

By contrast, if the output layer was a sigmoid layer, then we certainly couldn't assume that the activations formed a probability distribution. I won't explicitly prove it, but it should be plausible that the activations from a sigmoid layer won't in general form a probability distribution. And so with a sigmoid output layer we don't have such a simple interpretation of the output activations.

### Mathematical Properties

The softmax function has several important properties:

Range: Each output of the softmax function lies in the range $(0, 1)$.
Sum: The sum of all outputs of the softmax function is 1, making it a valid probability distribution.
Exponentiation: The use of exponentiation ensures that higher weighted inputs have a more significant impact on the resulting probabilities.


### Example Usage in Neural Networks
In practice, softmax is often used in the final layer of a neural network for classification tasks, where the network needs to output probabilities for each class.

### Example Code
Here is a simple implementation of the softmax function and its use in a neural network:

In [1]:
import numpy as np

# Softmax function
def softmax(z):
    exp_z = np.exp(z - np.max(z))  # Subtract max(z) for numerical stability
    return exp_z / np.sum(exp_z, axis=0)

# Example usage
def main():
    # Example input (logits from the final layer of the network)
    logits = np.array([2.0, 1.0, 0.1])

    # Compute the softmax probabilities
    probabilities = softmax(logits)

    print("Logits:", logits)
    print("Softmax probabilities:", probabilities)

if __name__ == '__main__':
    main()


Logits: [2.  1.  0.1]
Softmax probabilities: [0.65900114 0.24243297 0.09856589]


The softmax function will transform these logits into probabilities that sum to 1. This allows us to interpret the output as follows:

The probability that the input belongs to the first class is the highest.

The probability that the input belongs to the second class is moderate.

The probability that the input belongs to the third class is the lowest.
