# Softmax

Softmax is an activation function that converts a vector of raw, real-valued numbers (logits) into a probability distribution. Each value in the output probability distribution is between 0 and 1, and all values sum to 1.



- It's the standard output activation for multi-class classification problems.


- The neuron with the highest probability represents the model's final prediction.

- Example: For an input image, the model might output logits $[1.2, 2.9, 0.4]$. After applying softmax, these become probabilities like [0.18, 0.81, 0.01] for the classes ['cat', 'dog', 'bird']. The model's prediction is 'dog' since it has the highest probability (0.81).

  
$$Softmax(Z)_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}$$

_Where:_

- $Z$ =  The vector of input logits for a single sample (e.g., [1.2, 2.9, 0.4] -> mapped to -> ['cat', 'dog', 'bird']).

- $z_i$: The i-th element in the vector $z$. This single value is the raw score (logit) for the $i'th$ class.
  - Example: If your classes are ['cat', 'dog', 'bird'], then $z_1$ is the score for 'cat', $z_2$ is the score for 'dog', and so on...

* $K$ is the number of classes, in our example above we have 3 animals.

* $e^{z_i}$ The standard exponential function applied to each logit. This makes all values positive.
* $\sum_{j=1}^K e^{z_j}$ The normalization term. It's the sum of all the exponentiated logits, which ensures the final probabilities sum to 1.

**Numerical Stability:**
- Directly calculating $e^{z_i}$ can be a problem. If your logits are large $(e.g., 1000)$, $e^{1000}$ is an astronomically large number that causes a numerical overflow (an inf error). To fix this, we use a stabilization trick by subtracting the maximum logit from all logits before exponentiating.

$$Softmax(Z)_i  = \frac{e^{z_i - max(z)}}{\sum_{j=1}^K e^{z_j - max(z)}}$$

In [12]:
import numpy as np
def softmax(logits):
    """Apply softmax

    Args:
        logits: The output of the last layer
    """
    # Subtract the max for numerical stability
    exp_logits = np.exp(logits - np.max(logits, axis=1, keepdims=True))
    # Normalize to get probabilities
    return exp_logits / np.sum(exp_logits, axis=1, keepdims=True)

In [19]:
# --- Example Usage ---

# Let's simulate a batch of 4 samples for the 3 classes
num_images = 4
num_classes = 3 # Corresponds to ['cat', 'dog', 'bird']

# The output of the previous layer (the logits)
# Shape: (num_images, num_classes)
logits = np.random.randn(num_images, num_classes) * 2 # Multiply by 2 for more varied logits
logits

array([[ 0.5003897 , -1.93858329,  1.42608783],
       [ 4.05780403, -0.65232111,  1.97923639],
       [ 1.85062418, -2.6290737 ,  0.25195059],
       [-3.9246835 , -1.30803736, -1.34155564]])

In [20]:
print("Original Logits:\n", logits)
print("-" * 20)

# Apply the softmax function
probabilities = softmax(logits)

print("Probabilities after Softmax:\n", probabilities)
print("-" * 20)

# Verify that each row (sample) sums to 1
print("Sum of probabilities for each sample:\n", np.sum(probabilities, axis=1))

Original Logits:
 [[ 0.5003897  -1.93858329  1.42608783]
 [ 4.05780403 -0.65232111  1.97923639]
 [ 1.85062418 -2.6290737   0.25195059]
 [-3.9246835  -1.30803736 -1.34155564]]
--------------------
Probabilities after Softmax:
 [[0.27694081 0.0241632  0.69889599]
 [0.8817464  0.00793894 0.11031466]
 [0.82406173 0.00934225 0.16659602]
 [0.03580608 0.49017573 0.47401818]]
--------------------
Sum of probabilities for each sample:
 [1. 1. 1. 1.]
