# The softmax function

The softmax function is widely used in machine learning, particularly in classification tasks. Here’s a detailed look at its usage and importance:

### Definition

The softmax function takes an $n$-dimensional vector of real numbers and transforms it into a probability distribution of $n$ possible outcomes. Each component of the resulting vector is between 0 and 1, and the sum of all components is 1. The formula for the softmax function for a vector $ \mathbf{z} $ is:

$ \text{softmax}(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^n e^{z_j}} $

### Usage

1. **Multi-class Classification**:
   - **Output Layer of Neural Networks**: The most common use of the softmax function is in the output layer of neural networks for multi-class classification problems. It converts the raw output scores (logits) into probabilities, which are easier to interpret and work with.
   - **Probability Interpretation**: The output probabilities indicate the likelihood of each class. The class with the highest probability is typically chosen as the predicted class.

2. **Probability Distribution**:
   - **Normalization**: Softmax normalizes the logits, ensuring they sum to 1. This property is crucial for tasks that require a valid probability distribution over classes.

3. **Cross-Entropy Loss**:
   - **Training Stability**: When combined with the cross-entropy loss function, softmax provides a smooth gradient, which helps in the efficient training of neural networks using gradient-based optimization methods.

### Why Softmax is Useful

1. **Interpretable Outputs**:
   - The softmax function transforms the raw model outputs into a probability distribution, making the outputs interpretable as probabilities.

2. **Gradient Computation**:
   - Softmax, when used with cross-entropy loss, produces well-behaved gradients, which facilitate efficient learning during backpropagation.

3. **Handling Multi-class Problems**:
   - Unlike binary classification where a single sigmoid function suffices, softmax can handle multiple classes in a single pass by providing a distribution over all possible classes.

### Example

Consider a neural network model for digit classification (0-9). The network's final layer might output a vector of logits $[z_0, z_1, \ldots, z_9]$. Applying softmax to this vector will yield:

$ \text{softmax}(z)_i = \frac{e^{z_i}}{\sum_{j=0}^{9} e^{z_j}} $

This transforms the logits into probabilities, such as:

$ [0.1, 0.05, 0.7, 0.05, 0.02, 0.02, 0.03, 0.01, 0.01, 0.01] $

Here, the network predicts that the input digit is most likely '2' with a probability of 0.7.

### Conclusion

The softmax function is a crucial component in neural networks, particularly for multi-class classification tasks. It transforms logits into interpretable probabilities, enabling the network to make meaningful predictions and facilitating effective training with smooth gradients.

In [1]:
using Flux
∑ = sum

sum (generic function with 23 methods)

In [2]:
# Define the softmax function
function softmax(z) # z -> logits
    return exp.(z) ./ ∑(exp.(z))  # Normalize to get probabilities
end

softmax (generic function with 1 method)

In [3]:
# Example logits from a neural network output layer (e.g., for digit classification 0-9)
logits = [2.0, 1.0, 0.1, -0.5, 0.5, 2.5, -1.0, 0.0, 1.5, -0.8];

# Apply the softmax function to the logits
probabilities = softmax(logits);

# Print the probabilities
println("Logits: ", logits)
println("Probabilities: ", probabilities)

# Verify that the probabilities sum to 1
println("Sum of probabilities: ", ∑(probabilities))


Logits: [2.0, 1.0, 0.1, -0.5, 0.5, 2.5, -1.0, 0.0, 1.5, -0.8]
Probabilities: [0.2312754983247151, 0.08508150108034303, 0.03459155694445449, 0.018984248961725746, 0.0516045389796016, 0.38130883347972966, 0.011514529046904394, 0.03129973507146406, 0.1402756805742575, 0.014063877536804418]
Sum of probabilities: 0.9999999999999999


In [4]:
# Example neural network using Flux
# Define a simple model with one hidden layer and softmax output
model = Chain(
    Dense(20, 10, sigmoid),  # Input layer: 20 features, hidden layer: 10 units
    Dense(10, 10),           # Output layer: 10 classes
    softmax                  # Apply softmax to the output
)


Chain(
  Dense(20 => 10, σ),                   [90m# 210 parameters[39m
  Dense(10 => 10),                      [90m# 110 parameters[39m
  softmax,
) [90m                  # Total: 4 arrays, [39m320 parameters, 1.500 KiB.

In [5]:
# Example input vector (20-dimensional)
input_vector = rand(20)

# Perform a forward pass through the network
output_probabilities = model(input_vector)

# Print the output probabilities
println("Output probabilities: ", output_probabilities)

# Verify that the output probabilities sum to 1
println("Sum of output probabilities: ", sum(output_probabilities))


Output probabilities: Float32

[33m[1m│ [22m[39m  The input will be converted, but any earlier layers may be very slow.
[33m[1m│ [22m[39m  layer = Dense(20 => 10, σ)  [90m# 210 parameters[39m
[33m[1m│ [22m[39m  summary(x) = "20-element Vector{Float64}"
[33m[1m└ [22m[39m[90m@ Flux ~/.julia/packages/Flux/Wz6D4/src/layers/stateless.jl:60[39m


[0.060023494, 0.090147905, 0.12973249, 0.080260485, 0.050536975, 0.07160723, 0.08341373, 0.13941053, 0.051136695, 0.24373041]
Sum of output probabilities: 0.99999994


In [6]:
# Example of defining a loss function using cross-entropy
loss(x, y) = Flux.crossentropy(model(x), y);

In [7]:
# Generate example data
X = rand(20, 100)  # 100 samples of 20-dimensional input
Y = Flux.onehotbatch(rand(1:10, 100), 1:10)  # 100 one-hot encoded labels for 10 classes

10×100 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅  ⋅  ⋅  …  ⋅  ⋅  ⋅  ⋅  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅     ⋅  ⋅  ⋅  ⋅  ⋅  1  1  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅  1  ⋅  ⋅
 ⋅  ⋅  ⋅  ⋅  1  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1
 ⋅  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  1  ⋅  …  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅  ⋅  ⋅  ⋅     ⋅  ⋅  1  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅
 1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅     ⋅  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  1  ⋅  ⋅  ⋅  1  ⋅  ⋅  ⋅  ⋅  ⋅     1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅

In [8]:
# Train the model using stochastic gradient descent
opt = Descent(0.01)
Flux.train!(loss, Flux.params(model), [(X, Y)], opt);

In [9]:
# Example prediction
prediction = model(input_vector)
println("Prediction: ", prediction)

Prediction: Float32[0.06021196, 0.09020385, 0.1295892, 0.080533534, 0.050752405, 0.07192128, 0.08346487, 0.1394814, 0.05117373, 0.24266773]
