In [5]:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.datasets import make_blobs

The softmax function can be written:
$$a_j = \frac{e^{z_j}}{ \sum_{k=1}^{N}{e^{z_k} }} \tag{1}$$
The output $\mathbf{a}$ is a vector of length N, so for softmax regression, you could also write:
\begin{align}
\mathbf{a}(x) =
\begin{bmatrix}
P(y = 1 | \mathbf{x}; \mathbf{w},b) \\
\vdots \\
P(y = N | \mathbf{x}; \mathbf{w},b)
\end{bmatrix}
=
\frac{1}{ \sum_{k=1}^{N}{e^{z_k} }}
\begin{bmatrix}
e^{z_1} \\
\vdots \\
e^{z_{N}} \\
\end{bmatrix} \tag{2}
\end{align}


In [3]:
def softmax(z):
    #we will do exponential of every element of z
    ez = np.exp(z)
    a = ez / np.sum(ez)
    return a

## Cost

The loss function associated with Softmax, the cross-entropy loss, is:
\begin{equation}
  L(\mathbf{a},y)=\begin{cases}
    -log(a_1), & \text{if $y=1$}.\\
        &\vdots\\
     -log(a_N), & \text{if $y=N$}
  \end{cases} \tag{3}
\end{equation}


Loss is for one example while Cost covers all examples.


Note in (3) above, only the line that corresponds to the target contributes to the loss, other lines are zero. To write the cost equation we need an 'indicator function' that will be 1 when the index matches the target and zero otherwise. 
    $$\mathbf{1}\{y == n\} = =\begin{cases}
    1, & \text{if $y==n$}.\\
    0, & \text{otherwise}.
  \end{cases}$$
Now the cost is:

\begin{align}
J(\mathbf{w},b) = -\frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{N}  1\left\{y^{(i)} == j\right\} \log \frac{e^{z^{(i)}_j}}{\sum_{k=1}^N e^{z^{(i)}_k} }\right] \tag{4}
\end{align}

In [6]:
#dataset
centers = [[-5,2],
           [-2,2],
           [1,2],
           [5,-2]]
x_train,y_train = make_blobs(n_samples = 2000, centers = centers,cluster_std = 1.0, random_state = 30)

The model below is implemented with the softmax as an activation in the final Dense layer. The loss function is separately specified in the compile directive.

The loss function is SparseCategoricalCrossentropy. This loss is described in (3) above. In this model, the softmax takes place in the last layer. The loss function takes in the softmax output which is a vector of probabilities.

## Non Prefered Softmax Neural Network

In [11]:
model = Sequential([
    Dense(25,activation = 'relu'),
    Dense(15,activation = 'relu'),
    Dense(4,activation = 'softmax')
])

model.compile(
    loss = tf.keras.losses.SparseCategoricalCrossentropy(),
    optimizer = tf.keras.optimizers.Adam(0.001),)

model.fit(x_train,y_train,epochs = 10)

Epoch 1/10
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fbeac5e6210>

Because the softmax is integrated into the output layer, the output is a vector of probabilities.

In [13]:
p_nonpreferred = model.predict(x_train)
print(p_nonpreferred [:2])
print('largest value ', np.max(p_nonpreferred))
print('smallest value ',np.min(p_nonpreferred))

[[1.3200734e-04 1.5484955e-02 9.5769781e-01 2.6685284e-02]
 [9.5720732e-01 4.2127814e-02 4.9191201e-04 1.7294526e-04]]
largest value  0.9999993
smallest value  1.5161074e-12


## Prefered Softmax Neural Network

In the preferred organization the final layer has a linear activation. For historical reasons, the outputs in this form are referred to as *logits*. The loss function has an additional argument: `from_logits = True`. This informs the loss function that the softmax operation should be included in the loss calculation. This allows for an optimized implementation.

In [14]:
prefered = Sequential(
    [
    Dense(25, activation = 'relu'),
    Dense(15, activation = 'relu'),
    Dense(4, activation = 'linear')
    ]
)

prefered.compile(
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer = tf.keras.optimizers.Adam(0.001),
)

prefered.fit(
    x_train, y_train,
    epochs = 10
)

Epoch 1/10
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fbe08241890>

Output Handling
Notice that in the preferred model, the outputs are not probabilities, but can range from large negative numbers to large positive numbers. The output must be sent through a softmax when performing a prediction that expects a probability. Let's look at the preferred model outputs:

In [16]:
pre_prefered = prefered.predict(x_train)
print(f'Two examples output vector : {pre_prefered[:2]}')
print('largest value: ', np.max(pre_prefered))
print('smallest value: ',np.min(pre_prefered))

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Two examples output vector : [[-4.143013    0.19964506  4.6349564   0.99905497]
 [ 3.7505262   0.38420558 -5.980793   -5.580508  ]]
largest value:  12.087269
smallest value:  -10.990598


The output predictions are not probabilities! If the desired output are probabilities, the output should be be processed by a softmax.

In [18]:
sm_preferred = tf.nn.softmax(pre_prefered).numpy()
print(f"two example output vectors:\n {sm_preferred[:2]}")
print("largest value", np.max(sm_preferred), "smallest value", np.min(sm_preferred))

two example output vectors:
 [[1.4839732e-04 1.1413490e-02 9.6305192e-01 2.5386205e-02]
 [9.6649694e-01 3.3360001e-02 5.7404002e-05 8.5661100e-05]]
largest value 0.99999917 smallest value 7.245911e-10


To select the most likely category, the softmax is not required. One can find the index of the largest output using np.argmax().

In [23]:
for i in range(5):
    print(f'{sm_prefered[i]}, category: {np.argmax(sm_prefered[i])}')

[1.4839732e-04 1.1413490e-02 9.6305192e-01 2.5386205e-02], category: 2
[9.6649694e-01 3.3360001e-02 5.7404002e-05 8.5661100e-05], category: 0
[0.86219364 0.13607714 0.00086485 0.00086435], category: 0
[0.01046968 0.9214138  0.06356313 0.00455332], category: 1
[0.00123136 0.59213495 0.40055788 0.00607582], category: 1
