#### Module 2: SOFTMAX

#### Outline

* Softmax 1D
* Softmax 2D

#### A short detour (Sigmoid vs Softmax comparison)

| Feature                | **Sigmoid**                            | **Softmax**                                 |
|------------------------|----------------------------------------|---------------------------------------------|
| 📌 Use case            | Binary classification                  | Multi-class classification                  |
| 📊 Output range        | [0, 1]                                 | [0, 1] for each class, and **sums to 1**    |
| 🎯 Interpretation      | Probability of the **positive class**  | **Probabilities for all classes**           |
| 🔢 Input shape         | Single number per sample               | Vector (logits for each class)              |
| ➕ Multi-label use?    | ✅ Yes                                  | ❌ No (assumes only one correct class)       |
| 🔄 Formula             | σ(x) = 1 / (1 + e^(–x))                | softmax(xᵢ) = e^(xᵢ) / ∑ e^(xⱼ)              |


#### 1. Softmax 1D

Just like logistic regression, we'll have integer classes but instead of just two classes. We could have multiple classes, and in this case we have four.  
y = $\begin{bmatrix} 2 \\ 4 \\ 1 \\ 0 \end{bmatrix}$  

We'll also have feature vectors or tensors.  
Each sample will correspond to a different row in the matrix or tensor X.  
X = $\begin{bmatrix} 4.9 & 3 & 1.4 & 0.2 \\ 4.1 & 1 & 1.4 & 0.2 \\ 1.1 & 2.1 & 3 & -1 \\ 4.3 & 1.9 & 1 & 7.9 \end{bmatrix}$  

How softmax behaves in a typical 3 class output  

![image.png](attachment:image.png)  

We can conclude that for values of x in this region, z zero is greater than z one and z two.

##### Argmax Function

The argmax function returns the index corresponding to the largest value in a sequence of numbers.  

![image.png](attachment:image.png)  

Here the largest value in z is 100, and the corresponding index is 0.  
Thus, the argmax function will return zero.

##### Using argmax function in multi-class prediction

![image.png](attachment:image.png)

We store these outputs in the following table with the index i corresponding to the line number and apply the argmax function to the table.
Since the largest value corresponds to z zero.
The argmax function returns zero and y hat equals 0.

##### A short detour (Softmax vs Argmax)

| Feature         | **Softmax**                                            | **Argmax**                                |
|-----------------|--------------------------------------------------------|--------------------------------------------|
| 🔢 Output Type  | Vector of **probabilities** (floats)                   | A **single index** (integer)               |
| 🎯 Purpose      | Gives **confidence scores** for each class             | Chooses the **most likely class**          |
| 📊 Output Range | Values in **[0, 1]**, sum to **1**                     | Integer index: one of `[0, 1, ..., n-1]`    |
| 🧮 Example      | `[0.7, 0.2, 0.1]`                                       | `0` (index of highest value)               |
| 🔄 Differentiable | ✅ Yes (used in training with gradients)             | ❌ No (not differentiable, used in inference) |



##### Example of input vector of MNIST data

![image.png](attachment:image.png)

Since each image is a greyscale image, the intensity values for each pixel can range from 0 to 255.
Further, each image in the MNIST dataset comprises of 784 pixels,
thus our input vector has 784 values in it. 

Visualizing and plotting 784 dimensions would be extremely difficult.
To visualize Softmax in 2D, you can think of the samples as vectors.
Here we have three weight parameters w 0, w 1 and w 2, the vectors values are shown in the table.

![image-2.png](attachment:image-2.png)

These vectors represent the parameters of Softmax in 2D.
The Softmax function is used for finding the points nearest to each parameter vector.

**NOTE :** The number of dimensional space does not directly relate to the output vector size. In MNIST case, We have 10 weight parameters (w0, w1, ... w9) in a 784 Dimensional space. Here is just an example of 3 weight paramters (w0, w1, w2) in 2 Dimensional space.

Sample computation of output class  

![image-3.png](attachment:image-3.png)


The reason the function is called Softmax since the actual distances i.e.
dot products for each input vector with the parameters
is converted to probabilities using the following probability functions.
Similar to logistic regression. 

![image-4.png](attachment:image-4.png)