# `007-softmax`

Task: practice using the `softmax` function.

## Setup

In [1]:
import torch
from torch import tensor
import matplotlib.pyplot as plt
%matplotlib inline

## Task

Let's see how `softmax` behaves!

Try this example: `x = tensor([1., 2., 3.])`

1. Compare the result of `softmax(x)` with the result of `x.exp() / x.exp().sum()`. Are they close?
2. What happens to the output of `softmax` when you give it `x + 1` instead of `x`? What happens if you add 100 instead? (Do this without changing `x`)
3. *optional*: What happens to the output of `x.exp() / x.exp().sum()` when you add 1 to x? When you add 100? 
4. What happens when you multiply `x` by a constant like 0.5 or 3.0 before passing it to `softmax`? Compare this situation with the situation in question 2.

**Note: you'll need to specify the axis: `torch.softmax(x, dim=0)`.**

## Solution

Add code and Markdown cells for each of the listed tasks above.

In [2]:
x = tensor([1., 2., 3.])

### Problem 1

In [3]:
torch.softmax(x, dim=0), x.exp() / x.exp().sum()

(tensor([0.0900, 0.2447, 0.6652]), tensor([0.0900, 0.2447, 0.6652]))

The two tensors contain the same values.

### Problem 2

In [4]:
torch.softmax(x + 1, dim=0), torch.softmax(x + 100, dim=0)

(tensor([0.0900, 0.2447, 0.6652]), tensor([0.0900, 0.2447, 0.6652]))

The values in the resulting tensor from softmax do not change, regardless of whether 0, 1, or 100 is added to `x`.

### Problem 3

In [5]:
(x+1).exp() / (x+1).exp().sum(), (x+100).exp() / (x+100).exp().sum()

(tensor([0.0900, 0.2447, 0.6652]), tensor([nan, nan, nan]))

It seems like for small values added to `x`, the values of the tensor remain the same. However, for large values (such as 100), the tensor contains `NaN` (undefined) values.

### Problem 4

In [6]:
torch.softmax(x * 0.5, dim=0), torch.softmax(x * 3.0, dim=0)

(tensor([0.1863, 0.3072, 0.5065]), tensor([0.0024, 0.0473, 0.9503]))

If we multiply `x` by a constant before passing it to `softmax`, then the values within the resulting tensor change.

## Analysis

In [7]:
x2 = tensor([1., 0.,])
x3 = x2 - 1
x3

tensor([ 0., -1.])

In [8]:
x4 = x2 * 2
x4

tensor([2., 0.])

1. Are `softmax(x2)` and `softmax(x3)` the same or different? How could you tell without having to try it?
2. Are `softmax(x2)` and `softmax(x4)` the same or different? How could you tell without having to try it?

1. `softmax(x2)` and `softmax(x3)` are the same; as we saw from the "Solution" section above, adding a constant to `x` does not change the results from `softmax`.

2. `softmax(x2)` and `softmax(x4)` are different; as we saw from the "Solution" section above, multiplying `x` by a constant changes the results of `softmax`.

## Extension *optional*

1. Try to prove your observation in \#2 by symbolically simplifying the expression `softmax(logits + c)` and seeing if you can get `softmax(logits)`. Remember that `softmax(x) = exp(x) / exp(x).sum()` and `exp(a + b) = exp(a)exp(b)`.

2. Why does `exp(x + 100) / exp(x + 100).sum()` not work, while it does work for `torch.softmax`? Can you think of what `torch.softmax` might be doing to make sure that works?

2. `torch.softmax` might be converting the values in x to the logarithmic scale before exponentiating them, since exponentiating a large number (x + 100) may result in values outside of the memory space. To prevent this, `softmax` may convert to the logarithmic scale, calculate the softmax for those values, then exponentiate the results to get back to the natural scale. 