# Softmax Regression
:label:`sec_softmax`

## Section Summary
This section discusses on classification problems, which are concerned with assigning data points to categories. The distinction between hard and soft assignments of examples to categories is explained. Multi-label classification is introduced, which deals with cases where more than one label might be true. The section discusses a simple image classification problem, which involves representing the labels using the one-hot encoding.



## Classification Summary
:label:`subsec_classification-problem`

The text explains a simple image classification problem where each input consists of a $2\times2$ grayscale image with four features $x_1, x_2, x_3, x_4$. The image can belong to one of three categories: cat, chicken, or dog. To represent the labels, one can use the one-hot encoding method, which represents each category with a three-dimensional vector.




### Linear Model Summary

The text describes a way of addressing classification problems with linear models. Multiple outputs are needed, one for each class, and the conditional probabilities for all possible classes need to be estimated. One-hot encoding is used to represent categorical data, and a model with multiple outputs, one per class, is used. The model requires a single-layer neural network, and a fully connected layer is used to calculate each output. The outputs are calculated using affine functions that depend on all inputs. A more concise notation is introduced using vectors and matrices.




### The Softmax Summary
:label:`subsec_softmax_operation`


The Softmax function is a normalization function used in classification tasks to ensure that the outputs of a model sum up to 1 and are non-negative. It accomplishes this goal by using an exponential function, which satisfies the requirement that the conditional class probability increases with increasing output, is monotonic, and all probabilities are non-negative. The softmax function was inspired by the Boltzmann distribution, which was used to model a distribution over energy states in gas molecules. The largest coordinate of the output of the softmax function corresponds to the most likely class according to the predicted probabilities.






### Vectorization Summary
:label:`subsec_softmax_vectorization`

The section discusses the vectorization technique used to improve computational efficiency by processing minibatches of data. The minibatch $\mathbf{X} \in \mathbb{R}^{n \times d}$ contains $n$ examples with dimensionality $d$, and there are $q$ categories in the output. The weights $\mathbf{W} \in \mathbb{R}^{d \times q}$ and the bias $\mathbf{b} \in \mathbb{R}^{1\times q}$. The softmax operation can be computed rowwise by exponentiating and normalizing the entries in each row of $\mathbf{O}$. Numerical overflow or underflow issues are automatically handled by deep learning frameworks.






## Loss Function Summary
:label:`subsec_softmax-regression-loss-func`

The loss function is used to optimize the accuracy of mapping from features to probabilities. The maximum likelihood estimation is used for this purpose, which is the same concept as used for probabilistic justification for the mean squared error loss.






### Log-Likelihood Summary

The softmax function converts input features into probability estimates for different classes. We use maximum likelihood estimation and negative log-likelihood to optimize the accuracy of these estimates. We compare the model's estimated probabilities with actual class probabilities using a loss function called the cross-entropy loss. The cross-entropy loss is the negative sum of actual class probabilities multiplied by the log of the estimated probabilities. It is always greater than or equal to 0 and reaches 0 only when the model predicts the actual label with certainty.







### Softmax and Cross-Entropy Loss Summary
:label:`subsec_softmax_and_derivatives`

The softmax function and cross-entropy loss are commonly used in machine learning and are related. The cross-entropy loss is the expected value of the loss for a distribution over labels, which measures the number of bits to encode what we see relative to what we predict should happen. The derivative of the cross-entropy loss with respect to a logit is the difference between the probability assigned by the model and what actually happened, which makes computing gradients easy.







## Information Theory Basics
:label:`subsec_info_theory_basics`

This text is a brief introduction to information theory, which is frequently referenced in deep learning papers. The central idea of information theory is to quantify the amount of information contained in data through entropy. Entropy places a limit on our ability to compress data. Cross-entropy is also explained, which is the expected surprisal of an observer with subjective probabilities Q upon seeing data that was actually generated according to probabilities P. Cross-entropy can be thought of as maximizing the likelihood of the observed data and minimizing the number of bits required to communicate the labels.





## Exercises

1. We can explore the connection between exponential families and the softmax in some more depth.
    1. Compute the second derivative of the cross-entropy loss $l(\mathbf{y},\hat{\mathbf{y}})$ for the softmax.
        - $ = -\frac{1}{\hat{y}_j}$
    2. Compute the variance of the distribution given by $\mathrm{softmax}(\mathbf{o})$ and show that it matches the second derivative computed above.
        - $ = -\frac{1}{\hat{y}_j}$
1. Assume that we have three classes which occur with equal probability, i.e., the probability vector is $(\frac{1}{3}, \frac{1}{3}, \frac{1}{3})$.
    1. What is the problem if we try to design a binary code for it?
        - Need at least 3 bits to represent each class as it occurs with equal probability.
    2. Can you design a better code? Hint: what happens if we try to encode two independent observations? What if we encode $n$ observations jointly?
        - Can encode two independent observations with four possible outcomes. Use 2 bits to encode the classes.
1. When encoding signals transmitted over a physical wire, engineers do not always use binary codes. For instance, [PAM-3](https://en.wikipedia.org/wiki/Ternary_signal) uses three signal levels $\{-1, 0, 1\}$ as opposed to two levels $\{0, 1\}$. How many ternary units do you need to transmit an integer in the range $\{0, \ldots, 7\}$? Why might this be a better idea in terms of electronics?
    - Need at least 2 ternary units. This makese the encoding more resistent to noise.
1. The [Bradley-Terry model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) uses
a logistic model to capture preferences. For a user to choose between apples and oranges one
assumes scores $o_{\mathrm{apple}}$ and $o_{\mathrm{orange}}$. Our requirements are that larger scores should lead to a higher likelihood in choosing the associated item and that
the item with the largest score is the most likely one to be chosen :cite:`Bradley.Terry.1952`.
    1. Prove that the softmax satisfies this requirement.
        -  $i$-th item is given by $\frac{\exp(o_i)}{\sum_{j}\exp(o_j)}$
    1. What happens if you want to allow for a default option of choosing neither apples nor oranges? Hint: now the user has 3 choices.
        - Create new class as 'neither' and modify the softmax to include the new class
1. Softmax derives its name from the following mapping: $\mathrm{RealSoftMax}(a, b) = \log (\exp(a) + \exp(b))$.
    1. Prove that $\mathrm{RealSoftMax}(a, b) > \mathrm{max}(a, b)$.
    1. How small can you make the difference between both functions? Hint: without loss of
    generality you can set $b = 0$ and $a \geq b$.
    1. Prove that this holds for $\lambda^{-1} \mathrm{RealSoftMax}(\lambda a, \lambda b)$, provided that $\lambda > 0$.
    1. Show that for $\lambda \to \infty$ we have $\lambda^{-1} \mathrm{RealSoftMax}(\lambda a, \lambda b) \to \mathrm{max}(a, b)$.
    1. What does the soft-min look like?
    1. Extend this to more than two numbers.
1. The function $g(\mathbf{x}) \stackrel{\mathrm{def}}{=} \log \sum_i \exp x_i$ is sometimes also referred to as the [log-partition function](https://en.wikipedia.org/wiki/Partition_function_(mathematics)).
    1. Prove that the function is convex. Hint: to do so, use the fact that the first derivative amounts to the probabilities from the softmax function and show that the second derivative is the variance.
    1. Show that $g$ is translation invariant, i.e., $g(\mathbf{x} + b) = g(\mathbf{x})$.
    1. What happens if some of the coordinates $x_i$ are very large? What happens if they're all very small?
    1. Show that if we choose $b = \mathrm{max}_i x_i$ we end up with a numerically stable implementation.
1. Assume that we have some probability distribution $P$. Suppose we pick another distribution $Q$ with $Q(i) \propto P(i)^\alpha$ for $\alpha > 0$.
    1. Which choice of $\alpha$ corresponds to doubling the temperature? Which choice corresponds to halving it?
    1. What happens if we let the temperature converge to $0$?
    1. What happens if we let the temperature converge to $\infty$?
