### Vector and Matrix

Let's start with $K$ = 2 experts, and $|A|$ = 2 actions, Expert 1 = [0.9, 0.1], Expert 2 = [0.7, 0.3].

In [18]:
K, A = 2, 2
E1 = [0.9, 0.1]
E2 = [0.7, 0.3]
Q = [E1, E2]

In [22]:
print(Q)

[[0.9, 0.1], [0.7, 0.3]]


##### What are the dimensions of the full data structure that stores both experts’ predictions?

Since we have 2 experts and 2 actions, the dimension is a matrix of 2 by 2.

##### How would we write the element that stores Expert 2’s score for Action 1?

$E_2A_1 = Q[1, 0]$

##### Why should we use as a $𝐾 \times ∣𝐴∣$ matrix instead of a flat list?

Because we can use aggregation and take advantage of linear algebra such that when we apply softmax it can be on either dimension, per row, per column, etc.

### Log-Sum-Exp and Softmax

##### Why shouldn't we use raw score as probabilities?

This is because the raw score themselves are not a probability distribution since they are summing to 1 and can't be less than 0. Also, if we use raw score directly, we can't express the uncertainty between the option and this is where applying softmax over it we can convert the score into probability distribution instead.

##### Let's write a general softmax the softmax of a score vector $s = [s_1, s_2]$ with temperature $\tau$. What does lowering $\tau$ do to the resulting probabilities?

To answer this question well, we have to first look at the equation of softmax function but started with this particular setting.

From Wikipedia: the softmax function takes as input a tuple z of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some tuple components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the interval $(0,1)$, and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities.

Formally, the standard (unit) softmax function $\sigma: \mathbb{R}^K \rightarrow (0, 1)^K$, where $K > 1$, takes a tuple $z = (z_1, ..., z_K) \in \mathbb(R)^K$ and computes each components of vector $\sigma(z) \in (0,1)^K$ with: $$ \sigma(z)_i = \frac{e^{z_i}}{\Sigma_{j=1}^{K}{e^{z_j}}}$$

So this is the standard softmax equation, what that means in words is that we apply exponential function over each individual element of $z$ then divided them by the sum of all of the exponentials. The normalization is to ensure that the sum of the components of the output vector $\sigma(z)$ is 1.

##### Using $\tau = 1$, hand‑calculate softmax([0,0]). What should the distribution look like, and why?




### KL-divergence for Measuring Policy Disagreement

Assume two softmax policies $\pi_1 = [0.8, 0.2], \pi_2 = [0.6, 0.4].$

##### Is $KL(\pi_1 || \pi_2)$ equivalence to $KL(\pi_2 || \pi_1)$? What's the intuition behind this and explain in words?

##### Under what exact condition is $KL$ divergence equal to 0?

##### Suppose $KL(\pi_1 || \pi_2)$, what does “0.02 nats” this mean intuitively?



### Mutual Information for "Keeping only Useful Bits"

Let's say we have a random variable $Q$ expert scores, and abstraction of $\varphi(Q)$ (compressed abstract).

##### Express mutual information with entropy $I(Q, \varphi(Q))$.

##### What does $I(Q;ϕ(Q))=0$ tell about $\varphi(Q)$? What if $I(Q;\varphi(Q))=H(Q)$?

##### Why might we want to minimize $I(Q;\varphi(Q))$ while still respecting decision quality? (Hint: “compression without losing the action boundary.”)

