# Typical Problems

Model parameters: $\Theta$, Observations: $X$, Hidden States: $Z$

- scoring (observations): $Pr(X) \longrightarrow$ **Forward Algorithm**

- decoding (hidden variables):
    - best posterior state: $argmax_i Pr(Z_t=i|X) \longrightarrow$  **Forward-Backward Algorithm**
    - best state sequence: $argmax_Z Pr(Z|X) \longrightarrow$ **Viterbi Algorithm**
    
- learning (model parameters)  $argmax_\Theta Pr(\Theta|X) \longrightarrow$ **Baum-Welch Algorithm**

# Probability of Observations: $Pr(X)$

### Uses: Evaluate (score) observations. 

Compare different models: 
$$
P(X|\Theta_1) ~~\mbox{vs}~~ P(X|\Theta_2)
$$

### First idea: Chain rule
$$
P(X) = P(X_1) P(X_1|X_2) P(X_3|X_1,X_2) \ldots P(X_T| X_1, \ldots ,X_{T-1})
$$

... not calcuable from HMM parameters

### Second idea: Naive Marginalization

Use joint distribution $Pr(X,Z)$ and remove hidden state sequence (it is unobservable) $\to$ marginalize

$$
Pr(X) = \sum_Z P(X,Z) = \sum_Z Pr(X|Z) Pr(Z)
$$

### Notice
- remember: each term in sum breaks into emission probabilities, transition probabilities (and initial state probability)
- marginalization over *all possible* state paths $Z$ ($=Z_{1:T} = Z_1, Z_2, \cdots Z_T$) 
- $N^T$ paths for $N$ possible states and sequences of length $T \longrightarrow $  unfeasible

## Third idea: Dynamic Programming (reuse previous calculations)

# The Forward Algorithm

## Idea and Visalization

We don't know any $Z_t$, so we need to **track all possibilities**: $\longrightarrow$ Trellis graph.

<div>
   <img src="https://github.com/thomasmanke/ABS/raw/main/figures/HMM_HiddenTrellis.jpg",  width="1000">
</div>

**Recursion** again !!!

Let's assume that at some time $t$ we already know the joint probability for the observed sequence $X_{1:t}$ and the hidden state $Z_t$ (for each possible value of $Z_t=i$).

This information is stored in the **forward variable:** $\alpha_{ti} = Pr(X_{1:t}, Z_t=i)$. This is a vector of joint probabilities that will be propagated forward in time.

This works efficiently because of the Markov property (separation of future from past, given the present).



## Algorithmic Formulation: Iteration

### 1. Initialization ($t=1$) 
$$
\alpha_{1i} = Pr(X_1, Z_1=i) = Pr(X_1|Z_1=i) Pr(Z_1=i)
$$
  - 
   $\to$ element-wise multiplication of a row from emission matrix with initial state distributions
  - $Pr(X_1|Z_1=i)$ is one element of the emission matrix $E_{ik}$ (row i = state, column k = k(1) observed valuee of $X_1$). 
  - $Pr(Z_1=i)$ is the i-th element of the initial state distribution $\pi_i$.



### 2. Induction ($t \to t+1$): state transition + new observation 
  - *2.1 state transition ($Z_t \to Z_{t+1}$):* Consider all possible Markov transitions and sum them up.
$$
\begin{align}
Pr(Z_{t+1}=i, X_{1:t}) &= \sum_k Pr(Z_{t+1}=i, Z_t=k, X_{1:t}) \\
&=\sum_k Pr(Z_{t+1}=i|Z_t=k, X_{1:t}) Pr(Z_t=k,X_{1:t}) \\
 &= \sum_{k} P_{ki} \alpha_{tk}  = \sum_{k} \alpha_{tk} P_{ki}
\end{align}
$$
$\to$ matrix multiplcation of row vector $\alpha_{tk}$ with transition matrix
  - *2.2 new observation ($X_{t+1}$):* consider emission probability resulting in observation $X_{t+1}$
$$
Pr(Z_{t+1}=i, X_{1:t}, X_{t+1}) = Pr(X_{t+1}|Z_{t+1}=i) Pr(Z_{t+1}=i, X_{1:t})
$$
$\to$ element-wise multiplication of a row from emission matrix with 



### 3. Termination ($T=T$): Marginalization
$$
Pr(X) = Pr(X_{1:T}) = \sum_i Pr(X_{1:T}, Z_T=i) = \sum_i \alpha_{Ti}
$$

## Graphical Summary:  2 Steps

<div>
   <img src="https://github.com/thomasmanke/ABS/raw/main/figures/HMM_Forward.jpg",  width="800">
</div>


<div>
   <img src="https://github.com/thomasmanke/ABS/raw/main/figures/HMM_Forward_summary.jpg",  width="800">
</div>

- The Markov Model pushes the state $Z$ forward in time $\longrightarrow$ matrix multiplication with $P$
- Emission probabilities: take into account the state-specifc (time-dependent) probabilities for observation $X_{t+1}$ $\longrightarrow$ element-wise multiplication with proper column of $E$


## Notice
- **Marginalization:** $Pr(X_{1:t}) = \sum_i Pr(X_{1:t}, Z_t = i) = \sum_i \alpha_{ti} \ne 1$.  In fact, it is much smaller than 1 for large $t$ !
- **Recursion Efficiency:** Forward Propagation of $\alpha_{tk}$ **is fast** (linear in sequence length $T$)
    - Calculation of $Pr(X)$ requires $T N^2$ calculations $\ll N^T$ 
    - Example: $(N,T) = (2, 100) \longrightarrow 400 \ll 2^{100}$  
- **Emission matrix** $E_{ik}$ serves as lookup table for given observation $X_t=k$ at time $t$. ($k=f(t)$)

## Group Task (30 min): A single step forward in time

$$Pr(Z_{t-1}=i, X_{1:t-1}) \to Pr(Z_t=i, X_{1:t})$$

Given the above HMM  with 2 states (Germany=0, Switzerland=1) and a magically known joint probability $Pr(Z_{t-1}, X_{1:t-1})=(0.05, 0.02)$. I made those numbers up, and it is irrelevant which history of observations resulted in these probabilities. They denotes the probability for the two states **and** all observations until time $t-1$. Notice that this does not have to sum to 1! But (thanks to Markov) this is all you need to calculate the next step.

Calculate the updated probability for $Z_t=$ Germany (0) **and** that the new observation $X_t$ is Bread (0), Fish (1) or Fondue (2). 

- Group 1: $P(Z_t=0, X_{1:t-1}, X_t=0) = $ ? 
- Group 2: $P(Z_t=0, X_{1:t-1}, X_t=1) = $ ?
- Group 3: $P(Z_t=0, X_{1:t-1}, X_t=2) = $ ?

In [4]:
import numpy as np
pi=np.array( [0.75, 0.25] )                          # initial state probability
P =np.array([ [0.8, 0.2], [0.1, 0.9] ])              # transition probabilites
E =np.array([ [0.7, 0.2, 0.1], [0.1, 0.1, 0.8] ])    # emission probabilities

alpha = np.array([0.05, 0.02])  # initial probability at time t-1  (prior)
xt = 0                          # observation of interest at time t. (Bread = 0)

alpha = alpha.dot(P)            # push prior with P from t-1 --> t (state transition)
print('after state transition: ', alpha) 

LH=E[:,xt]                      # pick emission probs for observation xt
print('emission vector:        ', LH)

alpha = LH * alpha              # elementwise multiplication pf alpha with LH
print('new probability         ', alpha)   # 

# only for calculation of conditional probability
#alpha /= np.sum(alpha)          # normalized posterior
#print('posterior norm:', alpha)

after state transition:  [0.042 0.028]
emission vector:         [0.7 0.1]
new probability          [0.0294 0.0028]
posterior norm: [0.91304348 0.08695652]


## Remarks

- *Monitoring*: With the same recipe we can progagate the **conditional probability distribution**
$$Pr(Z_{t-1}=i| X_{1:t-1}) \to Pr(Z_t=i | X_{1:t})$$
through time - rather than the joint $Pr(Z_t=i, X_{1:t})$.
-  We simply have to normalize after every time step.
$$Pr(Z_t=i|X_{1:t}) = Pr(Z_t=i,X_{1:t})~/~Pr(X_{1:t})$$ 
