# Motivation and Simulation

Even if the "states of the world" are Markovian, they are often hidden from us, and we only observe some measurements. 

**A traveling analogy**

> I frequently commute between two states: Germany and Switzerland. Let's assume my travels can be modelled as a Markov Process, as described in the previous section. But now I only communicate my dinner plans with the world. Therefore dinner is an **observable** variable, but my current state (the country) variable is **hidden.** We might hope that something could still be learned about the states visited from the observation on food consumption.

<div>
   <img src="https://github.com/thomasmanke/ABS/raw/main/figures/HMM_CountryFood.jpg",  width="1000">
</div>


This is a Hidden Markov Model (HMM). An HMM is characterized by three ingredients:

- initial distribution: $P(Z_0=i)=\pi_i$ ( $\to 1 x N$ matrix = row vector )
- transition matrix: $P(Z_t=j|Z_{t-1}=i) = P_{ij}$  ( $\to N \times N$ matrix )
- emission matrix: $P(X_t=k|Z_t=i) = E_{ik}$ ( $ \to N \times M$ matrix )

The emission probabilities are dependent on the state, but constant over time.

For simplicity we will assume that both states and observables are discrete.
To be specific, the Hidden Markov Model with 2 states $Z \in$ {Germany=0, Switzerland=1} and observations with 3 possible observations $X \in$ {Bread=0, Fish=1, Fondue=2} may read:

\begin{align}
    P(Z_0) &= \begin{bmatrix} 0.75 & 0.25  \end{bmatrix} \\ \\
    P(Z_t | Z_{t-1}) & = \begin{bmatrix} 0.8 & 0.2 \\ 0.1 & 0.9 \end{bmatrix} \\ \\
    P(X_t | Z_t) & =  \begin{bmatrix} 0.7 & 0.2 &  0.1 \\ 0.1 & 0.1 & 0.8 \end{bmatrix} \\
\end{align}


**Discussion:** Give an interpretation of the numbers as they relate to the graph above.  

**Notice:** all rows are non-negative and they sum to 1 (*stochastic* matrices)

The cell below specifies all these parameters in Python/Numpy.



In [None]:
import numpy as np
pi=np.array( [0.75, 0.25] )                          # initial state probability
P =np.array([ [0.8, 0.2], [0.1, 0.9] ])              # transition probabilites
E =np.array([ [0.7, 0.2, 0.1], [0.1, 0.1, 0.8] ])    # emission probabilities

**Task (20 min):** Simulate the above Hidden Markov Model. 

Complete the following function and generate observations from a Hidden Markov Model defined above. You might want to refer back to the first lecture on simple Markov Models

In [None]:
# notice the similarities with generate_sequence() from the plain Markov Model
def generate_HMM(P, pi, E, T=50):

  assert P.shape[0]==P.shape[1],         "generate_HMM: P should be a squared matrix"
  assert np.allclose( P.sum(axis=1), 1), "generate_HMM: P should be a stochastic matrix"
  assert np.allclose( E.sum(axis=1), 1), "generate_HMM: E should be a stochastic matrix"
  assert np.isclose( pi.sum(), 1),       "generate_HMM: pi should sum to 1"
  assert E.shape[0]==P.shape[0],         "generate_HMM: E and P should have the same number of rows (states)"
  
  # first define two list (states = integers, emissions = letters)
  ns = ...                          # number of states
  ne = ...                          # number of outputs (#observables)
  states= list(range(ns))                  # state labels as integers
  emissions=list(range(ne))                # observation labels as integers
  #emissions=[chr(97+i) for i in range(ne)] # observation labels as letters

  # chose first state and the corresponding emission
  # we used to set this deterministically before - here we make a random choice
  z = np.random.choice( states,   ... )
  x = np.random.choice( emissions, ... )

  # add state and observation to history
  state_hist = [z]
  emit_hist = [x]
  
  # loop for T time steps
  for t in range(T):
    z = np.random.choice( states,    ...  )
    x = np.random.choice( emissions, ... )

    # collect history with state and emission labels
    state_hist.append(z)
    emit_hist.append(x)
  return state_hist, emit_hist

**Test:** If done correctly, the function should return output such as

In [None]:
np.random.seed(42)
Z, X = generate_HMM(P,pi,E, T=50)
print('states Z       =',*Z)
print('observations X =',*X)

states Z       = 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0
observations X = 2 0 0 1 1 2 2 1 2 2 1 2 2 2 0 1 2 2 0 2 2 2 0 0 0 2 2 2 0 0 0 2 2 2 2 0 1 1 0 0 2 0 2 2 2 2 2 2 2 0 0


**Discussion:** Do these sequences make sense? Can you give an  interpretation of the observation?

From now on we assume that the *hidden* state sequence $Z=Z_{1:T}$ is never observed. 

However, if we know the HMM parameters, we can still give a probabilitic description for $Z$:

- **prior probability** $Pr(Z_t=i)⟶$ stationary distribution $\pi = \pi P$ 
- **posterior probability** $Pr(Z_t=i|X_t=k) \longrightarrow$ Bayesian update: $ \propto Pr(X_t=k|Z_t=i) Pr(Z_t=i)$
- **most likely hidden state sequence** for given observations: $argmax_Z P(Z|X) \longrightarrow$ Viterbi algorithm



 

## Bayesian interlude here

**Tasks (20 min)**: For the following assume that all HMM parameters are known: $\pi, P, E$.

1. Is the initial distribution the same as the stationary distribution?

2. Let's assume that I sent you my (first ever) message, saying that I just had Fondue for dinner. What is the (posterior) probability that I am in Germany?

- 10%
- 27%
- 75%

In [None]:
# 1. matrix powers
from numpy.linalg import matrix_power
pi = np.array([1.0, 0.0])         
stat_dist =    # independent of pi
print('stat_dist = ', stat_dist)

# 2.  Bayesian analysis
# it will help to write down Bayes formula here
sum  = ...
prob = ... / sum
print('sum = ', sum)
print('answer = ', prob)

Later you will learn how to incorporate all observations $X$ systematically to derive probabilitic statements for $Z$.

# Group Task (15 + 10 min): HMM Generation

1. Make up your own hidden Markov story, draw the corresponding state graph, and define the Hidden Markov Model. 
  - Please keep it simple; less than 5 hidden states and less than 5 possible observations. 
  - Also make sure that the hidden states are *ergodic* (what was that?)

2. Choose your own emission probabilties, transition probabilties and initital state distribution - make sure they correspond to probabilties. 

3. Simulate $T=1000$ steps.

4. Record (only) the sequence of observations that were generated and store the results as string in a text file (for latter use). Be kind and use integer encoding of observations, i.e. $0,1,\ldots$ regardless of the interpretation.

5. Share your story, code, results and report back to the class.

In [None]:
#%%script echo complete (only one per group)

pi=np.array( ... )
P =np.array( ... )
E =np.array( ... )

T=1000
Z, X = generate_HMM(P, pi, E, T=T)

fn='obs_group1.txt'  # choose a group-specific filename

# write ######
with open(fn, 'w') as f:
  m = map(str, X)            # convert numbers to strings
  f.write(' '.join(list(m))) 

# read (for later) ######
with open(fn, "r") as f:
  line  = f.readline().split()  # read first line and split
Xr = list(map(np.int64, line))  # map line to np.int64

print('X =',*X)
print('Xr=',*Xr)
np.all(X==Xr)

# Applications: What is this all good for?

- Speech Recognition
- DNA sequence analysis: sequence segmentation
- Summarizing Multiple Alignments: profile HMM
- Robot Location


# Reference: Conventions and Notations

- Number of States: $i=1,2, \ldots N$
- Number of Observations (discrete values): $k=1,2, \ldots M$
- Length of Sequence: $t=1,2, \ldots T$
- conditional probabilities (notice the index order !)
  - $P_{ij} = Pr(Z_{t+1}=j|Z_t=i)$ 
  - $E_{ik} = Pr(X_{t}=k|Z_t=i)$ 
- Transition Matrix: $P=(P_{ij})$ $\longrightarrow N \times N$ matrix
- Emission Matrix: $E=(E_{ik})$ $\longrightarrow N \times M$ matrix
- Initial Probability: $\pi_i$  $\longrightarrow N$ dim. row vector
- Sequences: condensed notation
  - observations $X = X_{1:T} = (X_1, X_2, \ldots, X_T)$
  - hidden states $Z = Z_{1:T} = (Z_1, Z_2, \ldots, Z_T)$


**Notice:** 

- Both states and observations are discrete variables (e.g. $Z=GGGGIIIIGGGG...$ and $X=ACTGTCGCGCGATTA$) but they are often encoded as integer variables, e.g. $Z=000011110000$
- more condensed notation: $Pr(Z_t=i) = Pr(Z_t)$
- all $X_t$ are observed, so $E_{ik}$ serves as a look-up table. Sometimes I write $E_{it}$ 
- Python indices of arrays start at 0

# Joint Probability: $Pr(X,Z)$

Hidden Markov Models are graphical models:


<div>
   <img src="https://github.com/thomasmanke/ABS/raw/main/figures/HMM_Joint.jpg",  width="1000">
</div>


$$ 
Pr(X,Z) = Pr(X|Z) Pr(Z) = Pr(Z_1) Pr(X_1|Z_1) \prod_{t=2}^T Pr(Z_t | Z_{t-1}) Pr(X_{t}|Z_t)
$$


**Message**: 
- Chain together probabilities of initial state, state transitions and observed emissions!
- For given sequences ($X$ and $Z$), multiply all edge probabilities in graphical model ! 
- Recursion Principle: if partial solution $Pr(X_{1:t}, Z_{1:t})$ is available,  $Pr(X_{1:t+1}, Z_{1:t+1})$ can be obtained iteratively

**Task (10 min)**: 

Given the HMM parameters and an observed sequence $X=002$. Calculate the probability of all possible paths 
- Group 1: all paths starting with $G$
- Group 2: all paths starting with $S$
Report back the path with the highest probability and compare

In [None]:
path000 = pi[0]*E[0,0]*P[0,0]*E[0,0]*P[0,0]*E[0,2]  # path GGG
path001 = 

**Notice:** 
- Condensed notation, e.g. $Pr(X,Z) = Pr(X_1=x_1, X_2=x_2, \ldots, Z_1=z_1, Z_2=z_2, \ldots, Z_T=z_T)$ 
- $\sum_X \sum_Z P(X,Z) = 1$
- for discrete Markov chains both $X_t$ and $Z_t$ are discrete variables. Their values are often represented by integers (e.g. X=001123001)
- Strictly we may want to write $Pr(X,Z|\Theta)$ to highlight the conditioning on known parameters $\Theta=(P,E,\pi)$. Unless stated otherwise, we assume that those parameters are all known and fixed.

**Discussion (5 min):** 

- How many possible sequences are there for a) observations $X$ and b) hidden states $Z$?
- What happens to $Pr(X,Z)$ if the sequence $Z$ contains forbidden transitions?
- Why would one want to know the joint?
- Why can we not calculate it ?

# Typical Problems



- given model parameters + observations $X$ 
  - scoring (observations): $Pr(X) \longrightarrow$ **Forward Algorithm**
  - decoding (hidden variables):
    - likelihood $Pr(Z_t=i|X) \longrightarrow$  **Forward-Backward Algorithm**
    - best state sequence: $argmax_Z Pr(Z|X) \longrightarrow$ **Viterbi Algorithm**

- given only observations $X$:
  - learning (model parameters)  $\longrightarrow$ **Baum-Welch Algorithm**

# Probability of Observations: $Pr(X)$

**Uses:** Evaluate (score) observations. Compare different models: $P(X|\Theta_1)$ vs $P(X|\Theta_2)$

**First idea: Chain rule**
$$
P(X) = P(X_1) P(X_1|X_2) P(X_3|X_1,X_2) \ldots P(X_T| X_1, \ldots ,X_{T-1})
$$

... not calcuable from HMM parameters

**Second idea: Naive Marginalization**

Use joint distribution $Pr(X,Z)$ and remove hidden state sequence (it is unobservable) $\to$ marginalize

$$
Pr(X) = \sum_Z P(X,Z) = \sum_Z Pr(X|Z) Pr(Z)
$$

**Notice:**
- remember: each term in sum breaks into emission probabilities, transition probabilities (and initial state probability)
- marginalization over *all possible* state paths $Z$ ($=Z_{1:T} = Z_1 Z_2 \cdots Z_T$) 
- $N^T$ paths for $N$ possible states and sequences of length $T$ -- unfeasible



**Third idea: Dynamic Programming (reuse previous calculations)**

### The Forward Algorithm

We don't know any $Z_t$, so we need to **track all possibilities**: Trellis graph.

<div>
   <img src="https://github.com/thomasmanke/ABS/raw/main/figures/HMM_HiddenTrellis.jpg",  width="1000">
</div>

Let's assume that at some time $t$ we already know the joint probability for the observed sequence $X_{1:t}$ and the hidden state $Z_t$ (for each possible value of $Z_t=i$).

This information is stored in the **forward variable:** $\alpha_{ti} = Pr(X_{1:t}, Z_t=i)$. This is a vector of joint probabilities that will be propagated forward in time.



#### Iteration

- **1. Initialization** ($t=1$): 
$$
\alpha_{1i} = Pr(X_1, Z_1=i) = Pr(X_1|Z_1=i) Pr(Z_1=i)
$$
  - 
   $\to$ element-wise multiplication of a row from emission matrix with initial state distributions
  - $Pr(X_1|Z_1=i)$ is one element of the emission matrix $E_{ik}$ (row i = state, column k = k(1) observed valuee of $X_1$). 
  - $Pr(Z_1=i)$ is the i-th element of the initial state distribution $\pi_i$.



- **2. Induction ($t \to t+1$):** state transition + new observation 
  - *2.1 state transition ($Z_t \to Z_{t+1}$):* Consider all possible Markov transitions and sum them up.
$$
\begin{align}
Pr(Z_{t+1}=i, X_{1:t}) &= \sum_k Pr(Z_{t+1}=i, Z_t=k, X_{1:t}) \\
&=\sum_k Pr(Z_{t+1}=i|Z_t=k, X_{1:t}) Pr(Z_t=k,X_{1:t}) \\
 &= \sum_{k} P_{ki} \alpha_{tk}  = \sum_{k} \alpha_{tk} P_{ki}
\end{align}
$$
$\to$ matrix multiplcation of row vector $\alpha_{tk}$ with transition matrix
  - *2.2 new observation ($X_{t+1}$):* consider emission probability resulting in observation $X_{t+1}$
$$
Pr(Z_{t+1}=i, X_{1:t}, X_{t+1}) = Pr(X_{t+1}|Z_{t+1}=i) Pr(Z_{t+1}=i, X_{1:t})
$$
$\to$ element-wise multiplication of a row from emission matrix with 

2 Steps (graphical interpretation)
<div>
   <img src="https://github.com/thomasmanke/ABS/raw/main/figures/HMM_Forward.jpg",  width="800">
</div>

- **3. Termination ($t=T$):**
$$
Pr(X) = Pr(X_{1:T}) = \sum_i \alpha_{Ti}
$$

**Notice:**
- marginalization: $Pr(X_{1:t}) = \sum_i \alpha_{ti} \ne 1$.  In fact, it is much smaller than 1 for large $t$ !
- Propagation of $\alpha_{tk}$ is linear in sequence length $T$
- Calculation of $Pr(X)$ requires $T N^2$ calculations $\ll N^T$ 
- Example: $(N,T) = (2, 100) \longrightarrow 400 \ll 2^{100}$  
- Emission matrix $E_{ik}$ serves as lookup table for given observation $X_t=k$ at time $t$. ($k=f(t)$)

### Summary: forward recursion is fast
<div>
   <img src="https://github.com/thomasmanke/ABS/raw/main/figures/HMM_Forward_summary.jpg",  width="800">
</div>


### Group Task (20 min)

Given the above HMM  with 2 states (Germany=0, Switzerland=1) and a magically known joint probability $Pr(Z_{t-1}, X_{1:t-1})=(0.05, 0.02)$.
This denotes the probability for the two states **and** all observations until time $t-1$. (Notice that this does not have to sum to 1!)

Calculate the updated probability for $Z_t=$ Germany (0) **and** that the newly observed emission is Bread (0) or Fondue (2). 

- Group 1: $P(Z_t=0, X_t=0) = $ ? 
- Group 2: $P(Z_t=0, X_t=2) = $ ?

In [None]:
import numpy as np

alpha = np.array([0.05, 0.02]) # initial probability   

alpha = ...     # state transition
print('after state transition: ', alpha)

x = ...                          # define observation at time t
LH=E[ ... ]                      # get all emission that result in state x
print('emission vector:        ', LH)

alpha = ...                # take into account observations
print('new probability         ', alpha) 

# for calculation of conditional probability
alpha /= np.sum(alpha)         # normalized posterior
print('posterior norm:', alpha)

**More remarks:**
- We assumed that all parameters of the HMM are known: $\Theta=(P,E,\pi)$


- *Monitoring:* The above derivation could also have been done for the conditional probability: $Pr(Z_t=i|X_{t})$. The only difference is that new observations are incorporated using Bayes theorem
$$
Pr(Z_{t+1}=i, X_{1:t+1}) = Pr(X_{t+1}|Z_{t+1}=i) Pr(Z_{t+1}=i, X_{1:t})
$$
$$
Pr(Z_{t+1}=i | X_{1:t+1}) \propto Pr(X_{t+1}|Z_{t+1}=i) Pr(Z_{t+1}=i | X_{1:t})
$$ \\
Here $P(Z_{t+1}=i|X_{1:t})$ will still have to be normalized at each time $t$ such that $\sum_k P(Z_{t+1}=k|X_{1:t}) = 1$



**Follow up task (5 min)**

Rather than calculating the joint probabilities, adjust the above cell to calculate the more interesting conditional probablities:

- Group 1: $P(Z_t=0 | X_t=0) = $ ? 
- Group 2: $P(Z_t=0 | X_t=2) = $ ?

Assume that the conditional probability at time $(t-1)$ is $Pr(Z_{t-1}=i| X_{1:t-1})=(0.75, 0.25)$. (This has to sum to 1).

Hint: Apart from the the obvious change for the initial alpha, this needs only one additional line of code.

Calculating $P(Z_t=i| X_{1:t}$ for all $t$ amounts to simple (forward) loop over all times; it is linear in $t$.