# Motivation and Simulation

Even if the "states of the world" are Markovian, they are often hidden from us, and we only observe some measurements. 

**A traveling analogy**

> I frequently commute between two states: Germany and Switzerland. Let's assume my travels can be modelled as a Markov Process, as described in the previous section. But now I only communicate my dinner plans with the world. Therefore dinner is an **observable** variable, but my current state (the country) variable is **hidden.** We might hope that something could still be learned about the states visited from the observation on food consumption.

<div>
   <img src="https://github.com/thomasmanke/ABS/raw/main/figures/HMM_CountryFood.jpg",  width="1000">
</div>


This is a Hidden Markov Model (HMM). An HMM is characterized by three ingredients:

- initial distribution: $P(Z_0=i)=\pi_i$ ( $\to 1 x N$ matrix = row vector )
- transition matrix: $P(Z_t=j|Z_{t-1}=i) = P_{ij}$  ( $\to N \times N$ matrix )
- emission matrix: $P(X_t=k|Z_t=i) = E_{ik}$ ( $ \to N \times M$ matrix )

The emission probabilities are dependent on the state, but constant over time.

For simplicity we will assume that both states and observables are discrete.
To be specific, the Hidden Markov Model with 2 states $Z \in$ {Germany=0, Switzerland=1} and observations with 3 possible observations $X \in$ {Bread=0, Fish=1, Fondue=2} may read:

$$
\begin{align}
    P(Z_0) &= \begin{bmatrix} 0.75 & 0.25  \end{bmatrix} \\ \\
    P(Z_t | Z_{t-1}) & = \begin{bmatrix} 0.8 & 0.2 \\ 0.1 & 0.9 \end{bmatrix} \\ \\
    P(X_t | Z_t) & =  \begin{bmatrix} 0.7 & 0.2 &  0.1 \\ 0.1 & 0.1 & 0.8 \end{bmatrix} \\
\end{align}
$$

**Discussion:** Give an interpretation of the numbers as they relate to the graph above.  

**Notice:** all rows are non-negative and they sum to 1 (*stochastic* matrices)

The cell below specifies all these parameters in Python/Numpy.



In [None]:
import numpy as np
pi=np.array( [0.75, 0.25] )                          # initial state probability
P =np.array([ [0.8, 0.2], [0.1, 0.9] ])              # transition probabilites
E =np.array([ [0.7, 0.2, 0.1], [0.1, 0.1, 0.8] ])    # emission probabilities

**Group Task (30 min):** Discuss and simulate the above Hidden Markov Model. 

Complete the following function and generate observations from a Hidden Markov Model defined above. You might want to refer back to the first lecture on simple Markov Models

In [None]:
# notice the similarities with generate_sequence() from the plain Markov Model
def generate_HMM(P, pi, E, T=50):
  assert P.shape[0]==P.shape[1],         "generate_HMM: P should be a squared matrix"
  assert E.shape[0]==P.shape[0],         "generate_HMM: E and P should have the same number of rows (states)"
  assert np.allclose( P.sum(axis=1), 1), "generate_HMM: P should be a stochastic matrix"
  assert np.allclose( E.sum(axis=1), 1), "generate_HMM: E should be a stochastic matrix"
  assert np.isclose( pi.sum(), 1),       "generate_HMM: pi should sum to 1"
  
  
  # first define two list (states = integers, emissions = letters)
  ns = ...                          # number of states
  ne = ...                          # number of outputs (#observables)
  states= list(range(ns))           # state labels as integers
  emissions=list(range(ne))         # observation labels as integers

  # chose first state and emission
  z = np.random.choice( states,    p = ... )
  x = np.random.choice( emissions, p = ... )

  # add state and observation to history
  state_hist = [z]
  emit_hist = [x]
  
  # loop for T time steps
  for t in range(T):
    z = np.random.choice( states,    p = ... )
    x = np.random.choice( emissions, p = ... )

    # collect history with state and emission labels
    state_hist.append(z)
    emit_hist.append(x)
  return state_hist, emit_hist

**Test:** If done correctly, the function should return output such as

In [None]:
%%script echo ensure that function generate_HMM() is defined
np.random.seed(42)
Z, X = generate_HMM(P,pi,E, T=50)
print('states Z       =',*Z)
print('observations X =',*X)

states Z       = 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0
observations X = 2 0 0 1 1 2 2 1 2 2 1 2 2 2 0 1 2 2 0 2 2 2 0 0 0 2 2 2 0 0 0 2 2 2 2 0 1 1 0 0 2 0 2 2 2 2 2 2 2 0 0


**Discussion:** Do these sequences make sense? Can you give an  interpretation of the observation?

# Interlude: Joints, Marginals, Conditionals, Bayes & All That

This interlude will apply beyond Hidden Markov Models, but I will use the above emission probablities as an example.

So let's be discrete: with two variables: $Z \in \{0,1\}$ and $X \in \{0,1,2\}$. 

Much of what follows applies also to continuous variables, where discrete distributions can be replaced by probability density functions, sums by integrals - and the usual mathematical concerns about the existence of limits and such.


## Joint & Co


<div>
   <img src="https://github.com/thomasmanke/ABS/raw/main/figures/JointConditionalMarginal.jpg",  width="1000">
</div>

Knowing the **joint distribution** $P(X,Z)$ is the best we can hope for, since everything else can (in principle) be calculated from it.

However:

a) remember that we are about to hide all state variables $Z$.

b) calculations maybe very hard - analytically and computationally. For example, even if we knew a joint distribtuions such as 

$$
P(X_1, X_2, \ldots, X_T, Z_1, \ldots Z_T)
$$

many computational task would become very difficult (combinatorics!) - unless the problem has some structure (such as a Markov Property).





In many situations we may be more interested in specific subsets of variables: 

- **conditional distributions**: some variables are known or fixed, 
- **marginal distributions** some variables are uninteresting $\longrightarrow$ average over.

## Bayes Theorem

<div>
   <img src="https://github.com/thomasmanke/ABS/raw/main/figures/BayesEquation.jpg",  width="1000">
</div>

## Example: Diagnostic Tests

<div>
   <img src="https://github.com/thomasmanke/ABS/raw/main/figures/DiagnosticTest.jpg",  width="1000">
</div>


## **Group Tasks (20 min)**: Bayesian Reasoning

For the following assume that all HMM parameters are known: $\pi, P, E$.

1. Is the initial distribution the same as the stationary distribution?

2. Let's assume that I sent you my (first ever) message, saying that I just had Fondue for dinner. What is the (posterior) probability that I am in Germany?

- 5.8%
- 50.0%
- 75%

In [None]:
%%script echo edit before execution
# 1. matrix powers
from numpy.linalg import matrix_power
pi = np.array([1.0, 0.0])         
stat_dist = ...   # independent of pi
print('stat_dist = ', stat_dist)

# 2.  Bayesian analysis
sum  = ...
prob = ... / sum
print('sum = ', sum)
print('answer = ', prob)

Later you will learn how to incorporate all observations $X$ systematically to derive probabilitic statements for $Z$.

# Back to HMM: Hiding Z

From now on we assume that the *hidden* state sequence $Z=Z_{1:T}$ is never observed. 

<div>
   <img src="https://github.com/thomasmanke/ABS/raw/main/figures/HMM_Chain.jpg",  width="1000">
</div>

However, if we know the HMM parameters, we can still give a probabilitic description for $Z$:

- **prior probability** $Pr(Z_t=i)⟶$ stationary distribution $\pi = \pi P$ 
- **posterior probability** $Pr(Z_t=i|X_t=k) \longrightarrow$ Bayesian update: $ \propto Pr(X_t=k|Z_t=i) Pr(Z_t=i)$
- **most likely hidden state sequence** for given observations: $argmax_Z P(Z|X) \longrightarrow$ Viterbi algorithm



 

# Reference: Conventions and Notations

- Number of States: $i=1,2, \ldots N$
- Number of Observations (discrete values): $k=1,2, \ldots M$
- Length of Sequence: $t=1,2, \ldots T$
- conditional probabilities (notice the index order !)
  - $P_{ij} = Pr(Z_{t+1}=j|Z_t=i)$ 
  - $E_{ik} = Pr(X_{t}=k|Z_t=i)$ 
- Transition Matrix: $P=(P_{ij})$ $\longrightarrow N \times N$ matrix
- Emission Matrix: $E=(E_{ik})$ $\longrightarrow N \times M$ matrix
- Initial Probability: $\pi_i$  $\longrightarrow N$ dim. row vector
- Sequences: condensed notation
  - observations $X = X_{1:T} = (X_1, X_2, \ldots, X_T)$
  - hidden states $Z = Z_{1:T} = (Z_1, Z_2, \ldots, Z_T)$


**Notice:** 

- Both states and observations are discrete variables (e.g. $Z=GGGGIIIIGGGG...$ and $X=ACTGTCGCGCGATTA$) but they are often encoded as integer variables, e.g. $Z=000011110000$
- more condensed notation: $Pr(Z_t=i) = Pr(Z_t)$
- all $X_t$ are observed, so $E_{ik}$ serves as a look-up table. Sometimes I write $E_{it}$ 
- Python indices of arrays start at 0

# Joint Probability: $Pr(X,Z)$

Hidden Markov Models are graphical models:


<div>
   <img src="https://github.com/thomasmanke/ABS/raw/main/figures/HMM_Joint.jpg",  width="1000">
</div>


$$ 
Pr(X,Z) = Pr(X|Z) Pr(Z) = Pr(Z_1) Pr(X_1|Z_1) \prod_{t=2}^T Pr(Z_t | Z_{t-1}) Pr(X_{t}|Z_t)
$$


**Message**: 
- Chain together probabilities of initial state, state transitions and observed emissions!
- For given sequences ($X$ and $Z$), multiply all edge probabilities in graphical model ! 
- **Recursion Principle:** if partial solution $Pr(X_{1:t}, Z_{1:t})$ is available,  $Pr(X_{1:t+1}, Z_{1:t+1})$ can be obtained iteratively

**Notice:** 
- Condensed notation, e.g. $Pr(X,Z) = Pr(X_1=x_1, X_2=x_2, \ldots, Z_1=z_1, Z_2=z_2, \ldots, Z_T=z_T)$ 
- $\sum_X \sum_Z P(X,Z) = 1$
- for discrete Markov chains both $X_t$ and $Z_t$ are discrete variables. Their values are often represented by integers, e.g. $X=(0,0,2,1,1,2,2, \ldots)$
- Strictly we may want to write $Pr(X,Z|\Theta)$ to highlight the conditioning on known parameters $\Theta=(P,E,\pi)$. For now we assume that those parameters are all known and fixed.

## Discussion (10 min):

- Why would we want to know the joint distribution $P(X,Z)$ ? Why will it be difficult ?
- How many possible sequences are there for a) observations $X$ and b) hidden states $Z$?
- What happens to $Pr(X,Z)$ if the sequence $Z$ contains forbidden transitions?

## Group Task (20 min)

Given the HMM parameters and an observed sequence $X=(0,0,2)$. You have not observed the corresponding sequence $Z=(z_1, z_2, z_3)$, but given the HMM parameters you can calculate the probability of all possible hidden state paths $Z$
- Group 1: all paths starting with $G$
- Group 2: all paths starting with $S$

Report back the path with the highest probability and compare.
What happens if you sum all collected probabilities?

In [None]:
%%script echo edit before execution
print('0 0 0: ', pi[0]*E[...]]*P[...]*E[...]*...)
...

# alternatively write a loop over Z
X = [0,0,2]



edit before execution


# Group Task (30 min): HMM Generation

1. Make up your own hidden Markov story, draw the corresponding state graph, and define the Hidden Markov Model. 
  - Please keep it simple; less than 5 hidden states and less than 5 possible observations. 
  - Also make sure that the Markov Model for the hidden states is *ergodic* (what was that?)

2. Choose your own emission probabilties, transition probabilties and the initital state distribution - make sure they correspond to probabilties. 

3. Simulate $T=1000$ steps.

4. Record (only) the sequence of observations that were generated and store the results as string in a text file (for latter use). Be kind and use integer encoding of observations, i.e. $0,1,\ldots$ regardless of the interpretation.

5. Share your story, code, results and report back to the class.

In [None]:
%%script echo edit before execution (only one per group)

pi=np.array( ... )
P =np.array( ... )
E =np.array( ... )

T=1000
Z, X = generate_HMM(P, pi, E, T=T)

fn='obs_group1.txt'  # choose a group-specific filename

# write ######
with open(fn, 'w') as f:
  m = map(str, X)            # convert numbers to strings
  f.write(' '.join(list(m))) 

# read (for later) ######
with open(fn, "r") as f:
  line  = f.readline().split()  # read first line and split
Xr = list(map(np.int64, line))  # map line to np.int64

#print('X =',*X)
#print('Xr=',*Xr)
np.all(X==Xr)

X = 1 1 0 1 0 1 0 0 1 0 1
Xr= 1 1 0 1 0 1 0 0 1 0 1


True

# Applications: What is this all good for?

- Speech Recognition
- DNA sequence analysis: sequence segmentation
- Robot Location


# Typical Problems

Model parameters: $\Theta$, Observations: $X$, Hidden States: $Z$

- scoring (observations): $Pr(X) \longrightarrow$ **Forward Algorithm**

- decoding (hidden variables):
    - best posterior state: $argmax_i Pr(Z_t=i|X) \longrightarrow$  **Forward-Backward Algorithm**
    - best state sequence: $argmax_Z Pr(Z|X) \longrightarrow$ **Viterbi Algorithm**
    
- learning (model parameters)  $argmax_\Theta Pr(\Theta|X) \longrightarrow$ **Baum-Welch Algorithm**