# Tutorial 1- Optimal Control for Discrete State

Please execute the cell below to initialize the notebook environment.


In [1]:
import numpy as np                 # import numpy
import scipy               # import scipy
import random                      # import basic random number generator functions
from scipy.linalg import inv

import matplotlib.pyplot as plt    # import matplotlib


---

## Tutorial objectives

In this tutorial, we will implement a binary HMM task.


---

## Task Description

There are two boxes. The box can be in a high-rewarding state ($s=1$), which means that a reward will be delivered with high probabilty $q_{high}$; or the box can be in low-rewarding state ($s=0$), then the reward will be delivered with low probabilty $q_{low}$.

The states of the two boxes are latent. At a certain time, only one of the sites can be in high-rewarding state, and the other box will be the opposite. The states of the two boxes switches with a certain probability $p_{sw}$. 

![alt text](switching.png "Title")


The agent may stay at one site for sometime. As the agent accumulates evidence about the state of the box on that site, it may choose to stay or switch to the other side with a switching cost $c$. The agent keeps beliefs on the states of the boxes, which is the posterior probability of the state being high-rewarding given all the past observations. Consider the belief on the state of the left box, we have 

$$b(s_t) = p(s_t = 1 | o_{0:t}, l_{0:t}, a_{0:t-1})$$

where $o$ is the observation that whether a reward is obtained, $l$ is the location of the agent, $a$ is the action of staying ($a=0$) or switching($a=1$). 

Since the two boxes are completely anti-correlated, i.e. only one of the boxes is high-rewarded at a certain time, the the other one is low-rewarded, the belief on the two boxes should sum up to be 1. As a result, we only need to track the belief on one of the boxes. 

The policy of the agent depends on a threshold on beliefs. When the belief on the box on the other side gets higher than the threshold $\theta$, the agent will switch to the other side. In other words, the agent will choose to switch when it is confident enough that the other side is high rewarding. 

The value function can be defined as the reward rate during a single trial.

$$v(\theta) = \sum_t r_t - c\cdot 1_{a_t = 1}$$ 

we would like to see the relation between the threshold and the value function. 

### Exercise 1: Control for binary HMM
In this excercise, we generate the dynamics for the binary HMM task as described above. 

In [None]:
# This function is the policy based on threshold