### Assignment 5
_Jiajie Lu_

In [1]:
import numpy as np
import pandas as pd

#### Problem 1
>Consider the chess game from Assignment 2. Assume that if you play timid your probability of making a draw is $p = 0.8$ and the probability to win is the same as the probability to lose.

(a) Solve the problem again with this new probability setting.

**Solution:**
* Transition Probability
$$
\begin{array}{ll}
P(x_{t+1}=x_t+0|x_t,u=0)=0.8, &P(x_{t+1}=x_t-1|x_t,u=0)=0.1, &P(x_{t+1}=x_t+1|x_t,u=0)=0.1,\\
P(x_{t+1}=x_t+0|x_t,u=1)=0, &P(x_{t+1}=x_t-1|x_t,u=1)=0.55, &P(x_{t+1}=x_t+1|x_t,u=1)=0.45.
\end{array}
$$
In Assignment 2, if we step into the "sudden death" phase, we have no choice but to play bold, because we cannot win if playing timid. However here we cannot simply play bold since the probability of playing timid has modified. 

Therefore we cannot simply set our $v_6(\cdot)$ as before. Then we can treat it as an infinite horizon dynamic programming problem of which any state except $0$ is absorbing. That means we need to find out the optimal value function and optimal policy for the "sudden death" phase. Then we set up as following
* State Space:
$$
\mathcal{S}=\{-1,0,1\}
$$
representing "lose", "draw" and "win" in "sudden death" phase.
* Control Space:
$$
\{0:\text{timid}, 1:\text{bold}\}
$$
* Transition Probability:
$$
p^{0}_{0,\cdot}=\begin{bmatrix}
0.1&0.8&0.1
\end{bmatrix},\qquad 
p^{1}_{0,\cdot}=\begin{bmatrix}
0.55&0&0.45
\end{bmatrix}
$$
* Dynamic Programming Equation:
$$
\begin{array}{ll}
&v_6^*(0)=\max\{\sum_{s\in\mathcal{S}}p^0_{0,s}v_6^*(s), \sum_{s\in\mathcal{S}}p^1_{0,s}v_6^*(s)\},\\
&v_6^*(1)=1,\quad v_6^*(-1)=0
\end{array}
$$
Here we denote the optimal value function of the previous $v_6(\cdot)$ by $v_6^*(\cdot)$.

Let
$$
\begin{array}{ll}
h&=&\sum_{s\in\mathcal{S}}p^0_{0,s}v_6^*(s)\\
&=&0.1\cdot0+0.8\cdot v_6^*(0)+0.1\cdot1\\
&=&0.8v_6^*(0)+0.1
\end{array}
$$
and
$$
\begin{array}{ll}
g&=&\sum_{s\in\mathcal{S}}p^1_{0,s}v_6^*(s)\\
&=&0.55\cdot0+0\cdot v_6^*(0)+0.45\cdot1\\
&=&0.45
\end{array}
$$
If we play timid at state $s=0$, then we have
$$
h=0.8v_6^*(0)+0.1=v_6^*(0)\implies v_6^*(0)=0.5.
$$
If we play bold at state $s=0$, then we have
$$
g=0.45=v_6^*(0)<0.5.
$$
That shows playing timid is the optimal policy for "sudden death". Then we can set up
$$
v_6(x)=\left\{
\begin{array}{ll}
1, &x>0,\\
0.5, &x=0,\\
0, &x<0.
\end{array}
\right.
$$
Then solve the problem as we did previously.

(b) Investigate how your strategy changes when the probability of wining changes in the case of a timid play. The probability of wining can change from 0 to 0.15

Let $p$ be the probability of winning when playing timid. With our preceeding notation, we can obtain
$$
h=(0.2-p)\cdot0+0.8v_6^*(0)+p\cdot1=0.8v_6^*(0)+p,
$$
and $g=0.45$.
If playing timid at $s=0$, then we can obtain
$$
h=0.8v_6^*(0)+p=v_6^*(0)\implies  v_6^*(0)=5p.
$$
and it's clear that $v_6^*(0)=0.45$ when playing bold at $s=0$.
Then if 
$$
5p\ge 0.45\implies p\ge0.09,
$$
the optimal policy is just playing timid and playing bold otherwise.
Then based on the constraint that $0\le p\le 0.15$, we have that the optimal policy $\pi^*$ of $v_6(\cdot)$ is just
$$
\pi^*=\left\{
\begin{array}{ll}
\text{timid}, &0.09\le p\le 0.15,\\
\text{bold}, &0\le p\le 0.09.
\end{array}
\right.
$$

#### Problem 2
>We have a tree farm. At any time, the size $s$ of a tree is 0,1,2,3,4, where 0 means that the tree has died, and 4 is the size of a mature tree. We need to decide when to harvest a given tree. Each year it costs about &#36;10+s to maintain a tree, and 30+5s to harvest a tree. The sales price of a tree of each size is as follows:

|     Tree Size     | 1 |             2                | 3 | 4|
| ------------ | --- | ------------------------------- | ------------------------------- | ------------------------------- |
| Sale Prize |  150 | 180 | 210 | 260

>The transition probability matrix for the size of the tree is as follows:

| sizes | 0 | 1 |  2 |  3 |   4 |
| :--: |:--:|:--:|:--:|:--:|:--:|
|0| 1| 0 |   0 |  0 |   0 |
|1| 0.05|  0.15 | 0.7 | 0.1 |  0|
|2|0.05 |  0 |  0.2 | 0.7 | 0.05|
|3|0.05 |  0 |   0 |  0.5 | 0.45|
|4|0.05 |  0 |   0 |  0 |  0.95|

(a) Describe a dynamic programming problem to determine an optimal harvesting policy.

**Solution:**
* State space : $\{0,1,2,3,4\}$
* Control space : $\{0:\text{maintain},1:\text{harvest}\}$
* Transition probability : 
$$
p^1_{ij}=\left\{
\begin{array}{ll}
1, &j=0,\\
0, &j\ne0.
\end{array}
\right.,\qquad u=1,
$$

$$
p^0_{ij}=P_{ij},\qquad u=0,
$$
where $P$ is the given transition probaility matrix.
* reward function :
$$
r(s,u)=\left\{
\begin{aligned}
-10-s && u=0\\
-30-5s+K_s && u=1
\end{aligned}
\right.
$$
where $K_s$ is the sale price of tree with size $x$.
* Dynamic programming equation :
$$
\begin{aligned}
v^*(s)&=\max\{r(s,0)+\sum_{i=0}^{4}P_{si}v^*(i),r(s,1)\},x=1,2,3,4,\\
v^*(0)&=0
\end{aligned}
$$

(b) Solve the problem numerically. What numerical methods are applicable to this problem and why?

**Solution:**<br>
The state space is finite. We can use value iteration and linear programming. Here I just apply value iteration.

In [2]:
# transtion prob mat
P = np.array(
    [
        [1,  0,  0,  0,  0],
        [0.05, 0.15, 0.7, 0.1, 0],
        [0.05, 0, 0.2, 0.7, 0.05],
        [0.05, 0, 0, 0.5, 0.45],
        [0.05, 0, 0, 0, 0.95]
    ]
)

# sale price mapping
K = {1:150, 2:180, 3:210, 4:260}

# reward function
def reward(s,u):
    if u == 0:
        return -10-s
    if u == 1:
        return -30-5*s+K[s]

In [3]:
def value_iteration_tree(T, max_iter=1000, tol=0.0001):
    # T is length of state space
    # init value function list
    v = np.zeros(T)
    v_old = np.zeros(T) + 10000
    # init policy path
    p = np.zeros(T)
    
    current_step = 0
    
    # value iteration method
    while (np.linalg.norm(v - v_old) > tol) and (current_step < max_iter):
        v_old = v.copy()
        current_step += 1
        
        for s in range(1, T):
            v0 = reward(s,0) + np.sum([
                P[s, i]*v_old[i] for i in range(T)
            ])
            v1 = reward(s, 1)
            v[s], p[s] = np.max([v0, v1]), np.argmax([v0, v1])
            
    return v, p, current_step

v, p, iters = value_iteration_tree(5)
print(f"Within {iters} iterations,")
print("We got optimal value function:")
print(v)
print("And optimal policy")
print(p)

Within 10 iterations,
We got optimal value function:
[  0.         123.8235125  142.49999872 165.         210.        ]
And optimal policy
[0. 0. 0. 1. 1.]
