## Assignment 2
_Jiajie Lu_

In [1]:
import numpy as np
import pandas as pd

### Problem 1
>Assume that you play a chess match with a friend.  If you play timid your probability of making a draw is p=0.9, the probability to win is 0 and the probability to lose is 0.1.  If you play bold you either win with probability q=0.45, or you lose. Each win brings one point to the score of the winner. The match consists of 5 games. If the score is a tie after the fifth game, then a“sudden death”rule is adopted; that is, whoever wins the next game is a winner of the match; if it is a draw, then the game is repeated with the same rule
<br><br>
Formulate a Markov decision problem to determine the optimal strategy of your play (to maximizethe probability of winning the match) and solve it. Clearly describe the state space, control space,transition probabilities, and the reward function.

**Solution**

* State space $\{-5,-4,-3,-2,-1,0,1,2,3,4,5\}$ $x_t$ - current advantages
* Control space $\{0,1\}$ where $0=\text{'timid'}$, $1=\text{'bold'}$
* Transition probability:
$$
P(x_{t+1}=x_t|x_t,u=0)=0.9\\
P(x_{t+1}=x_t-1|x_t,u=0)=0.1\\
P(x_{t+1}=x_t-1|x_t,u=1)=0.55\\
P(x_{t+1}=x_t+1|x_t,u=1)=0.45\\
P(x_{t+1}=y|x_t,u_t)=0 , \text{other }y\in\mathcal{X}
$$
* Reward function
$$r(x_t,u_t)=0$$
* Dynamic Programming Equation
$$
\begin{aligned}
v_t(x)&=\max\{v_{t+1}(x)P(x|x,0)+v_{t+1}(x-1)P(x-1|x,0),v_{t+1}(x-1)P(x-1|x,1)+v_{t+1}(x+1)P(x+1|x,1)\}\\
v_N(x)&=\left\{
\begin{aligned}
1&&x>0\\
0.45&&x=0\\
0&&x<0
\end{aligned}
\right.\\
t&=1,2,3,4,5,N\ge6
\end{aligned}
$$

When two players turn into sudden death, the best choice for me is undoubtedly to play "bold". So we just need to optimize the finite states in regular game.

In [2]:
def chess_match():
    # value function
    v_mat = np.zeros((11, 6))
    # policy matrix
    u_mat = np.zeros((11, 6))
    # prob transition matrix
    p_mat = np.zeros((3, 2))
    
    # u=0
    p_mat[0][0] = 0.1
    p_mat[1][0] = 0.9
    # u=1
    p_mat[0][1] = 0.55
    p_mat[2][1] = 0.45
    # set v(6)
    for idx in range(11):
        if idx < 5:
            v_mat[idx][0] = 0
        elif idx > 5:
            v_mat[idx][0] = 1
        else:
            v_mat[idx][0] = 0.45
    # policy for "sudden death" 1-timid, 2-bold, 3-any
    u_mat[5][0] = 2
    # DPE
    for t in range(1, 6):
        for x in range(t, 11-t):
            # value function
            vs = [v_mat[x, t-1]*p_mat[1, 0] + v_mat[x-1, t-1]*p_mat[0, 0],
                     v_mat[x-1, t-1]*p_mat[0, 1] + v_mat[x+1, t-1]*p_mat[2, 1]]
            v_mat[x][t] = round(np.max(vs), 5)
            # policy
            u_mat[x][t] = np.argmax(vs) + 1
            # if vs[0] = vs[1], then any policy is fine
            if vs[0] == vs[1]:
                u_mat[x][t] = 3
            
    return v_mat, u_mat

# show values
def show_table_1(v):
    v = pd.DataFrame(v, index=[f"x={x-5}" for x in range(11)])
    
    v.columns = [f"t={6 - idx}" for idx in range(6)]
    
    return v
    

v, u = chess_match()
print("Value function matrix: ")
display(show_table_1(v))
print("Policy Matrix: 1-Timid, 2-Bold, 3-Any")
display(show_table_1(u))

Value function matrix: 


Unnamed: 0,t=6,t=5,t=4,t=3,t=2,t=1
x=-5,0.0,0.0,0.0,0.0,0.0,0.0
x=-4,0.0,0.0,0.0,0.0,0.0,0.0
x=-3,0.0,0.0,0.0,0.0,0.0,0.0
x=-2,0.0,0.0,0.09113,0.09113,0.0,0.0
x=-1,0.0,0.2025,0.2025,0.2916,0.28158,0.0
x=0,0.45,0.45,0.53662,0.51435,0.5472,0.52616
x=1,1.0,0.945,0.8955,0.85961,0.82508,0.0
x=2,1.0,1.0,0.9945,0.9846,0.0,0.0
x=3,1.0,1.0,1.0,0.0,0.0,0.0
x=4,1.0,1.0,0.0,0.0,0.0,0.0


Policy Matrix: 1-Timid, 2-Bold, 3-Any


Unnamed: 0,t=6,t=5,t=4,t=3,t=2,t=1
x=-5,0.0,0.0,0.0,0.0,0.0,0.0
x=-4,0.0,3.0,0.0,0.0,0.0,0.0
x=-3,0.0,3.0,3.0,0.0,0.0,0.0
x=-2,0.0,3.0,2.0,2.0,0.0,0.0
x=-1,0.0,2.0,2.0,2.0,2.0,0.0
x=0,2.0,2.0,2.0,2.0,2.0,2.0
x=1,0.0,1.0,1.0,1.0,1.0,0.0
x=2,0.0,3.0,1.0,1.0,0.0,0.0
x=3,0.0,3.0,3.0,0.0,0.0,0.0
x=4,0.0,3.0,0.0,0.0,0.0,0.0


### Problem 2
>A software manufacturer can be in one of two states. In state 1 their software sells well, and in state 2, the product sells poorly.  While in state 1, the company can invest in development of upgraded version of the software, in which case the one-stage reward is 4 units, and the probability of degrading to state 2 is 0.2.  If no investment in new development occurs, then the reward is 6 units, but the probability of transition to state 2 is 0.5.  While in state 2, if the company invests in software development, then the reward is -2 units, but the probability of transition to state 1 is 0.7.Without special efforts to improve, the reward is 1 and the probability of upgrading to state 1 is 0.1.Formulate a dynamic programming problem to determine an optimal reserch and development policy. Solve the problem for a time horizon of 12 time intervals.

**Solution**
* State space $\{1:\text{"good"},2:\text{"bad"}\}$
* Control space $\{0: \text{"not invest"}, 1:\text{"invest"}\}$
* Transition probability
$$
P(0)=\begin{pmatrix}
0.5&&0.5\\
0.1&&0.9
\end{pmatrix},
P(1)=\begin{pmatrix}
0.8&&0.2\\
0.7&&0.3
\end{pmatrix}
$$
* Reward function
$$
r(x,u)=\left\{
\begin{aligned}
6&&x=1,u=0\\
4&&x=1,u=1\\
1&&x=2,u=0\\
-2&&x=2,u=1
\end{aligned}
\right.
$$
* Dynamic Programming Equation
$$
v_t(x)=\max_{u\in \{0,1\}}\{r(x,u)+P(1|x,u)v_{t+1}(1)+P(2|x,u)v_{t+1}(2)\}\\
t=1,2,...,T\\
v_{T+1}(x)=0
$$

In [3]:
def software_upgrade(T):
    # T - time horizon
    # transition probability
    p_mat = np.array([
        [[0.5, 0.5], [0.1, 0.9]],
        [[0.8, 0.2], [0.7, 0.3]]
    ])
    # reward function matrix
    r_mat = np.array([
        [6, 4], [1, -2]
    ])
    # value function matrix
    v_mat = np.zeros((2, T+1))
    # policy mat
    u_mat = np.zeros((2, T))
    
    # iterate to calculate value function
    for t in range(T):
        for x in [0, 1]:
            vs = [r_mat[x, u] + p_mat[u, x, 0]*v_mat[0, T-t] + p_mat[u, x, 1]*v_mat[1, T-t] 
                  for u in [0, 1]]
            v_mat[x, T-1-t] = round(np.max(vs),5)
            u_mat[x, T-1-t] = np.argmax(vs)
            
    return v_mat[:,:T], u_mat

def show_table_2(v):
    return pd.DataFrame(v, index=["x=1", "x=2"],
                       columns = [f"t={idx+1}" for idx in range(12)])

v, u = software_upgrade(12)
print("The value functions are:")
display(show_table_2(v))
print("The policy matrix is:")
display(show_table_2(u))

The value functions are:


Unnamed: 0,t=1,t=2,t=3,t=4,t=5,t=6,t=7,t=8,t=9,t=10,t=11,t=12
x=1,36.09261,33.42594,30.75927,28.0926,25.42593,22.75926,20.0926,17.426,14.76,12.1,9.5,6.0
x=2,29.42594,26.75927,24.0926,21.42593,18.75926,16.09259,13.4259,10.759,8.09,5.4,2.5,1.0


The policy matrix is:


Unnamed: 0,t=1,t=2,t=3,t=4,t=5,t=6,t=7,t=8,t=9,t=10,t=11,t=12
x=1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0
x=2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0


### Problem 3
>Consider the equipment replacement problem discussed in class, with the following data:
* operating cost per period $c_0+c_1x$,$x=0,1,2,...$;
* revenue per period $R$;
* replacement cost $K$;
* salvage value $\gamma Ke^{−\mu x}$; 
where $c_0>0,c_1>0,c_2>0, 0<\gamma<1$, and $\mu>0$. <br>
Assume that the salvage value can be collected whenever the item is replaced. The probabilities of deterioration by $j$ steps in one period are given by the Poisson distribution
$$
p_j=\frac{\lambda^j}{j!}e^{-\lambda}, j=0,1,2,\ldots
$$<br>  
(3.1)Formulate the corresponding Markov decision problem. Clearly define the state space, action space, transition probabilities, and the reward function.<br>
(3.2)  Solve the problem numerically for $c_0=1,c_1=1,R=5,K=10,\gamma=0.8,\mu=0.2,\lambda=1$,and time horizon $T=20$.  To this end, argue that a constant value $\bar{x}$ exists such that for all $x\ge\bar{x}$ replacement is always profitable.  Then you will know that the value function for all $x\ge\bar{x}$ is the same as at $\bar{x}$. This will allow you have finite tables of the value function for each time $t$.

**Solution**
<br>(3.1)
* State space  $\{0,1,2,3,\ldots\}$ where $0 = \text{"new"}$ and $n>0 = \text{"level of damage"}$
* Control space $\{0,1\}$ where $0 = \text{"continue"}$ and $1 = \text{"renew"}$
* Transition probability
$$
P(x_{t+1}=k|x_t=i,0)=\left\{
\begin{aligned}
p_{k-i}&&i<k\\
0&& \text{otherwise}
\end{aligned}
\right.
$$
$$
P(x_{t+1}=k|x_t=i,1)=p_k,k=0,1,2,3,...
$$
where $p_j=\frac{\lambda^j}{j!}e^{-\lambda}$
* Reward function
$$
r_t(x,u)=\left\{
\begin{aligned}
R-c_0-c_1x&&u=0\\
R-K+\gamma Ke^{-\mu x}-c_0&&u=1
\end{aligned}
\right.
$$
* Dynamic Programming Equation
$$
v^*_t(x)=\max\{r(x,1)+\sum_{j=0}^\infty p_jv_{t+1}^*(j), r(x,0)+\sum_{j=0}^\infty p_jv_{t+1}^*(x+j) \}
$$

(3.2)<br>
$r(x,0)$ is decreasing since $c_1>0$; $r(x,1)$ is also decreasing since $\mu>0$ and $\gamma K>0$. As we benifit less if the state of device is larger (worse), $v^*_{t+1}(\cdot )$ is decreasing with respect to $x$ by induction assumption. 
Let
$$
\begin{aligned}
h(x)&=r(x,1)+\sum_{j=0}^\infty p_jv_{t+1}^*(j)\\
g(x)&=r(x,0)+\sum_{j=0}^\infty p_jv_{t+1}^*(x+j)
\end{aligned}
$$
It's obvious that $h$ and $g$ are decreasing. <br>
Thus for $x>y$, we can obtain
$$
\max\{h(x),g(x)\}\le\max\{h(y),g(y)\}
$$
That means $v_t^*(x)$ is nonincreasing. <br><br>
With the help of the monotonicity of $v_t^*$, one can imply
$$
\sum_{j=0}^\infty p_jv_{t+1}^*(j)\ge\sum_{j=0}^\infty p_jv_{t+1}^*(x+j)
$$
We can obtain there exists $x\ge9$ satisfying
$$
R-K+\gamma Ke^{-\mu x}-c_0\ge R-c_0-c_1x \iff c_1x+\gamma Ke^{-\mu x}\ge K
$$
where $c_0=1,c_1=1,R=5,K=10,\gamma=0.8,\mu=0.2,\lambda=1$. Let $\bar x=9$, then we can imply for any $x\ge\bar x$, 
$$
v_t^*(x)=r(x,1)+\sum_{j=0}^\infty p_jv_{t+1}^*(j)
$$
Then the state space can be restricted into $\mathcal{X}=\{0,1,2,3,\ldots,9\}$.

For finite time horizion $T=20$, we can rewrite the dynamic programming equation as
$$
v^*_t(x)=\max\{r(x,1)+\sum_{j=0}^9 p_jv_{t+1}^*(j), r(x,0)+\sum_{j=0}^9 p_jv_{t+1}^*(x+j) \}\\
t=1,2,\ldots,20\\
v^*_{21}(x)=0
$$

In [4]:
c0, c1 = 1, 1
R, K = 5, 10
gamma, mu, lamda = 0.8, 0.2, 1

# reward function
def reward(x, u):
    if u == 0:
        return R - c0 - c1*x
    if u== 1:
        return R - c0 - K + gamma*K*np.exp(-mu*x)

def device_replacement(T):
    # T - time horizon
    # transition probability list
    p = [(lamda^j)/np.math.factorial(j)*np.exp(-lamda) for j in range(10)]
    # value function matrix
    v_mat = np.zeros((19, T+1))
    # policy matrix
    u_mat = np.zeros((10, T+1))
    
    for t in range(T):
        for x in range(19):
            s1 = np.sum([p[j]*v_mat[j, T-t] for j in range(10)])
            if x < 10:
                s2 = np.sum([p[j]*v_mat[x+j, T-t] for j in range(10)])
                vs = [reward(x, 1) + s1, reward(x, 0) + s2]
                v_mat[x, T-1-t] = round(np.max(vs),1)
                u_mat[x, T-1-t] = np.argmax(vs)
            else:
                v_mat[x, T-1-t] = reward(x, 1) + s1
                
    return v_mat[:10, :T], u_mat[:, :T]

def show_table_3(v):
    return pd.DataFrame(v, index=[f"x={x}" for x in range(10)],
                       columns = [f"t={x+1}" for x in range(20)])

v, u = device_replacement(20)
display(show_table_3(v))
display(show_table_3(u))

Unnamed: 0,t=1,t=2,t=3,t=4,t=5,t=6,t=7,t=8,t=9,t=10,t=11,t=12,t=13,t=14,t=15,t=16,t=17,t=18,t=19,t=20
x=0,76.4,67.2,59.0,51.9,45.6,40.0,35.1,30.8,27.0,23.7,20.7,18.1,15.8,13.8,12.1,10.5,9.1,7.9,6.7,4.0
x=1,73.7,64.5,56.4,49.2,42.9,37.4,32.5,28.2,24.4,21.0,18.0,15.4,13.2,11.2,9.4,7.8,6.5,5.3,4.5,3.0
x=2,71.7,62.5,54.4,47.2,40.9,35.4,30.5,26.2,22.4,19.0,16.1,13.5,11.2,9.2,7.4,5.9,4.5,3.3,2.4,2.0
x=3,70.8,61.5,53.4,46.3,40.0,34.4,29.5,25.2,21.4,18.1,15.1,12.5,10.2,8.2,6.5,4.9,3.5,2.3,1.1,1.0
x=4,70.0,60.7,52.6,45.5,39.2,33.6,28.7,24.4,20.6,17.3,14.3,11.7,9.4,7.4,5.7,4.1,2.7,1.5,0.3,0.0
x=5,69.3,60.1,52.0,44.8,38.5,33.0,28.1,23.8,20.0,16.6,13.7,11.0,8.8,6.8,5.0,3.4,2.1,0.9,-0.4,-1.0
x=6,68.8,59.6,51.4,44.3,38.0,32.4,27.5,23.2,19.4,16.1,13.1,10.5,8.2,6.3,4.5,2.9,1.5,0.3,-0.9,-2.0
x=7,68.4,59.1,51.0,43.9,37.6,32.0,27.1,22.8,19.0,15.6,12.7,10.1,7.8,5.8,4.1,2.5,1.1,-0.1,-1.4,-3.0
x=8,68.0,58.8,50.6,43.5,37.2,31.6,26.8,22.4,18.6,15.3,12.3,9.7,7.5,5.5,3.7,2.1,0.8,-0.4,-1.7,-4.0
x=9,67.7,58.5,50.4,43.2,36.9,31.3,26.5,22.1,18.3,15.0,12.0,9.4,7.2,5.2,3.4,1.8,0.5,-0.7,-2.0,-4.7


Unnamed: 0,t=1,t=2,t=3,t=4,t=5,t=6,t=7,t=8,t=9,t=10,t=11,t=12,t=13,t=14,t=15,t=16,t=17,t=18,t=19,t=20
x=0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
x=1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
x=2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
x=3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
x=4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
x=5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
x=6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
x=7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
x=8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
x=9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
