#### Assignment 4
_Jiajie Lu_

In [1]:
import numpy as np
import pandas as pd

#### Problem 1
>Consider the equipment replacement problem of Assignment 2.  Assume that we would like to identify the optimal replacement policy by solving an infinite-horizon discounted total reward problem.

**(1.1)**
Formulate the infinite-horizon Markov decision problem.

**Solution:**<br>
* State space : $\mathcal{X}=\{0,1,2,3,...\}$ state of equipement; $x=0$ - new
* Control space : $\mathcal{U}=\{0:\text{"continue"},1:\text{"renew"}\}$
* Transition probability
$$
P(x_{t+1}=k|x_t=i,0)=\left\{
\begin{aligned}
p_{k-j} && k>i\\
0 && \text{o.w.}
\end{aligned}
\right.\\
P(x_{t+1}=k|x_t=i,1)=p_k, k=0,1,\ldots
$$
where $p_j=\frac{\lambda^j}{j!}e^{-\lambda}$
* Reward function :
$$
r(x,u)=\left\{
\begin{aligned}
&R-K(1-\gamma e^{-\mu x})-c_0 && u=1\\
&R-(c_0+c_1x) && u=0
\end{aligned}
\right.
$$
* Dynamic programming equation
$$
v^*(x)=\max_{u\in\{0,1\}}\{r(x,u)+\alpha\sum_{j=0}^{\infty}p_jv^*[x(1-u)+j]\}\\
$$

**(1.2)** If there is no salvage value, then show that the optimal value function is non-increasing function of the state.

**Solution:**<br>
The value iteration equation is just
$$
\begin{aligned}
v^{k+1}(x)&=\max\{r(x,0)+\alpha\sum_{j=0}^{\infty}p_jv^{k}(x+j), r(x,1)+\alpha\sum_{j=0}^{\infty}p_jv^{k}(j)\}\\
\end{aligned}
$$
And
$$
r(x,u)=\left\{
\begin{aligned}
&R-K-c_0 && u=1\\
&R-(c_0+c_1x) && u=0
\end{aligned}
\right.
$$
Since $c_1>0$, the reward function is non-increasing.<br>
Starting from $v^0(\cdot)=0$, we have
$
v^1(x)=\max\{r(x,0), r(x,1)\}
$
For $x\ge\frac{K}{c^{}_1}$, $v^1(x)=r(x, 1)$ which is a constant. And for $x\le\frac{K}{c^{}_1}$, $v^1(x)=r(x, 0)$ which is obviously nonincreasing.<br>
Inductively, if $v^k(\cdot)$ is non-increasing, then 
$$
\begin{aligned}
h(x)&=r(x,0)+\alpha\sum_{j=0}^{\infty}p_jv^{k}(x+j)\\
g(x)&=r(x,1)+\alpha\sum_{j=0}^{\infty}p_jv^{k}(j)
\end{aligned}
$$
and obviously both $h(x)$ and $g(x)$ are non-increasing. We have
$$
v^{k+1}(x)=\max\{h(x), g(x)\}
$$
For $x>y$, we can verify that
$$
\max\{h(x),g(x)\}\le\max\{h(y),g(y)\}
$$
Therefore we can obtain that $v^{k+1}(\cdot)$ is non-increasing. As we know that $v^k$ converges to $v^*$ when $k\to\infty$, $v^*(\cdot)$ is non-increasing conclusively.

**(1.3)** Solve the infinite horizon problem (with salvage value present) for the following values of the parameters:$c_0= 1$,$c_1= 1$,$R= 5$,$K= 10$,$\gamma= 0.8$,$\mu= 0.2$,$\lambda= 1$ and discount factor $\alpha= 0.9$.Solve the problem in value iteration method.

In [2]:
# action space
U1 = [0,1]
# state space ; thanks to the analysis in Assignment 02
X1 = list(range(10))
# parameter table
c0, c1 = 1, 1
R, K = 5, 10
gamma, mu, lamda = 0.8, 0.2, 1
alpha = 0.9

# reward function
def reward(x,u):
    if u == 0:
        return R-c0-c1*x
    return R-K-c0+K*gamma*np.exp(-mu*x)

# probability function
def prob(j):
    return (lamda**j/np.math.factorial(j))*np.exp(-lamda)

# show result
def show_result_1(v, p):
    return pd.DataFrame(
        np.vstack([v, p]), index=["value", "policy"], 
        columns=[f"x={t}" for t in X1])

Then we try the value iteration method.

In [5]:
# Value Iteration
def value_iteration_1(T, max_iter=1000, tol=0.0001):
    # T is the length of stricted state space
    # initial value function
    v = np.zeros(2*T)
    v_old = np.zeros(2*T) + 10000
    # inital policy path
    p = np.zeros(2*T)
    # current iteration
    current_step = 0
    
    # value iteration method
    while (np.linalg.norm(v - v_old) > tol) and (current_step < max_iter):
        v_old = v.copy()
        current_step += 1
        
        for x in X1:
            vs = [reward(x, u) + alpha*np.sum([prob(j)*v_old[x*(1-u)+j] for j in X1]) for u in U1]
            v[x], p[x] = np.max(vs), np.argmax(vs)
           
        for y in X1[1:]:
             v[y+9] = reward(y+9, 1) + alpha*np.sum([prob(j)*v_old[j] for j in X1])
            
    return v[:T], p[:T], current_step

v, p, iters = value_iteration_1(10)
print(f"With {iters} iterations, we can obtain:")
display(show_result_1(v, p))

With 106 iterations, we can obtain:


Unnamed: 0,x=0,x=1,x=2,x=3,x=4,x=5,x=6,x=7,x=8,x=9
value,18.999813,16.248869,14.362373,13.390306,12.594444,11.942848,11.409366,10.972588,10.614985,10.322204
policy,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


#### Problem 2
>We consider an inventory model as discussed in class.  The stock at the beginning of period $t$ denoted by $x_t$, orders at the beginning of period $t$ by $u_t$, and random demand in period $t$ (observed only after the orders are placed) by $d_t$.  We assume ordering cost 5, selling price 10 and holding cost 2.  The demands in successive periods are i.i.d.  with values (0,1,2,3,4) whose respective probabilities are 0.1,0.2,0.3,0.2,0.2.  The capacity of the inventory is 12.

**(2.1)**
Formulate an infinite horizon problem with discount factor 0.8 to determine the best re-order policy.

**Solution:**
* State space : $\mathcal{X}=\{0,1,2,3,...,12\}$ - state of the equipment; $x=0$ - new
* Control space : $\mathcal{U}=\{0,1,2,3,...,12\}$
* Dynamics : $x_{t+1}=\max\{0, x_t+u_t-d_t\}$
* Control mapping : $U(x)=\{0,1,2,3,4,5,...,12-x\}$
* Reward function
$$
r(x,u)=E[10\min(d, x+u) - 5u - 2(x+u))]
$$
* Dynamic programming equation :
$$
v^*(x)=\max_{u\in\mathcal{U(x)}}\{-5u-2(x+u)+\sum_{d=0}^{4}P(d)(10\min\{x+u,d\}+0.8v^*(\max\{0, x+u-d\}))\}
$$

**(2.2)** Solve the problem in (2.1) by value and policy iteration methods.

In [33]:
# state space
X2 = list(range(13))
# number of demand
D = 5
# control space
U2 = list(range(13))
# probability of demand
P = [0.1, 0.2, 0.3, 0.2, 0.2]
# discount factor
alpha2 = 0.8

# show result
def show_result_2(v, p):
    return pd.DataFrame(
        np.vstack([v, p]), index=["value", "policy"], 
        columns=[f"x={t}" for t in X2])

First we try to do with value iteration method.

In [42]:
# value iteration
def value_iteration_2(k, max_iter=1000, tol=1e-5):
    # Kk- the capacity of Inventory
    # initial value functions
    v = np.zeros(len(X2))
    v_old = np.zeros(len(X2)) + 100
    # initial policy
    p = np.zeros(len(X2))
    # current iter step
    current_step = 0
    
    while (np.linalg.norm(v - v_old) > tol) and (current_step < max_iter):
        v_old = v.copy()
        current_step += 1
        
        for x in X2:
            tmp = [-5*u - 2*(x+u) + np.sum([
                P[d]*(10*np.min([x+u, d]) + alpha2*v_old[int(np.max([0, x+u-d]))]) 
                for d in range(D)]) 
                   for u in range(k-x+1)]
            v[x], p[x] = np.max(tmp), np.argmax(tmp)
            
    return v, p, current_step

v, p, iters = value_iteration_2(k=12)
print(f'With {iters} iterations:')
display(show_result_2(np.round(v, 3), p))

With 65 iterations:


Unnamed: 0,x=0,x=1,x=2,x=3,x=4,x=5,x=6,x=7,x=8,x=9,x=10,x=11,x=12
value,18.0,23.0,28.0,32.348,35.278,36.487,36.913,36.395,34.962,32.688,29.728,26.107,21.887
policy,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Then we do the policy iteration.

In [41]:
# policy iteration
def policy_iteration_2(k, max_iter=1000, tol=1e-5):
    # k - capacity of the inventory
    # initial policy
    p = np.zeros(len(X2))
    p_old = np.zeros(len(X2)) + 10
    # current step
    current_step = 0
    
    while (np.linalg.norm(p - p_old) > tol) and (current_step < max_iter):
        p_old = p.copy()
        current_step += 1
        
        # initial coef
        A = np.zeros((len(X2), len(X2)))
        # initial bias
        b = np.zeros(len(X2))
        # set coef&bias
        for x in X2:
            u = p_old[x]
            A[x, x] += 1
            b[x] += -5*u - 2*(x+u) + np.sum([
                P[d]*(10*np.min([x+u, d])) for d in range(D)
            ])
            
            for d in range(D):
                A[x, int(np.max(x+u-d, 0))] += -alpha2*P[d]
                
        # iterated v
        v = np.linalg.solve(A, b)
        # update policy
        for x in X2:
            tmp = [-5*u - 2*(x+u) + np.sum([
                P[d]*(10*np.min([x+u, d]) + alpha2*v[int(np.max([0, x+u-d]))])
                for d in range(D)
            ]) for u in range(k-x+1)]
            p[x] = np.argmax(tmp)
            
    return v, p, current_step

v, p, iters = policy_iteration_2(12)
print(f"With {iters} iterations:")
display(show_result_2(np.round(v,3), p))

With 2 iterations:


Unnamed: 0,x=0,x=1,x=2,x=3,x=4,x=5,x=6,x=7,x=8,x=9,x=10,x=11,x=12
value,22.755,27.755,32.755,36.205,38.843,39.767,39.912,39.063,37.398,34.899,31.734,27.921,23.534
policy,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Problem 3
>Fisher boat is sent to the waters of three connected lakes during one fishing season. Let $x_i$, $i= 1,2,3$ be the (estimated) current amounts of fish in lake $i$.  If we fish in lake $i$, then we harvest $r_ix_i$ fish, provided the fishing conditions are good.  The weather may change abruptly with probability $p$ so that we end the fishing season.  We assume that $0< r_i<1$ for all $i= 1,2,3$.  Identify the lake-selection policy that maximizes the amount of fish before the end of the season.

**Solution:**
* State space : $\mathcal{X}=\{x=(x_1,x_2,x_3)\in\mathbb{R}_+^3\}$
* Control space : $\mathcal{U}=\{1,2,3\}$
* Transition probability :
$$
\begin{aligned}
P[(0,0,0)|(x_1, x_2, x_3), u=0]&=p\\
P[((1-r_1)x_1, x_2, x_3)|(x_1, x_2, x_3), u=1]&=1-p\\
P[(x_1, (1-r_2)x_2, x_3)|(x_1, x_2, x_3), u=2]&=1-p\\
P[(x_1, x_2, (1-r_3)x_3)|(x_1, x_2, x_3), u=3]&=1-p
\end{aligned}
$$
$(0, 0, 0)$ represents the season ends.
* Reward function :
$$
r(x, u) =\left\{
\begin{aligned}
0 && u=0\\
(1-p)r_ux_u && u\in\{1,2,3\}
\end{aligned}
\right.
$$
* State equation : 
$$
x^{t+1}=x^t_{-u}
$$
The subscript $-u$ means the u-th lake was havested and the u-th item of $x^{t+1}$ is $(1-r_u)x^t_u$.
* Dynamic programming equation :
$$
\begin{aligned}
v^*(x)=\max\{0, \max_{u\in U}\{(1-p)r_ux_u+(1-p)v^*(x_{-u})\}\}
\end{aligned}
$$

The dynamic programming problem here can be considered as a 3-armed bandit problem. And the optimal policy is just
$$
\pi^*(x)=\arg\max_{u\in\mathcal{U}}r_ux_u
$$

That shows if the weather permits, we just choose the lake which we can harvest highest amount of fish.