### Assignment 3
_Jiajie Lu_

Consider the equipment location example; its text is repeated for convenience. A service man moves between 4 sites, with site 1 denoting home oﬃce, and 2, 3, and 4 denoting remote sites. Work at sites 2,3, and 4 requires the use of an equipment trailer. The cost of relocating the equipment trailer between the sites is $d(j,k) = 300$ for $k\ne j$. The cost $c(k,j)$ of using the trailer is 100 if the work is at site $k>1$ and trailer is at site $j\ne k$ with $j > 1$; $50$ if $j = k$ and $j>1$, and 200 if the work is at $k>1$ and the trailer is at $j=1$ (home oﬃce). 
$$
c(k,j)=\left\{
\begin{aligned}
100 && k>1,j\ne k\\
50 && k>1,j= k\\
200 && k>1,j=1
\end{aligned}
\right.
$$
If the service man is at the home oﬃce, no work is done and cost is zero. At any time, the service man knows their location, observes the location of the next job (or 1 if no job is to be done) and decides whether and where to move the trailer. The transition matrix between job locations is
$$
p=\begin{bmatrix}
0.1 & 0.3 & 0.3 & 0.3\\ 
0.0 & 0.5 & 0.5 & 0.0\\ 
0.0 & 0.0 & 0.8 & 0.2\\ 
0.4 & 0.0 & 0.0 & 0.6
\end{bmatrix}
$$

For example, if the current location (job) is 2, the probability that the next job is at 3 is 0.5. Assume the discount factor of 0.95.

(a)  Formulate a Markov decision problem to help the repairment decide on the movement of the trailer: deﬁne the state and control space and write the dynamic programming equations.

>* State Space :
$$x_t=(j_t,k_t)\in\{1,2,3,4\}\times\{1,2,3,4\}$$ 
where $j_t$ is current trailer position and $k_t$ is current job position.
* Control Space : $$u_t\in\{1,2,3,4\}$$ is the next trailer position.
* Transition kernel:
$$
P[(j_{t+1},k_{t+1})|j_t,k_t,u_t]=\left\{
\begin{aligned}
p_{k_t,k_{t+1}} && j_{t+1}=u_t\\
0 && \text{otherwise}
\end{aligned}
\right.
$$
* Step-wise Cost function conditioned with control (j-location of the equipment; k-location of the next job):
$$
c(j,k,u)=\left\{
\begin{aligned}
0 && u=j, k=1\\
50 && u=j=k>1\\
100 && u=j, k\ne j, j\ne1,k>1\\
200 && u=j=1, k>1, \\
300 && u\ne j, k=1\\
300+50 && u\ne j, u= k, k>1\\ 
300+100 && u\ne j, u\ne k,u\ne1, k>1\\
300+200 && u\ne j, u=1, k>1
\end{aligned}
\right.
$$
* Dynamic programming equation
$$
v(j,k)=\min_{u\in\{1,2,3,4\}}\{c(j,k,u)+0.95\sum_{l=1}^{4}v(u,l)p_{kl}\}
$$

In [1]:
import numpy as np

# Transition Probability Matrix
P = np.array([
    [.0, .0, .0, .0, .0],
    [.0, .1, .3, .3, .3],
    [.0, .0, .5, .5, .0],
    [.0, .0, .0, .8, .2],
    [.0, .4, .0, .0, .6]
])

# control space
U = np.array([1,2,3,4])

# step-wise cost function
# c[j,k,u]
# j-location of equipment,
# k-location of next job,
# u-control
C = np.zeros((5,5,5))

# function to get value
def get_value(j,k,u):
    if u == j:
        if k == 1:
            return 0
        else:
            if k == j:
                return 50
            else:
                if j == 1:
                    return 200
                else:
                    return 100
    else:
        if k == 1:
            return 300
        else:
            if u == k:
                return 350
            elif u == 1:
                return 500
            else:
                return 400
        
for i in U:
    for j in U:
        for u in U:
            C[i,j,u] = get_value(i,j,u)

(b)  Solve these equations by the value iteration method (starting from 0) and calculate lower and upper bound at each step.

In [2]:
V1_old = np.zeros((5,5))
V1 = np.zeros((5,5))
# initial value function
def init_v1(j, k):
    v = np.array([C[j, k, u] for u in U])
    return np.min(v)

for j in U:
    for k in U:
        V1[j, k] = init_v1(j, k)
        
def iter_v1(j, k, value):
    v = np.array([C[j, k, u] + 0.95*np.sum([P[k, l]*value[u, l] for l in U]) for u in U])
    return np.min(v)

iter_cnt = 0
iter_max = 1000

while (iter_cnt < iter_max) and (1/16*np.sum(np.abs(V1-V1_old)) > 0.01):
    iter_cnt += 1
    V1_old = V1.copy()
    
    for j in U:
        for k in U:
            V1[j, k] = iter_v1(j, k, V1_old)
            
    upper, lower = np.max(V1-V1_old), np.min(V1-V1_old)
    
    if iter_cnt%10 == 0:
        print('-'*50)
        print(f"Iteration : {iter_cnt}\n")
        print("Upper bound:")
        print(V1[1:,1:]+0.95/0.05*upper)
        print("Lower bound:")
        print(V1[1:,1:]+0.95/0.05*lower)
        print("-"*50)
        
print(f"Final value matrix after {iter_cnt} iterations :")
print(V1[1:,1:])

--------------------------------------------------
Iteration : 10

Upper bound:
[[1677.77833822 1818.9172291  1747.10822275 1730.19632746]
 [1536.46351291 1612.04908735 1676.23906951 1590.4052497 ]
 [1421.30985759 1518.9172291  1447.10822275 1511.98104171]
 [1468.77332717 1631.50478835 1577.90806698 1430.19632746]]
Lower bound:
[[754.49322728 895.63211816 823.82311181 806.91121652]
 [613.17840197 688.76397641 752.95395857 667.12013876]
 [498.02474665 595.63211816 523.82311181 588.69593077]
 [545.48821623 708.21967741 654.62295604 506.91121652]]
--------------------------------------------------
--------------------------------------------------
Iteration : 20

Upper bound:
[[1604.65078056 1737.3699288  1665.58612966 1673.76153406]
 [1534.09146449 1609.04673759 1665.58612966 1588.95139347]
 [1339.74643441 1437.3699288  1365.58612966 1430.39392717]
 [1412.14714308 1574.7174746  1521.3169392  1373.76153406]]
Lower bound:
[[1052.79687319 1185.51602143 1113.73222228 1121.90762669]
 [ 982.23

(c)  Solve the problem by the policy iteration method starting from “do not move trailer from base” at each state.

In [3]:
V2_old = np.zeros((5,5))
V2 = np.zeros((5,5))

# update value
def update_v2(j, k, u, value):
    return C[j, k, u] + 0.95*np.sum([value[u, l]*P[k, l] for l in U])

# iter control
def iter_pi(j, k, value):
    control = np.array([C[j, k, u] + 0.95*np.sum([value[u, l]*P[k, l] for l in U]) for u in U])
    return np.argmin(control)+1

for j in U:
    for k in U:
        V2[j, k] = update_v2(j, k, j, V2_old)

iter_cnt = 0
iter_max = 1000

while (iter_cnt < iter_max) and (1/16*np.sum(np.abs(V2-V2_old)) > 0.01):
    iter_cnt += 1
    
    V2_old = V2.copy()
    
    for j in U:
        for k in U:
            u = iter_pi(j, k, V2_old)
            V2[j, k] = update_v2(j, k, u, V2_old)

print(f"After {iter_cnt} iterations:")
print(V2[1:,1:])

After 172 iterations:
[[1497.58108088 1617.87623144 1546.09244875 1591.54018069]
 [1428.04040265 1494.06670763 1546.09244875 1494.52665432]
 [1220.25281904 1317.87623144 1246.09244875 1310.90041764]
 [1329.92614702 1492.49679138 1439.09573399 1291.54018069]]


(d) Solve the problem by linear programming.

In [4]:
V3 = []
lbl = []

def make_tmp_mat(j, k):
    tmp = [[-0.95*P[k, l+1] for l in range(4)] for u in range(4)]
    tmp[j-1][k-1] = 4 - 0.95*P[k, k]
    return tmp

for j in U:
    for k in U:
        V3.append(sum(make_tmp_mat(j,k), []))
        lbl.append(np.sum([C[j,k,u] for u in U]))
        
A,b = np.array(V3)/4,np.array(lbl)/4
k = np.linalg.inv(A).dot(b)
k.reshape((4,4))

array([[6338.43964592, 6489.48943626, 6462.06727166, 6386.29550104],
       [6338.43964592, 6489.48943626, 6462.06727166, 6386.29550104],
       [6338.43964592, 6489.48943626, 6462.06727166, 6386.29550104],
       [6338.43964592, 6489.48943626, 6462.06727166, 6386.29550104]])