## Revenue Management

We consider the general case where the length of the buffer is $n$. Then the states are given by $\mathcal{X}:=A \cup B$, where $A= \{0,\ldots, n\}$ and $B= \{n+1,\ldots, 2n+1\}$, with $A$ denoting the states where the server is off and $B$ denoting the states where the server is on. The number of customers in the queue is exactly the state number for states in $A$ and the number of customers in the queue for states in $B$ is modulo $n+1$ of the state number. As for the action, we have $\mathcal{A}(x) = \{0,1\}$ where 0 (1) means the server is off (on).

For our exercise, we have `n=100`

With that, we can get evaluate the reward function by considering cases.

#### Server is switched off, $a = 0$

- $x <n$, $R(x,a,w) = (1-w)(-x)+w(-x-1)$
- $x =n$, $R(x,a,w) = (1-w)(-n)+w(-n-1000)$
- $n< x < 2n+1$, $R(x,a,w) = (1-w)(-(x \mod n+1))+w(-(x \mod n+1)-1)$
- $x =2n+1$, $R(x,a,w) = (1-w)(-n)+w(-n-1000)$

#### Server is switched off, $a = 1$

- $x <n$, $R(x,a,w) = (1-w)(-x)+w(-x-1)-10$
- $x =n$, $R(x,a,w) = (1-w)(-n)+w(-n-1)-10$
- $n< x < 2n+1$, $R(x,a,w) = (1-w)(-(x \mod n+1))+w(-(x \mod n+1)-1)$
- $x =2n+1$, $R(x,a,w) = (1-w)(-n)+w(-n-1)$


In [1]:
import numpy as np

In [51]:
n = 100

In [86]:
def reward_function(x,a,w = 0.75,n = 100):
    """
    Computes the expected reward given current state x, action taken a, weight w and buffer length n.
    """
    
    if (x < n) and (a == 0):
        r = (1-w)*(-x) + w*(-x-1)
    elif ((x == n) or (x == 2*n+1)) and (a == 0):
        r = (1-w)*(-n) + w*(-n-1000);
    elif (x > n) and (x < (2*n+1) ) and (a == 0):
        y = np.mod(x,n+1)
        r = (1-w)*(-y) + w*(-y-1)
    elif (x <= n) and (a == 1):
        r = (1-w)*(-x) + w*(-x-1)-10
    elif (x > n) and (x < (2*n+1) ) and (a == 1):
        y = np.mod(x,n+1)
        r = (1-w)*(-y) + w*(-y-1)
    else: 
        r = (1-w)*(-n) + w*(-n-1)  
        
    return r    

The function `reward_gen` generates two vectors `r0` and `r1` which gives us the expected rewards when we take $a=0$ and $a=1$ respectively.

In [84]:
def reward_gen(w,n):
    """
    Generates the rewards for action a=0 and a=1 given weight w and buffer length n.
    """
    R0 = np.zeros(2*n+1+1)
    R1 = np.zeros(2*n+1+1)
    for i in range(2*n+1+1):
        R0[i] = reward_function(i,0,w,n)
        R1[i] = reward_function(i,1,w,n)
        
    return R0, R1   

We generate the reward vectors `r0` and `r1` here.

In [46]:
R0, R1 = reward_gen(.75, 100)

We also have the discount factor $\alpha = 0.98$

In [47]:
alpha = .98

Taking the maximum over the possible action $\mathcal{A}=\{0,1\}$ over the entire state space $\mathcal{X}$, we get the reward vector `r`.

In [48]:
R = np.maximum(R0,R1)

In [53]:
def T(R, J, alpha = 0.98):
    """
    Applies the T operator:
    (TJ)(x) = max E[R(x,a,w) + alpha*J(f(x,a,w))|x]
    where r = max R(x,a,w) over action space
          alpha is the discount factor
          J is expected total reward
    """
    J = R + alpha*J
    
    return J
    

#### Value iteration

We initialise the initial `J` to be all zeros and apply the operator `T` 1000 times

In [128]:
J = np.zeros(2*n+1+1)
err = 1
k = 0
while err > 0:
    J = T(R, J)
    err = norm((T(R,J)-J),1)
    k += 1

print ('Iterations:', k)    
print ('Error:', err)

Iterations: 1654
Error: 0.0


From above, we see that we needed 1654 iterations for $(TJ^{k})(x) = J(x)$

We would now like to derive a policy to based on the `J` values derived from value iteration

In [129]:
K = J.reshape(2,101)

In [135]:
np.argmax(K,axis = 0)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1], dtype=int64)

By comparing the the expected total reward for each state $x \in \mathcal{X}$ we see that the optimal policy is to keep the server off when there are less than 100 customers in the queue and to switch on the server when there are 100 customers in the queue. This policy can be explained by the huge cost incurred when a customer leaves when there is no more space in the buffer. Also the cost incurred when there are less than 100 customers in the queue is the same regardless of whether the server is on or off; the same cost in incurred whether we allow the customer to wait in the queue or when the server is switched on to serve the customer. The cost of 10 to switch on the server is also a deterance to switch on the server, unless we have a possibility of a large cost to be incurred; when there is no more space in the buffer.

#### Policy Iteration

We initialise the initial stationary policy, `pi` to be switch off the server at every state

In [140]:
pi = np.zeros(2*n+1)

The transition probability matrix under policy `pi` is given by 

$$p_{ij}(\text{pi}(i)) =\begin{cases} 
0.25 &  j = i, pi(i) = 0\\
0.75 &  j = i+1, pi(i) = 0\\
0.25 &  j = i-1, pi(i) = 1\\
0.75 &  j = i, pi(i) = 1\\
0 & \text{otherwise}
\end{cases} $$

In [164]:
def trans_mat_gen(pi, n):
    """
    Generates transition probability matrix under the policy pi. The buffer is of length n.
    """
    P = np.zeros((n+1,n+1))
    for i in range(P.shape[0]):
        print (i)
        print (P)
        if pi[i] == 0:
            P[i,i] = 0.25
            P[i,i+1] = 0.75
        else:
            P[i,i-1] = 0.75
            P[i,i] = 0.25
        
            
    return P

In [165]:
pi = [0,1,0,0]

In [166]:
trans_mat_gen(pi,3)

0
[[ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]]
1
[[ 0.25  0.75  0.    0.  ]
 [ 0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.  ]]
2
[[ 0.25  0.75  0.    0.  ]
 [ 0.75  0.25  0.    0.  ]
 [ 0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.  ]]
3
[[ 0.25  0.75  0.    0.  ]
 [ 0.75  0.25  0.    0.  ]
 [ 0.    0.    0.25  0.75]
 [ 0.    0.    0.    0.  ]]


IndexError: index 4 is out of bounds for axis 1 with size 4