MC prediction, for estimating $V \approx v_\pi$

Input: a policy $\pi$ to be evaluated

Initialize:

&nbsp;&nbsp;$V(s) \in \mathbb{R}$, arbitrarily, for all $s \in S$ <br>
&nbsp;&nbsp;$Returns(s) \leftarrow$ an empty list, for all $s \in S$

Loop forever (for each episode):
&nbsp;&nbsp;Generate an episode following $\pi: S_0, A_0, R_1, A_1, S_1, \dots, S_{T-1}, A_{T-1}, R_T$ <br>
&nbsp;&nbsp;&nbsp;$G\leftarrow 0$<br>
&nbsp;&nbsp;&nbsp;Loop for each step of episode, $t = T-1, T-2, \dots, 0$:<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$G\leftarrow \gamma G + R_{t+1}$<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Append $G$ to $Returns(S_t)$<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$V_(S_t)\leftarrow average(Returns(S_t))$<br>

In [58]:
from numpy import mean


def mc_prediction(R:list, gamma = 0.5):
    G = [0] * len(R)
    g = 0
    for t in range(len(R)-1 - 1,-2,-1):
        print(f"{t+1}: G_({t+1})={g}, R_({t+1})={R[t+1]}", end=f" G_{t+1}->")
        g = gamma * g + R[t+1]
        print(f"{g}")
        G[t+1] = g
        
    print(G)
    
    
R = [3,4,7,1,2,0]
mc_prediction(R)

5: G_(5)=0, R_(5)=0 G_5->0.0
4: G_(4)=0.0, R_(4)=2 G_4->2.0
3: G_(3)=2.0, R_(3)=1 G_3->2.0
2: G_(2)=2.0, R_(2)=7 G_2->8.0
1: G_(1)=8.0, R_(1)=4 G_1->8.0
0: G_(0)=8.0, R_(0)=3 G_0->7.0
[7.0, 8.0, 8.0, 2.0, 2.0, 0.0]


Incremental update <br>
$NewEstimate \leftarrow OldEstimate + StepSize[Target-OldEstimate]$