In [1]:
%load_ext autoreload
%autoreload 2

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from numpy.linalg import pinv
import pandas as pd

np.set_printoptions(precision=4, suppress=True)

import mdpy as mdp

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Overview

We provide code for a few different ways of estimating the variance and second moment of cumulants in a finite MDP.

We apply this code to analyzing some cumulants in particular: the reward, the TD-error ($\delta$), and the squared TD-error ($\delta^2$).

We show that in the setting where the estimate of the value function is exact ($\hat{v} = v_{\pi}$), the $\delta^2$-return (sum of discounted squared TD-errors) gives the variance (and trivially the mean squared error.

We show also that the second moment of the $\delta$-return (discounted sum of TD-errors$) gives the mean squared error even in settings where the approximate value function is perturbed from the true value function.

# Problem Setup

Here we define an MDP and solve it analytically, computing both the expected return (i.e., the value function `v_pi`) and its variance (`v_var`, using Sobel's approach) for each state.

In [2]:
# MDP solved analytically
ns = 6
I = np.eye(ns)

# Probability of transitioning from state s_i --> s_j = P[i,j]
P = np.diag(np.ones(ns-1), 1) * 0.5
P[:,0] = 0.5
P[-1, 0] = 1

# Expected reward for transitioning from s_i --> s_j = R[i,j]
R = np.zeros((ns, ns))
# -1 Reward for non-terminal transitions
R[:,:] = -1
# Reaching edge has zero reward
R[-2, -1] = 0
# Transitions from terminal state have zero reward
R[-1,:] = 0
r = np.sum(P*R, axis=1)

# State-dependent discount
gvec = np.ones(ns)*0.9
gvec[0] = 0
G = np.diag(gvec)

# State-dependent bootstrapping
lvec = np.ones(ns)*0.0
L = np.diag(lvec)

# Value function (expected Monte Carlo return)
v_pi = pinv(I - P @ G) @ r

# Compute stationary distribution for transition matrix
d_pi = mdp.stationary(P)
D = np.diag(d_pi)


# From Sobel, setting up variance Bellman equation
T = -v_pi**2
for i in range(ns):
    for j in range(ns):
        T[i] += P[i,j] * (R[i,j] + gvec[j]*v_pi[j])**2

# Alternatively,
# T = np.sum(P * (R + G @ v_pi)**2, axis=1) - v_pi**2
        
# Solve Bellman equation for variance of return
v_var = pinv(I - P @ G @ G) @ T 

# print(T)
print('v_pi:\n', v_pi)
print('per-state variance:\n', v_var)

v_pi:
 [-1.7641 -1.6981 -1.5513 -1.225  -0.5    -0.    ]
per-state variance:
 [ 0.8412  0.6353  0.3654  0.1519  0.25    0.    ]


# Trying It Out Empirically

Empirically checking via simulation is useful for finding errors.
The code for simulating the MDP is somewhat inefficient but simple to write and debug.

We use the same MDP as above. 

Note that the use of state-dependent γ allows for an effectively episodic problem without requiring us to actually break the trajectory into episodes.

The trajectory is kept as a list of dictionaries that record all relevant data for each time step.
We then compute additional quantities, the *return* and the *squared return*, taking the expression for the latter from the VTD paper (White & White).

In [3]:
def compute_return(history):
    ret = []
    g = 0
    for step in reversed(history):
        g *= step['gm']
        g += step['r']
        ret.append({'g': g, **step})
    return list(reversed(ret))

def compute_squared_return(history):
    ret = []
    g_sq = 0
    g_next = 0
    for step in reversed(history):
        g_sq *= step['gm']**2
        g_sq += step['r']**2 + 2*step['gm']*step['r']*g_next
        ret.append({'g_sq': g_sq, **step})
        g_next = step['g']
    return list(reversed(ret))

In [4]:
# Simulate our MDP
num_steps = 10000

states = np.arange(ns)
features = [i for i in np.eye(ns)]

# Initial state
s0  = 0
s   = s0
x   = features[s]

history = []
for i in range(num_steps):
    p_next = x @ P
    sp = np.random.choice(states, p=p_next)
    xp = features[sp]
    gm = gvec[sp]
    reward = R[s, sp]
    
    history.append({'s': s, 'sp': sp, 'gm': gm, 'r': reward})
    
    # Next iteration
    s = sp
    x = xp.copy()
    
# Augment the history with the return at each timestep
history = compute_return(history)
history = compute_squared_return(history)

# Convert to pandas dataframe
df = pd.DataFrame(history)

In [5]:
df.head()

Unnamed: 0,g,g_sq,gm,r,s,sp
0,-3.439,11.826721,0.9,-1.0,0,1
1,-2.71,7.3441,0.9,-1.0,1,2
2,-1.9,3.61,0.9,-1.0,2,3
3,-1.0,1.0,0.0,-1.0,3,0
4,-2.71,7.3441,0.9,-1.0,0,1


In [6]:
# Check expected squared return
grouped = df.groupby('s')
g_sq = np.array(grouped.aggregate({'g_sq': np.mean}))

# Display results
grouped.aggregate({'g':np.mean, 'g_sq': np.mean}).T

s,0,1,2,3,4,5
g_sq,4.002009,3.509261,2.672358,1.616077,0.477419,0.0
g,-1.778378,-1.701001,-1.522674,-1.21244,-0.477419,0.0


In [7]:
# Variance via squared-return minus expected return squared
grouped = df.groupby('s')
a = np.array(grouped.aggregate({'g': np.mean})**2)
b = np.array(grouped.aggregate({'g_sq': np.mean}))
print("Variance via E[G^2] - E[G]^2:\n", np.ravel(b - a))
print("Analytical variance (Sobel):\n", v_var)

Variance via E[G^2] - E[G]^2:
 [ 0.8394  0.6159  0.3538  0.1461  0.2495  0.    ]
Analytical variance (Sobel):
 [ 0.8412  0.6353  0.3654  0.1519  0.25    0.    ]


In [8]:
# Compare with variance computed directly
print("Variance (via numpy.var):")
grouped = df.groupby('s')
grouped.aggregate({'g': np.var}).T

Variance (via numpy.var):


s,0,1,2,3,4,5
g,0.839548,0.616097,0.354093,0.146299,0.250298,0.0


This suggests that we our formulas were accurate, although more simulations could further improve the accuracy.

# Second Moment of the Return

An alternative way to compute the variance analytically is by defining and solving a Bellman equation for the second moment of the return and using the fact that $\operatorname{Var}[X] = \mathbb{E}[X^2] - \mathbb{E}[X]^2$.

In *A Greedy Approach to Adapting the Trace Parameter for Temporal Difference Learning*, White & White provide just such a Bellman equation, which we use here.

In [9]:
# Using the VTD paper to calculate second moments of the return.
# Note that here we are using the most accurate values for everything
# in order to check the equations.
Pbar = np.zeros((ns, ns))
Rbar = np.zeros((ns,ns))
rbar = np.zeros(ns)

# Specify parameters
lvec = np.ones(ns)

# Calculate R-bar transition matrix
for i in range(ns):
    for j in range(ns):
        Rbar[i,j] = R[i,j]**2 + 2*gvec[j]*lvec[j]*R[i,j]*v_pi[j]

# Calculate r-bar vector
for i in range(ns):
    for j in range(ns):
        rbar[i] += P[i,j]*Rbar[i,j]

# Calculate P-bar
for i in range(ns):
    for j in range(ns):
        Pbar[i,j] += P[i,j]*(gvec[j]**2)*(lvec[j]**2)
        
        
# Calculate second moment of return
r_second = pinv(I - Pbar) @ rbar

# Print the results
print("Second moment of return:\n", r_second)
print("Estimated variance via second moment of return:\n", r_second - v_pi**2)
print("Sobel variance:\n", v_var)


# An alternative approach, which is somewhat more concise
# Second moment of return
rr = (P*R**2) @ np.ones(ns) + (2*P @ G * R) @ v_pi
vv = pinv(I - P @ G @ G)@(rr)

Second moment of return:
 [ 3.9533  3.5187  2.7718  1.6525  0.5     0.    ]
Estimated variance via second moment of return:
 [ 0.8412  0.6353  0.3654  0.1519  0.25    0.    ]
Sobel variance:
 [ 0.8412  0.6353  0.3654  0.1519  0.25    0.    ]


## Takeaway

The two methods produce the same result, as expected.

## Implementation Notes

We calculated things as carefully as we can (and with as complete information as possible) to avoid mis-specifying things.
Of note is that in a few cases whether the expected reward *matrix* vs. expected reward *vector* is used substantially affects the calculations.
So while there are less verbose ways to solve the problem, the following is perhaps a little cleaner to examine and debug.

# Second Moment of δ-return

Using the equations in the White & White paper, we can calculate the second moment of the δ-return.
The δ-return is the discounted sum of future TD-errors, which we can define via: 

$$
G^{\delta, \lambda}_{t} = \sum_{n=0}^{\infty} \delta_{t+n} \prod_{k=1}^{n-1} \gamma_{t+k} \lambda_{t+k}
$$

Note that the δ-return also encodes the bias (difference between the expected λ-return for each state and the approximate value function).

$$G^{\lambda}_{t} - \hat{v}(S_t) = \sum_{n=0}^{\infty} \delta_{t+n} \prod_{k=1}^{n-1} \gamma_{t+k} \lambda_{t+k}$$

This means that for $\lambda = 1$ we have a way of computing the bias with respect to the Monte Carlo return:

$$
G^{\lambda=1}_{t} - \hat{v}(S_t) = v_{\pi} - \hat{v} 
$$

Disregarding the values of λ used in the approximation process for $\hat{v}$, we can choose alternative values to get the bias with respect to a particular λ-return.

## Sanity Check: Exact Value Function

We check that our algorithm works by computing the second moment of the δ-return for when the approximate and 'true' value functions are identical.
The expected TD-error for each state should be zero, as should the bias (which should be equal to the δ-return).

The second moment of the δ-return may be nonzero.
In fact, it should be equal to both the $\delta^2$-return and the variance, and since the bias is zero, the mean squared-error as well.

In [10]:
# Approximate value function is identical to true value function
v_hat = v_pi

# Bias of approximate value function
bias = v_pi - v_hat

# TD error matrix, for error given transition i-->j
Δ = np.zeros_like(R)
for i in range(ns):
    for j in range(ns):
        Δ[i,j] = (R[i,j] + gvec[j]*v_hat[j] - v_hat[i])

# Expected TD-error
δ = (P*Δ) @ np.ones(ns)
        
# Expected δ^2
δ_sq = (P * Δ**2) @ np.ones(ns)
        
# δ-return
gd = pinv(I - P @ G) @ δ

# δ^2-return
gd_sq = pinv(I - P @ G @ G) @ δ_sq

# Second moment of δ-return
dd = (P * Δ**2) @ np.ones(ns) + (2*P @ G * Δ) @ gd
gd_second = pinv(I - P @ G @ G)@(dd)

print("v_π:\n", v_pi)
print("v_hat:\n", v_hat)
print("bias:\n", v_pi-v_hat)
print("δ-return:\n", gd)
print("δ^2-return:\n", gd_sq)
print("Second moment expected 'reward' (r-bar):\n", dd)
print("Second moment of delta-return:\n", gd_second)
print("Sobel variance:\n", v_var)
print("Expected squared error:\n", (v_pi - v_hat)**2 + v_var)

v_π:
 [-1.7641 -1.6981 -1.5513 -1.225  -0.5    -0.    ]
v_hat:
 [-1.7641 -1.6981 -1.5513 -1.225  -0.5    -0.    ]
bias:
 [ 0.  0.  0.  0.  0.  0.]
δ-return:
 [ 0. -0.  0.  0.  0.  0.]
δ^2-return:
 [ 0.8412  0.6353  0.3654  0.1519  0.25    0.    ]
Second moment expected 'reward' (r-bar):
 [ 0.5839  0.4873  0.3039  0.0506  0.25    0.    ]
Second moment of delta-return:
 [ 0.8412  0.6353  0.3654  0.1519  0.25    0.    ]
Sobel variance:
 [ 0.8412  0.6353  0.3654  0.1519  0.25    0.    ]
Expected squared error:
 [ 0.8412  0.6353  0.3654  0.1519  0.25    0.    ]


We get the expected results.

Incidentally, this provides an alternative route for proving the efficacy of algorithms estimating the $\delta^2$-return for computing variance in the tabular setting.

# Inaccurate Value Function

We now examine what happens to the $\delta$-return and its second moment (as well as the $\delta^2$-return) in the case where the value function is no longer exact.

## Large Perturbation, Single State

We first examine the impact of a relatively large perturbation to a single state's value function.

In [11]:
# Approximate value function
v_hat = v_pi + [0, 0.0, 0.5, 0, 0.0, 0]

# Bias of approximate value function
bias = v_pi - v_hat

# TD error matrix, for error given transition i-->j
Δ = np.zeros_like(R)
for i in range(ns):
    for j in range(ns):
        Δ[i,j] = (R[i,j] + gvec[j]*v_hat[j] - v_hat[i])

# Expected TD-error
δ = (P*Δ) @ np.ones(ns)
        
# Expected δ^2
δ_sq = (P * Δ**2) @ np.ones(ns)
        
# Second moment of δ-return 
rr = (P*R**2) @ np.ones(ns) + (2*P @ G * R) @ v_pi
vv = pinv(I - P @ G @ G)@(rr)

# δ-return
gd = pinv(I - P @ G) @ δ

# δ^2-return
gd_sq = pinv(I - P @ G @ G) @ δ_sq

# Second moment
dd = (P * Δ**2) @ np.ones(ns) + (2*P @ G * Δ) @ gd
gd_second = pinv(I - P @ G @ G)@(dd)

print("v_π:\n", v_pi)
print("v_hat:\n", v_hat)
print("bias:\n", v_pi-v_hat)
print("δ-return:\n", gd)
print("δ^2-return:\n", gd_sq)
print("Second moment of delta-return:\n", gd_second)
print("Sobel variance:\n", v_var)
print("Bias^2:\n", bias**2)
print("Expected squared error:\n", (v_pi - v_hat)**2 + v_var)

v_π:
 [-1.7641 -1.6981 -1.5513 -1.225  -0.5    -0.    ]
v_hat:
 [-1.7641 -1.6981 -1.0513 -1.225  -0.5    -0.    ]
bias:
 [ 0.   0.  -0.5  0.   0.   0. ]
δ-return:
 [ 0.  -0.  -0.5  0.   0.   0. ]
δ^2-return:
 [ 0.796   0.5236  0.6154  0.1519  0.25    0.    ]
Second moment of delta-return:
 [ 0.8412  0.6353  0.6154  0.1519  0.25    0.    ]
Sobel variance:
 [ 0.8412  0.6353  0.3654  0.1519  0.25    0.    ]
Bias^2:
 [ 0.    0.    0.25  0.    0.    0.  ]
Expected squared error:
 [ 0.8412  0.6353  0.6154  0.1519  0.25    0.    ]


The bias is still given by the δ-return, and the $\delta^2$-return still accurately gives the variance for states that do not bootstrap from the state where the error was introduced.

As hypothesized, the second moment of the δ-return now gives the MSE rather than the variance.

The $\delta^2$-return for the state whose value was perturbed also gives the MSE (as do other states which do not bootstrap from it).
For the states that *do* bootstrap from the perturbation in the value function, we note that they underestimate both the MSE and the variance (although they would *overestimate* if the sign of the variance were to be changed).

## Small Perturbation, Single State

In [12]:
# Approximate value function
v_hat = v_pi + [0, 0.0, 0.1, 0, 0.0, 0]

# Bias of approximate value function
bias = v_pi - v_hat

# TD error matrix, for error given transition i-->j
Δ = np.zeros_like(R)
for i in range(ns):
    for j in range(ns):
        Δ[i,j] = (R[i,j] + gvec[j]*v_hat[j] - v_hat[i])

# Expected TD-error
δ = (P*Δ) @ np.ones(ns)
        
# Expected δ^2
δ_sq = (P * Δ**2) @ np.ones(ns)
        
# Second moment of δ-return 
rr = (P*R**2) @ np.ones(ns) + (2*P @ G * R) @ v_pi
vv = pinv(I - P @ G @ G)@(rr)

# δ-return
gd = pinv(I - P @ G) @ δ

# δ^2-return
gd_sq = pinv(I - P @ G @ G) @ δ_sq

# Second moment
dd = (P * Δ**2) @ np.ones(ns) + (2*P @ G * Δ) @ gd
gd_second = pinv(I - P @ G @ G)@(dd)

print("v_π:\n", v_pi)
print("v_hat:\n", v_hat)
print("bias:\n", v_pi-v_hat)
print("δ-return:\n", gd)
print("δ^2-return:\n", gd_sq)
print("Second moment of delta-return:\n", gd_second)
print("Sobel variance:\n", v_var)
print("Bias^2:\n", bias**2)
print("Expected squared error:\n", (v_pi - v_hat)**2 + v_var)

v_π:
 [-1.7641 -1.6981 -1.5513 -1.225  -0.5    -0.    ]
v_hat:
 [-1.7641 -1.6981 -1.4513 -1.225  -0.5    -0.    ]
bias:
 [ 0.   0.  -0.1  0.   0.   0. ]
δ-return:
 [ 0.   0.  -0.1  0.   0.   0. ]
δ^2-return:
 [ 0.819   0.5805  0.3754  0.1519  0.25    0.    ]
Second moment of delta-return:
 [ 0.8412  0.6353  0.3754  0.1519  0.25    0.    ]
Sobel variance:
 [ 0.8412  0.6353  0.3654  0.1519  0.25    0.    ]
Bias^2:
 [ 0.    0.    0.01  0.    0.    0.  ]
Expected squared error:
 [ 0.8412  0.6353  0.3754  0.1519  0.25    0.    ]


We see similar results as in the large-perturbation, but note that the difference between the $\delta^2-return and the analytical variance actually seems to decrease the further 'away' we get from the perturbed state.

The second moment of the $\delta$-return continues to reflect the MSE.

In [13]:
gd_sq - v_var

array([-0.0222, -0.0547,  0.01  ,  0.    ,  0.    , -0.    ])

In [14]:
gd_second - ((v_pi - v_hat)**2 + v_var)

array([ 0., -0.,  0.,  0.,  0.,  0.])

## Small Perturbations, Multiple States

We now check a small uniform perturbation to the value function of all states.

In [15]:
# Approximate value function
v_hat = v_pi + [0.1, 0.1, 0.1, 0.1, 0.1, 0]

# Bias of approximate value function
bias = v_pi - v_hat

# TD error matrix, for error given transition i-->j
Δ = np.zeros_like(R)
for i in range(ns):
    for j in range(ns):
        Δ[i,j] = (R[i,j] + gvec[j]*v_hat[j] - v_hat[i])

# Expected TD-error
δ = (P*Δ) @ np.ones(ns)
        
# Expected δ^2
δ_sq = (P * Δ**2) @ np.ones(ns)
        
# Second moment of δ-return 
rr = (P*R**2) @ np.ones(ns) + (2*P @ G * R) @ v_pi
vv = pinv(I - P @ G @ G)@(rr)

# δ-return
gd = pinv(I - P @ G) @ δ

# δ^2-return
gd_sq = pinv(I - P @ G @ G) @ δ_sq

# Second moment
dd = (P * Δ**2) @ np.ones(ns) + (2*P @ G * Δ) @ gd
gd_second = pinv(I - P @ G @ G)@(dd)

print("v_π:\n", v_pi)
print("v_hat:\n", v_hat)
print("bias:\n", v_pi-v_hat)
print("δ-return:\n", gd)
print("δ^2-return:\n", gd_sq)
print("Second moment of delta-return:\n", gd_second)
print("Sobel variance:\n", v_var)
print("Bias^2:\n", bias**2)
print("Expected squared error:\n", (v_pi - v_hat)**2 + v_var)

v_π:
 [-1.7641 -1.6981 -1.5513 -1.225  -0.5    -0.    ]
v_hat:
 [-1.6641 -1.5981 -1.4513 -1.125  -0.4    -0.    ]
bias:
 [-0.1 -0.1 -0.1 -0.1 -0.1  0. ]
δ-return:
 [-0.1 -0.1 -0.1 -0.1 -0.1  0. ]
δ^2-return:
 [ 0.746   0.5576  0.3163  0.1407  0.26    0.    ]
Second moment of delta-return:
 [ 0.8512  0.6453  0.3754  0.1619  0.26    0.    ]
Sobel variance:
 [ 0.8412  0.6353  0.3654  0.1519  0.25    0.    ]
Bias^2:
 [ 0.01  0.01  0.01  0.01  0.01  0.  ]
Expected squared error:
 [ 0.8512  0.6453  0.3754  0.1619  0.26    0.    ]


The MSE is still the same as the second moment of the $\delta$-return.

The $\delta^2$ return has its errors compound as perturbed bootstraps bootstrap from perturbed states.
The start state is off by almost a factor of ten with respect to its perturbation.
Note that it is not accounting for the bias, it is just incorrect.

In [16]:
gd_sq - v_var

array([-0.0952, -0.0777, -0.0491, -0.0111,  0.01  , -0.    ])

In [17]:
gd_sq - ((v_pi - v_hat)**2 + v_var)

array([-0.1052, -0.0877, -0.0591, -0.0211,  0.    , -0.    ])

In [18]:
gd_second - ((v_pi - v_hat)**2 + v_var)

array([ 0., -0.,  0.,  0.,  0., -0.])

## Small Perturbations of Alternating Sign

We can check if having the perturbations be somewhat less correlated (a far more likely case in real-world settings) affects the various estimators under consideration.

In [19]:
# Approximate value function
v_hat = v_pi + [-0.1, 0.1, -0.1, 0.1, -0.1, 0]

# Bias of approximate value function
bias = v_pi - v_hat

# TD error matrix, for error given transition i-->j
Δ = np.zeros_like(R)
for i in range(ns):
    for j in range(ns):
        Δ[i,j] = (R[i,j] + gvec[j]*v_hat[j] - v_hat[i])

# Expected TD-error
δ = (P*Δ) @ np.ones(ns)
        
# Expected δ^2
δ_sq = (P * Δ**2) @ np.ones(ns)
        
# Second moment of δ-return 
rr = (P*R**2) @ np.ones(ns) + (2*P @ G * R) @ v_pi
vv = pinv(I - P @ G @ G)@(rr)

# δ-return
gd = pinv(I - P @ G) @ δ

# δ^2-return
gd_sq = pinv(I - P @ G @ G) @ δ_sq

# Second moment
dd = (P * Δ**2) @ np.ones(ns) + (2*P @ G * Δ) @ gd
gd_second = pinv(I - P @ G @ G)@(dd)

print("v_π:\n", v_pi)
print("v_hat:\n", v_hat)
print("bias:\n", v_pi-v_hat)
print("δ-return:\n", gd)
print("δ^2-return:\n", gd_sq)
print("Second moment of delta-return:\n", gd_second)
print("Sobel variance:\n", v_var)
print("Bias^2:\n", bias**2)
print("Expected squared error:\n", (v_pi - v_hat)**2 + v_var)

v_π:
 [-1.7641 -1.6981 -1.5513 -1.225  -0.5    -0.    ]
v_hat:
 [-1.8641 -1.5981 -1.6513 -1.125  -0.6    -0.    ]
bias:
 [ 0.1 -0.1  0.1 -0.1  0.1  0. ]
δ-return:
 [ 0.1 -0.1  0.1 -0.1  0.1  0. ]
δ^2-return:
 [ 0.829   0.7182  0.358   0.1992  0.26    0.    ]
Second moment of delta-return:
 [ 0.8512  0.6453  0.3754  0.1619  0.26    0.    ]
Sobel variance:
 [ 0.8412  0.6353  0.3654  0.1519  0.25    0.    ]
Bias^2:
 [ 0.01  0.01  0.01  0.01  0.01  0.  ]
Expected squared error:
 [ 0.8512  0.6453  0.3754  0.1619  0.26    0.    ]


The MSE is still the same as the second moment of the $\delta$-return.

For the $\delta^2$-return, the results are somewhat more mixed.
The errors cancel in some places (state 3) and compound in others when we compare it with the variance.

Overall it doesn't really seem to improve things, although a more carefully crafted set of perturbations could likely be found. One strategy might be to select improve feature selection to ensure that as many of the errors cancel each other out as possible.

However, in general this suggests that some method of accounting for the cross-terms in the expansion is desirable, if we want to continue to use the $\delta^2$-return in lieu of the approximation morass that is estimating the $\delta$-return online under function approximation.

In [20]:
gd_sq - v_var

array([-0.0122,  0.0829, -0.0074,  0.0474,  0.01  ,  0.    ])

In [21]:
gd_sq - ((v_pi - v_hat)**2 + v_var)

array([-0.0222,  0.0729, -0.0174,  0.0374,  0.    ,  0.    ])

In [22]:
gd_second - ((v_pi - v_hat)**2 + v_var)

array([ 0., -0.,  0.,  0.,  0., -0.])

# Closing Notes

While the above may seem appealing, there is still work to be done to get things into a more useful form.

At the very least, it's a way of directly computing the MSE for a given value function, assuming the full MDP is known.
This could be interesting to pursue, although I am not sure how practical the results would be.

Optimizing the bias-variance trade-off via finding the minimum of the MSE using the second moment of the $\delta$-return looks interesting, but requires more analysis to derive the appropriate equations.

We note that under function approximation, the fixed point for the weights when approximating the $\delta$-return is zero, meaning a different strategy might have to be employed if we were to actually use this.
I can think of a few strategies for overcoming this (using multiple representations, or splitting the representations to make it possible to estimate the $\delta$-return under LFA, perhaps a re-weighting method in the style of Emphatic TD, or choosing a more approximable target that still imparts useful information by modifying the bootstrapping/discount factors).

Suggestions or comments are welcome.

# Variance of δ-return

Given the popularity of the movie *Inception*, it behooves us to go deeper.

On a more serious note, I am not really sure how to interpret this but kept it in anyways because I wish to avoid having to redo it if it turns out to be relevant.

In [23]:
# Approximate value function
# v_hat = v_pi + 0.1*np.arange(len(v_hat))
v_hat = v_pi + [0, -0.1, 0.5, 0, 0.0, 0]


# TD error matrix, for error given transition i-->j
Δ = np.zeros_like(R)
for i in range(ns):
    for j in range(ns):
        Δ[i,j] = (R[i,j] + gvec[j]*v_hat[j] - v_hat[i])

# Expected per-state TD-error
δ = (P*Δ) @ np.ones(ns)
        
# Sobel-like method for variance of δ-return 
T_d = -δ**2
for i in range(ns):
    for j in range(ns):
        T_d[i] += P[i,j] * (Δ[i,j] + gvec[j]*δ[j])**2

# Alternatively,
# T_d = np.sum(P * (Δ + G @ δ)**2, axis=1) - δ**2

# Calculating variance of δ-return
dd = pinv(I - P @ G @ G) @ T_d

print(v_var)
print(dd)
print(v_var - dd)

[ 0.8412  0.6353  0.3654  0.1519  0.25    0.    ]
[ 0.6662  0.5396  0.3654  0.1519  0.25    0.    ]
[ 0.175   0.0956 -0.     -0.     -0.     -0.    ]
