Suppose we have current estimates for the values of two states: $V(A) = 1.0, V(B)=1.0$ in an episodic setting. We observe the following trajectory: A, 0, B, 1, B, 0, T where T is the terminal state. Apply $TD(0)$ with step size $\alpha = 1$, and a discount factor, $\gamma = 0.5$. What are the values estimates for state A and state B at the end of the episode?

In [3]:
def update_v(state:str, reward:float, next_state:str, values:dict, alpha = 1.0, gamma = 0.5):
    if next_state is None:
        values[state] = values[state] + alpha * (reward + gamma * 0 - values[state])
        return values
    values[state] = values[state] + alpha * (reward + gamma * values[next_state] - values[state])
    return values
V = {"A":1, "B":1}
print(f"Initial values {V}")
print(f'Transition A, 0, B -> {update_v("A",0,"B",V)}')
print(f'Transition B, 1, B -> {update_v("B",1,"B",V)}')
print(f'Transition B, 0, T -> {update_v("B",0,None,V)}')

Initial values {'A': 1, 'B': 1}
Transition A, 0, B -> {'A': 0.5, 'B': 1}
Transition B, 1, B -> {'A': 0.5, 'B': 1.5}
Transition B, 0, T -> {'A': 0.5, 'B': 0.0}


In [4]:
V = {"A":1, "B":1}
print(f"Initial values {V}")
print(f'Transition A, 0, B -> {update_v("A",0,"B",V, gamma=1)}')
print(f'Transition B, 1, B -> {update_v("B",1,"B",V, gamma=1)}')
print(f'Transition B, 0, T -> {update_v("B",0,None,V, gamma=1)}')

Initial values {'A': 1, 'B': 1}
Transition A, 0, B -> {'A': 1.0, 'B': 1}
Transition B, 1, B -> {'A': 1.0, 'B': 2.0}
Transition B, 0, T -> {'A': 1.0, 'B': 0.0}
