# FrozenLake-v1 환경을 통한 Temporal Difference Prediction 실습

Monte Carlo prediction은 한 episode가 끝난 후에 얻은 return 값으로 각 state에서 얻은 reward를 시간에 따라 discount factor를 적용해 value function을 update한다. 그러나 이 방법에 쓰이는 episode는 반드시 terminal state를 통해 '끝이 있는' episode를 사용해야 한다. 무한히 긴 episode가 진행되면 Monte Carlo Prediction을 적용하는 것이 어려울 수 있다.

Dynamic Programming 에서는 time step마다 full-width update를 통해 학습을 진행했지만 environment에 대한 model 정보가 필요했다.

Time-step 마다 학습하면서 model free한 방법인 TD에 대한 기본적인 아이디어를 익히고 실습코드를 작성해보자.

> TODO: 
> TD에 대한 설명 추가 + 수식 추가

## Library Import

In [1]:
import gym
import numpy as np
import random
from IPython.display import clear_output

TD prediction for value function

In [2]:
def TD_prediction(env, alpha = 0.01, gamma = 1):
    V = np.zeros(env.nS)
    
    for i in range(20000):
        state = env.reset()
        epochs, reward = 0, 0
        done = False
        action = env.action_space.sample()
        
        while not done:
            next_state, reward, done, info = env.step(action)
            next_action = env.action_space.sample()
            
            if done:
                V[state] = (1 - alpha) * V[state] + alpha * reward
            else:
                V[state] = (1 - alpha) * V[state] + alpha * (reward + gamma * V[next_state])
                
            state = next_state
            action = next_action
            epochs += 1
            
        if i % 100 == 0:
            clear_output(wait=True)
            print('Episode: {}'.format(i))
            
    return V

In [3]:
env = gym.make('FrozenLake-v1', is_slippery=True)
env.render()

V = TD_prediction(env)

print(V)

Episode: 19900
[0.01380267 0.00994827 0.01716553 0.00927513 0.01884445 0.
 0.03528116 0.         0.03779247 0.08926983 0.1184477  0.
 0.         0.16465493 0.44839986 0.        ]


TD prediction for Q-function

In [4]:
def TD_Q_prediction(env, alpha = 0.01, gamma = 1):
    Q = np.zeros([env.nS, env.nA])
    
    for i in range(300000):
        state = env.reset()
        epochs, reward = 0, 0
        done = False
        action = env.action_space.sample()
        
        while not done:
            next_state, reward, done, info = env.step(action)
            next_action = env.action_space.sample()
            
            if done:
                Q[state, action] = (1 - alpha) * Q[state, action] + alpha * reward
            else:
                Q[state, action] = (1 - alpha) * Q[state, action] + alpha * (reward + gamma * Q[next_state, next_action])
                
            state = next_state
            action = next_action
            epochs += 1
            
        if i % 100 == 0:
            clear_output(wait = True)
            print('Episode: {}'.format(i))
            
    return Q

In [5]:
env = gym.make('FrozenLake-v1', is_slippery=True)
env.render()

Q = TD_Q_prediction(env)

print(Q)

Episode: 299900
[[0.01497078 0.01406827 0.01479327 0.01393006]
 [0.00815489 0.01213798 0.01133971 0.01533416]
 [0.02339026 0.02216452 0.02309844 0.01403719]
 [0.01003015 0.01042715 0.00713207 0.01441648]
 [0.02332183 0.01793433 0.01662275 0.01093528]
 [0.         0.         0.         0.        ]
 [0.05278772 0.05683548 0.05410334 0.00779289]
 [0.         0.         0.         0.        ]
 [0.02013643 0.04879552 0.03472303 0.04483682]
 [0.07508223 0.11968451 0.10597446 0.05952977]
 [0.17184574 0.16970391 0.14894196 0.04384832]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.09582441 0.20528748 0.24034556 0.18396717]
 [0.26332687 0.53958625 0.51874199 0.47300494]
 [0.         0.         0.         0.        ]]


## TODO
- TD($\lambda$) Prediction