### Baird
1. 문제:
2. 조건
    - state: 현재 node 위치
        - feature vector($\textbf{x}$): 현재 state를 $w_1,...w_8$을 이용해서 vector로 나타낸다.
    - action
        - SOLID: 무조건 아래 쪽 state(7번)로 이동한다.
        - DASH: 무작위로 위 쪽 state(1~6번)로 이동한다. 
    - reward: 모든 time step에서 0이다.
3. 학습 목표
    - Semi-gradient Off-policy TD
    - Semi-gradient DP

In [136]:
import numpy as np

In [137]:
STATES = np.arange(0, 7)
LOWER_STATE = 6
FEATURE_SIZE = 8
DASH = 0
SOLID = 1
BEHAVIOR_SOLID_PROBABILITY = 1.0 / 7

def target_policy():
    return SOLID

def behavior_policy():
    if np.random.binomial(1, BEHAVIOR_SOLID_PROBABILITY) == 1:
        return SOLID
    return DASH

class Baird:
    def __init__(self):
        self.state = 0
        
    def step(self, action):
        reward = 0
        if action == SOLID:
            self.state = LOWER_STATE
        else:
            self.state = np.random.choice(STATES[: LOWER_STATE])
        return self.state, reward            
    
    def reset(self):
        self.state = 0
        return self.state

In [154]:
class ValueFunction:
    def __init__(self, step_size):
        self.step_size = step_size
        self.weights = np.array([1, 1, 1, 1, 1, 1, 10, 1], dtype=np.float32)
        self.features = np.zeros([len(STATES), FEATURE_SIZE])
        for i in range(LOWER_STATE):
            self.features[i, i] = 2
            self.features[i, 7] = 1
        self.features[LOWER_STATE, 6] = 1
        self.features[LOWER_STATE, 7] = 2        
        
    def value(self, state):
        return self.features[state]@self.weights
    
    def learn(self, state, delta, rho):
        self.weights += self.features[state] * self.step_size * rho * delta        

In [155]:
num_steps = 1000
step_size = 0.01
env = Baird()
value_fn = ValueFunction(step_size)
policy_fn = behavior_policy

### Semi Gradient Off-policy TD(0)
importance sampling, value function approximation을 활용해서 off-policy 학습을 진행한다. 

식은 아래 2개 식을 참조해서 구현했고 사용한 value function approximator는 linear function으로 $v(S_t, \textbf{w}) = \textbf{x} \cdot \textbf{w}$, (*$\textbf{x}$는 feature vector이다.*)

$\delta_t = R_{t+1} + \gamma \hat v(S_{t+1}, \textbf{w}_t) - \hat v(S_t, \textbf{w})$ <br> 
$w_{t+1} = w_t + \alpha \rho_t \delta_t \nabla \hat v (S_t, w_t)$ 

In [156]:
def semi_gradient_off_policy_TD(env, num_steps, value_fn, behavior_policy_fn, 
                                discount_factor=0.95):
    state = env.reset()
    for i in range(num_steps):
        action = behavior_policy_fn()    
        reward, next_state = env.step(action)
        delta = reward + discount_factor * value_fn.value(next_state) - value_fn.value(state)
        # target policy에 대해서 일반화해서 구현 vs 이 문제에 대해서만 쉽게 구현
        if action == DASH:
            rho = 0.0
        else:
            rho = 1.0 / BEHAVIOR_SOLID_PROBABILITY
        value_fn.learn(state, delta, rho)
        state = next_state

In [157]:
semi_gradient_off_policy_TD(env, num_steps, value_fn, policy_fn)

In [158]:
value_fn.weights

array([ 45.0723381 ,   1.        ,   1.        ,   1.        ,
         1.        ,   1.        ,  10.        ,  23.03616714], dtype=float32)