# REINFORCE/pytorch

Некоторое время назад я уже выполнял аналогичное задание по reinforcement learning, но для задачи CartPole. Там reward-функция более подходила для метода REINFORCE. 

Здесь же нужно её модифицировать, иначе в сессиях, где машина не добирается до финиша, reward будет равен $-n$, где $n$ -- это максимальное число шагов за сессию, что не совсем удобно для обучения. 
Она модифицирована, как $R_{new} = R + (\gamma \cdot \Phi(new\_state) - \Phi(state))$, согласно статье: https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf. При таком изменении reward-функции оптимальное поведение агента не изменится. 

В дальнейшем я беру за основу код, написанный для CartPole, поэтому все комментарии оставляю без изменения, на английском.

In [1]:
import gym

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

env = gym.make("MountainCar-v0").env
env.reset()

array([-0.50997653,  0.        ])

# Building the network for REINFORCE

For REINFORCE algorithm, we'll need a model that predicts action probabilities given states. Let's define such a model below.

In [2]:
import torch
import torch.nn as nn

In [3]:
model = nn.Sequential(
  nn.Linear(2,32),
  nn.ReLU(),
  nn.Linear(32,16),
  nn.ReLU(),
  nn.Linear(16,3)
)

#### Predict function

In [4]:
def predict_probs(states):
    """ 
    Predicts action probabilities given states.
    :param states: numpy array of shape [batch, state_shape]
    :returns: numpy array of shape [batch, n_actions]
    """
    logits = model(torch.as_tensor(states, dtype = torch.float32))
    return torch.softmax(logits, dim=1).detach().numpy()

### Play the game

Как описано выше, мы должны подобрать reward-функцию в классе $R_{new} = R + (\gamma \cdot \Phi(new\_state) - \Phi(state))$. При таком изменении reward-функции оптимальное поведение агента не изменится. 

Выберем $R_{new} = R + 100(\gamma \cdot {new\_ velocity} - velocity ))$:

In [5]:
import random

In [6]:
def generate_session(t_max=200):
    
    # arrays to record session
    states, actions, rewards = [], [], []
    s = env.reset()

    for t in range(t_max):
        # action probabilities array aka pi(a|s)
        action_probs = predict_probs(np.array([s]))[0]
  
        # Sample action with given probabilities.
        r = random.random()
        if r>action_probs[1]+action_probs[0]:
            a = 2
        elif r>action_probs[0]:
            a = 1
        else:
            a = 0
        new_s, r, done, info = env.step(a)

        # record session history to train later
        states.append(s)
        actions.append(a)
        rewards.append(r + 100 *(0.95 * abs(new_s[1]) - abs(s[1])))

        s = new_s
        if done:
            break

    return states, actions, rewards

In [7]:
# test it
states, actions, rewards = generate_session()

### Computing cumulative rewards

In [8]:
def get_cumulative_rewards(rewards,  # rewards at each step
                           gamma=0.95  # discount for reward
                           ):

    G = []
    total_reward = 0
    coefficient = 1
    rewards.reverse()
    for r in rewards:
        total_reward = total_reward*gamma + r
        G.append(total_reward)
    G.reverse()
    return G

#### Loss function and updates

We now need to define objective and update over policy gradient.

Our objective function is

$$ J \approx  { 1 \over N } \sum  _{s_i,a_i} \pi_\theta (a_i | s_i) \cdot G(s_i,a_i) $$


Following the REINFORCE algorithm, we can define our objective as follows: 

$$ \hat J \approx { 1 \over N } \sum  _{s_i,a_i} log \pi_\theta (a_i | s_i) \cdot G(s_i,a_i) $$


In [9]:
def to_one_hot(y_tensor, ndims):
    """ helper: take an integer vector and convert it to 1-hot matrix. """
    y_tensor = y_tensor.type(torch.LongTensor).view(-1, 1)
    y_one_hot = torch.zeros(
        y_tensor.size()[0], ndims).scatter_(1, y_tensor, 1)
    return y_one_hot

In [10]:
#optimizer
optimizer = torch.optim.Adam(model.parameters(), 1e-2)


def train_on_session(states, actions, rewards, gamma=0.95):
    states = torch.tensor(states, dtype=torch.float32)
    actions = torch.tensor(actions, dtype=torch.int32)
    cumulative_returns = np.array(get_cumulative_rewards(rewards, gamma))
    cumulative_returns = torch.tensor(cumulative_returns, dtype=torch.float32)

    # logits, probas and log-probas
    logits = model(states)
    probs = nn.functional.softmax(logits, -1)
    log_probs = nn.functional.log_softmax(logits, -1)

    # log-probabilities for chosen actions, log pi(a_i|s_i)
    log_probs_for_actions = torch.sum(
        log_probs * to_one_hot(actions, env.action_space.n), dim=1)
   
    # loss
    loss = -torch.mean(log_probs_for_actions * cumulative_returns)

    # Gradient descent step
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    # session rewards to print them later
    return np.sum(rewards)

### The actual training

Даже с измененной reward-функцией обучаться модель не хочет. Не успел разобраться, в чем проблема :(
На всякий случай, оставлю на гитхабе .ipynb со своей реализацией REINFORCE для задачки CartPole.

In [11]:
for i in range(100):
    rewards = [train_on_session(*generate_session())
               for _ in range(100)]  # generate new sessions
    print("#%i mean reward:%.3f" % (i+1, np.mean(rewards)))

#1 mean reward:-204.343
#2 mean reward:-202.872
#3 mean reward:-202.702
#4 mean reward:-202.486
#5 mean reward:-202.439
#6 mean reward:-202.664
#7 mean reward:-202.650
#8 mean reward:-202.516
#9 mean reward:-202.434
#10 mean reward:-202.236
#11 mean reward:-202.436
#12 mean reward:-202.330
#13 mean reward:-202.465
#14 mean reward:-202.465
#15 mean reward:-202.310
#16 mean reward:-202.525
#17 mean reward:-202.621
#18 mean reward:-202.503
#19 mean reward:-202.497
#20 mean reward:-202.578
#21 mean reward:-202.516
#22 mean reward:-202.627
#23 mean reward:-202.253
#24 mean reward:-202.093
#25 mean reward:-202.169
#26 mean reward:-202.499
#27 mean reward:-202.282
#28 mean reward:-202.296
#29 mean reward:-202.527
#30 mean reward:-202.809
#31 mean reward:-202.501
#32 mean reward:-202.037
#33 mean reward:-202.398
#34 mean reward:-202.652
#35 mean reward:-202.460
#36 mean reward:-202.348
#37 mean reward:-202.546
#38 mean reward:-202.579


KeyboardInterrupt: 