<a href="https://colab.research.google.com/github/rootAkash/reinforcement_learning/blob/master/muzero/mu_0_with_prioritized_experience_replay_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#why priority?
#since the mcts root value is a better estimate of value of a state. The difference between the n step return value and the mcts value tells somehow how much the value fuction has 
#coverged to the actual value (wrt to the mcts policy that we are following while generating trajectory).
#policy network just apporximates the mcts to boot strap the value function.
#so the difference between search value(mcts root) and the n step return value should be higher for sates which we have not trained properly or seen yet so we need to sample more of those
#in order to get a better value function and lead it to quicker convergence in enviroments with intermediate rewards (eg ATARI)
#so we sample from replay using the difference between root search value and  step value return as priority,it introduces a sampling bias. 
#sample bias: sampling bias is a bias in which a sample is collected in such a way that some members of the intended population have a lower or higher sampling probability than others.
#intituively the samples having higher probablity will be sampled more leading to the model getting baised into optimising for these samples more.Equivalent of having bigger loss for higher
#probablity samples since they are sampled more.
#to correct for this we use importance sampling (depends on probality value of sample) and scale down the loss for sample using it.  

In [13]:
!pip install gym[all]
!pip install box2d-py
!apt-get install python-opengl -y
!apt install xvfb -y

Collecting box2d-py~=2.3.5; extra == "all"
[?25l  Downloading https://files.pythonhosted.org/packages/87/34/da5393985c3ff9a76351df6127c275dcb5749ae0abbe8d5210f06d97405d/box2d_py-2.3.8-cp37-cp37m-manylinux1_x86_64.whl (448kB)
[K     |████████████████████████████████| 450kB 19.4MB/s 
Collecting mujoco-py<2.0,>=1.50; extra == "all"
[?25l  Downloading https://files.pythonhosted.org/packages/cf/8c/64e0630b3d450244feef0688d90eab2448631e40ba6bdbd90a70b84898e7/mujoco-py-1.50.1.68.tar.gz (120kB)
[K     |████████████████████████████████| 122kB 58.1MB/s 
Collecting glfw>=1.4.0
[?25l  Downloading https://files.pythonhosted.org/packages/13/d7/79c091c877493de7f8286ed62c77bf0f2c51105656073846b2326021b524/glfw-2.1.0-py2.py27.py3.py30.py31.py32.py33.py34.py35.py36.py37.py38-none-manylinux2014_x86_64.whl (205kB)
[K     |████████████████████████████████| 215kB 49.6MB/s 
Collecting lockfile>=0.12.2
  Downloading https://files.pythonhosted.org/packages/c8/22/9460e311f340cb62d26a38c419b1381b8593b0bb6b

In [1]:
import numpy as np
def stcat(x,support=5):
  x = np.sign(x) * ((abs(x) + 1)**0.5 - 1) + 0.001 * x
  x = np.clip(x, -support, support)
  floor = np.floor(x)
  prob = x - floor
  logits = np.zeros( 2 * support + 1)
  first_index = int(floor + support)
  second_index = int(floor + support+1)
  logits[first_index] = 1-prob
  if prob>0:
    logits[second_index] = prob
  return logits
def catts(x,support=5):
  support = np.arange(-support, support+1, 1)
  x = np.sum(support*x)
  x = np.sign(x) * ((((1 + 4 * 0.001 * (abs(x) + 1 + 0.001))**0.5 - 1) / (2 * 0.001))** 2- 1)
  return x  

#cat = stcat(58705)
#print(cat)
#scalar = catts(cat)
#print(scalar)
print("done")        


done


In [2]:

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim




class MuZeroNet(nn.Module):
    def __init__(self, input_size, action_space_n, reward_support_size, value_support_size):
        super().__init__()
        self.hx_size = 32
        self._representation = nn.Sequential(nn.Linear(input_size, self.hx_size),
                                             nn.Tanh())
        self._dynamics_state = nn.Sequential(nn.Linear(self.hx_size + action_space_n, 64),
                                             nn.Tanh(),
                                             nn.Linear(64, self.hx_size),
                                             nn.Tanh())
        self._dynamics_reward = nn.Sequential(nn.Linear(self.hx_size + action_space_n, 64),
                                              nn.LeakyReLU(),
                                              nn.Linear(64, 2*reward_support_size+1))
        self._prediction_actor = nn.Sequential(nn.Linear(self.hx_size, 64),
                                               nn.LeakyReLU(),
                                               nn.Linear(64, action_space_n))
        self._prediction_value = nn.Sequential(nn.Linear(self.hx_size, 64),
                                               nn.LeakyReLU(),
                                               nn.Linear(64, 2*value_support_size+1))
        self.action_space_n = action_space_n

        self._prediction_value[-1].weight.data.fill_(0)
        self._prediction_value[-1].bias.data.fill_(0)
        self._dynamics_reward[-1].weight.data.fill_(0)
        self._dynamics_reward[-1].bias.data.fill_(0)

    def p(self, state):
        actor = torch.softmax(self._prediction_actor(state),dim=1)
        value = torch.softmax(self._prediction_value(state),dim=1)
        return actor, value

    def h(self, obs_history):
        return self._representation(obs_history)

    def g(self, state, action):
        x = torch.cat((state, action), dim=1)
        next_state = self._dynamics_state(x)
        reward = torch.softmax(self._dynamics_reward(x),dim=1)
        return next_state, reward     

    def initial_state(self, x):
        hout = self.h(x)
        prob,v= self.p(hout)
        return hout,prob,v
    def next_state(self,hin,a):
        hout,r = self.g(hin,a)
        prob,v= self.p(hout)
        return hout,r,prob,v
    def inference_initial_state(self, x):
        with torch.no_grad():
          hout = self.h(x)
          prob,v=self.p(hout)

          return hout,prob,v
    def inference_next_state(self,hin,a):
        with torch.no_grad():
          hout,r = self.g(hin,a)
          prob,v=self.p(hout)
          return hout,r,prob,v     


print("done")                                      

done


In [3]:

#MTCS    MUzero modified for intermeditate rewards settings and using predicted rewards
#accepts policy as a list
import torch
import math
import numpy as np

import random
def dynamics(net,state,action):
    device = 'cuda' if torch.cuda.is_available() else 'cpu'

    #print(state,action) 
    next_state,reward,prob,value = net.inference_next_state(state.to(device),torch.tensor([action]).float().to(device))
    reward = catts(reward.cpu().numpy().ravel())
    value = catts(value.cpu().numpy().ravel())
    prob = prob.cpu().tolist()[0]
    #print("dynamics",prob)
    return next_state.cpu(),reward,prob,value


class MinMaxStats:
    """A class that holds the min-max values of the tree."""

    def __init__(self):
        self.MAXIMUM_FLOAT_VALUE = float('inf')       
        self.maximum =  -self.MAXIMUM_FLOAT_VALUE
        self.minimum =  self.MAXIMUM_FLOAT_VALUE

    def update(self, value: float):
        if value is None:
            raise ValueError

        self.maximum = max(self.maximum, value)
        self.minimum = min(self.minimum, value)

    def normalize(self, value: float) -> float:
        # If the value is unknow, by default we set it to the minimum possible value
        if value is None:
            return 0.0

        if self.maximum > self.minimum:
            # We normalize only when we have set the maximum and minimum values.
            return (value - self.minimum) / (self.maximum - self.minimum)
        return value


class Node:
    """A class that represent nodes inside the MCTS tree"""

    def __init__(self, prior: float):
        self.visit_count = 0
        self.to_play = -1
        self.prior = prior
        self.value_sum = 0
        self.children = {}
        self.hidden_state = None
        self.reward = 0

    def expanded(self):
        return len(self.children) > 0

    def value(self):
        if self.visit_count == 0:
            return None
        return self.value_sum / self.visit_count


def softmax_sample(visit_counts, actions, t):
    counts_exp = np.exp(visit_counts) * (1 / t)
    probs = counts_exp / np.sum(counts_exp, axis=0)
    action_idx = np.random.choice(len(actions), p=probs)
    return actions[action_idx]


"""MCTS module: where MuZero thinks inside the tree."""


def add_exploration_noise( node):
    """
    At the start of each search, we add dirichlet noise to the prior of the root
    to encourage the search to explore new actions.
    """
    actions = list(node.children.keys())
    noise = np.random.dirichlet([0.25] * len(actions)) # config.root_dirichlet_alpha
    frac = 0.25#config.root_exploration_fraction
    for a, n in zip(actions, noise):
        node.children[a].prior = node.children[a].prior * (1 - frac) + n * frac



def ucb_score(parent, child,min_max_stats):
    """
    The score for a node is based on its value, plus an exploration bonus based on
    the prior.

    """
    pb_c_base = 19652
    pb_c_init = 1.25
    pb_c = math.log((parent.visit_count + pb_c_base + 1) / pb_c_base) + pb_c_init
    pb_c *= math.sqrt(parent.visit_count) / (child.visit_count + 1)

    prior_score = pb_c * child.prior
    value_score = min_max_stats.normalize(child.value())
    return  value_score + prior_score 

def select_child(node, min_max_stats):
    """
    Select the child with the highest UCB score.
    """
    # When the parent visit count is zero, all ucb scores are zeros, therefore we return a random child
    if node.visit_count == 0:
        return random.sample(node.children.items(), 1)[0]

    _, action, child = max(
        (ucb_score(node, child, min_max_stats), action,
         child) for action, child in node.children.items())
    return action, child




def expand_node(node, to_play, actions_space,hidden_state,reward,policy):
    """
    We expand a node using the value, reward and policy prediction obtained from
    the neural networks.
    """
    node.to_play = to_play
    node.hidden_state = hidden_state
    node.reward = reward
    policy = {a:policy[a] for a in actions_space}
    policy_sum = sum(policy.values())
    for action, p in policy.items():
        node.children[action] = Node(p / policy_sum) # not needed since mine are already softmax but its fine 


def backpropagate(search_path, value,to_play,discount, min_max_stats):
    """
    At the end of a simulation, we propagate the evaluation all the way up the
    tree to the root.
    """
    for node in search_path[::-1]: #[::-1] means reversed
        node.value_sum += value 
        node.visit_count += 1
        min_max_stats.update(node.value())

        value = node.reward + discount * value


def select_action(node, mode ='softmax'):
    """
    After running simulations inside in MCTS, we select an action based on the root's children visit counts.
    During training we use a softmax sample for exploration.
    During evaluation we select the most visited child.
    """
    visit_counts = [child.visit_count for child in node.children.values()]
    actions = [action for action in node.children.keys()]
    action = None
    if mode == 'softmax':
        t = 1.0
        action = softmax_sample(visit_counts, actions, t)
    elif mode == 'max':
        action, _ = max(node.children.items(), key=lambda item: item[1].visit_count)
    counts_exp = np.exp(visit_counts)
    probs = counts_exp / np.sum(counts_exp, axis=0)    
    #return action ,probs,node.value()
    return action ,np.array(visit_counts)/sum(visit_counts),node.value()

def run_mcts(net, state,prob,root_value,num_simulations,discount = 0.9):
    """
    Core Monte Carlo Tree Search algorithm.
    To decide on an action, we run N simulations, always starting at the root of
    the search tree and traversing the tree according to the UCB formula until we
    reach a leaf node.
    """
    prob, root_value = prob.tolist()[0] ,catts(root_value.numpy().ravel())
    to_play = True
    action_space=[ i for i in range(len(prob))]#history.action_space()
    #print("action space",action_space)
    root = Node(0)
    expand_node(root, to_play,action_space,state,0.0,prob)#node, to_play, actions_space ,hidden_state,reward,policy
    add_exploration_noise( root)


    min_max_stats = MinMaxStats()

    for _ in range(num_simulations): 
        node = root
        search_path = [node]

        while node.expanded():
            action, node = select_child( node, min_max_stats)
            search_path.append(node)

        # Inside the search tree we use the dynamics function to obtain the next
        # hidden state given an action and the previous hidden state.
        parent = search_path[-2]
        
        #network_output = network.recurrent_inference(parent.hidden_state, action)
        next_state,r,action_probs, value = dynamics(net,parent.hidden_state,onehot(action,len(action_space))) 
        expand_node(node, to_play, action_space,next_state,r,action_probs)#node, to_play, actions_space ,hidden_state,reward,policy

        backpropagate(search_path, value, to_play, discount, min_max_stats)#search_path, value,,discount, min_max_stats
    return root    


In [4]:
import gym
class ScalingObservationWrapper(gym.ObservationWrapper):
    """
    Wrapper that apply a min-max scaling of observations.
    """

    def __init__(self, env, low=None, high=None):
        super().__init__(env)
        assert isinstance(env.observation_space, gym.spaces.Box)

        low = np.array(self.observation_space.low if low is None else low)
        high = np.array(self.observation_space.high if high is None else high)

        self.mean = (high + low) / 2
        self.max = high - self.mean

    def observation(self, observation):
        return (observation - self.mean) / self.max

In [6]:

import random
import numpy as np
import torch
from tqdm import tqdm
def onehot(a,n=2):
  return np.eye(n)[a]
def play_game(env,net,n_sim,discount,render,device,n_act,max_steps):
    trajectory=[]
    state = env.reset() 
    done = False
    r =0 
    stp=0
    while not done:
        if render:
          env.render()
        stp+=1  
        h ,prob,pred_value= net.inference_initial_state(torch.tensor([state]).float().to(device)) 
        root  = run_mcts(net,h.cpu(),prob.cpu(),pred_value.cpu(),num_simulations=n_sim,discount=discount)
        action,action_prob,mcts_val = select_action(root) 
        next_state, reward, done, info = env.step(action)
        r+=reward
        if stp>max_steps:
          done = True
        data = (state,onehot(action,n_act),action_prob,mcts_val,reward,pred_value.cpu())
        trajectory.append(data)
        state = next_state
    print("DATA collection:played for ",len(trajectory)," steps , rewards",r)   
    return trajectory    
def eval_game(env,net,n_sim,render,device,max_steps):
    state = env.reset() 
    done = False
    r = 0
    stp=0
    while not done:
        if render:
          env.render()
        stp+=1  
        h ,prob,value= net.inference_initial_state(torch.tensor([state]).float().to(device)) 
        root  = run_mcts(net,h.cpu(),prob.cpu(),value.cpu(),num_simulations=n_sim,discount=discount)
        action,action_prob,mcts_val = select_action(root,"max")
        next_state, reward, done, info = env.step(action)
        if stp>max_steps:
          done = True
        r+=reward
        state = next_state
    print("Eval:played for ",r ," rewards")   
    
def sample_games(buffer,batch_size):
    # Sample game from buffer either uniformly or according to some priority
    #print("samplig from .",len(buffer))
    return random.choices(buffer, k=batch_size)

def sample_position(trajectory,priority=None):
    # Sample position from game either uniformly or according to some priority.
    if priority == None:
      return np.random.choice(len(trajectory),1)[0]
    return np.random.choice(len(trajectory),1,p = priority)[0]
    #return np.random.choice(list(range(0, len(trajectory))),1,p = priority)[0]
def get_priorities(root_values,rewards,discount=0.99, td_steps=10):
    z_values = []
    alpha = 1
    beta = 1 
    for current_index in range(len(root_values)):
        bootstrap_index = current_index + td_steps
        if bootstrap_index < len(root_values):
            value = root_values[bootstrap_index] * discount ** td_steps
        else:
            value = 0

        for i, reward in enumerate(rewards[current_index:bootstrap_index]):
            value += reward * discount ** i

        if current_index < len(root_values):
            z_values.append(value)
    p = np.abs(np.array(root_values)-np.array(z_values))**alpha  
    priority = p /np.sum(p)
    N= len(root_values) #????????????????????????????????????????????????????????????????????????????????????????????????????????
    weights = (1/(N*priority))**beta
    return list(priority),list(weights)



def sample_batch(action_space_size,buffer,discount,batch_size,num_unroll_steps, td_steps,per):
    obs_batch, action_batch, reward_batch, value_batch, policy_batch,weights_batch = [], [], [], [], [],[]
    games = sample_games(buffer,batch_size)
    for g in games:
      state,action,action_prob,root_val,reward,pred_val = zip(*g)
      state,action,action_prob,root_val,reward,pred_val  =list(state),list(action),list(action_prob),list(root_val),list(reward),list(pred_val)
      if per:
        #make priority for sampling from root_value and n_step value
        priority,weights = get_priorities(root_val,reward,discount=discount, td_steps=td_steps)
        
        game_pos = sample_position(g,priority)#state index sampled using priority
      else:  
        weights = [1.0]*len(root_val)
        game_pos = sample_position(g)#state index sampled using priority
      _actions = action[game_pos:game_pos + num_unroll_steps]
      # random action selection to complete num_unroll_steps
      _actions += [onehot(np.random.randint(0, action_space_size),action_space_size)for _ in range(num_unroll_steps - len(_actions))]

      obs_batch.append(state[game_pos])
      action_batch.append(_actions)
      value, reward, policy = make_target(child_visits=action_prob ,root_values=root_val,rewards=reward,state_index=game_pos,discount=discount, num_unroll_steps=num_unroll_steps, td_steps=td_steps)
      reward_batch.append(reward)
      value_batch.append(value)
      policy_batch.append(policy)
      weights_batch.append(weights[game_pos])



    obs_batch = torch.tensor(obs_batch).float()
    action_batch = torch.tensor(action_batch).long()
    reward_batch = torch.tensor(reward_batch).float()
    value_batch = torch.tensor(value_batch).float()
    policy_batch = torch.tensor(policy_batch).float()
    weights_batch = torch.tensor(weights_batch).float()
    return obs_batch, action_batch, reward_batch, value_batch, policy_batch,weights_batch


def make_target(child_visits,root_values,rewards,state_index,discount=0.99, num_unroll_steps=5, td_steps=10):
        # The value target is the discounted root value of the search tree N steps into the future, plus
        # the discounted sum of all rewards until then.
        target_values, target_rewards, target_policies = [], [], []
        for current_index in range(state_index, state_index + num_unroll_steps + 1):
            bootstrap_index = current_index + td_steps
            if bootstrap_index < len(root_values):
                value = root_values[bootstrap_index] * discount ** td_steps
            else:
                value = 0

            for i, reward in enumerate(rewards[current_index:bootstrap_index]):
                value += reward * discount ** i

            if current_index < len(root_values):
                target_values.append(stcat(value))
                target_rewards.append(stcat(rewards[current_index]))
                target_policies.append(child_visits[current_index])

            else:
                # States past the end of games are treated as absorbing states.
                target_values.append(stcat(0))
                target_rewards.append(stcat(0))
                # Note: Target policy is  set to 0 so that no policy loss is calculated for them
                #target_policies.append([0 for _ in range(len(child_visits[0]))])
                target_policies.append(child_visits[0]*0.0)

        return target_values, target_rewards, target_policies


def scalar_reward_loss( prediction, target):
        return -(torch.log(prediction) * target).sum(1)

def scalar_value_loss( prediction, target):
        return -(torch.log(prediction) * target).sum(1)
def update_weights(model, action_space_size, optimizer, replay_buffer,discount,batch_size,num_unroll_steps, td_steps,per ):
    batch = sample_batch(action_space_size,replay_buffer,discount,batch_size,num_unroll_steps, td_steps,per)
    obs_batch, action_batch, target_reward, target_value, target_policy,target_weights = batch
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    obs_batch = obs_batch.to(device)
    action_batch = action_batch.to(device)#.unsqueeze(-1) # its not onehot yet 
    target_reward = target_reward.to(device)
    target_value = target_value.to(device)
    target_policy = target_policy.to(device)
    target_weights = target_weights.to(device)

    # transform targets to categorical representation # its already done
    # Reference:  Appendix F
    #transformed_target_reward = config.scalar_transform(target_reward)
    target_reward_phi =target_reward #config.reward_phi(transformed_target_reward)
    #transformed_target_value = config.scalar_transform(target_value)
    target_value_phi = target_value#config.value_phi(transformed_target_value)

    hidden_state, policy_prob,value  = model.initial_state(obs_batch) # initial model_call ###################################### make changes
    #h,init_pred_p,init_pred_v = net.initial_state(in_s)

    value_loss = scalar_value_loss(value, target_value_phi[:, 0])
    policy_loss = -(torch.log(policy_prob) * target_policy[:, 0]).sum(1)
    reward_loss = torch.zeros(batch_size, device=device)

    gradient_scale = 1 / num_unroll_steps
    for step_i in range(num_unroll_steps):
        hidden_state, reward,policy_prob,value  = model.next_state(hidden_state, action_batch[:, step_i]) ######################### make changes
        #h,pred_reward,pred_policy,pred_value= net.next_state(h,act)
        policy_loss += -(torch.log(policy_prob) * target_policy[:, step_i + 1]).sum(1)
        value_loss += scalar_value_loss(value, target_value_phi[:, step_i + 1])
        reward_loss += scalar_reward_loss(reward, target_reward_phi[:, step_i])
        hidden_state.register_hook(lambda grad: grad * 0.5)

    # optimize
    value_loss_coeff = 1
    loss = (policy_loss + value_loss_coeff * value_loss + reward_loss) # find value loss coefficiet = 1?
    weights = target_weights#/target_weights.max()
    weighted_loss = (weights * loss).mean()#1?
    weighted_loss.register_hook(lambda grad: grad * gradient_scale)
    loss = loss.mean()

    optimizer.zero_grad()
    weighted_loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 5)#5?
    optimizer.step()

def adjust_lr(optimizer, step_count):

    lr_init=0.05
    lr_decay_rate=0.01
    lr_decay_steps=10000
    lr = lr_init * lr_decay_rate ** (step_count / lr_decay_steps)
    lr = max(lr, 0.001)
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr
    return lr


learning_rate = [0.05]   
def net_train(net,  action_space_size, replay_buffer,discount,batch_size,num_unroll_steps, td_steps,training_steps=1000,per = False):
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    model =net
    #MuZeroNet(input_size=4, action_space_n=2, reward_support_size=5, value_support_size=5).to(device) #training fresh net
    optimizer = optim.SGD(model.parameters(), lr=learning_rate[0], momentum=0.9,weight_decay=1e-4)
    #training_steps=training_steps=500#20000
    # wait for replay buffer to be non-empty
    while len(replay_buffer) == 0:
        pass

    for step_count in tqdm(range(training_steps)):
        learning_rate[0] = adjust_lr( optimizer, step_count)
        update_weights(model, action_space_size, optimizer, replay_buffer,discount,batch_size,num_unroll_steps, td_steps,per)
    return model


In [None]:
import gym
import numpy as np
from collections import deque

render = False
episodes_per_train=30
episodes_per_eval =5
#buffer =[]
buffer = deque(maxlen = episodes_per_train)
training_steps=50
max_steps=5000
n_sim= 50
discount = 0.99
batch_size = 512
envs = ['CartPole-v1','MountainCar-v0','LunarLander-v2']
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("training for ",envs[2])
env=gym.make(envs[2])
#env=env.unwrapped
#env = ScalingObservationWrapper(env, low=[-2.4, -2.0, -0.42, -3.5], high=[2.4, 2.0, 0.42, 3.5])

s_dim =env.observation_space.shape[0]
print("s_dim: ",s_dim)
a_dim =env.action_space.n
print("a_dim: ",a_dim)
a_bound =1 #env.action_space.high[0]
print("a_bound: ",a_bound)



net = MuZeroNet(input_size=s_dim, action_space_n=a_dim, reward_support_size=5, value_support_size=5).to(device)

for t in range(training_steps):
  for _ in range(episodes_per_train):
    buffer.append(play_game(env,net,n_sim,discount,render,device,a_dim,max_steps))
  print("training from ",len(buffer)," games")  
  if t<20:
    priority = True 
    tr_stp=500
  else :
    tr_stp=2000
    priority =False
  net = net_train(net,  action_space_size=a_dim, replay_buffer=buffer,discount=discount,batch_size=batch_size,num_unroll_steps=5, td_steps=10,training_steps=tr_stp,per = priority)
  for _ in range(episodes_per_eval):
    eval_game(env,net,n_sim,render,device,max_steps)
  


training for  LunarLander-v2
s_dim:  8
a_dim:  4
a_bound:  1
DATA collection:played for  88  steps , rewards -101.85668461706629
DATA collection:played for  104  steps , rewards -117.76086883613767
DATA collection:played for  99  steps , rewards -504.50921462555596
DATA collection:played for  103  steps , rewards -120.95003351036684
DATA collection:played for  95  steps , rewards -408.40362463389835
DATA collection:played for  76  steps , rewards -97.81801424871111
DATA collection:played for  105  steps , rewards -464.0141749483592
DATA collection:played for  74  steps , rewards -303.38397742425855
DATA collection:played for  75  steps , rewards -232.2964755769686
DATA collection:played for  82  steps , rewards -92.08698841593089
DATA collection:played for  127  steps , rewards -395.25222689318514
DATA collection:played for  109  steps , rewards -194.6926962643017
DATA collection:played for  91  steps , rewards -233.71634342907694
DATA collection:played for  110  steps , rewards -165.3

  0%|          | 0/500 [00:00<?, ?it/s]

DATA collection:played for  86  steps , rewards -168.3994302592149
training from  30  games


100%|██████████| 500/500 [03:55<00:00,  2.12it/s]


Eval:played for  -369.18820124380767  rewards
Eval:played for  -104.46200592694606  rewards
Eval:played for  -189.8690013828629  rewards
Eval:played for  -117.02598889453671  rewards
Eval:played for  -522.5154526453116  rewards
DATA collection:played for  92  steps , rewards -28.066749886392046
DATA collection:played for  72  steps , rewards -150.40891211377502
DATA collection:played for  89  steps , rewards -241.1991161257918
DATA collection:played for  71  steps , rewards -136.47378143620304
DATA collection:played for  105  steps , rewards -301.26114194733054
DATA collection:played for  83  steps , rewards -328.08624423505455
DATA collection:played for  82  steps , rewards -278.38412454805996
DATA collection:played for  94  steps , rewards -514.1207208279138
DATA collection:played for  76  steps , rewards -256.0368771162459
DATA collection:played for  83  steps , rewards -263.47190839491043
DATA collection:played for  85  steps , rewards -221.06833835610166
DATA collection:played for

  0%|          | 0/500 [00:00<?, ?it/s]

DATA collection:played for  93  steps , rewards -448.42437885680044
training from  30  games


100%|██████████| 500/500 [03:44<00:00,  2.23it/s]


Eval:played for  -231.34640141883276  rewards
Eval:played for  9.229893036200224  rewards
Eval:played for  -208.63988720240894  rewards
Eval:played for  -271.6539387968308  rewards
Eval:played for  -299.85026892178706  rewards
DATA collection:played for  99  steps , rewards -237.22378647222234
DATA collection:played for  56  steps , rewards -113.16520204148878
DATA collection:played for  66  steps , rewards -184.809342521921
DATA collection:played for  99  steps , rewards -334.45159157348905
DATA collection:played for  77  steps , rewards -370.7532199053163
DATA collection:played for  81  steps , rewards -365.3790383559971
DATA collection:played for  62  steps , rewards -135.56536607687528
DATA collection:played for  69  steps , rewards -301.6412036095619
DATA collection:played for  56  steps , rewards -214.05972241013495
DATA collection:played for  106  steps , rewards -208.7046497145085
DATA collection:played for  73  steps , rewards -39.61481569224097
DATA collection:played for  54 

  0%|          | 0/500 [00:00<?, ?it/s]

DATA collection:played for  59  steps , rewards -180.85467494771046
training from  30  games


100%|██████████| 500/500 [03:40<00:00,  2.27it/s]


Eval:played for  -156.37000047530657  rewards
Eval:played for  -129.600771095379  rewards
Eval:played for  -243.14965506556672  rewards
Eval:played for  -205.9169200635184  rewards
Eval:played for  -146.77096774053422  rewards
DATA collection:played for  59  steps , rewards 0.46725029711898003
DATA collection:played for  85  steps , rewards -218.72563350437267
DATA collection:played for  84  steps , rewards -270.6362673671854
DATA collection:played for  81  steps , rewards -199.48262517663844
DATA collection:played for  60  steps , rewards -154.91768394430048
DATA collection:played for  64  steps , rewards 17.420894917305944
DATA collection:played for  105  steps , rewards -216.9261013399867
DATA collection:played for  68  steps , rewards -155.38347238019907
DATA collection:played for  99  steps , rewards -240.20013248737885
DATA collection:played for  101  steps , rewards -297.1714646225172
DATA collection:played for  68  steps , rewards -23.554954898611967
DATA collection:played for 

  0%|          | 0/500 [00:00<?, ?it/s]

DATA collection:played for  57  steps , rewards -125.16943223889896
training from  30  games


100%|██████████| 500/500 [03:46<00:00,  2.21it/s]


Eval:played for  -227.1969331842232  rewards
Eval:played for  -116.9191553130954  rewards
Eval:played for  -156.53398669401176  rewards
Eval:played for  -163.3744495797692  rewards
Eval:played for  -274.172345068608  rewards
DATA collection:played for  57  steps , rewards -129.39826555788562
DATA collection:played for  70  steps , rewards -222.55023231903135
DATA collection:played for  90  steps , rewards -319.3756617259369
DATA collection:played for  74  steps , rewards -218.14281178587692
DATA collection:played for  58  steps , rewards -148.67591977812887
DATA collection:played for  52  steps , rewards -157.03842553415103
DATA collection:played for  61  steps , rewards -129.09913436625916
DATA collection:played for  63  steps , rewards -39.19527900815342
DATA collection:played for  70  steps , rewards -110.22606093370221
DATA collection:played for  87  steps , rewards -322.60402060531317
DATA collection:played for  84  steps , rewards -160.77119634975867
DATA collection:played for  5

  0%|          | 0/500 [00:00<?, ?it/s]

DATA collection:played for  73  steps , rewards -162.36998047239223
training from  30  games


100%|██████████| 500/500 [03:37<00:00,  2.30it/s]


Eval:played for  -235.15302562408866  rewards
Eval:played for  -88.24096169024573  rewards
Eval:played for  -176.12740092411747  rewards
Eval:played for  -305.91034921208416  rewards
Eval:played for  -116.47065699949933  rewards
DATA collection:played for  159  steps , rewards -793.7208403917292
DATA collection:played for  121  steps , rewards -172.8626440832067
DATA collection:played for  81  steps , rewards -494.7437481925009
DATA collection:played for  130  steps , rewards -393.45219144730083
DATA collection:played for  58  steps , rewards -112.81136496550278
DATA collection:played for  121  steps , rewards -247.20135935153309
DATA collection:played for  116  steps , rewards -232.38712753518067
DATA collection:played for  120  steps , rewards -207.56879537387994
DATA collection:played for  140  steps , rewards -286.79238622223386
DATA collection:played for  120  steps , rewards -263.1512801851987
DATA collection:played for  184  steps , rewards -471.9672969727361
DATA collection:pla

  0%|          | 0/500 [00:00<?, ?it/s]

DATA collection:played for  122  steps , rewards -204.05857537220703
training from  30  games


100%|██████████| 500/500 [04:39<00:00,  1.79it/s]


Eval:played for  -73.02227592782693  rewards
Eval:played for  -69.10417801037228  rewards
Eval:played for  -111.14255847283309  rewards
Eval:played for  -117.93545355145733  rewards
Eval:played for  -97.80495293433655  rewards
DATA collection:played for  124  steps , rewards -143.65449797202933
DATA collection:played for  84  steps , rewards -96.32055170017279
DATA collection:played for  156  steps , rewards -101.3450920324785
DATA collection:played for  59  steps , rewards -101.7489563636547
DATA collection:played for  112  steps , rewards -141.3741267465536
DATA collection:played for  146  steps , rewards -158.07955009791965
DATA collection:played for  67  steps , rewards -104.75138379715114
DATA collection:played for  125  steps , rewards -118.57883858699716
DATA collection:played for  124  steps , rewards -107.9960839052333
DATA collection:played for  69  steps , rewards -81.16183642584916
DATA collection:played for  76  steps , rewards -121.223957262477
DATA collection:played for 

  0%|          | 0/500 [00:00<?, ?it/s]

DATA collection:played for  57  steps , rewards -108.6875723383821
training from  30  games


100%|██████████| 500/500 [04:01<00:00,  2.07it/s]


Eval:played for  -85.76180263395794  rewards
Eval:played for  -79.30968179149794  rewards
Eval:played for  -220.75713966250535  rewards
Eval:played for  -82.38907438101592  rewards
Eval:played for  -95.19095359270511  rewards
DATA collection:played for  149  steps , rewards -107.47821949125746
DATA collection:played for  227  steps , rewards -202.5525976463028
DATA collection:played for  63  steps , rewards -97.18887331094588
DATA collection:played for  93  steps , rewards -52.40443270220027
DATA collection:played for  201  steps , rewards -124.42095719590884
DATA collection:played for  135  steps , rewards -84.38364677496921
DATA collection:played for  187  steps , rewards -89.91408662983284
DATA collection:played for  220  steps , rewards -119.58412512694821
DATA collection:played for  62  steps , rewards -71.19903038815903
DATA collection:played for  63  steps , rewards -110.64119778937729
DATA collection:played for  97  steps , rewards -90.40517915600883
DATA collection:played for 

  0%|          | 0/500 [00:00<?, ?it/s]

DATA collection:played for  60  steps , rewards -68.80760362843088
training from  30  games


100%|██████████| 500/500 [05:38<00:00,  1.48it/s]


Eval:played for  -107.18823920580786  rewards
Eval:played for  -120.74106821452183  rewards
Eval:played for  -121.46276144676179  rewards
Eval:played for  -159.02937635729592  rewards
Eval:played for  -130.58369878322677  rewards
DATA collection:played for  116  steps , rewards -110.16140782675829
DATA collection:played for  73  steps , rewards -87.94932741858953
DATA collection:played for  97  steps , rewards -143.41084467474462
DATA collection:played for  102  steps , rewards -21.901295469044356
DATA collection:played for  109  steps , rewards -158.31014287311504
DATA collection:played for  146  steps , rewards -210.57813272875413
DATA collection:played for  77  steps , rewards -150.34313250233058
DATA collection:played for  93  steps , rewards -87.98457644418522
DATA collection:played for  87  steps , rewards -20.79924076669795
DATA collection:played for  107  steps , rewards -176.06407548674684
DATA collection:played for  147  steps , rewards -91.41871933170597
DATA collection:play

  0%|          | 0/500 [00:00<?, ?it/s]

DATA collection:played for  66  steps , rewards -130.83618329222531
training from  30  games


100%|██████████| 500/500 [04:11<00:00,  1.99it/s]


Eval:played for  -91.81726692680022  rewards
Eval:played for  -19.099774669449644  rewards
Eval:played for  -101.22377651593169  rewards
Eval:played for  -84.7628273118444  rewards
Eval:played for  -143.01433261464757  rewards
DATA collection:played for  148  steps , rewards -17.05724973098698
DATA collection:played for  107  steps , rewards -66.53206054651754
DATA collection:played for  62  steps , rewards -56.62176414615274
DATA collection:played for  133  steps , rewards -26.58356706481115
DATA collection:played for  167  steps , rewards -136.50141547470102
DATA collection:played for  129  steps , rewards -34.17129145069646
DATA collection:played for  79  steps , rewards -96.91541993937281
DATA collection:played for  58  steps , rewards -119.88659253923747
DATA collection:played for  61  steps , rewards -54.99350905369919
DATA collection:played for  104  steps , rewards -161.52527696935664
DATA collection:played for  148  steps , rewards -60.07321295321703
DATA collection:played for

  0%|          | 0/500 [00:00<?, ?it/s]

DATA collection:played for  121  steps , rewards -19.068177301837594
training from  30  games


100%|██████████| 500/500 [04:17<00:00,  1.94it/s]


Eval:played for  -54.779875565424085  rewards
Eval:played for  -36.24185923281178  rewards
Eval:played for  -67.95766475544451  rewards
Eval:played for  -14.51381421841731  rewards
Eval:played for  -95.93291479322562  rewards
DATA collection:played for  128  steps , rewards 9.336022326054135
DATA collection:played for  85  steps , rewards -135.70949317045688
DATA collection:played for  99  steps , rewards -97.99108867734824
DATA collection:played for  124  steps , rewards -56.16650814535568
DATA collection:played for  82  steps , rewards -82.48407754368881
DATA collection:played for  68  steps , rewards -54.717589354365856
DATA collection:played for  81  steps , rewards -34.000714019594156
DATA collection:played for  114  steps , rewards -84.06418293645076
DATA collection:played for  54  steps , rewards -54.46752521534446
DATA collection:played for  87  steps , rewards -68.36529047439159
DATA collection:played for  58  steps , rewards -46.13351422868993
DATA collection:played for  75  

  0%|          | 0/500 [00:00<?, ?it/s]

DATA collection:played for  81  steps , rewards -123.87317970174487
training from  30  games


100%|██████████| 500/500 [03:56<00:00,  2.11it/s]


Eval:played for  -75.87296310336494  rewards
Eval:played for  -182.12812981217934  rewards
Eval:played for  -119.09609210281937  rewards
Eval:played for  -92.47230144210737  rewards
Eval:played for  -73.2740967145973  rewards
DATA collection:played for  82  steps , rewards -131.72807537915008
DATA collection:played for  88  steps , rewards -43.11373219570882
DATA collection:played for  62  steps , rewards -99.87302076147805
DATA collection:played for  97  steps , rewards -43.323911366755794
DATA collection:played for  98  steps , rewards -104.79335385325709
DATA collection:played for  73  steps , rewards -159.31063748755236
DATA collection:played for  94  steps , rewards -3.259587897648757
DATA collection:played for  70  steps , rewards -130.49043541350903
DATA collection:played for  83  steps , rewards -132.2103307652061
DATA collection:played for  81  steps , rewards -100.9544741466117
DATA collection:played for  93  steps , rewards -47.374002488600155
DATA collection:played for  67 

  0%|          | 0/500 [00:00<?, ?it/s]

DATA collection:played for  63  steps , rewards -81.44128128875144
training from  30  games


100%|██████████| 500/500 [03:51<00:00,  2.16it/s]


Eval:played for  -186.64885111145128  rewards
Eval:played for  -29.72091545312246  rewards
Eval:played for  -65.78387830633909  rewards
Eval:played for  -100.004158925516  rewards
Eval:played for  15.589562416033274  rewards
DATA collection:played for  111  steps , rewards -38.56502980728096
DATA collection:played for  102  steps , rewards 0.34701420136235583
DATA collection:played for  102  steps , rewards -33.869665427259605
DATA collection:played for  105  steps , rewards -10.376877595525968
DATA collection:played for  55  steps , rewards -67.93717335367472
DATA collection:played for  93  steps , rewards -69.82536266680508
DATA collection:played for  126  steps , rewards -127.71369649126024
DATA collection:played for  102  steps , rewards -44.28624679058265
DATA collection:played for  65  steps , rewards -37.14624182249463
DATA collection:played for  116  steps , rewards -29.06017197396433
DATA collection:played for  124  steps , rewards -68.48804489944555
DATA collection:played for

  0%|          | 0/500 [00:00<?, ?it/s]

DATA collection:played for  71  steps , rewards 3.8758071490167083
training from  30  games


100%|██████████| 500/500 [04:23<00:00,  1.90it/s]


Eval:played for  -105.73009797580217  rewards
Eval:played for  31.75694167100525  rewards
Eval:played for  -23.736807230084523  rewards
Eval:played for  18.91956337585262  rewards
Eval:played for  -70.99821527386499  rewards
DATA collection:played for  139  steps , rewards -133.75221149091908
DATA collection:played for  121  steps , rewards -26.62548958368886
DATA collection:played for  84  steps , rewards -94.97840507979609
DATA collection:played for  89  steps , rewards -134.03061854866021
DATA collection:played for  94  steps , rewards -77.42796833064078
DATA collection:played for  179  steps , rewards -198.18087024692312
