# Collaboration and Competition

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the third project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Tennis.app"`
- **Windows** (x86): `"path/to/Tennis_Windows_x86/Tennis.exe"`
- **Windows** (x86_64): `"path/to/Tennis_Windows_x86_64/Tennis.exe"`
- **Linux** (x86): `"path/to/Tennis_Linux/Tennis.x86"`
- **Linux** (x86_64): `"path/to/Tennis_Linux/Tennis.x86_64"`
- **Linux** (x86, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86"`
- **Linux** (x86_64, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86_64"`

For instance, if you are using a Mac, then you downloaded `Tennis.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Tennis.app")
```

In [2]:
env_file_name = "Tennis_Windows_x86_64/Tennis.exe"
# env = UnityEnvironment(file_name=env_file_name)
env = UnityEnvironment(file_name=env_file_name,no_graphics=True)

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents 
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])
print('states shape : ',states.shape)
print('Both states look like : ',states)
print(2*states)

Number of agents: 2
Size of each action: 2
There are 2 agents. Each observes a state with length: 24
The state for the first agent looks like: [ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.         -6.65278625 -1.5
 -0.          0.          6.83172083  6.         -0.          0.        ]
states shape :  (2, 24)
Both states look like :  [[ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.         -6.65278625 -1.5
  -0.          0.          6.83172083  6.         -0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.         -6.4669857  -1.5
   0.          0.         -6.83172083  6.          0.          0.

### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agents and receive feedback from the environment.

Once this cell is executed, you will watch the agents' performance, if they select actions at random with each time step.  A window should pop up that allows you to observe the agents.

Of course, as part of the project, you'll have to change the code so that the agents are able to use their experiences to gradually choose better actions when interacting with the environment!

In [5]:
if False:
    total_scores = []
    for i in range(100):                                        # play game for 5 episodes
        env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
        states = env_info.vector_observations                  # get the current state (for each agent)
        scores = np.zeros(num_agents)                          # initialize the score (for each agent)
        t = 0
        while True:
            actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
            actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
            # print('actions : ',actions)
            env_info = env.step(actions)[brain_name]           # send all actions to tne environment
            t += 1
            next_states = env_info.vector_observations         # get next state (for each agent)
            rewards = env_info.rewards                         # get reward (for each agent)
            dones = env_info.local_done                        # see if episode finished
            scores += env_info.rewards                         # update the score (for each agent)
            states = next_states                               # roll over states to next time step
            if np.any(dones):                                  # exit loop if episode finished
                break
        print('Score (max over agents) from episode {}: {}, and {} steps taken'.format(i, np.max(scores),t))
        print(scores)
        total_scores.append(scores)
    print('Average Random Score : ', np.mean(total_scores))
        
def plot_results(results):
    import matplotlib.pyplot as plt
    import torch
    plt.ion()

    fig = plt.figure()
    ax = fig.add_subplot(111)
    plt.plot(np.arange(len(results.all_rewards)), results.all_rewards)
    plt.plot(np.arange(len(results.avg_rewards)), results.avg_rewards)
    plt.ylabel('Rewards')
    plt.xlabel('Episode #')
    plt.show()

    fig = plt.figure()
    ax = fig.add_subplot(111)
    plt.plot(np.arange(len(results.critic_loss)), results.critic_loss)
    plt.ylabel('critic_losses')
    plt.xlabel('Learn Step #')
    plt.show()

    fig = plt.figure()
    ax = fig.add_subplot(111)
    plt.plot(np.arange(len(results.actor_loss)), results.actor_loss)
    plt.ylabel('actor_losses')
    plt.xlabel('Learn Step #')
    plt.show()


When finished, you can close the environment.

In [None]:
from maddpg import maddpg
import cProfile
DoProfile = False

config = {
    'gamma'               : 0.99,
    'tau'                 : 0.01,
    'action_size'         : action_size,
    'state_size'          : state_size,
    'hidden_size'         : 512,
    'buffer_size'         : 50000,
    'batch_size'          : 512,
    'dropout'             : 0.01,
    'seed'                : 149,
    'max_episodes'        : 1000,
    'learn_every'         : 10,
    'joined_states'       : True,
    'critic_learning_rate': 1e-3,
    'actor_learning_rate' : 1e-3,
    'noise_decay'         : 0.999,
    'sigma'               : 0.1,
    'num_agents'          : num_agents,
    'env_file_name'       : env_file_name,
    'train_mode'          : True,
    'brain_name'          : brain_name}

def print_config(config):
    print('Config Parameters    : ')
    for c,k in config.items():
        print('{:20s} : {}'.format(c,k))

config_list = []
result_list = []
var_range = [0.0001] #, 0.0003, 0.0005, 0.001]
num_runs = 5
for param in range(len(var_range)):
    alt_config = config.copy()
    # alt_config['actor_learning_rate'] = var_range[param]
    # alt_config['tau'] = config['tau']*curmult
    # alt_config['critic_learning_rate'] = config['critic_learning_rate']*curmult
    # alt_config['actor_learning_rate'] = config['actor_learning_rate']*curmult
    for main in range(num_runs):#len(tau_range)):
        print('-------------------------------------')
        print('New Run :')
        print('-------------------------------------')
        alt_config['seed'] += 1
        print_config(alt_config)
        config_list.append(alt_config.copy())
        agent = maddpg(env, alt_config)
        if DoProfile:cProfile.run("results = agent.train()",'PerfStats')
        else:results = agent.train()
        result_list.append(results)
        # all_rewards,avg_rewards,critic_losses,actor_losses = agent.train()
        print_config(alt_config)
        plot_results(results)
print('-------------------------------------')
print('-------------------------------------')
print('Summary :')
print('-------------------------------------')
print('-------------------------------------')
for param in range(len(var_range)):
    for main in range(num_runs):
        print_config(config_list[param*num_runs+main])
        plot_results(result_list[param*num_runs+main])
    
env.close()

episode: 0/1000   0% ETA:  --:--:-- |                                        | 

-------------------------------------
New Run :
-------------------------------------
Config Parameters    : 
gamma                : 0.99
tau                  : 0.01
action_size          : 2
state_size           : 24
hidden_size          : 512
buffer_size          : 50000
batch_size           : 512
dropout              : 0.01
seed                 : 150
max_episodes         : 1000
learn_every          : 10
joined_states        : True
critic_learning_rate : 0.001
actor_learning_rate  : 0.001
noise_decay          : 0.999
sigma                : 0.1
num_agents           : 2
env_file_name        : Tennis_Windows_x86_64/Tennis.exe
train_mode           : True
brain_name           : TennisBrain
Running on device :  cpu
Episode 0 with 15 steps || Reward : [ 0.   -0.01] || avg reward :  0.000 || Noise  0.999 || 0.104 seconds, mem : 15
[0m

  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)
episode: 20/1000   2% ETA:  0:01:13 |                                        | 

Episode 20 with 15 steps || Reward : [ 0.   -0.01] || avg reward :  0.000 || Noise  0.979 || 0.067 seconds, mem : 299
[0m

episode: 40/1000   4% ETA:  0:01:08 ||                                       | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 40 
update - q expected : mean : 0.0349 - sd : 0.0053 min-max 0.0029|0.0501
update - reward : mean : -0.0007 - sd : 0.0026 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0069 - sd : 0.0229 min-max -0.0941|0.0341
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 40 
update - q expected : mean : 0.0349 - sd : 0.0055 min-max 0.0031|0.0496
update - reward : mean : -0.0007 - sd : 0.0026 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0069 - sd : 0.0230 min-max -0.0948|0.0349
Episode 40 with 15 steps || Reward : [-0.01  0.  ] || avg reward :  0.000 || Noise  0.960 || 0.505 seconds, mem : 583
[0m

episode: 49/1000   4% ETA:  0:01:15 |/                                       | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 50 
update - q expected : mean : -0.0475 - sd : 0.0469 min-max -0.1731|0.0070
update - reward : mean : -0.0008 - sd : 0.0028 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0737 - sd : 0.0473 min-max 0.0038|0.2157
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 50 
update - q expected : mean : -0.0467 - sd : 0.0457 min-max -0.1710|0.0064
update - reward : mean : -0.0008 - sd : 0.0028 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0730 - sd : 0.0462 min-max 0.0020|0.2122


episode: 59/1000   5% ETA:  0:01:17 |--                                      | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 60 
update - q expected : mean : 0.0217 - sd : 0.0267 min-max -0.0287|0.0920
update - reward : mean : -0.0006 - sd : 0.0024 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0067 - sd : 0.0261 min-max -0.0632|0.0573
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 60 
update - q expected : mean : 0.0223 - sd : 0.0269 min-max -0.0308|0.0986
update - reward : mean : -0.0006 - sd : 0.0024 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0060 - sd : 0.0264 min-max -0.0650|0.0568
Episode 60 with 15 steps || Reward : [-0.01  0.  ] || avg reward :  0.000 || Noise  0.941 || 0.390 seconds, mem : 867
[0m

episode: 69/1000   6% ETA:  0:01:18 |\\                                      | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 70 
update - q expected : mean : 0.0688 - sd : 0.0341 min-max 0.0136|0.1472
update - reward : mean : -0.0007 - sd : 0.0025 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0409 - sd : 0.0340 min-max -0.1286|0.0154
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 70 
update - q expected : mean : 0.0685 - sd : 0.0343 min-max 0.0145|0.1516
update - reward : mean : -0.0007 - sd : 0.0025 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0406 - sd : 0.0342 min-max -0.1280|0.0141


episode: 79/1000   7% ETA:  0:01:17 ||||                                     | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 80 
update - q expected : mean : 0.0657 - sd : 0.0256 min-max 0.0277|0.1216
update - reward : mean : -0.0009 - sd : 0.0029 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0395 - sd : 0.0277 min-max -0.1142|0.0025
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 80 
update - q expected : mean : 0.0650 - sd : 0.0250 min-max 0.0282|0.1195
update - reward : mean : -0.0009 - sd : 0.0029 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0388 - sd : 0.0273 min-max -0.1073|0.0014
Episode 80 with 15 steps || Reward : [ 0.   -0.01] || avg reward :  0.000 || Noise  0.922 || 0.678 seconds, mem : 1151
[0m

episode: 85/1000   8% ETA:  0:01:23 |///                                     | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 90 
update - q expected : mean : 0.0513 - sd : 0.0141 min-max 0.0236|0.0862
update - reward : mean : -0.0007 - sd : 0.0025 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0230 - sd : 0.0209 min-max -0.0913|0.0038
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 90 
update - q expected : mean : 0.0508 - sd : 0.0133 min-max 0.0236|0.0825
update - reward : mean : -0.0007 - sd : 0.0025 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0225 - sd : 0.0205 min-max -0.0901|0.0034


episode: 95/1000   9% ETA:  0:01:22 |---                                     | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 100 
update - q expected : mean : 0.0383 - sd : 0.0081 min-max 0.0164|0.0612
update - reward : mean : -0.0007 - sd : 0.0025 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0098 - sd : 0.0201 min-max -0.0876|0.0090
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 100 
update - q expected : mean : 0.0381 - sd : 0.0076 min-max 0.0163|0.0593
update - reward : mean : -0.0007 - sd : 0.0025 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0097 - sd : 0.0196 min-max -0.0878|0.0124
Episode 100 with 15 steps || Reward : [ 0.   -0.01] || avg reward :  0.000 || Noise  0.904 || 0.335 seconds, mem : 1435
[0m

episode: 105/1000  10% ETA:  0:01:21 |\\\\                                   | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 110 
update - q expected : mean : 0.0296 - sd : 0.0057 min-max 0.0134|0.0458
update - reward : mean : -0.0008 - sd : 0.0027 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0018 - sd : 0.0207 min-max -0.0786|0.0232
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 110 
update - q expected : mean : 0.0293 - sd : 0.0057 min-max 0.0122|0.0454
update - reward : mean : -0.0008 - sd : 0.0027 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0014 - sd : 0.0206 min-max -0.0801|0.0250


episode: 115/1000  11% ETA:  0:01:21 |||||                                   | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 120 
update - q expected : mean : 0.0242 - sd : 0.0051 min-max 0.0075|0.0350
update - reward : mean : -0.0006 - sd : 0.0025 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0047 - sd : 0.0197 min-max -0.0765|0.0313
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 120 
update - q expected : mean : 0.0239 - sd : 0.0054 min-max 0.0062|0.0319
update - reward : mean : -0.0006 - sd : 0.0025 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0051 - sd : 0.0194 min-max -0.0750|0.0359
Episode 120 with 15 steps || Reward : [-0.01  0.  ] || avg reward :  0.000 || Noise  0.886 || 0.383 seconds, mem : 1719
[0m

episode: 125/1000  12% ETA:  0:01:21 |////                                   | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 130 
update - q expected : mean : 0.0212 - sd : 0.0047 min-max 0.0032|0.0296
update - reward : mean : -0.0007 - sd : 0.0025 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0072 - sd : 0.0196 min-max -0.0697|0.0313
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 130 
update - q expected : mean : 0.0208 - sd : 0.0049 min-max 0.0021|0.0278
update - reward : mean : -0.0007 - sd : 0.0025 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0076 - sd : 0.0195 min-max -0.0701|0.0341


episode: 135/1000  13% ETA:  0:01:20 |-----                                  | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 140 
update - q expected : mean : 0.0196 - sd : 0.0051 min-max 0.0007|0.0312
update - reward : mean : -0.0007 - sd : 0.0025 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0087 - sd : 0.0195 min-max -0.0713|0.0382
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 140 
update - q expected : mean : 0.0192 - sd : 0.0052 min-max -0.0005|0.0274
update - reward : mean : -0.0007 - sd : 0.0025 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0091 - sd : 0.0194 min-max -0.0738|0.0355
Episode 140 with 15 steps || Reward : [-0.01  0.  ] || avg reward :  0.000 || Noise  0.868 || 0.327 seconds, mem : 2003
[0m

episode: 145/1000  14% ETA:  0:01:19 |\\\\\                                  | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 150 
update - q expected : mean : 0.0196 - sd : 0.0053 min-max -0.0015|0.0284
update - reward : mean : -0.0007 - sd : 0.0026 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0083 - sd : 0.0194 min-max -0.0716|0.0300
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 150 
update - q expected : mean : 0.0192 - sd : 0.0056 min-max -0.0027|0.0280
update - reward : mean : -0.0007 - sd : 0.0026 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0086 - sd : 0.0192 min-max -0.0709|0.0319


episode: 155/1000  15% ETA:  0:01:19 |||||||                                 | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 160 
update - q expected : mean : 0.0199 - sd : 0.0060 min-max -0.0036|0.0274
update - reward : mean : -0.0007 - sd : 0.0026 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0078 - sd : 0.0186 min-max -0.0674|0.0315
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 160 
update - q expected : mean : 0.0197 - sd : 0.0064 min-max -0.0048|0.0286
update - reward : mean : -0.0007 - sd : 0.0026 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0079 - sd : 0.0185 min-max -0.0674|0.0305
Episode 160 with 15 steps || Reward : [ 0.   -0.01] || avg reward :  0.000 || Noise  0.851 || 0.355 seconds, mem : 2287
[0m

episode: 165/1000  16% ETA:  0:01:18 |//////                                 | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 170 
update - q expected : mean : 0.0205 - sd : 0.0065 min-max -0.0064|0.0305
update - reward : mean : -0.0007 - sd : 0.0026 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0064 - sd : 0.0184 min-max -0.0728|0.0277
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 170 
update - q expected : mean : 0.0203 - sd : 0.0069 min-max -0.0060|0.0300
update - reward : mean : -0.0007 - sd : 0.0026 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0066 - sd : 0.0181 min-max -0.0702|0.0274


episode: 175/1000  17% ETA:  0:01:17 |------                                 | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 180 
update - q expected : mean : 0.0214 - sd : 0.0072 min-max -0.0086|0.0300
update - reward : mean : -0.0008 - sd : 0.0027 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0044 - sd : 0.0182 min-max -0.0691|0.0249
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 180 
update - q expected : mean : 0.0212 - sd : 0.0076 min-max -0.0105|0.0291
update - reward : mean : -0.0008 - sd : 0.0027 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0047 - sd : 0.0178 min-max -0.0683|0.0270
Episode 180 with 15 steps || Reward : [-0.01  0.  ] || avg reward :  0.000 || Noise  0.834 || 0.307 seconds, mem : 2571
[0m

episode: 185/1000  18% ETA:  0:01:16 |\\\\\\\                                | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 190 
update - q expected : mean : 0.0223 - sd : 0.0084 min-max -0.0147|0.0325
update - reward : mean : -0.0006 - sd : 0.0025 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0045 - sd : 0.0155 min-max -0.0657|0.0253
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 190 
update - q expected : mean : 0.0221 - sd : 0.0089 min-max -0.0169|0.0320
update - reward : mean : -0.0006 - sd : 0.0025 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0047 - sd : 0.0152 min-max -0.0638|0.0285


episode: 195/1000  19% ETA:  0:01:15 ||||||||                                | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 200 
update - q expected : mean : 0.0230 - sd : 0.0097 min-max -0.0205|0.0324
update - reward : mean : -0.0006 - sd : 0.0024 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0041 - sd : 0.0143 min-max -0.0642|0.0295
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 200 
update - q expected : mean : 0.0227 - sd : 0.0104 min-max -0.0228|0.0325
update - reward : mean : -0.0006 - sd : 0.0024 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0044 - sd : 0.0138 min-max -0.0608|0.0313
Episode 200 with 15 steps || Reward : [-0.01  0.  ] || avg reward :  0.000 || Noise  0.818 || 0.288 seconds, mem : 2855
[0m

episode: 205/1000  20% ETA:  0:01:13 |///////                                | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 210 
update - q expected : mean : 0.0234 - sd : 0.0110 min-max -0.0191|0.0329
update - reward : mean : -0.0006 - sd : 0.0024 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0028 - sd : 0.0138 min-max -0.0587|0.0300
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 210 
update - q expected : mean : 0.0231 - sd : 0.0115 min-max -0.0212|0.0331
update - reward : mean : -0.0006 - sd : 0.0024 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0030 - sd : 0.0133 min-max -0.0656|0.0300


episode: 215/1000  21% ETA:  0:01:13 |--------                               | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 220 
update - q expected : mean : 0.0237 - sd : 0.0115 min-max -0.0211|0.0335
update - reward : mean : -0.0006 - sd : 0.0024 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0024 - sd : 0.0135 min-max -0.0615|0.0310
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 220 
update - q expected : mean : 0.0236 - sd : 0.0118 min-max -0.0242|0.0338
update - reward : mean : -0.0006 - sd : 0.0024 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0026 - sd : 0.0132 min-max -0.0606|0.0312
Episode 220 with 15 steps || Reward : [ 0.   -0.01] || avg reward :  0.000 || Noise  0.802 || 0.384 seconds, mem : 3139
[0m

episode: 225/1000  22% ETA:  0:01:12 |\\\\\\\\                               | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 230 
update - q expected : mean : 0.0240 - sd : 0.0127 min-max -0.0255|0.0342
update - reward : mean : -0.0008 - sd : 0.0027 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0003 - sd : 0.0142 min-max -0.0575|0.0317
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 230 
update - q expected : mean : 0.0239 - sd : 0.0129 min-max -0.0271|0.0343
update - reward : mean : -0.0008 - sd : 0.0027 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0002 - sd : 0.0139 min-max -0.0597|0.0324


episode: 235/1000  23% ETA:  0:01:11 ||||||||||                              | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 240 
update - q expected : mean : 0.0246 - sd : 0.0143 min-max -0.0324|0.0359
update - reward : mean : -0.0006 - sd : 0.0024 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0005 - sd : 0.0120 min-max -0.0515|0.0383
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 240 
update - q expected : mean : 0.0246 - sd : 0.0145 min-max -0.0331|0.0361
update - reward : mean : -0.0006 - sd : 0.0024 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0007 - sd : 0.0117 min-max -0.0526|0.0395
Episode 240 with 15 steps || Reward : [-0.01  0.  ] || avg reward :  0.000 || Noise  0.786 || 0.323 seconds, mem : 3423
[0m

episode: 245/1000  24% ETA:  0:01:10 |/////////                              | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 250 
update - q expected : mean : 0.0246 - sd : 0.0154 min-max -0.0396|0.0376
update - reward : mean : -0.0005 - sd : 0.0022 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0007 - sd : 0.0126 min-max -0.0516|0.0403
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 250 
update - q expected : mean : 0.0246 - sd : 0.0153 min-max -0.0398|0.0365
update - reward : mean : -0.0005 - sd : 0.0022 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0007 - sd : 0.0123 min-max -0.0455|0.0426


episode: 255/1000  25% ETA:  0:01:09 |---------                              | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 260 
update - q expected : mean : 0.0250 - sd : 0.0161 min-max -0.0454|0.0381
update - reward : mean : -0.0006 - sd : 0.0024 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0008 - sd : 0.0111 min-max -0.0504|0.0418
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 260 
update - q expected : mean : 0.0251 - sd : 0.0160 min-max -0.0458|0.0382
update - reward : mean : -0.0006 - sd : 0.0024 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0009 - sd : 0.0108 min-max -0.0444|0.0441
Episode 260 with 15 steps || Reward : [-0.01  0.  ] || avg reward :  0.000 || Noise  0.770 || 0.283 seconds, mem : 3707
[0m

episode: 265/1000  26% ETA:  0:01:08 |\\\\\\\\\\                             | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 270 
update - q expected : mean : 0.0253 - sd : 0.0154 min-max -0.0329|0.0377
update - reward : mean : -0.0008 - sd : 0.0028 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0036 - sd : 0.0132 min-max -0.0534|0.0382
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 270 
update - q expected : mean : 0.0255 - sd : 0.0150 min-max -0.0316|0.0370
update - reward : mean : -0.0008 - sd : 0.0028 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0038 - sd : 0.0131 min-max -0.0538|0.0369


episode: 275/1000  27% ETA:  0:01:07 |||||||||||                             | 

[41mEpisode 273 with 30 steps || Reward : [0.   0.09] || avg reward :  0.001 || Noise  0.760 || 0.119 seconds, mem : 3907
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 280 
update - q expected : mean : 0.0258 - sd : 0.0158 min-max -0.0416|0.0399
update - reward : mean : -0.0007 - sd : 0.0026 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0036 - sd : 0.0110 min-max -0.0484|0.0299
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 280 
update - q expected : mean : 0.0260 - sd : 0.0154 min-max -0.0403|0.0398
update - reward : mean : -0.0007 - sd : 0.0026 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0039 - sd : 0.0106 min-max -0.0490|0.0263
Episode 280 with 14 steps || Reward : [ 0.   -0.01] || avg reward :  0.001 || Noise  0.755 || 0.317 

episode: 285/1000  28% ETA:  0:01:06 |///////////                            | 

[41mEpisode 282 with 30 steps || Reward : [0.   0.09] || avg reward :  0.002 || Noise  0.753 || 0.142 seconds, mem : 4051
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 290 
update - q expected : mean : 0.0264 - sd : 0.0150 min-max -0.0420|0.0391
[42mupdate - reward : mean : -0.0005 - sd : 0.0051 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0042 - sd : 0.0119 min-max -0.0464|0.0932
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 290 
update - q expected : mean : 0.0266 - sd : 0.0146 min-max -0.0418|0.0392
[42mupdate - reward : mean : -0.0005 - sd : 0.0051 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0045 - sd : 0.0116 min-max -0.0501|0.0938


episode: 295/1000  29% ETA:  0:01:05 |-----------                            | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 300 
update - q expected : mean : 0.0255 - sd : 0.0166 min-max -0.0486|0.0400
update - reward : mean : -0.0007 - sd : 0.0025 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0037 - sd : 0.0107 min-max -0.0452|0.0380
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 300 
update - q expected : mean : 0.0256 - sd : 0.0166 min-max -0.0489|0.0402
update - reward : mean : -0.0007 - sd : 0.0025 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0038 - sd : 0.0106 min-max -0.0429|0.0422
Episode 300 with 14 steps || Reward : [-0.01  0.  ] || avg reward :  0.002 || Noise  0.740 || 0.338 seconds, mem : 4307
[0m

episode: 305/1000  30% ETA:  0:01:04 |\\\\\\\\\\\                            | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 310 
update - q expected : mean : 0.0253 - sd : 0.0174 min-max -0.0552|0.0388
update - reward : mean : -0.0008 - sd : 0.0027 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0045 - sd : 0.0094 min-max -0.0433|0.0346
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 310 
update - q expected : mean : 0.0253 - sd : 0.0174 min-max -0.0538|0.0389
update - reward : mean : -0.0008 - sd : 0.0027 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0045 - sd : 0.0093 min-max -0.0397|0.0339


episode: 315/1000  31% ETA:  0:01:03 |||||||||||||                           | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 320 
update - q expected : mean : 0.0246 - sd : 0.0184 min-max -0.0550|0.0375
update - reward : mean : -0.0009 - sd : 0.0028 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0052 - sd : 0.0102 min-max -0.0728|0.0373
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 320 
update - q expected : mean : 0.0245 - sd : 0.0189 min-max -0.0575|0.0378
update - reward : mean : -0.0009 - sd : 0.0028 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0050 - sd : 0.0101 min-max -0.0739|0.0382
Episode 320 with 14 steps || Reward : [ 0.   -0.01] || avg reward :  0.002 || Noise  0.725 || 0.285 seconds, mem : 4591
[0m

episode: 325/1000  32% ETA:  0:01:02 |////////////                           | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 330 
update - q expected : mean : 0.0239 - sd : 0.0186 min-max -0.0573|0.0375
[42mupdate - reward : mean : -0.0005 - sd : 0.0051 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0030 - sd : 0.0117 min-max -0.0392|0.0928
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 330 
update - q expected : mean : 0.0238 - sd : 0.0189 min-max -0.0598|0.0374
[42mupdate - reward : mean : -0.0005 - sd : 0.0051 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0029 - sd : 0.0116 min-max -0.0336|0.0935


episode: 335/1000  33% ETA:  0:01:01 |-------------                          | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 340 
update - q expected : mean : 0.0233 - sd : 0.0198 min-max -0.0653|0.0373
update - reward : mean : -0.0007 - sd : 0.0026 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0033 - sd : 0.0100 min-max -0.0385|0.0431
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 340 
update - q expected : mean : 0.0233 - sd : 0.0202 min-max -0.0686|0.0373
update - reward : mean : -0.0007 - sd : 0.0026 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0031 - sd : 0.0101 min-max -0.0359|0.0435
Episode 340 with 14 steps || Reward : [-0.01  0.  ] || avg reward :  0.002 || Noise  0.711 || 0.270 seconds, mem : 4875
[0m

episode: 345/1000  34% ETA:  0:01:00 |\\\\\\\\\\\\\                          | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 350 
update - q expected : mean : 0.0236 - sd : 0.0183 min-max -0.0613|0.0362
update - reward : mean : -0.0008 - sd : 0.0027 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0045 - sd : 0.0090 min-max -0.0354|0.0432
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 350 
update - q expected : mean : 0.0235 - sd : 0.0185 min-max -0.0619|0.0365
update - reward : mean : -0.0008 - sd : 0.0027 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0043 - sd : 0.0090 min-max -0.0357|0.0438


episode: 355/1000  35% ETA:  0:00:59 ||||||||||||||                          | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 360 
update - q expected : mean : 0.0237 - sd : 0.0174 min-max -0.0672|0.0359
update - reward : mean : -0.0005 - sd : 0.0022 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0026 - sd : 0.0087 min-max -0.0258|0.0479
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 360 
update - q expected : mean : 0.0237 - sd : 0.0172 min-max -0.0653|0.0364
update - reward : mean : -0.0005 - sd : 0.0022 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0026 - sd : 0.0084 min-max -0.0265|0.0371
Episode 360 with 15 steps || Reward : [-0.01  0.  ] || avg reward :  0.002 || Noise  0.697 || 0.315 seconds, mem : 5160
[0m

episode: 365/1000  36% ETA:  0:00:58 |//////////////                         | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 370 
update - q expected : mean : 0.0241 - sd : 0.0149 min-max -0.0430|0.0348
update - reward : mean : -0.0006 - sd : 0.0023 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0040 - sd : 0.0079 min-max -0.0342|0.0325
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 370 
update - q expected : mean : 0.0242 - sd : 0.0145 min-max -0.0403|0.0356
update - reward : mean : -0.0006 - sd : 0.0023 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0040 - sd : 0.0080 min-max -0.0390|0.0327


episode: 372/1000  37% ETA:  0:00:58 |--------------                         | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 380 
update - q expected : mean : 0.0245 - sd : 0.0126 min-max -0.0428|0.0345
update - reward : mean : -0.0005 - sd : 0.0022 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0042 - sd : 0.0078 min-max -0.0426|0.0236
learn : Next States : 

episode: 381/1000  38% ETA:  0:00:57 |\\\\\\\\\\\\\\                         | 

 torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 380 
update - q expected : mean : 0.0246 - sd : 0.0124 min-max -0.0423|0.0351
update - reward : mean : -0.0005 - sd : 0.0022 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0043 - sd : 0.0078 min-max -0.0413|0.0263
Episode 380 with 15 steps || Reward : [ 0.   -0.01] || avg reward :  0.001 || Noise  0.683 || 0.434 seconds, mem : 5444
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 390 
update - q expected : mean : 0.0232 - sd : 0.0143 min-max -0.0471|0.0347
[42mupdate - reward : mean : -0.0004 - sd : 0.0051 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0040 - sd : 0.0093 min-max -0.0698|0.0945
learn : Next States :  torch.Size([512, 48])


episode: 391/1000  39% ETA:  0:00:57 ||||||||||||||||                        | 

--------------------------------------
Agent 1 and episode 390 
update - q expected : mean : 0.0234 - sd : 0.0142 min-max -0.0464|0.0347
[42mupdate - reward : mean : -0.0004 - sd : 0.0051 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0040 - sd : 0.0089 min-max -0.0675|0.0943
Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 400 
update - q expected : mean : 0.0218 - sd : 0.0154 min-max -0.0510|0.0343
update - reward : mean : -0.0010 - sd : 0.0029 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0054 - sd : 0.0086 min-max -0.0394|0.0257
learn : Next States :  torch.Size([512, 48])


episode: 401/1000  40% ETA:  0:00:56 |///////////////                        | 

--------------------------------------
Agent 1 and episode 400 
update - q expected : mean : 0.0220 - sd : 0.0154 min-max -0.0514|0.0346
update - reward : mean : -0.0010 - sd : 0.0029 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0055 - sd : 0.0083 min-max -0.0393|0.0238
Episode 400 with 15 steps || Reward : [-0.01  0.  ] || avg reward :  0.000 || Noise  0.670 || 0.448 seconds, mem : 5728
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 410 
update - q expected : mean : 0.0211 - sd : 0.0154 min-max -0.0460|0.0334
update - reward : mean : -0.0007 - sd : 0.0026 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0034 - sd : 0.0072 min-max -0.0491|0.0271
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 410 
update - q expected : mean : 0.0211 - s

episode: 411/1000  41% ETA:  0:00:55 |----------------                       | 

[41mEpisode 412 with 31 steps || Reward : [0.   0.09] || avg reward :  0.001 || Noise  0.662 || 0.151 seconds, mem : 5915
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 420 
update - q expected : mean : 0.0192 - sd : 0.0173 min-max -0.0632|0.0329
update - reward : mean : -0.0006 - sd : 0.0024 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0008 - sd : 0.0082 min-max -0.0329|0.0390


episode: 421/1000  42% ETA:  0:00:55 |\\\\\\\\\\\\\\\\                       | 

learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 420 
update - q expected : mean : 0.0193 - sd : 0.0176 min-max -0.0624|0.0334
update - reward : mean : -0.0006 - sd : 0.0024 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0008 - sd : 0.0080 min-max -0.0312|0.0369
Episode 420 with 15 steps || Reward : [-0.01  0.  ] || avg reward :  0.001 || Noise  0.656 || 0.484 seconds, mem : 6029
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 430 
update - q expected : mean : 0.0169 - sd : 0.0210 min-max -0.0701|0.0323
update - reward : mean : -0.0007 - sd : 0.0026 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0001 - sd : 0.0086 min-max -0.0194|0.0407
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 

episode: 431/1000  43% ETA:  0:00:54 |||||||||||||||||                       | 

[41mEpisode 433 with 33 steps || Reward : [ 0.1  -0.01] || avg reward :  0.002 || Noise  0.648 || 0.169 seconds, mem : 6232
[0m[41mEpisode 435 with 29 steps || Reward : [ 0.1  -0.01] || avg reward :  0.003 || Noise  0.646 || 0.142 seconds, mem : 6275
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 440 
update - q expected : mean : 0.0174 - sd : 0.0189 min-max -0.0655|0.0315
update - reward : mean : -0.0006 - sd : 0.0024 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0001 - sd : 0.0076 min-max -0.0215|0.0375
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 440 
update - q expected : mean : 0.0176 - sd : 0.0189 min-max -0.0653|0.0326
update - reward : mean : -0.0006 - sd : 0.0024 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0001 - sd : 0

episode: 441/1000  44% ETA:  0:00:53 |/////////////////                      | 

Episode 440 with 14 steps || Reward : [-0.01  0.  ] || avg reward :  0.003 || Noise  0.643 || 0.337 seconds, mem : 6346
[0m[41mEpisode 441 with 31 steps || Reward : [0.   0.09] || avg reward :  0.004 || Noise  0.643 || 0.172 seconds, mem : 6377
[0m[41mEpisode 443 with 32 steps || Reward : [0.   0.09] || avg reward :  0.005 || Noise  0.641 || 0.165 seconds, mem : 6423
[0m[41mEpisode 445 with 30 steps || Reward : [ 0.1  -0.01] || avg reward :  0.006 || Noise  0.640 || 0.158 seconds, mem : 6467
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 450 
update - q expected : mean : 0.0173 - sd : 0.0171 min-max -0.0563|0.0318
[42mupdate - reward : mean : -0.0005 - sd : 0.0051 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0004 - sd : 0.0085 min-max -0.0295|0.1066
learn : Next States :  torch.Size([

episode: 451/1000  45% ETA:  0:00:53 |-----------------                      | 


Agent 1 and episode 450 
update - q expected : mean : 0.0174 - sd : 0.0172 min-max -0.0559|0.0325
[42mupdate - reward : mean : -0.0005 - sd : 0.0051 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0003 - sd : 0.0083 min-max -0.0274|0.1069
[41mEpisode 456 with 30 steps || Reward : [ 0.1  -0.01] || avg reward :  0.007 || Noise  0.633 || 0.146 seconds, mem : 6639
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 460 
update - q expected : mean : 0.0184 - sd : 0.0151 min-max -0.0509|0.0316
[42mupdate - reward : mean : -0.0005 - sd : 0.0052 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0018 - sd : 0.0087 min-max -0.0663|0.1050
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 460 
update - q expected : mean : 0.0186 - sd : 0.0148 min-

episode: 461/1000  46% ETA:  0:00:52 |\\\\\\\\\\\\\\\\\                      | 

Episode 460 with 14 steps || Reward : [ 0.   -0.01] || avg reward :  0.007 || Noise  0.631 || 0.340 seconds, mem : 6695
[0m

episode: 470/1000  47% ETA:  0:00:51 |||||||||||||||||||                     | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 470 
update - q expected : mean : 0.0192 - sd : 0.0112 min-max -0.0412|0.0309
update - reward : mean : -0.0006 - sd : 0.0024 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0024 - sd : 0.0078 min-max -0.0537|0.0161
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 470 
update - q expected : mean : 0.0194 - sd : 0.0113 min-max -0.0409|0.0319
update - reward : mean : -0.0006 - sd : 0.0024 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0024 - sd : 0.0079 min-max -0.0544|0.0171
[41mEpisode 470 with 32 steps || Reward : [0.   0.09] || avg reward :  0.008 || Noise  0.624 || 0.696 seconds, mem : 6854
[0m

episode: 475/1000  47% ETA:  0:00:51 |//////////////////                     | 

[41mEpisode 474 with 31 steps || Reward : [ 0.1  -0.01] || avg reward :  0.009 || Noise  0.622 || 0.166 seconds, mem : 6927
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 480 
update - q expected : mean : 0.0183 - sd : 0.0115 min-max -0.0388|0.0311
update - reward : mean : -0.0007 - sd : 0.0026 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0026 - sd : 0.0088 min-max -0.0640|0.0228
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 480 
update - q expected : mean : 0.0185 - sd : 0.0115 min-max -0.0388|0.0320
update - reward : mean : -0.0007 - sd : 0.0026 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0026 - sd : 0.0089 min-max -0.0659|0.0247
Episode 480 with 14 steps || Reward : [ 0.   -0.01] || avg reward :  0.009 || Noise  0.618 || 0.34

episode: 485/1000  48% ETA:  0:00:51 |------------------                     | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 490 
update - q expected : mean : 0.0174 - sd : 0.0120 min-max -0.0406|0.0301
update - reward : mean : -0.0007 - sd : 0.0025 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0016 - sd : 0.0088 min-max -0.0622|0.0141
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 490 
update - q expected : mean : 0.0177 - sd : 0.0119 min-max -0.0405|0.0307
update - reward : mean : -0.0007 - sd : 0.0025 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0016 - sd : 0.0091 min-max -0.0636|0.0161


episode: 494/1000  49% ETA:  0:00:50 |\\\\\\\\\\\\\\\\\\\                    | 

[41mEpisode 492 with 31 steps || Reward : [0.   0.09] || avg reward :  0.010 || Noise  0.611 || 0.153 seconds, mem : 7199
[0m[41mEpisode 494 with 32 steps || Reward : [ 0.1  -0.01] || avg reward :  0.011 || Noise  0.609 || 0.164 seconds, mem : 7245
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 500 
update - q expected : mean : 0.0157 - sd : 0.0136 min-max -0.0538|0.0289
[42mupdate - reward : mean : -0.0003 - sd : 0.0067 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.0004 - sd : 0.0097 min-max -0.0573|0.1073
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 500 
update - q expected : mean : 0.0158 - sd : 0.0138 min-max -0.0496|0.0307
[42mupdate - reward : mean : -0.0003 - sd : 0.0067 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean :

episode: 503/1000  50% ETA:  0:00:49 ||||||||||||||||||||                    | 

Episode 500 with 14 steps || Reward : [-0.01  0.  ] || avg reward :  0.011 || Noise  0.606 || 0.339 seconds, mem : 7330
[0m[41mEpisode 509 with 31 steps || Reward : [0.   0.09] || avg reward :  0.011 || Noise  0.600 || 0.155 seconds, mem : 7477
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 510 
update - q expected : mean : 0.0145 - sd : 0.0144 min-max -0.0560|0.0297
[42mupdate - reward : mean : -0.0005 - sd : 0.0052 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.0005 - sd : 0.0096 min-max -0.0587|0.1029


episode: 511/1000  51% ETA:  0:00:48 |///////////////////                    | 

learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 510 
update - q expected : mean : 0.0147 - sd : 0.0147 min-max -0.0566|0.0309
[42mupdate - reward : mean : -0.0005 - sd : 0.0052 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.0007 - sd : 0.0097 min-max -0.0595|0.1022
[41mEpisode 517 with 32 steps || Reward : [ 0.1  -0.01] || avg reward :  0.012 || Noise  0.596 || 0.155 seconds, mem : 7608
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 520 
update - q expected : mean : 0.0134 - sd : 0.0150 min-max -0.0559|0.0299
[42mupdate - reward : mean : -0.0003 - sd : 0.0068 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.0020 - sd : 0.0094 min-max -0.0447|0.1015
learn : Next States :  torch.Size([512, 48])
-------------------------------------

episode: 521/1000  52% ETA:  0:00:48 |--------------------                   | 

[41mEpisode 520 with 32 steps || Reward : [ 0.1  -0.01] || avg reward :  0.013 || Noise  0.594 || 0.464 seconds, mem : 7669
[0m[41mEpisode 527 with 31 steps || Reward : [0.   0.09] || avg reward :  0.013 || Noise  0.590 || 0.150 seconds, mem : 7785
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 530 
update - q expected : mean : 0.0126 - sd : 0.0160 min-max -0.0564|0.0291
[42mupdate - reward : mean : -0.0007 - sd : 0.0053 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.0008 - sd : 0.0096 min-max -0.0470|0.0990
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 530 
update - q expected : mean : 0.0128 - sd : 0.0163 min-max -0.0572|0.0296
[42mupdate - reward : mean : -0.0007 - sd : 0.0053 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean :

episode: 531/1000  53% ETA:  0:00:47 |\\\\\\\\\\\\\\\\\\\\                   | 

[41mEpisode 538 with 37 steps || Reward : [0.1  0.09] || avg reward :  0.012 || Noise  0.583 || 0.185 seconds, mem : 7983
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 540 
update - q expected : mean : 0.0131 - sd : 0.0145 min-max -0.0673|0.0278
[42mupdate - reward : mean : -0.0002 - sd : 0.0067 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.0028 - sd : 0.0104 min-max -0.0557|0.1097
learn : Next States :  torch.Size([512, 48])
--------------------------------------


episode: 541/1000  54% ETA:  0:00:46 ||||||||||||||||||||||                  | 

Agent 1 and episode 540 
update - q expected : mean : 0.0135 - sd : 0.0144 min-max -0.0657|0.0291
[42mupdate - reward : mean : -0.0002 - sd : 0.0067 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.0027 - sd : 0.0103 min-max -0.0572|0.1103
Episode 540 with 14 steps || Reward : [-0.01  0.  ] || avg reward :  0.012 || Noise  0.582 || 0.427 seconds, mem : 8012
[0m[41mEpisode 542 with 32 steps || Reward : [ 0.1  -0.01] || avg reward :  0.013 || Noise  0.581 || 0.172 seconds, mem : 8058
[0m[41mEpisode 545 with 30 steps || Reward : [ 0.1  -0.01] || avg reward :  0.012 || Noise  0.579 || 0.196 seconds, mem : 8117
[0m[41mEpisode 548 with 32 steps || Reward : [0.   0.09] || avg reward :  0.013 || Noise  0.577 || 0.186 seconds, mem : 8177
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 550 
update - 

episode: 551/1000  55% ETA:  0:00:45 |/////////////////////                  | 

 torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 550 
update - q expected : mean : 0.0133 - sd : 0.0143 min-max -0.0524|0.0291
update - reward : mean : -0.0006 - sd : 0.0024 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0016 - sd : 0.0077 min-max -0.0593|0.0197
[41mEpisode 551 with 33 steps || Reward : [ 0.1  -0.01] || avg reward :  0.014 || Noise  0.576 || 0.331 seconds, mem : 8238
[0m

episode: 560/1000  56% ETA:  0:00:45 |---------------------                  | 

[41mEpisode 557 with 32 steps || Reward : [0.   0.09] || avg reward :  0.013 || Noise  0.572 || 0.156 seconds, mem : 8341
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 560 
update - q expected : mean : 0.0133 - sd : 0.0149 min-max -0.0617|0.0279
update - reward : mean : -0.0006 - sd : 0.0025 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0013 - sd : 0.0076 min-max -0.0526|0.0262
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 560 
update - q expected : mean : 0.0137 - sd : 0.0150 min-max -0.0613|0.0292
update - reward : mean : -0.0006 - sd : 0.0025 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0013 - sd : 0.0075 min-max -0.0518|0.0264
Episode 560 with 14 steps || Reward : [ 0.   -0.01] || avg reward :  0.013 || Noise  0.570 || 0.333 se

episode: 569/1000  56% ETA:  0:00:44 |\\\\\\\\\\\\\\\\\\\\\\                 | 

[41mEpisode 569 with 33 steps || Reward : [-0.01  0.1 ] || avg reward :  0.015 || Noise  0.565 || 0.161 seconds, mem : 8552
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 570 
update - q expected : mean : 0.0140 - sd : 0.0140 min-max -0.0572|0.0291
[42mupdate - reward : mean : -0.0001 - sd : 0.0081 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.0003 - sd : 0.0104 min-max -0.0508|0.1037
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 570 
update - q expected : mean : 0.0144 - sd : 0.0140 min-max -0.0573|0.0298
[42mupdate - reward : mean : -0.0001 - sd : 0.0081 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.0002 - sd : 0.0104 min-max -0.0513|0.1043
[41mEpisode 570 with 32 steps || Reward : [ 0.1  -0.01] || avg reward :  0.015 ||

episode: 572/1000  57% ETA:  0:00:44 |||||||||||||||||||||||                 | 

[41mEpisode 571 with 30 steps || Reward : [0.   0.09] || avg reward :  0.016 || Noise  0.564 || 0.130 seconds, mem : 8614
[0m[41mEpisode 572 with 32 steps || Reward : [0.   0.09] || avg reward :  0.017 || Noise  0.564 || 0.137 seconds, mem : 8646
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 580 
update - q expected : mean : 0.0145 - sd : 0.0137 min-max -0.0550|0.0295
[42mupdate - reward : mean : 0.0001 - sd : 0.0092 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.0003 - sd : 0.0110 min-max -0.0483|0.1015
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 580 
update - q expected : mean : 0.0150 - sd : 0.0137 min-max -0.0547|0.0301
[42mupdate - reward : mean : 0.0001 - sd : 0.0092 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.0

episode: 581/1000  58% ETA:  0:00:43 |//////////////////////                 | 

[41mEpisode 580 with 32 steps || Reward : [0.   0.09] || avg reward :  0.017 || Noise  0.559 || 0.363 seconds, mem : 8779
[0m[41mEpisode 585 with 52 steps || Reward : [-0.01  0.1 ] || avg reward :  0.018 || Noise  0.556 || 0.243 seconds, mem : 8888
[0m[41mEpisode 586 with 33 steps || Reward : [ 0.1  -0.01] || avg reward :  0.019 || Noise  0.556 || 0.137 seconds, mem : 8921
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 590 
update - q expected : mean : 0.0151 - sd : 0.0133 min-max -0.0554|0.0289
[42mupdate - reward : mean : 0.0000 - sd : 0.0080 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.0002 - sd : 0.0104 min-max -0.0496|0.1018
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 590 
update - q expected : mean : 0.0155 - sd : 0.0137 

episode: 591/1000  59% ETA:  0:00:42 |-----------------------                | 

[41mEpisode 594 with 32 steps || Reward : [0.   0.09] || avg reward :  0.018 || Noise  0.551 || 0.179 seconds, mem : 9052
[0m[41mEpisode 599 with 31 steps || Reward : [ 0.1  -0.01] || avg reward :  0.019 || Noise  0.549 || 0.124 seconds, mem : 9139
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 600 
update - q expected : mean : 0.0159 - sd : 0.0111 min-max -0.0432|0.0297
[42mupdate - reward : mean : -0.0004 - sd : 0.0050 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0016 - sd : 0.0087 min-max -0.0503|0.1014


episode: 601/1000  60% ETA:  0:00:41 |\\\\\\\\\\\\\\\\\\\\\\\                | 

learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 600 
update - q expected : mean : 0.0164 - sd : 0.0112 min-max -0.0430|0.0301
[42mupdate - reward : mean : -0.0004 - sd : 0.0050 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0018 - sd : 0.0088 min-max -0.0504|0.1013
Episode 600 with 15 steps || Reward : [-0.01  0.  ] || avg reward :  0.019 || Noise  0.548 || 0.320 seconds, mem : 9154
[0m[41mEpisode 604 with 31 steps || Reward : [ 0.1  -0.01] || avg reward :  0.020 || Noise  0.546 || 0.132 seconds, mem : 9227
[0m[41mEpisode 608 with 33 steps || Reward : [ 0.1  -0.01] || avg reward :  0.021 || Noise  0.544 || 0.137 seconds, mem : 9303
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 610 
update - q expected : mean : 0.0148 - sd : 0.0138 m

episode: 611/1000  61% ETA:  0:00:40 ||||||||||||||||||||||||                | 

learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 610 
update - q expected : mean : 0.0153 - sd : 0.0138 min-max -0.0643|0.0302
[42mupdate - reward : mean : 0.0000 - sd : 0.0080 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.0001 - sd : 0.0106 min-max -0.0538|0.1012
[41mEpisode 612 with 33 steps || Reward : [ 0.1  -0.01] || avg reward :  0.021 || Noise  0.542 || 0.165 seconds, mem : 9378
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 620 
update - q expected : mean : 0.0138 - sd : 0.0150 min-max -0.0617|0.0295
[42mupdate - reward : mean : -0.0000 - sd : 0.0093 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0006 - sd : 0.0123 min-max -0.0548|0.1001
learn : Next States :  torch.Size([512, 48])
-------------------------------------

episode: 621/1000  62% ETA:  0:00:39 |////////////////////////               | 


Agent 1 and episode 620 
update - q expected : mean : 0.0143 - sd : 0.0152 min-max -0.0631|0.0304
[42mupdate - reward : mean : -0.0000 - sd : 0.0093 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0007 - sd : 0.0123 min-max -0.0546|0.1003
[41mEpisode 620 with 32 steps || Reward : [-0.01  0.1 ] || avg reward :  0.020 || Noise  0.537 || 0.434 seconds, mem : 9528
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 630 
update - q expected : mean : 0.0140 - sd : 0.0139 min-max -0.0571|0.0294
[42mupdate - reward : mean : -0.0004 - sd : 0.0051 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0003 - sd : 0.0082 min-max -0.0446|0.1000
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 630 
update - q expected : mean : 0.0146 - sd : 0.0143 min-

episode: 631/1000  63% ETA:  0:00:38 |------------------------               | 

[41mEpisode 634 with 32 steps || Reward : [ 0.1  -0.01] || avg reward :  0.020 || Noise  0.530 || 0.157 seconds, mem : 9745
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 640 
update - q expected : mean : 0.0128 - sd : 0.0151 min-max -0.0564|0.0294
[42mupdate - reward : mean : -0.0000 - sd : 0.0093 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0000 - sd : 0.0118 min-max -0.0557|0.1026
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 640 
update - q expected : mean : 0.0133 - sd : 0.0152 min-max -0.0580|0.0303
[42mupdate - reward : mean : -0.0000 - sd : 0.0093 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0002 - sd : 0.0117 min-max -0.0558|0.1010


episode: 641/1000  64% ETA:  0:00:37 |\\\\\\\\\\\\\\\\\\\\\\\\               | 

[41mEpisode 640 with 32 steps || Reward : [ 0.1  -0.01] || avg reward :  0.020 || Noise  0.527 || 0.476 seconds, mem : 9848
[0m[41mEpisode 641 with 33 steps || Reward : [0.   0.09] || avg reward :  0.021 || Noise  0.526 || 0.155 seconds, mem : 9881
[0m[41mEpisode 645 with 32 steps || Reward : [0.   0.09] || avg reward :  0.020 || Noise  0.524 || 0.184 seconds, mem : 9956
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 650 
update - q expected : mean : 0.0123 - sd : 0.0163 min-max -0.0589|0.0288
[42mupdate - reward : mean : -0.0004 - sd : 0.0069 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.0003 - sd : 0.0090 min-max -0.0380|0.1063
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 650 
update - q expected : mean : 0.0127 - sd : 0.0166 m

episode: 651/1000  65% ETA:  0:00:36 ||||||||||||||||||||||||||              | 

[41mEpisode 651 with 32 steps || Reward : [0.   0.09] || avg reward :  0.019 || Noise  0.521 || 0.167 seconds, mem : 10059
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 660 
update - q expected : mean : 0.0122 - sd : 0.0158 min-max -0.0590|0.0295
[42mupdate - reward : mean : -0.0006 - sd : 0.0052 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.0001 - sd : 0.0091 min-max -0.0549|0.1032
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 660 
update - q expected : mean : 0.0126 - sd : 0.0159 min-max -0.0593|0.0298
[42mupdate - reward : mean : -0.0006 - sd : 0.0052 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.0000 - sd : 0.0092 min-max -0.0546|0.1041


episode: 661/1000  66% ETA:  0:00:35 |/////////////////////////              | 

Episode 660 with 15 steps || Reward : [-0.01  0.  ] || avg reward :  0.018 || Noise  0.516 || 0.301 seconds, mem : 10187
[0m[41mEpisode 663 with 33 steps || Reward : [-0.01  0.1 ] || avg reward :  0.019 || Noise  0.515 || 0.234 seconds, mem : 10269
[0m[41mEpisode 664 with 33 steps || Reward : [ 0.1  -0.01] || avg reward :  0.019 || Noise  0.514 || 0.182 seconds, mem : 10302
[0m

episode: 668/1000  66% ETA:  0:00:35 |--------------------------             | 

[41mEpisode 665 with 32 steps || Reward : [ 0.1  -0.01] || avg reward :  0.020 || Noise  0.514 || 0.179 seconds, mem : 10334
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 670 
update - q expected : mean : 0.0122 - sd : 0.0164 min-max -0.0653|0.0286
update - reward : mean : -0.0005 - sd : 0.0021 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0018 - sd : 0.0058 min-max -0.0355|0.0273
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 670 
update - q expected : mean : 0.0128 - sd : 0.0165 min-max -0.0645|0.0302
update - reward : mean : -0.0005 - sd : 0.0021 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0016 - sd : 0.0059 min-max -0.0392|0.0262
[41mEpisode 670 with 30 steps || Reward : [0.   0.09] || avg reward :  0.019 || Noise  0.511 || 0.

episode: 676/1000  67% ETA:  0:00:34 |\\\\\\\\\\\\\\\\\\\\\\\\\\             | 

[41mEpisode 674 with 35 steps || Reward : [0.1  0.09] || avg reward :  0.018 || Noise  0.509 || 0.141 seconds, mem : 10499
[0m[41mEpisode 678 with 33 steps || Reward : [0.   0.09] || avg reward :  0.019 || Noise  0.507 || 0.140 seconds, mem : 10575
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 680 
update - q expected : mean : 0.0128 - sd : 0.0152 min-max -0.0586|0.0294
update - reward : mean : -0.0006 - sd : 0.0024 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0004 - sd : 0.0060 min-max -0.0380|0.0274
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 680 
update - q expected : mean : 0.0133 - sd : 0.0153 min-max -0.0582|0.0308
update - reward : mean : -0.0006 - sd : 0.0024 min-max -0.0100|0.0000
update - TD-Error : mean : 0.0003 - sd : 0.0

episode: 686/1000  68% ETA:  0:00:33 |||||||||||||||||||||||||||             | 

[41mEpisode 682 with 30 steps || Reward : [ 0.1  -0.01] || avg reward :  0.019 || Noise  0.505 || 0.133 seconds, mem : 10648
[0m[41mEpisode 686 with 32 steps || Reward : [0.   0.09] || avg reward :  0.018 || Noise  0.503 || 0.136 seconds, mem : 10722
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 690 
update - q expected : mean : 0.0135 - sd : 0.0141 min-max -0.0512|0.0300
[42mupdate - reward : mean : 0.0000 - sd : 0.0093 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0006 - sd : 0.0111 min-max -0.0409|0.1030
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 690 
update - q expected : mean : 0.0140 - sd : 0.0143 min-max -0.0508|0.0300
[42mupdate - reward : mean : 0.0000 - sd : 0.0093 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean 

episode: 696/1000  69% ETA:  0:00:32 |///////////////////////////            | 

[41mEpisode 694 with 32 steps || Reward : [ 0.1  -0.01] || avg reward :  0.018 || Noise  0.499 || 0.131 seconds, mem : 10853
[0m[41mEpisode 699 with 32 steps || Reward : [0.   0.09] || avg reward :  0.018 || Noise  0.496 || 0.129 seconds, mem : 10942
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 700 
update - q expected : mean : 0.0137 - sd : 0.0139 min-max -0.0528|0.0296
[42mupdate - reward : mean : -0.0004 - sd : 0.0050 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0003 - sd : 0.0074 min-max -0.0446|0.0928
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 700 
update - q expected : mean : 0.0141 - sd : 0.0141 min-max -0.0528|0.0299
[42mupdate - reward : mean : -0.0004 - sd : 0.0050 min-max -0.0100|0.1000
[0mupdate - TD-Error : mea

episode: 706/1000  70% ETA:  0:00:31 |---------------------------            | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 710 
update - q expected : mean : 0.0138 - sd : 0.0141 min-max -0.0522|0.0297
[42mupdate - reward : mean : 0.0002 - sd : 0.0092 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0001 - sd : 0.0111 min-max -0.0433|0.1003
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 710 
update - q expected : mean : 0.0143 - sd : 0.0144 min-max -0.0526|0.0302
[42mupdate - reward : mean : 0.0002 - sd : 0.0092 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0002 - sd : 0.0110 min-max -0.0424|0.0994
[41mEpisode 711 with 32 steps || Reward : [0.   0.09] || avg reward :  0.017 || Noise  0.490 || 0.177 seconds, mem : 11130
[0m

episode: 715/1000  71% ETA:  0:00:30 |\\\\\\\\\\\\\\\\\\\\\\\\\\\            | 

[41mEpisode 713 with 32 steps || Reward : [ 0.1  -0.01] || avg reward :  0.017 || Noise  0.490 || 0.159 seconds, mem : 11176
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 720 
update - q expected : mean : 0.0143 - sd : 0.0127 min-max -0.0424|0.0298
[42mupdate - reward : mean : -0.0003 - sd : 0.0049 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0007 - sd : 0.0070 min-max -0.0325|0.0985
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 720 
update - q expected : mean : 0.0147 - sd : 0.0129 min-max -0.0424|0.0300
[42mupdate - reward : mean : -0.0003 - sd : 0.0049 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0007 - sd : 0.0071 min-max -0.0356|0.0985
Episode 720 with 14 steps || Reward : [-0.01  0.  ] || avg reward :  0.016 || N

episode: 725/1000  72% ETA:  0:00:29 |||||||||||||||||||||||||||||           | 

[41mEpisode 726 with 30 steps || Reward : [ 0.1  -0.01] || avg reward :  0.017 || Noise  0.483 || 0.123 seconds, mem : 11395
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 730 
update - q expected : mean : 0.0141 - sd : 0.0135 min-max -0.0418|0.0297
[42mupdate - reward : mean : 0.0001 - sd : 0.0092 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0016 - sd : 0.0106 min-max -0.0433|0.0983
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 730 
update - q expected : mean : 0.0145 - sd : 0.0136 min-max -0.0417|0.0301
[42mupdate - reward : mean : 0.0001 - sd : 0.0092 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0016 - sd : 0.0106 min-max -0.0440|0.0982


episode: 733/1000  73% ETA:  0:00:28 |////////////////////////////           | 

[41mEpisode 731 with 33 steps || Reward : [ 0.1  -0.01] || avg reward :  0.018 || Noise  0.481 || 0.147 seconds, mem : 11484
[0m[41mEpisode 732 with 33 steps || Reward : [ 0.1  -0.01] || avg reward :  0.019 || Noise  0.480 || 0.170 seconds, mem : 11517
[0m[41mEpisode 735 with 33 steps || Reward : [0.   0.09] || avg reward :  0.019 || Noise  0.479 || 0.146 seconds, mem : 11578
[0m[41mEpisode 736 with 32 steps || Reward : [ 0.1  -0.01] || avg reward :  0.020 || Noise  0.478 || 0.139 seconds, mem : 11610
[0m[41mEpisode 738 with 33 steps || Reward : [ 0.1  -0.01] || avg reward :  0.021 || Noise  0.477 || 0.140 seconds, mem : 11658
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 740 
update - q expected : mean : 0.0135 - sd : 0.0141 min-max -0.0403|0.0296
[42mupdate - reward : mean : -0.0000 - sd : 

episode: 741/1000  74% ETA:  0:00:27 |----------------------------           | 

learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 740 
update - q expected : mean : 0.0140 - sd : 0.0141 min-max -0.0410|0.0301
[42mupdate - reward : mean : -0.0000 - sd : 0.0080 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0006 - sd : 0.0098 min-max -0.0510|0.0992
Episode 740 with 14 steps || Reward : [ 0.   -0.01] || avg reward :  0.020 || Noise  0.476 || 0.304 seconds, mem : 11686
[0m[41mEpisode 745 with 32 steps || Reward : [ 0.1  -0.01] || avg reward :  0.019 || Noise  0.474 || 0.134 seconds, mem : 11776
[0m[41mEpisode 746 with 32 steps || Reward : [ 0.1  -0.01] || avg reward :  0.020 || Noise  0.474 || 0.132 seconds, mem : 11808
[0m[41mEpisode 748 with 26 steps || Reward : [ 0.1  -0.01] || avg reward :  0.021 || Noise  0.473 || 0.132 seconds, mem : 11848
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torc

episode: 751/1000  75% ETA:  0:00:26 |\\\\\\\\\\\\\\\\\\\\\\\\\\\\\          | 

[41mEpisode 751 with 32 steps || Reward : [ 0.1  -0.01] || avg reward :  0.021 || Noise  0.471 || 0.174 seconds, mem : 11909
[0m[41mEpisode 756 with 32 steps || Reward : [0.   0.09] || avg reward :  0.022 || Noise  0.469 || 0.128 seconds, mem : 12002
[0m[41mEpisode 758 with 31 steps || Reward : [0.   0.09] || avg reward :  0.023 || Noise  0.468 || 0.132 seconds, mem : 12047
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 760 
update - q expected : mean : 0.0120 - sd : 0.0159 min-max -0.0558|0.0290
[42mupdate - reward : mean : -0.0003 - sd : 0.0050 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.0009 - sd : 0.0078 min-max -0.0461|0.0982


episode: 761/1000  76% ETA:  0:00:25 ||||||||||||||||||||||||||||||          | 

learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 760 
update - q expected : mean : 0.0125 - sd : 0.0158 min-max -0.0589|0.0299
[42mupdate - reward : mean : -0.0003 - sd : 0.0050 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.0009 - sd : 0.0078 min-max -0.0510|0.0974
Episode 760 with 14 steps || Reward : [-0.01  0.  ] || avg reward :  0.023 || Noise  0.467 || 0.313 seconds, mem : 12075
[0m[41mEpisode 762 with 32 steps || Reward : [ 0.1  -0.01] || avg reward :  0.024 || Noise  0.466 || 0.146 seconds, mem : 12122
[0m[41mEpisode 764 with 45 steps || Reward : [ 0.1  -0.01] || avg reward :  0.023 || Noise  0.465 || 0.203 seconds, mem : 12181
[0m[41mEpisode 765 with 32 steps || Reward : [0.   0.09] || avg reward :  0.023 || Noise  0.465 || 0.139 seconds, mem : 12213
[0m[41mEpisode 766 with 33 steps || Reward : [ 0.1  -0.01] || avg reward :  0.024 || Noise  0.464 || 0.147 seconds, mem : 12246
[0mLearning shape :  torch.S

episode: 771/1000  77% ETA:  0:00:24 |//////////////////////////////         | 

[41mEpisode 770 with 33 steps || Reward : [ 0.1  -0.01] || avg reward :  0.024 || Noise  0.462 || 0.365 seconds, mem : 12322
[0m[41mEpisode 773 with 32 steps || Reward : [0.   0.09] || avg reward :  0.025 || Noise  0.461 || 0.140 seconds, mem : 12382
[0m[41mEpisode 776 with 33 steps || Reward : [-0.01  0.1 ] || avg reward :  0.025 || Noise  0.460 || 0.146 seconds, mem : 12443
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 780 
update - q expected : mean : 0.0113 - sd : 0.0164 min-max -0.0509|0.0295
update - reward : mean : -0.0008 - sd : 0.0027 min-max -0.0100|0.0000
update - TD-Error : mean : -0.0004 - sd : 0.0059 min-max -0.0358|0.0184
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 780 
update - q expected : mean : 0.0116 - sd : 0.0166 min-

episode: 781/1000  78% ETA:  0:00:23 |------------------------------         | 

Episode 780 with 14 steps || Reward : [ 0.   -0.01] || avg reward :  0.024 || Noise  0.458 || 0.291 seconds, mem : 12500
[0m[41mEpisode 787 with 29 steps || Reward : [ 0.1  -0.01] || avg reward :  0.023 || Noise  0.455 || 0.124 seconds, mem : 12614
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 790 
update - q expected : mean : 0.0117 - sd : 0.0139 min-max -0.0469|0.0301
[42mupdate - reward : mean : 0.0010 - sd : 0.0127 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.0018 - sd : 0.0136 min-max -0.0537|0.1028
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 790 
update - q expected : mean : 0.0120 - sd : 0.0139 min-max -0.0478|0.0306
[42mupdate - reward : mean : 0.0010 - sd : 0.0127 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.

episode: 791/1000  79% ETA:  0:00:22 |\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\         | 

[41mEpisode 792 with 33 steps || Reward : [-0.01  0.1 ] || avg reward :  0.024 || Noise  0.452 || 0.143 seconds, mem : 12704
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 800 
update - q expected : mean : 0.0118 - sd : 0.0142 min-max -0.0421|0.0296
[42mupdate - reward : mean : -0.0003 - sd : 0.0068 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0001 - sd : 0.0089 min-max -0.0481|0.0978
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 800 
update - q expected : mean : 0.0121 - sd : 0.0142 min-max -0.0438|0.0303
[42mupdate - reward : mean : -0.0003 - sd : 0.0068 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.0001 - sd : 0.0089 min-max -0.0474|0.0973


episode: 801/1000  80% ETA:  0:00:21 ||||||||||||||||||||||||||||||||        | 

[41mEpisode 800 with 32 steps || Reward : [0.   0.09] || avg reward :  0.023 || Noise  0.449 || 0.370 seconds, mem : 12835
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 810 
update - q expected : mean : 0.0122 - sd : 0.0146 min-max -0.0437|0.0295
[42mupdate - reward : mean : -0.0002 - sd : 0.0067 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0001 - sd : 0.0084 min-max -0.0550|0.0950
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 810 
update - q expected : mean : 0.0126 - sd : 0.0148 min-max -0.0442|0.0296
[42mupdate - reward : mean : -0.0002 - sd : 0.0067 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.0000 - sd : 0.0083 min-max -0.0550|0.0947


episode: 811/1000  81% ETA:  0:00:20 |///////////////////////////////        | 

[41mEpisode 817 with 33 steps || Reward : [ 0.1  -0.01] || avg reward :  0.022 || Noise  0.441 || 0.160 seconds, mem : 13119
[0m[41mEpisode 818 with 33 steps || Reward : [0.   0.09] || avg reward :  0.023 || Noise  0.441 || 0.161 seconds, mem : 13152
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 820 
update - q expected : mean : 0.0123 - sd : 0.0140 min-max -0.0445|0.0296
[42mupdate - reward : mean : 0.0000 - sd : 0.0080 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.0004 - sd : 0.0097 min-max -0.0459|0.0984
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 820 
update - q expected : mean : 0.0128 - sd : 0.0140 min-max -0.0417|0.0299
[42mupdate - reward : mean : 0.0000 - sd : 0.0080 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean :

episode: 821/1000  82% ETA:  0:00:19 |--------------------------------       | 

Episode 820 with 15 steps || Reward : [-0.01  0.  ] || avg reward :  0.023 || Noise  0.440 || 0.301 seconds, mem : 13181
[0m[41mEpisode 821 with 30 steps || Reward : [0.   0.09] || avg reward :  0.024 || Noise  0.439 || 0.130 seconds, mem : 13211
[0m[41mEpisode 824 with 32 steps || Reward : [ 0.1  -0.01] || avg reward :  0.025 || Noise  0.438 || 0.168 seconds, mem : 13272
[0m[41mEpisode 826 with 34 steps || Reward : [0.   0.09] || avg reward :  0.025 || Noise  0.437 || 0.163 seconds, mem : 13326
[0m

episode: 831/1000  83% ETA:  0:00:18 |\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\       | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 830 
update - q expected : mean : 0.0122 - sd : 0.0138 min-max -0.0396|0.0299
[42mupdate - reward : mean : -0.0003 - sd : 0.0068 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0008 - sd : 0.0095 min-max -0.0483|0.1128
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 830 
update - q expected : mean : 0.0125 - sd : 0.0140 min-max -0.0417|0.0300
[42mupdate - reward : mean : -0.0003 - sd : 0.0068 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0005 - sd : 0.0095 min-max -0.0473|0.1159
[41mEpisode 837 with 33 steps || Reward : [ 0.1  -0.01] || avg reward :  0.022 || Noise  0.432 || 0.137 seconds, mem : 13501
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1

episode: 841/1000  84% ETA:  0:00:16 |||||||||||||||||||||||||||||||||       | 

Episode 840 with 14 steps || Reward : [-0.01  0.  ] || avg reward :  0.021 || Noise  0.431 || 0.299 seconds, mem : 13543
[0m[41mEpisode 841 with 33 steps || Reward : [-0.01  0.1 ] || avg reward :  0.022 || Noise  0.431 || 0.152 seconds, mem : 13576
[0m[41mEpisode 842 with 33 steps || Reward : [ 0.1  -0.01] || avg reward :  0.023 || Noise  0.430 || 0.154 seconds, mem : 13609
[0m[41mEpisode 843 with 33 steps || Reward : [ 0.1  -0.01] || avg reward :  0.024 || Noise  0.430 || 0.149 seconds, mem : 13642
[0m[41mEpisode 845 with 32 steps || Reward : [0.   0.09] || avg reward :  0.024 || Noise  0.429 || 0.131 seconds, mem : 13688
[0m[41mEpisode 847 with 33 steps || Reward : [ 0.1  -0.02] || avg reward :  0.024 || Noise  0.428 || 0.136 seconds, mem : 13735
[0m[41mEpisode 849 with 33 steps || Reward : [ 0.1  -0.01] || avg reward :  0.024 || Noise  0.427 || 0.132 seconds, mem : 13782

episode: 850/1000  85% ETA:  0:00:16 |/////////////////////////////////      | 


[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 850 
update - q expected : mean : 0.0121 - sd : 0.0142 min-max -0.0416|0.0301
[42mupdate - reward : mean : -0.0000 - sd : 0.0081 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.0000 - sd : 0.0097 min-max -0.0539|0.1016
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 850 
update - q expected : mean : 0.0126 - sd : 0.0144 min-max -0.0456|0.0303
[42mupdate - reward : mean : -0.0000 - sd : 0.0081 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.0001 - sd : 0.0098 min-max -0.0537|0.1013
[41mEpisode 850 with 33 steps || Reward : [ 0.1  -0.01] || avg reward :  0.025 || Noise  0.427 || 0.432 seconds, mem : 13815
[0m

episode: 859/1000  85% ETA:  0:00:15 |---------------------------------      | 

[41mEpisode 856 with 32 steps || Reward : [-0.01  0.1 ] || avg reward :  0.024 || Noise  0.424 || 0.134 seconds, mem : 13917
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 860 
update - q expected : mean : 0.0120 - sd : 0.0149 min-max -0.0454|0.0297
[42mupdate - reward : mean : -0.0003 - sd : 0.0068 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0007 - sd : 0.0083 min-max -0.0328|0.0950
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 860 
update - q expected : mean : 0.0125 - sd : 0.0151 min-max -0.0456|0.0300
[42mupdate - reward : mean : -0.0003 - sd : 0.0068 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0006 - sd : 0.0085 min-max -0.0371|0.0977
Episode 860 with 14 steps || Reward : [ 0.   -0.01] || avg reward :  0.023 || N

episode: 869/1000  86% ETA:  0:00:13 |\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\      | 

[41mEpisode 869 with 33 steps || Reward : [ 0.1  -0.01] || avg reward :  0.021 || Noise  0.419 || 0.145 seconds, mem : 14146
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 870 
update - q expected : mean : 0.0121 - sd : 0.0145 min-max -0.0467|0.0298
[42mupdate - reward : mean : 0.0005 - sd : 0.0101 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.0017 - sd : 0.0108 min-max -0.0309|0.0986
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 870 
update - q expected : mean : 0.0126 - sd : 0.0147 min-max -0.0466|0.0302
[42mupdate - reward : mean : 0.0005 - sd : 0.0101 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.0017 - sd : 0.0109 min-max -0.0320|0.0987
[41mEpisode 870 with 32 steps || Reward : [ 0.1  -0.01] || avg reward :  0.021 || 

episode: 877/1000  87% ETA:  0:00:13 |||||||||||||||||||||||||||||||||||     | 

[41mEpisode 876 with 33 steps || Reward : [ 0.1  -0.01] || avg reward :  0.022 || Noise  0.416 || 0.149 seconds, mem : 14301
[0m[41mEpisode 877 with 33 steps || Reward : [ 0.1  -0.01] || avg reward :  0.023 || Noise  0.415 || 0.160 seconds, mem : 14334
[0mLearning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 880 
update - q expected : mean : 0.0122 - sd : 0.0142 min-max -0.0484|0.0306
[42mupdate - reward : mean : 0.0003 - sd : 0.0091 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : 0.0010 - sd : 0.0104 min-max -0.0390|0.1008
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 880 
update - q expected : mean : 0.0128 - sd : 0.0143 min-max -0.0498|0.0306
[42mupdate - reward : mean : 0.0003 - sd : 0.0091 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean

episode: 887/1000  88% ETA:  0:00:12 |//////////////////////////////////     | 

Learning shape :  torch.Size([512, 48]) torch.Size([512, 4]) torch.Size([512, 1]) torch.Size([512, 48]) torch.Size([512, 1])
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 0 and episode 890 
update - q expected : mean : 0.0126 - sd : 0.0147 min-max -0.0453|0.0300
[42mupdate - reward : mean : -0.0000 - sd : 0.0081 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0005 - sd : 0.0091 min-max -0.0272|0.0974
learn : Next States :  torch.Size([512, 48])
--------------------------------------
Agent 1 and episode 890 
update - q expected : mean : 0.0133 - sd : 0.0147 min-max -0.0453|0.0304
[42mupdate - reward : mean : -0.0000 - sd : 0.0081 min-max -0.0100|0.1000
[0mupdate - TD-Error : mean : -0.0007 - sd : 0.0091 min-max -0.0268|0.0983


# 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```