In [1]:
import grid2op
from grid2op.Reward import ConstantReward, FlatReward
import sys
import os
import numpy as np
TRAINING_STEP = 10

  from pandas.util.testing import assert_series_equal, assert_frame_equal


## Download more data for the default environment.

A lot of data have been made available for the default "case14_redisp" environment. Including this data in the package is not convenient. We chose instead to release them and make them easily available with a utility. To download them in the default directory ("~/data_grid2op/case14_redisp") on linux based system you can do the following (uncomment the following command)

In [2]:
# !$sys.executable -m grid2op.download --name "case14_redisp"

## Make a regular environment and agent

Now that we downloaded the dataset, it is time to make an environment that will use all the data avaiable. You can execute the following command line. If you see any error or warning consider re downloading the data, or adapting the key-word argument "chronics_path" to match the path where the data have been downloaded.

In [3]:
try:
    env = grid2op.make(name_env="case14_redisp", chronics_path=os.path.expanduser("~/data_grid2op/case14_redisp"))
except Exception as exc :
    print("Please read the above cell, it appears you don't have downloaded the dataset,"\
          "or save it into an unknown repository." \
          "I will continue with only 2 sets.")
    env = grid2op.make(name_env="case14_redisp")

## Train a standard RL Agent

Make sure you are using a computer with at least 4 cores if you want to notice some speed-ups.

In [4]:
from grid2op.MultiEnv import MultiEnvironment
from grid2op.Agent import DoNothingAgent
NUM_CORE = 4

### Using the standard open AI gym loop

Here we demonstrate how to use the multi environment class. First let's create a multi environment.

In [5]:
# create a simple agent
agent = DoNothingAgent(env.action_space)

# create the multi environment class
multi_envs = MultiEnvironment(env=env, nb_env=NUM_CORE)

A multienvironment is just like a regular environment but instead of dealing with one action, and one observation, is requires to be sent multiple actions, and returns a list of observations as well. 

It requires a grid2op environment to be initialized and creates some specific "workers", each a replication of the initial environment. None of the "worker" can be accessed directly. Supported methods are:
- multi_env.reset
- multi_env.step
- multi_env.close

That have similar behaviour to "env.step", "env.close" or "env.reset".


It can be used the following manner.

In [6]:
# initiliaze some variable with the proper dimension
obss = multi_envs.reset()
rews = [env.reward_range[0] for i in range(NUM_CORE)]
dones = [False for i in range(NUM_CORE)]
obss

array([<grid2op.Observation.CompleteObservation object at 0x7fa4d523b8d0>,
       <grid2op.Observation.CompleteObservation object at 0x7fa4d523e5f8>,
       <grid2op.Observation.CompleteObservation object at 0x7fa4d523b9e8>,
       <grid2op.Observation.CompleteObservation object at 0x7fa4d523e6a0>],
      dtype=object)

As you can see, obs is not a single obervation, but a list (numpy nd array to be precise) of 4 observations, each one being an observation of a given "worker" environment.

Worker environments are always called in the same order. It means the first observation of this vector will always correspond to the first worker environment. 


Similarly to Observation, the "step" function of a multi_environment takes as input a list of multiple actions, each action will be implemented in its own environment. It returns a list of observations, a list of rewards, and boolean list of whether or not the worker environment suffer from a game over (in that case this worker environment is automatically restarted using the "reset" method.)

Because orker environments are always called in the same order, the first action sent to the "multi_env.step" function will also be applied on this first environment.

It is possible to use it as follow:

In [7]:
# initialize the vector of actions that will be processed by each worker environment.
acts = [None for _ in range(NUM_CORE)]
for env_act_id in range(NUM_CORE):
    acts[env_act_id] = agent.act(obss[env_act_id], rews[env_act_id], dones[env_act_id])
    
# feed them to the multi_env
obss, rews, dones, infos = multi_envs.step(acts)

# as explained, this is a vector of Observation (as many as NUM_CORE in this example)
obss

array([<grid2op.Observation.CompleteObservation object at 0x7fa4d5246be0>,
       <grid2op.Observation.CompleteObservation object at 0x7fa4d5246978>,
       <grid2op.Observation.CompleteObservation object at 0x7fa4d52469e8>,
       <grid2op.Observation.CompleteObservation object at 0x7fa4d5246b38>],
      dtype=object)

The multi environment loop is really close to the "gym" loop:

In [8]:
# performs the appropriated steps
for i in range(TRAINING_STEP):
    acts = [None for _ in range(NUM_CORE)]
    for env_act_id in range(NUM_CORE):
        acts[env_act_id] = agent.act(obss[env_act_id], rews[env_act_id], dones[env_act_id])
    obss, rews, dones, infos = multi_envs.step(acts)

    # DO SOMETHING WITH THE AGENT IF YOU WANT
    ## agent.train(obss, rews, dones)
    

# close the environments created by the multi_env
multi_envs.close()

On the above example, `TRAINING_STEP` steps are performed on `NUM_CORE` environments in parrallel. The agent has then acted `TRAINING_STEP * NUM_CORE` (=`10 * 4 = 40` by default) times on `NUM_CORE` different environments.

### Using the training agent class from the previous notebook

We reuse the code of the Notebook [3_TrainingAnAgent](3_TrainingAnAgent.ipynb) to train a new agent, but this time using more than one process of the machine.

In [9]:
from ml_agent import TrainAgent, DeepQAgent, TrainingParam
from grid2op.Reward import RedispReward
import numpy as np
import random
import warnings
import pdb
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=FutureWarning)
    import keras
    import keras.backend as K
    from keras.models import load_model, Sequential, Model
    from keras.optimizers import Adam
    from keras.layers.core import Activation, Dropout, Flatten, Dense
    from keras.layers import Input, Lambda

Using TensorFlow backend.


We redifine the class used to train the agent.

In [10]:
class TrainAgentMultiEnv(TrainAgent):
    def __init__(self, agent, nb_process, reward_fun=RedispReward, env=None):
        TrainAgent.__init__(self, agent, reward_fun=RedispReward, env=env)
        self.nb_process = nb_process
        self.multi_envs = None
        
    def _build_valid_env(self, training_param):
        create_new = False
        if self.multi_envs is None:
#             create_new = super()._build_valid_env(training_param)
            self.multi_envs = MultiEnvironment(env=env, nb_env=self.nb_process)
            
            # make sure the environment is reset
            obss = self.multi_envs.reset()
            for worker_id in range(self.nb_process):
                self.agent.process_buffer.append(self.agent.convert_obs(obss[worker_id]))
            do_nothing = [self.env.action_space() for _ in range(self.nb_process)]
            for _ in range(training_param.NUM_FRAMES-1):
                # Initialize buffer with the first frames
                s1, r1, _, _ = self.multi_envs.step(do_nothing)
                for worker_id in range(self.nb_process):
                    self.agent.process_buffer.append(self.agent.convert_obs(s1[worker_id])) 
        return create_new
    
    def train(self, num_frames, training_param=TrainingParam()):
        # this function existed in the original implementation, but has been slightly adapted.
        
        # first we create an environment or make sure the given environment is valid
        close_env = self._build_valid_env(training_param)
        
        # bellow that, only slight modification has been made. They are highlighted
        observation_num = 0
        curr_state = self.agent.convert_process_buffer()
        
        # it's a bit less convenient that using the SpaceInvader environment.
        # first we need to initiliaze the neural network
        self.agent.init_deep_q(curr_state[0])
        # TODO it's weird to use the process buffer for this purpose...
            
        epsilon = training_param.INITIAL_EPSILON
        alive_frame = np.zeros(self.nb_process, dtype=np.int)
        total_reward = np.zeros(self.nb_process, dtype=np.float)

        while observation_num < num_frames:
            if observation_num % 1000 == 999:
                print(("Executing loop %d" %observation_num))

            # Slowly decay the learning rate
            if epsilon > training_param.FINAL_EPSILON:
                epsilon -= (training_param.INITIAL_EPSILON-training_param.FINAL_EPSILON)/training_param.EPSILON_DECAY

            initial_state = self.agent.convert_process_buffer()
#             pdb.set_trace()
            self.agent.process_buffer = []
            
            # TODO vectorize that in the Agent directly
            # ADDED
            predict_movement_int = []
            predict_q_value = []
            acts = []
            # then we need to predict the next moves
            for p_id in range(self.nb_process):
                pm_i, pq_v = self.agent.deep_q.predict_movement(curr_state[p_id], epsilon)
                predict_movement_int.append(pm_i)
                predict_q_value.append(pq_v)
                
                # and then we convert it to a valid action
                acts.append(self.agent.convert_act(pm_i))
            
            reward, done = np.zeros(self.nb_process), np.full(self.nb_process, fill_value=False, dtype=np.bool)
            for i in range(training_param.NUM_FRAMES):
                temp_observation_obj, temp_reward, temp_done, _ = self.multi_envs.step(acts)
                # here it has been adapted too. The observation get from the environment is
                # first converted to vector
                
                # below this line no changed have been made to the original implementation.
                reward[~temp_done] += temp_reward[~temp_done]
                
                
                for obs in temp_observation_obj:
                    # ADDED
                    self.agent.process_buffer.append(self.agent.convert_obs(obs))
                    
                done = done | temp_done

                # TODO fix that too
                alive_frame[~temp_done] += 1
            
                for env_done_idx in np.where(temp_done)[0]:
                    print("For env with id {}".format(env_done_idx))
                    print("\tLived with maximum time ", alive_frame[env_done_idx])
                    print("\tEarned a total of reward equal to ", total_reward[env_done_idx])
                
                reward[temp_done] = 0.
                alive_frame[temp_done] = 0
            
            new_state = self.agent.convert_process_buffer()
            for sub_env_id in range(self.nb_process):
                # ADDED
                self.agent.replay_buffer.add(initial_state[sub_env_id],
                                             predict_movement_int[sub_env_id],
                                             reward[sub_env_id],
                                             done[sub_env_id],
                                             new_state[sub_env_id])
            total_reward += reward
            if self.agent.replay_buffer.size() > training_param.MIN_OBSERVATION:
                s_batch, a_batch, r_batch, d_batch, s2_batch = self.agent.replay_buffer.sample(training_param.MINIBATCH_SIZE)
                self.agent.deep_q.train(s_batch, a_batch, r_batch, d_batch, s2_batch, observation_num)
                self.agent.deep_q.target_train()

            # Save the network every 100000 iterations
            if observation_num % 10000 == 9999 or observation_num == num_frames-1:
                print("Saving Network")
                self.agent.deep_q.save_network("saved_notebook6.h5")

#             alive_frame += 1
            observation_num += 1
            
        if close_env:
            print("closing env")
            self.env.close()
        

In [11]:
nb_frame = 100
my_agent = DeepQAgent(env.action_space, mode="DDQN", training_param=TrainingParam())
trainer = TrainAgentMultiEnv(agent=my_agent, env=env, nb_process=NUM_CORE)
trainer.train(nb_frame)


For env with id 1
	Lived with maximum time  5
	Earned a total of reward equal to  -25.0
For env with id 2
	Lived with maximum time  20
	Earned a total of reward equal to  867.1499305135851
For env with id 0
	Lived with maximum time  21
	Earned a total of reward equal to  147.57217847769024
For env with id 1
	Lived with maximum time  30
	Earned a total of reward equal to  585.8699453090342
For env with id 2
	Lived with maximum time  22
	Earned a total of reward equal to  2468.932234905286
For env with id 3
	Lived with maximum time  46
	Earned a total of reward equal to  3564.6394229154744
For env with id 1
	Lived with maximum time  16
	Earned a total of reward equal to  4079.7780691596645
For env with id 2
	Lived with maximum time  10
	Earned a total of reward equal to  4530.986762016142
For env with id 0
	Lived with maximum time  59
	Earned a total of reward equal to  1849.4143333089123
For env with id 0
	Lived with maximum time  7
	Earned a total of reward equal to  3005.501513406784

In [None]:
TODO fix the reward, this is weird now, i believe!!