# Agent, RL and MultiEnvironment

***Disclaimer***: This file referenced some files in other directories. In order to have working cross referencing it's recommended to start the notebook server from the root directory (`Grid2Op`) of the package and __not__ in the `getting_started` sub directory:
```bash
cd Grid2Op
jupyter notebook
```

***NB*** For more information about how to use the package, a general help can be built locally (provided that sphinx is installed on the machine) with:
```bash
cd Grid2Op
make html
```
from the top directory of the package (usually `Grid2Op`).

Once build, the help can be access from [here](../documentation/html/index.html)

It is recommended to have a look at the [0_basic_functionalities](0_basic_functionalities.ipynb), [1_Observation_Agents](1_Observation_Agents.ipynb) and [2_Action_GridManipulation](2_Action_GridManipulation.ipynb) and especially [3_TrainingAnAgent](3_TrainingAnAgent.ipynb) notebooks before getting into this one.

**Objectives**

In this notebook we will expose :
* what is a "MultiEnv"
* how can it be used with an agent
* how can it be used to train a agent that uses different environments

In [1]:
res = None
try:
    from jyquickhelper import add_notebook_menu
    res = add_notebook_menu()
except ModuleNotFoundError:
    print("Impossible to automatically add a menu / table of content to this notebook.\nYou can download \"jyquickhelper\" package with: \n\"pip install jyquickhelper\"")
res

In [2]:
import grid2op
from grid2op.Reward import ConstantReward, FlatReward
import sys
import os
import numpy as np
TRAINING_STEP = 10

pygame 1.9.6
Hello from the pygame community. https://www.pygame.org/contribute.html


  from pandas.util.testing import assert_series_equal, assert_frame_equal


## I) Download more data for the default environment.

A lot of data have been made available for the default "case14_redisp" environment. Including this data in the package is not convenient. We chose instead to release them and make them easily available with a utility. To download them in the default directory ("~/data_grid2op/case14_redisp") on linux based system you can do the following (uncomment the following command)

In [3]:
# !$sys.executable -m grid2op.download --name "case14_realistic"

## II) Make a regular environment and agent

Now that we downloaded the dataset, it is time to make an environment that will use all the data avaiable. You can execute the following command line. If you see any error or warning consider re downloading the data, or adapting the key-word argument "chronics_path" to match the path where the data have been downloaded.

In [4]:
try:
    env = grid2op.make(name_env="case14_realistic", chronics_path=os.path.expanduser("~/data_grid2op/case14_realistic"))
except Exception as exc :
    print("Please read the above cell, it appears you don't have downloaded the dataset, "\
          "or save it into an unknown repository. " \
          "I will continue with only 2 sets.")
    env = grid2op.make(name_env="case14_realistic")

python -m grid2op.download --name "case14_realistic" --path_save PATH\WHERE\YOU\WANT\TO\DOWNLOAD\DATA


Cannot create and instance of chronics_class with parameters "<class 'grid2op.ChronicsHandler.ChronicsHandler'>"
Please read the above cell, it appears you don't have downloaded the dataset, or save it into an unknown repository. I will continue with only 2 sets.


## III) Train a standard RL Agent

Make sure you are using a computer with at least 4 cores if you want to notice some speed-ups.

In [5]:
from grid2op.MultiEnv import MultiEnvironment
from grid2op.Agent import DoNothingAgent
NUM_CORE = 4

### IIIa) Using the standard open AI gym loop

Here we demonstrate how to use the multi environment class. First let's create a multi environment.

In [6]:
# create a simple agent
agent = DoNothingAgent(env.action_space)

# create the multi environment class
multi_envs = MultiEnvironment(env=env, nb_env=NUM_CORE)

A multienvironment is just like a regular environment but instead of dealing with one action, and one observation, is requires to be sent multiple actions, and returns a list of observations as well. 

It requires a grid2op environment to be initialized and creates some specific "workers", each a replication of the initial environment. None of the "worker" can be accessed directly. Supported methods are:
- multi_env.reset
- multi_env.step
- multi_env.close

That have similar behaviour to "env.step", "env.close" or "env.reset".


It can be used the following manner.

In [7]:
# initiliaze some variable with the proper dimension
obss = multi_envs.reset()
rews = [env.reward_range[0] for i in range(NUM_CORE)]
dones = [False for i in range(NUM_CORE)]
obss

array([<grid2op.Observation.CompleteObservation object at 0x7fa11b145fd0>,
       <grid2op.Observation.CompleteObservation object at 0x7fa14bcc7278>,
       <grid2op.Observation.CompleteObservation object at 0x7fa14bcc72e8>,
       <grid2op.Observation.CompleteObservation object at 0x7fa14bcc7e48>],
      dtype=object)

In [8]:
dones

[False, False, False, False]

As you can see, obs is not a single obervation, but a list (numpy nd array to be precise) of 4 observations, each one being an observation of a given "worker" environment.

Worker environments are always called in the same order. It means the first observation of this vector will always correspond to the first worker environment. 


Similarly to Observation, the "step" function of a multi_environment takes as input a list of multiple actions, each action will be implemented in its own environment. It returns a list of observations, a list of rewards, and boolean list of whether or not the worker environment suffer from a game over (in that case this worker environment is automatically restarted using the "reset" method.)

Because orker environments are always called in the same order, the first action sent to the "multi_env.step" function will also be applied on this first environment.

It is possible to use it as follow:

In [9]:
# initialize the vector of actions that will be processed by each worker environment.
acts = [None for _ in range(NUM_CORE)]
for env_act_id in range(NUM_CORE):
    acts[env_act_id] = agent.act(obss[env_act_id], rews[env_act_id], dones[env_act_id])
    
# feed them to the multi_env
obss, rews, dones, infos = multi_envs.step(acts)

# as explained, this is a vector of Observation (as many as NUM_CORE in this example)
obss

array([<grid2op.Observation.CompleteObservation object at 0x7fa14bccdeb8>,
       <grid2op.Observation.CompleteObservation object at 0x7fa14bccda20>,
       <grid2op.Observation.CompleteObservation object at 0x7fa14bccde80>,
       <grid2op.Observation.CompleteObservation object at 0x7fa14bccdc18>],
      dtype=object)

The multi environment loop is really close to the "gym" loop:

In [10]:
# performs the appropriated steps
for i in range(TRAINING_STEP):
    acts = [None for _ in range(NUM_CORE)]
    for env_act_id in range(NUM_CORE):
        acts[env_act_id] = agent.act(obss[env_act_id], rews[env_act_id], dones[env_act_id])
    obss, rews, dones, infos = multi_envs.step(acts)

    # DO SOMETHING WITH THE AGENT IF YOU WANT
    ## agent.train(obss, rews, dones)
    

# close the environments created by the multi_env
multi_envs.close()

On the above example, `TRAINING_STEP` steps are performed on `NUM_CORE` environments in parrallel. The agent has then acted `TRAINING_STEP * NUM_CORE` (=`10 * 4 = 40` by default) times on `NUM_CORE` different environments.

### III.b) Practical example

We reuse the code of the Notebook [3_TrainingAnAgent](3_TrainingAnAgent.ipynb) to train a new agent, but this time using more than one process of the machine.

In [11]:
from ml_agent import TrainingParam, ReplayBuffer, TrainAgent
from ml_agent import DeepQ, DuelQ, SAC
from grid2op.Agent import AgentWithConverter
from grid2op.Reward import RedispReward
from grid2op.Converters import IdToAct
import numpy as np
import random
import warnings
import pdb
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=FutureWarning)
    import tensorflow.keras
    import tensorflow.keras.backend as K
    from tensorflow.keras.models import load_model, Sequential, Model
    from tensorflow.keras.optimizers import Adam
    from tensorflow.keras.layers import Activation, Dropout, Flatten, Dense, subtract, add
    from tensorflow.keras.layers import Input, Lambda, Concatenate

In [12]:
class TrainAgentMultiEnv(TrainAgent):
    def __init__(self, agent, nb_process, reward_fun=RedispReward, env=None):
        TrainAgent.__init__(self, agent, reward_fun=RedispReward, env=env)
        self.nb_process = nb_process
        self.multi_envs = None
        # TODO optimize that to have a numpy array
        self.process_buffer = [[] for _ in range(self.nb_process)]
        # self.process_buffer = np.zeros((self.nb_process, ))
        
    def convert_obs(self, observation):
        return observation.rho
    
    def convert_process_buffer(self):
        """Converts the list of NUM_FRAMES images in the process buffer
        into one training sample"""
        # here i simply concatenate the action in case of multiple action in the "buffer"
        # this function existed in the original implementation, bus has been adapted.
        if self.training_param.NUM_FRAMES != 1:
            raise RuntimeError("can only use self.training_param.NUM_FRAMES = 1 for now")
        return np.array([np.concatenate(el) for el in self.process_buffer])
        # return np.concatenate(self.process_buffer)
        # TODO fix cases where NUM_FRAMES is not 1 !!!!
        
    def _build_valid_env(self, training_param):
        create_new = False
        if self.multi_envs is None:
#             create_new = super()._build_valid_env(training_param)
            self.multi_envs = MultiEnvironment(env=env, nb_env=self.nb_process)
            
            # make sure the environment is reset
            obss = self.multi_envs.reset()
            for worker_id in range(self.nb_process):
                self.process_buffer[worker_id].append(self.agent.convert_obs(obss[worker_id])) 
                # self.agent.process_buffer.append(self.agent.convert_obs(obss[worker_id]))
            do_nothing = [self.env.action_space() for _ in range(self.nb_process)]
            for _ in range(training_param.NUM_FRAMES-1):
                # Initialize buffer with the first frames
                s1, r1, _, _ = self.multi_envs.step(do_nothing)
                for worker_id in range(self.nb_process):
                    self.process_buffer[worker_id].append(self.agent.convert_obs(s1[worker_id])) 
        return create_new
    
    def train(self, num_frames, training_param=TrainingParam()):
        # this function existed in the original implementation, but has been slightly adapted.
        
        # first we create an environment or make sure the given environment is valid
        close_env = self._build_valid_env(training_param)
        
        # bellow that, only slight modification has been made. They are highlighted
        observation_num = 0
        curr_state = self.convert_process_buffer()
        
        # it's a bit less convenient that using the SpaceInvader environment.
        # first we need to initiliaze the neural network
        self.agent.init_deep_q(curr_state)
        # TODO it's weird to use the process buffer for this purpose...
            
        epsilon = training_param.INITIAL_EPSILON
        alive_frame = np.zeros(self.nb_process, dtype=np.int)
        total_reward = np.zeros(self.nb_process, dtype=np.float)

        while observation_num < num_frames:
            if observation_num % 1000 == 999:
                print(("Executing loop %d" %observation_num))

            # Slowly decay the learning rate
            if epsilon > training_param.FINAL_EPSILON:
                epsilon -= (training_param.INITIAL_EPSILON-training_param.FINAL_EPSILON)/training_param.EPSILON_DECAY

            initial_state = self.convert_process_buffer()
            self.process_buffer = [[] for _ in range(self.nb_process)]
            
            # TODO vectorize that in the Agent directly
            # ADDED
            predict_movement_int = []
            predict_q_value = []
            acts = []
            # then we need to predict the next moves
            #pdb.set_trace()
            pm_i, pq_v = self.agent.deep_q.predict_movement(curr_state, epsilon)
            for p_id in range(self.nb_process):
                predict_movement_int.append(pm_i[p_id])
                predict_q_value.append(pq_v[p_id])
                # and then we convert it to a valid action
                acts.append(self.agent.convert_act(pm_i[p_id]))
            
            reward, done = np.zeros(self.nb_process), np.full(self.nb_process, fill_value=False, dtype=np.bool)
            for i in range(training_param.NUM_FRAMES):
                temp_observation_obj, temp_reward, temp_done, _ = self.multi_envs.step(acts)
                # here it has been adapted too. The observation get from the environment is
                # first converted to vector
                
                # below this line no changed have been made to the original implementation.
                reward[~temp_done] += temp_reward[~temp_done]
                
                
                for worker_id, obs in enumerate(temp_observation_obj):
                    # ADDED
                    self.process_buffer[worker_id].append(self.agent.convert_obs(temp_observation_obj[worker_id])) 
                    
                done = done | temp_done

                # TODO fix that too
                alive_frame[~temp_done] += 1
            
                for env_done_idx in np.where(temp_done)[0]:
                    print("For env with id {}".format(env_done_idx))
                    print("\tLived with maximum time ", alive_frame[env_done_idx])
                    print("\tEarned a total of reward equal to ", total_reward[env_done_idx])
                
                reward[temp_done] = 0.
                total_reward[temp_done] = 0.
                total_reward += reward
                alive_frame[temp_done] = 0
            
            new_state = self.convert_process_buffer()
            for sub_env_id in range(self.nb_process):
                # ADDED
                self.agent.replay_buffer.add(initial_state[sub_env_id],
                                             predict_movement_int[sub_env_id],
                                             reward[sub_env_id],
                                             done[sub_env_id],
                                             new_state[sub_env_id])
                
            if self.agent.replay_buffer.size() > training_param.MIN_OBSERVATION:
                s_batch, a_batch, r_batch, d_batch, s2_batch = self.agent.replay_buffer.sample(training_param.MINIBATCH_SIZE)
                isfinite = self.agent.deep_q.train(s_batch, a_batch, r_batch, d_batch, s2_batch, observation_num)
                self.agent.deep_q.target_train()
            
                if not isfinite:
                    # if the loss is not finite
                    print("E INFINITE LOSS")
                    break
                

            # Save the network every 100000 iterations
            if observation_num % 10000 == 9999 or observation_num == num_frames-1:
                print("Saving Network")
                self.agent.deep_q.save_network("saved_notebook6.h5")
                
            observation_num += 1
            
        if close_env:
            print("closing env")
            self.env.close()
        

We redifine the class used to train the agent.

In [13]:
# class DeepQ(object):
#     """Constructs the desired deep q learning network"""
#     def __init__(self, action_size, observation_size,
#                  lr=1e-5,
#                  training_param=TrainingParam()):
#         # It is not modified from  Abhinav Sagar's code, except for adding the possibility to change the learning rate
#         # in parameter is also present the size of the action space
#         # (it used to be a global variable in the original code)
#         self.action_size = action_size
#         self.observation_size = observation_size
#         self.model = None
#         self.target_model = None
#         self.lr_ = lr
#         self.qvalue_evolution = np.zeros((0,))
#         self.training_param = training_param
#         self.construct_q_network()
    
#     def construct_q_network(self):
#         # replacement of the Convolution layers by Dense layers, and change the size of the input space and output space
        
#         # Uses the network architecture found in DeepMind paper
#         self.model = Sequential()
#         input_layer = Input(shape=(self.observation_size * self.training_param.NUM_FRAMES,))
#         layer1 = Dense(self.observation_size * self.training_param.NUM_FRAMES)(input_layer)
#         layer1 = Activation('relu')(layer1)
#         layer2 = Dense(self.observation_size)(layer1)
#         layer2 = Activation('relu')(layer2)
#         layer3 = Dense(self.observation_size)(layer2)
#         layer3 = Activation('relu')(layer3)
#         layer4 = Dense(2 * self.action_size)(layer3)
#         layer4 = Activation('relu')(layer4)
#         output = Dense(self.action_size)(layer4)

#         self.model = Model(inputs=[input_layer], outputs=[output])
#         self.model.compile(loss='mse', optimizer=Adam(lr=self.lr_))

#         self.target_model = Model(inputs=[input_layer], outputs=[output])
#         self.target_model.compile(loss='mse', optimizer=Adam(lr=self.lr_))
#         self.target_model.set_weights(self.model.get_weights())
    
#     def predict_movement(self, data, epsilon):
#         """Predict movement of game controler where is epsilon
#         probability randomly move."""
#         rand_val = np.random.random(data.shape[0])
#         q_actions = self.model.predict(data)
#         opt_policy = np.argmax(np.abs(q_actions), axis=-1)
#         opt_policy[rand_val < epsilon] = np.random.randint(0, self.action_size, size=(np.sum(rand_val < epsilon)))
        
#         self.qvalue_evolution = np.concatenate((self.qvalue_evolution , q_actions[0, opt_policy]))
#         return opt_policy, q_actions[0, opt_policy]

#     def train(self, s_batch, a_batch, r_batch, d_batch, s2_batch, observation_num):
#         """Trains network to fit given parameters"""
#         targets = self.model.predict(s_batch)
#         fut_action = self.target_model.predict(s2_batch)
#         targets[:, a_batch] = r_batch
#         targets[d_batch, a_batch[d_batch]] += self.training_param.DECAY_RATE * np.max(fut_action[d_batch], axis=-1)
        
#         loss = self.model.train_on_batch(s_batch, targets)
#         # Print the loss every 100 iterations.
#         if observation_num % 100 == 0:
#             print("We had a loss equal to ", loss)
#         return np.all(np.isfinite(loss))

#     def save_network(self, path):
#         # Saves model at specified path as h5 file
#         # nothing has changed
#         self.model.save(path)
#         print("Successfully saved network.")

#     def load_network(self, path):
#         # nothing has changed
#         self.model = load_model(path)
#         print("Succesfully loaded network.")

#     def target_train(self):
#         # nothing has changed from the original implementation
#         model_weights = self.model.get_weights()
#         target_model_weights = self.target_model.get_weights()
#         for i in range(len(model_weights)):
#             target_model_weights[i] = self.training_param.TAU * model_weights[i] + (1 - self.training_param.TAU) * target_model_weights[i]
#         self.target_model.set_weights(target_model_weights)

In [14]:
class DeepQAgent(AgentWithConverter):
    # first change: An Agent must derived from grid2op.Agent (in this case MLAgent, because we manipulate vector instead
    # of classes)

    def convert_obs(self, observation):
        return np.concatenate((observation.rho, observation.topo_vect))

    def my_act(self, transformed_observation, reward, done=False):
        if self.deep_q is None:
            self.init_deep_q(transformed_observation)
        predict_movement_int, *_ = self.deep_q.predict_movement(transformed_observation, epsilon=0.0)
        # print("predict_movement_int: {}".format(predict_movement_int))
        return predict_movement_int

    def init_deep_q(self, transformed_observation):
        if self.deep_q is None:
            # the first time an observation is observed, I set up the neural network with the proper dimensions.
            if self.mode == "DQN":
                cls = DeepQ
            elif self.mode == "DDQN":
                cls = DuelQ
            # elif self.mode == "RealQ":
            #     cls = RealQ
            elif self.mode == "SAC":
                cls = SAC
            else:
                raise RuntimeError("Unknown neural network named \"{}\"".format(self.mode))
            self.deep_q = cls(self.action_space.size(), observation_size=transformed_observation.shape[-1], lr=self.lr)

    def __init__(self, action_space, mode="DDQN", lr=1e-5, training_param=TrainingParam()):
        # this function has been adapted.

        # to built a AgentWithConverter, we need an action_space.
        # No problem, we add it in the constructor.
        AgentWithConverter.__init__(self, action_space, action_space_converter=IdToAct)

        # and now back to the origin implementation
        self.replay_buffer = ReplayBuffer(training_param.BUFFER_SIZE)

        # compare to original implementation, i don't know the observation space size.
        # Because it depends on the component of the observation we want to look at. So these neural network will
        # be initialized the first time an observation is observe.
        self.deep_q = None
        self.mode = mode
        self.lr = lr
        self.training_param = training_param

    def load_network(self, path):
        # not modified compare to original implementation
        self.deep_q.load_network(path)

In [None]:
TRAINING_STEP = 1000
my_agent = DeepQAgent(env.action_space, mode="DDQN", training_param=TrainingParam())
trainer = TrainAgentMultiEnv(agent=my_agent, env=env, nb_process=NUM_CORE)
# trainer = TrainAgent(agent=my_agent, env=env)
trainer.train(TRAINING_STEP)

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use tf.cast instead.
Successfully constructed networks.
For env with id 3
	Lived with maximum time  2
	Earned a total of reward equal to  447.7991706748635
For env with id 1
	Lived with maximum time  5
	Earned a total of reward equal to  663.1549377834867
For env with id 2
	Lived with maximum time  5
	Earned a total of reward equal to  896.2686016287104
For env with id 0
	Lived with maximum time  7
	Earned a total of reward equal to  1119.0141611232095
For env with id 1
	Lived with maximum time  2
	Earned a total of reward equal to  447.7991706748635
For env with id 2
	Lived with maximum time  2
	Earned a total of reward equal to  447.7991706748635
For env with id 3
	Lived with maximum time  5
	Earned a total of reward equal to  896.2686016287104
For env with id 0
	Lived with maximum time  2
	Earned a total of reward equal to  447.7991706748635
For env with id 3
	Lived with maximum time  

For env with id 2
	Lived with maximum time  5
	Earned a total of reward equal to  896.2686016287104
For env with id 0
	Lived with maximum time  5
	Earned a total of reward equal to  896.2686016287104
For env with id 3
	Lived with maximum time  2
	Earned a total of reward equal to  447.7991706748635
For env with id 1
	Lived with maximum time  5
	Earned a total of reward equal to  896.2686016287104
For env with id 2
	Lived with maximum time  2
	Earned a total of reward equal to  447.7991706748635
For env with id 0
	Lived with maximum time  2
	Earned a total of reward equal to  447.7991706748635
For env with id 1
	Lived with maximum time  2
	Earned a total of reward equal to  447.7991706748635
For env with id 3
	Lived with maximum time  5
	Earned a total of reward equal to  896.2686016287104
For env with id 2
	Lived with maximum time  5
	Earned a total of reward equal to  896.2686016287104
We had a loss equal to  158.96332
For env with id 0
	Lived with maximum time  5
	Earned a total of r

For env with id 0
	Lived with maximum time  7
	Earned a total of reward equal to  886.555916273282
For env with id 3
	Lived with maximum time  2
	Earned a total of reward equal to  447.7991706748635
For env with id 1
	Lived with maximum time  5
	Earned a total of reward equal to  896.2686016287104
For env with id 0
	Lived with maximum time  2
	Earned a total of reward equal to  447.7991706748635
For env with id 1
	Lived with maximum time  2
	Earned a total of reward equal to  447.7991706748635
For env with id 2
	Lived with maximum time  5
	Earned a total of reward equal to  896.2686016287104
For env with id 3
	Lived with maximum time  5
	Earned a total of reward equal to  896.2686016287104
For env with id 0
	Lived with maximum time  5
	Earned a total of reward equal to  896.2686016287104


In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(30,20))
plt.plot(my_agent.deep_q.qvalue_evolution)
plt.axhline(y=0, linewidth=3, color='red')
_ = plt.xlim(0, len(my_agent.deep_q.qvalue_evolution))

In [None]:
my_agent.deep_q.qvalue_evolution