# Actions as vector, and RL agent training

***Disclaimer***: This file referenced some files in other directories. In order to have working cross referencing it's recommended to start the notebook server from the root directory (`Grid2Op`) of the package and __not__ in the `getting_started` sub directory:
```bash
cd Grid2Op
jupyter notebook
```

***NB*** For more information about how to use the package, a general help can be built locally (provided that sphinx is installed on the machine) with:
```bash
cd Grid2Op
make html
```
from the top directory of the package (usually `Grid2Op`).

Once build, the help can be access from [here](../documentation/html/index.html)

It is recommended to have a look at the [0_basic_functionalities](0_basic_functionalities.ipynb), [1_Observation_Agents](1_Observation_Agents.ipynb) and [2_Action_GridManipulation](2_Action_GridManipulation.ipynb) notebooks before getting into this one.

**Objectives**

In this notebook we will expose :
* how to use the "vectorized" verions of Observations and Actions
* how to train a (stupid) Agent using reinforcement learning.
* how to inspect (rapidly) the action taken by the Agent

**NB** for this tutorial we train an Agent inspired from this blog post: [deep-reinforcement-learning-tutorial-with-open-ai-gym](https://towardsdatascience.com/deep-reinforcement-learning-tutorial-with-open-ai-gym-c0de4471f368). Many other different reinforcement learning tutorial exist. The code showed in this notebook has no pretention except to demonstrate how to use Grid2Op functionality to train a Deep Reinforcement learning Agent and inspect its behaviour. There are absolutely nothing implied about the performance, training strategy, type of Agent etc, meta parameters etc. All of them are purely "random".


In [1]:
import os
import sys
import grid2op

In [2]:
res = None
try:
    from jyquickhelper import add_notebook_menu
    res = add_notebook_menu()
except ModuleNotFoundError:
    print("Impossible to automatically add a menu / table of content to this notebook.\nYou can download \"jyquickhelper\" package with: \n\"pip install jyquickhelper\"")
res

## I) Manipulating vectors instead of class in Agents

Grid2op package has been built with an "object oriented" perspective: almost everything is encapsulated in a dedicated `class`. This allows for more customization of the plateform.

The downside of this approach is that machine learning method, and especially deep learning, often prefers to deal with vectors rather than with `complex` objects. 

To have the best of both worlds, we provided a `MLAgent` class that handles the convertion to / from these classes to numpy vector. In this notebook, we will see how this class can be used and overidden to develop a Agent that learns how to perform action on the grid. By default, this class does nothing, and it's possible (and encouraged) to override the `MLAgent._ml_act` method to build smarter Agents.

In [3]:
# import the usefull class
import numpy as np

from grid2op.Runner import Runner
from grid2op.ChronicsHandler import Multifolder, GridStateFromFileWithForecasts
from grid2op.Agent import MLAgent
from grid2op.Reward import L2RPNReward
from grid2op.Action import PowerLineSet
# make a runner
runner = Runner(init_grid_path=grid2op.CASE_14_FILE,
               path_chron=grid2op.CHRONICS_MLUTIEPISODE,
               gridStateclass=Multifolder,
               gridStateclass_kwargs={"gridvalueClass": GridStateFromFileWithForecasts},
               names_chronics_to_backend = grid2op.NAMES_CHRONICS_TO_BACKEND,
                agentClass=MLAgent,
               rewardClass=L2RPNReward,
               actionClass=PowerLineSet)
# initialize it
res = runner.run(nb_episode=1)
print("The results for the DoNothing agent are:")
for chron_name, cum_reward, nb_time_step, max_ts in res:
    msg_tmp = "\tFor chronics located at {}\n".format(chron_name)
    msg_tmp += "\t\t - cumulative reward: {:.6f}\n".format(cum_reward)
    msg_tmp += "\t\t - number of time steps completed: {:.0f} / {:.0f}".format(nb_time_step, max_ts)
    print(msg_tmp)

The results for the DoNothing agent are:
	For chronics located at /home/donnotben/.local/lib/python3.6/site-packages/grid2op/data/test_multi_chronics/1
		 - cumulative reward: 5739.951023
		 - number of time steps completed: 287 / 287


And that is it. It is as simple as changing the "agentClass". Now we will provide an example on how this class can be overidden to train a "real" Agent.

## II) Training an Agent

For this tutorial, we will expose to built a Q-learning Agent. Most of the code (except the neural network architecture) are inspired from this blog post: [https://towardsdatascience.com/deep-reinforcement-learning-tutorial-with-open-ai-gym-c0de4471f368](https://towardsdatascience.com/deep-reinforcement-learning-tutorial-with-open-ai-gym-c0de4471f368).

**Requirements** This notebook require to have `keras` installed on your machine.

The Agent we will train will:
* only act on the topology
* only try to set the status of the powerline (he won't try to modify the "change" them, nor try to perform any kind of bus splitting merging)

This is achieved by passing the "*PowerLineSet*" class in the actionClass of the Runner. For more information, one can consult the [PowerLineSet](../documentation/html/action.html#grid2op.Action.PowerLineSet) help page (if built locally, which is recommended) or the [PowerLineSet](../grid2op/Action.py) class definition in the [Action.py](../grid2op/Action.py) file.

Note that this agent is unlikely to perform well in a L2RPN competition, as it uses only a small subset of the action. It is exposed here as an example.

Also, note that we use a specific class of `Action` in this notebook. This is unlikely this class will be use in a competition. The exercise in this notebook is then purely for demonstrating "how to".

As always in these notebook, we will use the `case14_fromfile` Environment.

### II.A) Defining some "helpers"

The type of Agent were are using require a bit of set up, independantly of Grid2Op. We will reuse the code showed in 
[https://towardsdatascience.com/deep-reinforcement-learning-tutorial-with-open-ai-gym-c0de4471f368](https://towardsdatascience.com/deep-reinforcement-learning-tutorial-with-open-ai-gym-c0de4471f368) and in [Reinforcement-Learning-Tutorial](https://github.com/abhinavsagar/Reinforcement-Learning-Tutorial) from Abhinav Sagar code under a *MIT license* found here: [MIT License](https://github.com/abhinavsagar/Reinforcement-Learning-Tutorial/blob/master/LICENSE).

This first section is here to define these classes.

But first let's import the necessary dependencies

In [4]:
import numpy as np
import random
import warnings
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=FutureWarning)
    import keras
    import keras.backend as K
    from keras.models import load_model, Sequential, Model
    from keras.optimizers import Adam
    from keras.layers.core import Activation, Dropout, Flatten, Dense
    from keras.layers import Input, Lambda

Using TensorFlow backend.


#### a) Replay buffer

 First we define a "replay buffer" necessary to train the Agent.

In [5]:
# Credit Abhinav Sagar: 
# https://github.com/abhinavsagar/Reinforcement-Learning-Tutorial
# Code under MIT license, available at:
# https://github.com/abhinavsagar/Reinforcement-Learning-Tutorial/blob/master/LICENSE
from collections import deque

class ReplayBuffer:
    """Constructs a buffer object that stores the past moves
    and samples a set of subsamples"""

    def __init__(self, buffer_size):
        self.buffer_size = buffer_size
        self.count = 0
        self.buffer = deque()

    def add(self, s, a, r, d, s2):
        """Add an experience to the buffer"""
        # S represents current state, a is action,
        # r is reward, d is whether it is the end, 
        # and s2 is next state
        experience = (s, a, r, d, s2)
        if self.count < self.buffer_size:
            self.buffer.append(experience)
            self.count += 1
        else:
            self.buffer.popleft()
            self.buffer.append(experience)

    def size(self):
        return self.count

    def sample(self, batch_size):
        """Samples a total of elements equal to batch_size from buffer
        if buffer contains enough elements. Otherwise return all elements"""

        batch = []

        if self.count < batch_size:
            batch = random.sample(self.buffer, self.count)
        else:
            batch = random.sample(self.buffer, batch_size)

        # Maps each experience in batch in batches of states, actions, rewards
        # and new states
        s_batch, a_batch, r_batch, d_batch, s2_batch = list(map(np.array, list(zip(*batch))))

        return s_batch, a_batch, r_batch, d_batch, s2_batch

    def clear(self):
        self.buffer.clear()
        self.count = 0

#### b) Meta parameters of the methods

Then we re-use the default parameters, note that these can be optimized. Nothing has been changed for this example.

For more information about them, please refer to the blog post of Abhinav Sagar [available here](https://towardsdatascience.com/deep-reinforcement-learning-tutorial-with-open-ai-gym-c0de4471f368).

In [6]:
DECAY_RATE = 0.99
BUFFER_SIZE = 40000
MINIBATCH_SIZE = 64
TOT_FRAME = 3000000
EPSILON_DECAY = 1000000
MIN_OBSERVATION = 1000 #5000
FINAL_EPSILON = 1/300  # have on average 1 random action per scenario of approx 287 time steps
INITIAL_EPSILON = 0.1
TAU = 0.01
# Number of frames to "throw" into network
NUM_FRAMES = 1 ## this has been changed compared to the original implementation.

### II.B) Adapatation of the inputs

In the original code, the models were used to play an Atari game and the inputs were images. For our system, the inputs are "Observation" converted as vector.

For a more detailed description of the code used, please check:
* [https://towardsdatascience.com/deep-reinforcement-learning-tutorial-with-open-ai-gym-c0de4471f368](https://towardsdatascience.com/deep-reinforcement-learning-tutorial-with-open-ai-gym-c0de4471f368)
* and [Reinforcement-Learning-Tutorial](https://github.com/abhinavsagar/Reinforcement-Learning-Tutorial)


This is why we adapted the original code from Abhinav Sagar:
* We replaced convolutional layers with fully connected (dense) layers
* We made sure not to look at all the observations, but rather at only some part of it.

#### a) extracting relevant information of observation

First we extract relevant information about the dimension of the observation space, and the action space.

In [7]:
env = grid2op.make(action_class=PowerLineSet)
observation_size_init = env.observation_space.size()
topo_vect_size = env.observation_space.dim_topo
action_size = env.action_space.size()

The Agent we will train will be able to act one only one powerline, and to either reconnect it, or disconnect it. It has a choice then bewteen 2 * `number of powerlines`.

To train our agent more easily, we will only use part of the ***observation space***. We will not inform the agent about the state of the loads or generators, nor about the flows on the powerline, or the topology vector. Only the part of the observation that will concerned:
* the relative powerflow $\rho$ (current flow divided by thermal limit)
* the powerline status (1 powerline is connected, 0 it's disconnected)

The ***action space*** will be represented by a "one hot" vector, as followed:
* if the first component (index 0) is set to 1, then the Agent does nothing
* if the component equal to "1" has its index `i` between 1 and the `number of powerlines`, it will consist in reconnecting powerline `i-1`
* if the component equal to "1" has its index `i` between the `number of powerlines` + 1 and twice "the `number of powerlines`", it will consist in disconnecting powerline `i-nb_powerline-1`

It has then a dimension of 2 * `number of powerlines` + 1.

In [8]:
# define the subspace (represented as index) of the action space used by the agent.
obs_to_model = np.full(observation_size_init, fill_value=False, dtype=np.bool)
obs_to_model[(-3*action_size-topo_vect_size):(-action_size-topo_vect_size)] = True
observation_size = np.sum(obs_to_model)

# define the size of the action space
NUM_ACTIONS = 2* action_size + 1 

#### b) Code the neural networks

The code of the neural networks used have been impacted only slightly to adapt them to our problem. The biggest changes comes from removing the convolutional layers, as well as adapting the input and output size.

For each of the method bellow, we specify what have been adapted.

In [9]:
# Credit Abhinav Sagar: 
# https://github.com/abhinavsagar/Reinforcement-Learning-Tutorial
# Code under MIT license, available at:
# https://github.com/abhinavsagar/Reinforcement-Learning-Tutorial/blob/master/LICENSE

class DeepQ(object):
    """Constructs the desired deep q learning network"""
    def __init__(self, action_size, lr=0.00001):
        # It is not modified from  Abhinav Sagar's code, except for adding the possibility to change the learning rate
        # in parameter is also present the size of the action space
        # (it used to be a global variable in the original code)
        self.action_size = action_size
        self.model = None
        self.target_model = None
        self.lr_ = lr
        self.construct_q_network()
    
    def construct_q_network(self):
        # replacement of the Convolution layers by Dense layers, and change the size of the input space and output space
        
        # Uses the network architecture found in DeepMind paper
        self.model = Sequential()
        self.model.add(Dense(observation_size*NUM_FRAMES))
        self.model.add(Activation('relu'))
        self.model.add(Dense(observation_size))
        self.model.add(Activation('relu'))
        self.model.add(Dense(observation_size))
        self.model.add(Activation('relu'))
        self.model.add(Dense(2*NUM_ACTIONS))
        self.model.add(Activation('relu'))
        self.model.add(Dense(NUM_ACTIONS))
        self.model.compile(loss='mse', optimizer=Adam(lr=self.lr_))

        # Creates a target network as described in DeepMind paper
        self.target_model = Sequential()
        self.target_model.add(Dense(observation_size*NUM_FRAMES))
        self.target_model.add(Activation('relu'))
        self.target_model.add(Dense(observation_size))
        self.target_model.add(Activation('relu'))
        self.target_model.add(Dense(observation_size))
        self.target_model.add(Activation('relu'))
        self.target_model.add(Dense(2*NUM_ACTIONS))
        self.target_model.add(Activation('relu'))
        self.target_model.add(Dense(NUM_ACTIONS))
        self.target_model.compile(loss='mse', optimizer=Adam(lr=self.lr_))
        self.target_model.set_weights(self.model.get_weights())
    
    def predict_movement(self, data, epsilon):
        """Predict movement of game controler where is epsilon
        probability randomly move."""
        # nothing has changed from the original implementation
        rand_val = np.random.random()
        q_actions = self.model.predict(data.reshape(1, observation_size*NUM_FRAMES), batch_size = 1)
        
        if rand_val < epsilon:
            opt_policy = np.random.randint(0, NUM_ACTIONS)
        else:
            opt_policy = np.argmax(np.abs(q_actions))
        return opt_policy, q_actions[0, opt_policy]

    def train(self, s_batch, a_batch, r_batch, d_batch, s2_batch, observation_num):
        """Trains network to fit given parameters"""
        # nothing has changed from the original implementation, except for changing the input dimension 'reshape'
        batch_size = s_batch.shape[0]
        targets = np.zeros((batch_size, NUM_ACTIONS))

        for i in range(batch_size):
            targets[i] = self.model.predict(s_batch[i].reshape(1, observation_size*NUM_FRAMES), batch_size = 1)
            fut_action = self.target_model.predict(s2_batch[i].reshape(1, observation_size*NUM_FRAMES), batch_size = 1)
            targets[i, a_batch[i]] = r_batch[i]
            if d_batch[i] == False:
                targets[i, a_batch[i]] += DECAY_RATE * np.max(fut_action)
        loss = self.model.train_on_batch(s_batch, targets)
        # Print the loss every 100 iterations.
        if observation_num % 100 == 0:
            print("We had a loss equal to ", loss)

    def save_network(self, path):
        # Saves model at specified path as h5 file
        # nothing has changed
        self.model.save(path)
        print("Successfully saved network.")

    def load_network(self, path):
        # nothing has changed
        self.model = load_model(path)
        print("Succesfully loaded network.")

    def target_train(self):
        # nothing has changed from the original implementation
        model_weights = self.model.get_weights()
        target_model_weights = self.target_model.get_weights()
        for i in range(len(model_weights)):
            target_model_weights[i] = TAU * model_weights[i] + (1 - TAU) * target_model_weights[i]
        self.target_model.set_weights(target_model_weights)
        
class DuelQ(object):
    """Constructs the desired deep q learning network"""
    def __init__(self, action_size, lr=0.00001):
        # It is not modified from  Abhinav Sagar's code, except for adding the possibility to change the learning rate
        # in parameter is also present the size of the action space
        # (it used to be a global variable in the original code)
        self.action_size = action_size
        self.lr_ = lr
        self.model = None
        self.construct_q_network()

    def construct_q_network(self):
        # Uses the network architecture found in DeepMind paper
        # The inputs and outputs size have changed, as well as replacing the convolution by dense layers.
        self.model = Sequential()
        
        input_layer = Input(shape = (observation_size*NUM_FRAMES,))
        lay1 = Dense(observation_size*NUM_FRAMES)(input_layer)
        lay1 = Activation('relu')(lay1)
        
        lay2 = Dense(observation_size)(lay1)
        lay2 = Activation('relu')(lay2)
        
        lay3 = Dense(2*NUM_ACTIONS)(lay2)
        lay3 = Activation('relu')(lay3)
        
        fc1 = Dense(NUM_ACTIONS)(lay3)
        advantage = Dense(NUM_ACTIONS)(fc1)
        fc2 = Dense(NUM_ACTIONS)(lay3)
        value = Dense(1)(fc2)
        
        meaner = Lambda(lambda x: K.mean(x, axis=1) )
        mn_ = meaner(advantage)  
        tmp = keras.layers.subtract([advantage, mn_])  # keras doesn't like this part...
        policy = keras.layers.add([tmp, value])

        self.model = Model(inputs=[input_layer], outputs=[policy])
        self.model.compile(loss='mse', optimizer=Adam(lr=self.lr_))

        self.target_model = Model(inputs=[input_layer], outputs=[policy])
        self.target_model.compile(loss='mse', optimizer=Adam(lr=self.lr_))
        print("Successfully constructed networks.")
    
    def predict_movement(self, data, epsilon):
        """Predict movement of game controler where is epsilon
        probability randomly move."""
        # only changes lie in adapting the input shape
        q_actions = self.model.predict(data.reshape(1, observation_size*NUM_FRAMES), batch_size = 1)
        opt_policy = np.argmax(q_actions)
        rand_val = np.random.random()
        if rand_val < epsilon:
            opt_policy = np.random.randint(0, NUM_ACTIONS)
        return opt_policy, q_actions[0, opt_policy]

    def train(self, s_batch, a_batch, r_batch, d_batch, s2_batch, observation_num):
        """Trains network to fit given parameters"""
        # nothing has changed except adapting the input shapes
        batch_size = s_batch.shape[0]
        targets = np.zeros((batch_size, NUM_ACTIONS))

        for i in range(batch_size):
            targets[i] = self.model.predict(s_batch[i].reshape(1, observation_size*NUM_FRAMES), batch_size = 1)
            fut_action = self.target_model.predict(s2_batch[i].reshape(1, observation_size*NUM_FRAMES), batch_size = 1)
            targets[i, a_batch[i]] = r_batch[i]
            if d_batch[i] == False:
                targets[i, a_batch[i]] += DECAY_RATE * np.max(fut_action)

        loss = self.model.train_on_batch(s_batch, targets)

        # Print the loss every 100 iterations.
        if observation_num % 100 == 0:
            print("We had a loss equal to ", loss)

    def save_network(self, path):
        # Saves model at specified path as h5 file
        # nothing has changed
        self.model.save(path)
        print("Successfully saved network.")

    def load_network(self, path):
        # nothing has changed
        self.model.load_weights(path)
        self.target_model.load_weights(path)
        print("Succesfully loaded network.")

    def target_train(self):
        # nothing has changed
        model_weights = self.model.get_weights()
        self.target_model.set_weights(model_weights)

### II.C) Making the code of the Agent and train it

In the "reference" article [https://towardsdatascience.com/deep-reinforcement-learning-tutorial-with-open-ai-gym-c0de4471f368](https://towardsdatascience.com/deep-reinforcement-learning-tutorial-with-open-ai-gym-c0de4471f368), the author Abhinav Sagar made a dedicated environment based on SpaceInvader in the gym repository. We proceed here on a similar way, but with a the grid2op environment.

#### a) Adapated code

We first expose the modify code, for each function we highlight what has changed and what has not.

In [10]:
# Credit Abhinav Sagar: 
# https://github.com/abhinavsagar/Reinforcement-Learning-Tutorial
# Code under MIT license, available at:
# https://github.com/abhinavsagar/Reinforcement-Learning-Tutorial/blob/master/LICENSE

from grid2op.Parameters import Parameters
from grid2op.Action import PowerLineSet
from grid2op.Reward import L2RPNReward
from grid2op.Agent import MLAgent
import pdb

class DeepQAgent(MLAgent):
    # first change: An Agent must derived from grid2op.Agent (in this case MLAgent, because we manipulate vector instead
    # of classes)
    
    def __init__(self, action_space, mode="DDQN", reward_fun=L2RPNReward):
        # this function has been adapted.
        # no environment is created here, but it's rather created in "train" method.
        
        # to built a MLAgent, we need an action_space. No problem, we add it in the constructor.
        MLAgent.__init__(self, action_space)
        
        # easier to access
        self.action_space = action_space
        self.action_size = action_space.size()
        self.do_nothing_act = action_space({})
        
        # this is the reward function used to train the Agent. It can be change when building it.
        # note that a "reward function" must be a class, that is a subclass of grid2op.Reward
        self.reward_fun = reward_fun
        
        # the scales of my inputs varies. I add a "scaling vector"
        self.scale_vect = np.ones(observation_size)
        self.scale_vect[:action_size] = 150
        
        # and now back to the origin implementation
        self.replay_buffer = ReplayBuffer(BUFFER_SIZE)
        
        # Construct appropriate network based on flags
        if mode == "DDQN":
            self.deep_q = DeepQ(action_space.size())
        elif mode == "DQN":
            self.deep_q = DuelQ(action_space.size())

    def _ml_act(self, observation, reward, done=False):
        # method dedicated to grid2op, to use the MLAgent.
        # i predict the action using the internal neural network
        predict_movement_int, *_ = self.deep_q.predict_movement(observation[obs_to_model]*self.scale_vect, epsilon=0.)
        # in the line above, we set the "exploration" parameter epsilon to "0" when using the agent.
        # then convert it to a proper vector with the dedicated method (see bellow)
        predict_movement_vect = self.opt_policy_to_action_vect(predict_movement_int)
        return predict_movement_vect

    def opt_policy_to_action_vect(self, opt_policy):
        # helper to convert action, as returned as integer, to vector.
        # DQN or DDQN will output a number (which is the choosen action)
        # i need to convert it to a proper representation of an action.
        # please refer to cell where  "action space" is defined, section II.B.a
        res = np.zeros(self.action_size)
        if opt_policy == 0:
            # hard encode "do nothing"
            pass
        elif opt_policy < self.action_space.size():
            # reconnect a powerline
            res[opt_policy-1] = 1
        else:
            # disconnect a powerline
            res[opt_policy-self.action_space.size()-1] = -1
        return res
    
    def load_network(self, path):
        # not modified compare to original implementation
        self.deep_q.load_network(path)
    
    def convert_process_buffer(self):
        """Converts the list of NUM_FRAMES images in the process buffer
        into one training sample"""
        # here i simply concatenate the action in case of multiple action in the "buffer"
        # this function existed in the original implementation, bus has been adapted.
        return np.concatenate(self.process_buffer)
    
    def _build_valid_env(self, env=None):
        # now we are creating a valid Environment
        # it's mandatory because no environment are created when the agent is 
        # an Agent should not get direct access to the environment, but can interact with it only by:
        # * receiving reward
        # * receiving observation
        # * sending action
        
        close_env = False
        
        if env is None:
            env = grid2op.make(action_class=type(self.action_space({})),
                              reward_class=self.reward_fun)
            close_env = True
                               
        # I make sure the action space of the user and the environment are the same.
        if not isinstance(self.action_space, type(env.action_space)):
            raise RuntimeError("Imposssible to build an agent with 2 different action space")
        if not isinstance(env.action_space, type(self.action_space)):
            raise RuntimeError("Imposssible to build an agent with 2 different action space")
            
        # make sure the environment is reset
        env.reset() 
        
        # A buffer that keeps the last `NUM_FRAMES` images
        self.replay_buffer.clear()
        self.process_buffer = []
        for _ in range(NUM_FRAMES):
            # Initialize buffer with the first frames
            s1, r1, _, _ = env.step(self.do_nothing_act)
            s1_vect = s1.to_vect()
            # all observation will that will be used by the agent will be
            # of the shape vect_[obs_to_modl]*self.scale_vect
            self.process_buffer.append(s1_vect[obs_to_model]*self.scale_vect)
            
        return env, close_env
    
    def train(self, num_frames, env=None):
        # this function existed in the original implementation, but has been slightly adapted.
        
        # first we create an environment or make sure the given environment is valid
        env, close_env = self._build_valid_env(env)
        
        # bellow that, only slight modification has been made. They are highlighted
        observation_num = 0
        curr_state = self.convert_process_buffer()
        epsilon = INITIAL_EPSILON
        alive_frame = 0
        total_reward = 0

        while observation_num < num_frames:
            if observation_num % 1000 == 999:
                print(("Executing loop %d" %observation_num))

            # Slowly decay the learning rate
            if epsilon > FINAL_EPSILON:
                epsilon -= (INITIAL_EPSILON-FINAL_EPSILON)/EPSILON_DECAY

            initial_state = self.convert_process_buffer()
            self.process_buffer = []

            # it's a bit less convenient that using the SpaceInvader environment.
            # first we need to predict which actions to do (represented as an integer)
            predict_movement_int, predict_q_value = self.deep_q.predict_movement(curr_state, epsilon)
            # then we need to convert it to a valid vector that can represent a grid2op action
            predict_movement_vect = self.opt_policy_to_action_vect(predict_movement_int)
            # then we need to convert it to a proper action
            predict_movement = self.convert_from_vect(predict_movement_vect)
            
            reward, done = 0, False
            for i in range(NUM_FRAMES):
                temp_observation_obj, temp_reward, temp_done, _ = env.step(predict_movement)
                # here it has been adapted too. The observation get from the environment is
                # first converted to vector
                temp_observation = temp_observation_obj.to_vect()
                # then only a subpart of it is used, and it is scaled to have proper values
                temp_observation = temp_observation[obs_to_model]*self.scale_vect
                
                # below this line no changed have been made to the original implementation.
                reward += temp_reward
                self.process_buffer.append(temp_observation)
                done = done | temp_done

            if done:
                print("Lived with maximum time ", alive_frame)
                print("Earned a total of reward equal to ", total_reward)
                # reset the environment
                env.reset()
                
                alive_frame = 0
                total_reward = 0

            new_state = self.convert_process_buffer()
            self.replay_buffer.add(initial_state, predict_movement_int, reward, done, new_state)
            total_reward += reward
            if self.replay_buffer.size() > MIN_OBSERVATION:
                s_batch, a_batch, r_batch, d_batch, s2_batch = self.replay_buffer.sample(MINIBATCH_SIZE)
                self.deep_q.train(s_batch, a_batch, r_batch, d_batch, s2_batch, observation_num)
                self.deep_q.target_train()

            # Save the network every 100000 iterations
            if observation_num % 10000 == 9999 or observation_num == num_frames-1:
                print("Saving Network")
                self.deep_q.save_network("saved.h5")

            alive_frame += 1
            observation_num += 1
        if close_env:
            env.close()
            
    def calculate_mean(self, num_episode = 100, env=None):
        # this method has been only slightly adapted from the original implementation
        
        # Note that it is NOT the recommended method to evaluate an Agent. Please use "Grid2Op.Runner" instead
        
        # first we create an environment or make sure the given environment is valid
        env, close_env = self._build_valid_env(env)
        
        reward_list = []
        print("Printing scores of each trial")
        for i in range(num_episode):
            done = False
            tot_award = 0
            self.env.reset()
            while not done:
                state = self.convert_process_buffer()
                
                # same adapation as in "train" function. 
                predict_movement_int = self.deep_q.predict_movement(state, 0.0)[0]
                predict_movement_vect = self.opt_policy_to_action_vect(predict_movement_int)
                predict_movement = self.convert_from_vect(predict_movement_vect)
                
                # same adapation as in the "train" funciton
                observation_obj, reward, done, _ = self.env.step(predict_movement)
                observation_vect_full = observation_obj.to_vect()
                observation = observation_vect_full[obs_to_model]*self.scale_vect
                
                tot_award += reward
                self.process_buffer.append(observation)
                self.process_buffer = self.process_buffer[1:]
            print(tot_award)
            reward_list.append(tot_award)
            
        if close_env:
            env.close()
        return np.mean(reward_list), np.std(reward_list)

#### b) Training the model

Now we can define the model (agent), and then train it.

This is done exactly the same way as in the Abhinav Sagar implementation.

**NB** The code bellow can take a few minutes to run. It's training a Deep Reinforcement Learning Agent afterall. It this takes too long on your machine, you can always decrease the "nb_frame", and set it to 1000 for example. In this case, the Agent will probably not be really good.

**NB** For a real Agent, it would take much longer to train.

In [11]:
nb_frame = 10000
my_agent = DeepQAgent(runner.env.action_space, mode="DQN")
my_agent.train(nb_frame)

Instructions for updating:
Colocations handled automatically by placer.
Successfully constructed networks.
Lived with maximum time  139
Earned a total of reward equal to  2779.9432466330645
Lived with maximum time  147
Earned a total of reward equal to  2919.9435985214477
Lived with maximum time  60
Earned a total of reward equal to  1179.9815756418086
Lived with maximum time  66
Earned a total of reward equal to  1299.9797465166648
Lived with maximum time  60
Earned a total of reward equal to  1179.9763611449916
Lived with maximum time  122
Earned a total of reward equal to  2419.958479722687
Lived with maximum time  28
Earned a total of reward equal to  539.9901606307824
Lived with maximum time  56
Earned a total of reward equal to  1099.9828651013831
Lived with maximum time  119
Earned a total of reward equal to  2359.9471366440293
Lived with maximum time  91
Earned a total of reward equal to  1799.964909821738
Lived with maximum time  81
Earned a total of reward equal to  1599.9727

We had a loss equal to  2838.6738
Lived with maximum time  21
Earned a total of reward equal to  419.9924386821733
Lived with maximum time  35
Earned a total of reward equal to  679.9876148737703
Executing loop 6999
We had a loss equal to  1574.3474
We had a loss equal to  2279.1033
Lived with maximum time  245
Earned a total of reward equal to  4879.907859242132
We had a loss equal to  2771.8103
We had a loss equal to  2869.1582
Lived with maximum time  171
Earned a total of reward equal to  3399.9423603286746
We had a loss equal to  2616.048
We had a loss equal to  4314.7754
Lived with maximum time  216
Earned a total of reward equal to  4299.895433504523
We had a loss equal to  1699.2465
We had a loss equal to  3021.687
Lived with maximum time  133
Earned a total of reward equal to  2639.956721026433
Lived with maximum time  11
Earned a total of reward equal to  199.9959546476586
Lived with maximum time  31
Earned a total of reward equal to  599.9890503036517
We had a loss equal to 

## III) Evaluating the Agent

And now, time to test this trained agent.

To do that, we have multiple choices.

Either we recode the "DeepQAgent" class to load the stored weights (that have been saved during trainig) when it is initialized (not covered in this notebook), or we can also directly specified the "instance" of the Agent to use in the Grid2Op Runner.

To do that, it's fairly simple. First, you need to specify that you won't use the "*agentClass*" argument, by setting it to ``None``, and secondly you simply provide the agent to use in the *agentInstance* argument.

**NB** If you don't do that, the Runner will be created (the constructor will raise an exception). And if you choose to use the "*agentClass*" argument, your agent will be reloaded from scratch. So **if it doesn't load the weights** it will behave as a non trained agent, unlikely to perform well on the task.

### III.A) Evaluate the Agent

Now that we have "successfully" trained our Agent, we will evaluating it. As opposed to the trainining, the evaluation is done classically using a standard Runner.

Note that the Runner will use a "scoring function" that might be different from the "reward function" used during training. In our case, it's not. We use the `L2RPNReward` in both cases.

In the code bellow, we commented on what can be different and what must be identical for training and evaluation of model.

In [12]:
scoring_function = L2RPNReward
# make a runner
runner = Runner(init_grid_path=grid2op.CASE_14_FILE, # this should be the same grid as the one the agent is trained one
                path_chron=grid2op.CHRONICS_MLUTIEPISODE,  # chronics can changed of course
                gridStateclass=Multifolder, # the class of chronics can changed too
                gridStateclass_kwargs={"gridvalueClass": GridStateFromFileWithForecasts},  # so this can changed too
                names_chronics_to_backend = grid2op.NAMES_CHRONICS_TO_BACKEND,  # this also can changed
                agentInstance=my_agent,  # here i pass a trained agent, no need to read it from the 
                agentClass=None,  # if i use an instance of Agent, i cannot provide a class
                rewardClass=scoring_function,  # this can be anything, not necessarily the same for training
                actionClass=PowerLineSet  # this must be the same as the one used for training.
                )

Run the Agent and save the results. As opposed to the multiple times we exposed the "runner.run" call, we never really dive into the "path_save" argument. This path allows you to save lots of information about your Agent behaviour. Please All the informations present are shown on the documentation [here](file:///home/donnotben/Documents/Grid2Op/documentation/html/runner.html).

In [13]:
# initialize it
res = runner.run(nb_episode=1, path_save="trained_agent_log")
print("The results for the trained agent are:")
for chron_name, cum_reward, nb_time_step, max_ts in res:
    msg_tmp = "\tFor chronics located at {}\n".format(chron_name)
    msg_tmp += "\t\t - cumulative reward: {:.6f}\n".format(cum_reward)
    msg_tmp += "\t\t - number of time steps completed: {:.0f} / {:.0f}".format(nb_time_step, max_ts)
    print(msg_tmp)

The results for the trained agent are:
	For chronics located at /home/donnotben/.local/lib/python3.6/site-packages/grid2op/data/test_multi_chronics/1
		 - cumulative reward: 5739.905567
		 - number of time steps completed: 287 / 287


### III.B) Inspect the Agent 

Please refer to the official document for more information about the content of the directory where the data are saved. Note that the saving of the information is triggered by the "path_save" argument sent to the "runner.run" function.

Some information that will be present in this repository are:
If enabled, the :class:`Runner` will save the information in a structured way. For each episode there will be a folder
with:

  - "episode_meta.json" that represents some meta information about:

    - "backend_type": the name of the `grid2op.Backend` class used
    - "chronics_max_timestep": the **maximum** number of timestep for the chronics used
    - "chronics_path": the path where the temporal data (chronics) are located
    - "env_type": the name of the `grid2op.Environment` class used.
    - "grid_path": the path where the powergrid has been loaded from

  - "episode_times.json": gives some information about the total time spend in multiple part of the runner, mainly the
    `grid2op.Agent` (and especially its method `grid2op.Agent.act`) and amount of time spent in the
    `grid2op.Environment`

  - "_parameters.json": is a representation as json of a the `grid2op.Parameters` used for this episode
  - "rewards.npy" is a numpy 1d array giving the rewards at each time step. We adopted the convention that the stored
    reward at index `i` is the one observed by the agent at time `i` and **NOT** the reward sent by the
    `grid2op.Environment` after the action has been implemented.
  - "exec_times.npy" is a numpy 1d array giving the execution time of each time step of the episode
  - "actions.npy" gives the actions that has been taken by the `grid2op.Agent`. At row `i` of "actions.npy" is a
    vectorized representation of the action performed by the agent at timestep `i` *ie.* **after** having observed
    the observation present at row `i` of "observation.npy" and the reward showed in row `i` of "rewards.npy".
  - "disc_lines.npy" gives which lines have been disconnected during the simulation of the cascading failure at each
    time step. The same convention as for "rewards.npy" has been adopted. This means that the powerlines are
    disconnected when the `grid2op.Agent` takes the `grid2op.Action` at time step `i`.
  - "observations.npy" is a numpy 2d array reprensenting the `grid2op.Observation` at the disposal of the
    `grid2op.Agent` when he took his action.
    

We can first look at the repository were the data are stored:

In [14]:
!ls trained_agent_log

1


As we can see, there is only one folder there. It's named "1" because, in the original data, this came from the folder named "1" (the original data are located at "/home/donnotben/.local/lib/python3.6/site-packages/grid2op/data/test_multi_chronics/")

If there were multiple episode, each episode would have it's own folder, with a name as resemblant as possible to the origin name of the data. This is done to ease the studying of the results.

Now let's see what is inside this folder:

In [15]:
!ls trained_agent_log/1

actions.npy			  episode_meta.json   _parameters.json
agent_exec_times.npy		  episode_times.json  rewards.npy
disc_lines_cascading_failure.npy  observations.npy


We can for example load the "actions" performed by the Agent, and have a look at them.

To do that we will load the action array (represented as vector) and use the action_space to convert it back into valid action class.

In [16]:
all_actions = np.load(os.path.join("trained_agent_log", "1", "actions.npy"))
li_actions = []
for i in range(all_actions.shape[0]):
    tmp = runner.env.action_space.from_vect(all_actions[i,:])
    li_actions.append(tmp)

This allows us to have a deeper look at the action, and their effect. Note that here, we used action that can only **set** the line status, so looking at their effect is pretty straightforward.

Also, note that as oppose to "change", if a powerline is already connected, trying to **set** it as connected has absolutely no impact.

In [17]:
line_disc = 0
line_reco = 0
for act in li_actions:
    dict_ = act.as_dict()
    if "set_line_status" in dict_:
        line_reco +=  dict_["set_line_status"]["nb_connected"]
        line_disc +=  dict_["set_line_status"]["nb_disconnected"]
line_reco

0

As wa can see for our event, the agent always try to reconnect a powerline. As all lines are alway reconnected, this Agent does basically nothing.

We can also do the same kind of post analysis for the observation, even though here, as the observations come from files, it's probably not particularly intersting.

In [18]:
all_observations = np.load(os.path.join("trained_agent_log", "1", "observations.npy"))
li_observations = []
nb_real_disc = 0
for i in range(all_observations.shape[0]):
    tmp = runner.env.observation_space.from_vect(all_observations[i,:])
    li_observations.append(tmp)
    nb_real_disc += (np.sum(tmp.line_status) - tmp.line_status.shape[0])
nb_real_disc

-286