# Actions as vector, and RL agent training

***Disclaimer***: This file referenced some files in other directories. In order to have working cross referencing it's recommended to start the notebook server from the root directory (`Grid2Op`) of the package and __not__ in the `getting_started` sub directory:
```bash
cd Grid2Op
jupyter notebook
```

***NB*** For more information about how to use the package, a general help can be built locally (provided that sphinx is installed on the machine) with:
```bash
cd Grid2Op
make html
```
from the top directory of the package (usually `Grid2Op`).

Once build, the help can be access from [here](../documentation/html/index.html)

It is recommended to have a look at the [0_basic_functionalities](0_basic_functionalities.ipynb), [1_Observation_Agents](1_Observation_Agents.ipynb) and [2_Action_GridManipulation](2_Action_GridManipulation.ipynb) notebooks before getting into this one.

**Objectives**

In this notebook we will expose :
* how to use the "vectorized" verions of Observations and Actions
* how to train a (stupid) Agent using reinforcement learning.


In [1]:
import os
import sys
import grid2op

In [2]:
res = None
try:
    from jyquickhelper import add_notebook_menu
    res = add_notebook_menu()
except ModuleNotFoundError:
    print("Impossible to automatically add a menu / table of content to this notebook.\nYou can download \"jyquickhelper\" package with: \n\"pip install jyquickhelper\"")
res

## I) Manipulating vectors instead of class

Grid2op package has been built with an "object oriented" perspective: almost everything is encapsulated in a dedicated `class`. This allows for more customization of the plateform.

The downside of this approach is that machine learning method, and especially deep learning, often prefers to deal with vectors rather than with `complex` objects. 

To have the best of both worlds, we provided a `MLAgent` class that handles the convertion to / from these classes to numpy vector. In this notebook, we will see how this class can be used and overidden to develop a Agent that learns how to perform action on the grid. By default, this class does nothing, and it's possible (and encouraged) to override the `MLAgent._ml_act` method to build smarter Agents.

In [3]:
# import the usefull class
import numpy as np

from grid2op.Runner import Runner
from grid2op.ChronicsHandler import Multifolder, GridStateFromFileWithForecasts
from grid2op.Agent import MLAgent
from grid2op.Reward import L2RPNReward
from grid2op.Action import PowerLineSet
# make a runner
runner = Runner(init_grid_path=grid2op.CASE_14_FILE,
               path_chron=grid2op.CHRONICS_MLUTIEPISODE,
               gridStateclass=Multifolder,
               gridStateclass_kwargs={"gridvalueClass": GridStateFromFileWithForecasts},
               names_chronics_to_backend = grid2op.NAMES_CHRONICS_TO_BACKEND,
                agentClass=MLAgent,
               rewardClass=L2RPNReward,
               actionClass=PowerLineSet)
# initialize it
res = runner.run(nb_episode=1)
print("The results for the DoNothing agent are:")
for chron_name, cum_reward, nb_time_step, max_ts in res:
    msg_tmp = "\tFor chronics located at {}\n".format(chron_name)
    msg_tmp += "\t\t - cumulative reward: {:.6f}\n".format(cum_reward)
    msg_tmp += "\t\t - number of time steps completed: {:.0f} / {:.0f}".format(nb_time_step, max_ts)
    print(msg_tmp)

The results for the DoNothing agent are:
	For chronics located at /home/donnotben/.local/lib/python3.6/site-packages/grid2op/data/test_multi_chronics/1
		 - cumulative reward: 5739.951023
		 - number of time steps completed: 287 / 287


And that is it. It is as simple as changing the "agentClass". Now we will provide an example on how this class can be overidden to train a "real" Agent.

## II) Training an Agent

For this tutorial, we will expose to built a Q-learning Agent. Most of the code (except the neural network architecture) are inspired from this blog post: [https://towardsdatascience.com/deep-reinforcement-learning-tutorial-with-open-ai-gym-c0de4471f368](https://towardsdatascience.com/deep-reinforcement-learning-tutorial-with-open-ai-gym-c0de4471f368).

**Requirements** This notebook require to have `keras` and `gym` installed on your machine.

The Agent we will train will:
* only act on the topology
* only try to set the status of the powerline (he won't try to modify the "change" them, nor try to perform any kind of bus splitting merging)

This is achieved by passing the "*PowerLineSet*" class in the actionClass of the Runner. For more information, one can consult the [PowerLineSet](../documentation/html/action.html#grid2op.Action.PowerLineSet) help page (if built locally, which is recommended) or the [PowerLineSet](../grid2op/Action.py) class definition in the [Action.py](../grid2op/Action.py) file.

First we define a "replay buffer" necessary to train the Agent.

In [4]:
from collections import deque

class ReplayBuffer:
    """Constructs a buffer object that stores the past moves
    and samples a set of subsamples"""

    def __init__(self, buffer_size):
        self.buffer_size = buffer_size
        self.count = 0
        self.buffer = deque()

    def add(self, s, a, r, d, s2):
        """Add an experience to the buffer"""
        # S represents current state, a is action,
        # r is reward, d is whether it is the end, 
        # and s2 is next state
        experience = (s, a, r, d, s2)
        if self.count < self.buffer_size:
            self.buffer.append(experience)
            self.count += 1
        else:
            self.buffer.popleft()
            self.buffer.append(experience)

    def size(self):
        return self.count

    def sample(self, batch_size):
        """Samples a total of elements equal to batch_size from buffer
        if buffer contains enough elements. Otherwise return all elements"""

        batch = []

        if self.count < batch_size:
            batch = random.sample(self.buffer, self.count)
        else:
            batch = random.sample(self.buffer, batch_size)

        # Maps each experience in batch in batches of states, actions, rewards
        # and new states
        s_batch, a_batch, r_batch, d_batch, s2_batch = list(map(np.array, list(zip(*batch))))

        return s_batch, a_batch, r_batch, d_batch, s2_batch

    def clear(self):
        self.buffer.clear()
        self.count = 0

Then we import the necessary dependencies

In [5]:
import numpy as np
import random
import keras
import keras.backend as K
from keras.models import load_model, Sequential, Model
from keras.optimizers import Adam
from keras.layers.core import Activation, Dropout, Flatten, Dense
from keras.layers import Input

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Then we re-use the default parameters, note that these can be optimized. Nothing has been changed for this example.

In [6]:
DECAY_RATE = 0.99
BUFFER_SIZE = 40000
MINIBATCH_SIZE = 64
TOT_FRAME = 3000000
EPSILON_DECAY = 1000000
MIN_OBSERVATION = 100 #5000
FINAL_EPSILON = 0.05
INITIAL_EPSILON = 0.1
TAU = 0.01
# Number of frames to "throw" into network
NUM_FRAMES = 1

Now the model is built. Please check:
* [https://towardsdatascience.com/deep-reinforcement-learning-tutorial-with-open-ai-gym-c0de4471f368](https://towardsdatascience.com/deep-reinforcement-learning-tutorial-with-open-ai-gym-c0de4471f368)
* and [ddqn_space/deep_Q.py](https://github.com/abhinavsagar/Reinforcement-Learning-Tutorial/blob/master/ddqn&#160;space/deep_Q.py)

for more information.

Note that we replaced all Convolutional layer with Fully connected layers, and we changed the input size and output size.

In [7]:
observation_size_init = runner.env.observation_space.size()
topo_vect_size = runner.env.observation_space.dim_topo
action_size = runner.env.action_space.size()

The Agent we will train will be able to act one only one powerline, and to either reconnect it, or disconnect it. It has a choice then bewteen 2 * `number of powerlines`.

This `action_space` will be represented by a "one hot" vector, as followed:
* if the first component (index 0) is set to 1, then the Agent does nothing
* if the component equal to "1" has its index `i` between 1 and the number of powerline, it will consist in reconnecting powerline `i-1`
* if the component equal to "1" has its index `i` between the number of powerline + 1 and twice "the number of powerline", it will consist in disconnecting powerline `i-nb_powerline-1`

In [8]:
NUM_ACTIONS = 2* action_size + 1
obs_to_model = np.full(observation_size_init, fill_value=False, dtype=np.bool)
obs_to_model[(-3*action_size-topo_vect_size):(-action_size-topo_vect_size)] = True
observation_size = np.sum(obs_to_model)

In [18]:
from grid2op.Agent import MLAgent

class DeepQ(object):
    """Constructs the desired deep q learning network"""
    def __init__(self, action_size, lr=0.00001):
        self.action_size = action_size
        self.model = None
        self.target_model = None
        self.lr_ = lr
        self.construct_q_network()
    
    def construct_q_network(self):
        # Uses the network architecture found in DeepMind paper
        self.model = Sequential()
        self.model.add(Dense(observation_size*NUM_FRAMES))
        self.model.add(Activation('relu'))
        self.model.add(Dense(observation_size))
        self.model.add(Activation('relu'))
        self.model.add(Dense(observation_size))
        self.model.add(Activation('relu'))
        self.model.add(Dense(2*NUM_ACTIONS))
        self.model.add(Activation('relu'))
        self.model.add(Dense(NUM_ACTIONS))
        self.model.compile(loss='mse', optimizer=Adam(lr=self.lr_))

        # Creates a target network as described in DeepMind paper
        self.target_model = Sequential()
        self.target_model.add(Dense(observation_size*NUM_FRAMES))
        self.target_model.add(Activation('relu'))
        self.target_model.add(Dense(observation_size))
        self.target_model.add(Activation('relu'))
        self.target_model.add(Dense(observation_size))
        self.target_model.add(Activation('relu'))
        self.target_model.add(Dense(2*NUM_ACTIONS))
        self.target_model.add(Activation('relu'))
        self.target_model.add(Dense(NUM_ACTIONS))
        self.target_model.compile(loss='mse', optimizer=Adam(lr=self.lr_))
        self.target_model.set_weights(self.model.get_weights())
    
    def predict_movement(self, data, epsilon):
        """Predict movement of game controler where is epsilon
        probability randomly move."""
        rand_val = np.random.random()
        q_actions = self.model.predict(data.reshape(1, observation_size*NUM_FRAMES), batch_size = 1)
        
        if rand_val < epsilon:
            opt_policy = np.random.randint(0, NUM_ACTIONS)
        else:
            opt_policy = np.argmax(np.abs(q_actions))
            

        return opt_policy, q_actions[0, opt_policy]

    def train(self, s_batch, a_batch, r_batch, d_batch, s2_batch, observation_num):
        """Trains network to fit given parameters"""
        batch_size = s_batch.shape[0]
        targets = np.zeros((batch_size, NUM_ACTIONS))

        for i in range(batch_size):
            targets[i] = self.model.predict(s_batch[i].reshape(1, observation_size*NUM_FRAMES), batch_size = 1)
            fut_action = self.target_model.predict(s2_batch[i].reshape(1, observation_size*NUM_FRAMES), batch_size = 1)
#             pdb.set_trace()
            targets[i, a_batch[i]] = r_batch[i]
            if d_batch[i] == False:
                targets[i, a_batch[i]] += DECAY_RATE * np.max(fut_action)
        loss = self.model.train_on_batch(s_batch, targets)
#         print("init loss: {}".format(loss))
#         for i in range(100):
#             loss = self.model.train_on_batch(s_batch, targets)
#         print("init loss after 100 training: {}".format(loss))
#         pdb.set_trace()
        # Print the loss every 10 iterations.
        if observation_num % 10 == 0:
            print("We had a loss equal to ", loss)

    def save_network(self, path):
        # Saves model at specified path as h5 file
        self.model.save(path)
        print("Successfully saved network.")

    def load_network(self, path):
        self.model = load_model(path)
        print("Succesfully loaded network.")

    def target_train(self):
        model_weights = self.model.get_weights()
        target_model_weights = self.target_model.get_weights()
        for i in range(len(model_weights)):
            target_model_weights[i] = TAU * model_weights[i] + (1 - TAU) * target_model_weights[i]
        self.target_model.set_weights(target_model_weights)
        
class DuelQ(object):
    """Constructs the desired deep q learning network"""
    def __init__(self, lr=0.00001):
        self.lr_ = lr
        self.model = None
        self.construct_q_network()

    def construct_q_network(self):
        # Uses the network architecture found in DeepMind paper
        self.model = Sequential()
        
        input_layer = Input(shape = (observation_size*NUM_FRAMES,))
        lay1 = Dense(observation_size*NUM_FRAMES)(input_layer)
        lay1 = Activation('relu')(lay1)
        
        lay2 = Dense(observation_size)(lay1)
        lay2 = Activation('relu')(lay2)
        
        lay3 = Dense(2*NUM_ACTIONS)(lay2)
        lay3 = Activation('relu')(lay3)
        
        fc1 = Dense(NUM_ACTIONS)(lay3)
        advantage = Dense(NUM_ACTIONS)(fc1)
        fc2 = Dense(NUM_ACTIONS)(lay3)
        value = Dense(1)(fc2)
        
#         policy = merge([advantage, value], mode = lambda x: x[0]-K.mean(x[0])+x[1], output_shape = (NUM_ACTIONS,))
        #tmp =  keras.layers.Subtract()([advantage, K.mean(advantage)])
#         tmp = advantage - K.mean(advantage)
        mn_ = K.mean(advantage)
        tmp = keras.layers.subtract([advantage, mn_])
        policy = keras.layers.add([tmp, value])
#         policy = Dense(NUM_ACTIONS)(merge_layer)

        self.model = Model(inputs=[input_layer], outputs=[policy])
        self.model.compile(loss='mse', optimizer=Adam(lr=self.lr_))

        self.target_model = Model(inputs=[input_layer], outputs=[policy])
        self.target_model.compile(loss='mse', optimizer=Adam(lr=self.lr_))
        print("Successfully constructed networks.")
    
    def predict_movement(self, data, epsilon):
        """Predict movement of game controler where is epsilon
        probability randomly move."""
        q_actions = self.model.predict(data.reshape(1, observation_size*NUM_FRAMES), batch_size = 1)
        opt_policy = np.argmax(q_actions)
        rand_val = np.random.random()
        if rand_val < epsilon:
            opt_policy = np.random.randint(0, NUM_ACTIONS)
        return opt_policy, q_actions[0, opt_policy]

    def train(self, s_batch, a_batch, r_batch, d_batch, s2_batch, observation_num):
        """Trains network to fit given parameters"""
        batch_size = s_batch.shape[0]
        targets = np.zeros((batch_size, NUM_ACTIONS))

        for i in range(batch_size):
            targets[i] = self.model.predict(s_batch[i].reshape(1, observation_size*NUM_FRAMES), batch_size = 1)
            fut_action = self.target_model.predict(s2_batch[i].reshape(1, observation_size*NUM_FRAMES), batch_size = 1)
            targets[i, a_batch[i]] = r_batch[i]
            if d_batch[i] == False:
                targets[i, a_batch[i]] += DECAY_RATE * np.max(fut_action)

        loss = self.model.train_on_batch(s_batch, targets)

        # Print the loss every 10 iterations.
        if observation_num % 10 == 0:
            print("We had a loss equal to ", loss)

    def save_network(self, path):
        # Saves model at specified path as h5 file
        self.model.save(path)
        print("Successfully saved network.")

    def load_network(self, path):
        self.model.load_weights(path)
        self.target_model.load_weights(path)
        print("Succesfully loaded network.")

    def target_train(self):
        model_weights = self.model.get_weights()
        self.target_model.set_weights(model_weights)

In the "reference" article [https://towardsdatascience.com/deep-reinforcement-learning-tutorial-with-open-ai-gym-c0de4471f368](https://towardsdatascience.com/deep-reinforcement-learning-tutorial-with-open-ai-gym-c0de4471f368), the author Abhinav Sagar made a dedicated environment based on SpaceInvader in the gym repository. We proceed here on a similar way, but with a the grid2op environment.

In [31]:
from grid2op.Parameters import Parameters
from grid2op.BackendPandaPower import PandaPowerBackend
from grid2op.ChronicsHandler import ChronicsHandler, Multifolder, GridStateFromFileWithForecasts
from grid2op.Environment import Environment
from grid2op.Action import PowerLineSet
from grid2op.Reward import L2RPNReward
import pdb

class Grid2Op(MLAgent):
    def __init__(self, action_space, mode="DDQN"):
        MLAgent.__init__(self, action_space)
        self.action_size = action_space.size()
        param = Parameters()
        backend = PandaPowerBackend()
        data_feeding = ChronicsHandler(chronicsClass=Multifolder,
                                       path=grid2op.CHRONICS_MLUTIEPISODE,
                                       gridvalueClass=GridStateFromFileWithForecasts)
        
        self.env = Environment(init_grid_path=grid2op.CASE_14_FILE,
                               chronics_handler=data_feeding,
                               backend=backend,
                               parameters=param,
                               names_chronics_to_backend=grid2op.NAMES_CHRONICS_TO_BACKEND,
                              actionClass=PowerLineSet,
                              rewardClass=L2RPNReward)
        self.env.reset()
        if not isinstance(action_space, type(self.env.action_space)):
            raise RuntimeError("Imposssible to build an agent with 2 different action space")
        if not isinstance(self.env.action_space, type(action_space)):
            raise RuntimeError("Imposssible to build an agent with 2 different action space")
            
        self.replay_buffer = ReplayBuffer(BUFFER_SIZE)

        self.do_nothing_act = self.env.action_space({})
        
        self.scale_vect = np.ones(observation_size)
        self.scale_vect[:action_size] = 150
        
        # Construct appropriate network based on flags
        if mode == "DDQN":
            self.deep_q = DeepQ(self.env.action_space.size())
        elif mode == "DQN":
            raise RuntimeError("Implementation does not work yet")
            self.deep_q = DuelQ(self.env.action_space)

        # A buffer that keeps the last 3 images
        self.process_buffer = []
        for _ in range(NUM_FRAMES):
            # Initialize buffer with the first frames
            s1, r1, _, _ = self.env.step(self.do_nothing_act)
            s1_vect = s1.to_vect()
            self.process_buffer.append(s1_vect[obs_to_model]*self.scale_vect)

    def opt_policy_to_action_vect(self, opt_policy):
        res = np.zeros(self.action_size)
        if opt_policy == 0:
            # hard encode "do nothing"
            pass
        elif opt_policy < self.action_space.size():
            # reconnect a powerline
            res[opt_policy-1] = 1
        else:
            # disconnect a powerline
            res[opt_policy-self.action_space.size()-1] = -1
        return res
    
    def load_network(self, path):
        self.deep_q.load_network(path)

    def _ml_act(self, observation, reward, done=False):
        predict_movement_int, *_ = self.deep_q.predict_movement(observation[obs_to_model], epsilon=0)
        predict_movement_vect = self.opt_policy_to_action_vect(predict_movement_int)
#         predict_movement = self.deep_q.convert_from_vect(predict_movement_vect)
        return predict_movement_vect
    
    def convert_process_buffer(self):
        """Converts the list of NUM_FRAMES images in the process buffer
        into one training sample"""
        return np.concatenate(self.process_buffer)
    
    def train(self, num_frames):
        observation_num = 0
        curr_state = self.convert_process_buffer()
        epsilon = INITIAL_EPSILON
        alive_frame = 0
        total_reward = 0

        while observation_num < num_frames:
            if observation_num % 1000 == 999:
                print(("Executing loop %d" %observation_num))

            # Slowly decay the learning rate
            if epsilon > FINAL_EPSILON:
                epsilon -= (INITIAL_EPSILON-FINAL_EPSILON)/EPSILON_DECAY

            initial_state = self.convert_process_buffer()
            self.process_buffer = []

            predict_movement_int, predict_q_value = self.deep_q.predict_movement(curr_state, epsilon)
            predict_movement_vect = self.opt_policy_to_action_vect(predict_movement_int)
            predict_movement = self.convert_from_vect(predict_movement_vect)
            
            reward, done = 0, False
            for i in range(NUM_FRAMES):
                temp_observation_obj, temp_reward, temp_done, _ = self.env.step(predict_movement)
                temp_observation = temp_observation_obj.to_vect()
                temp_observation = temp_observation[obs_to_model]*self.scale_vect
                reward += temp_reward
                self.process_buffer.append(temp_observation)
                done = done | temp_done

#             if observation_num % 10 == 0:
#                 print("We predicted a q value of ", predict_q_value)

            if done:
                print("Lived with maximum time ", alive_frame)
                print("Earned a total of reward equal to ", total_reward)
                self.env.reset()
                alive_frame = 0
                total_reward = 0

            new_state = self.convert_process_buffer()
            self.replay_buffer.add(initial_state, predict_movement_int, reward, done, new_state)
            total_reward += reward

            if self.replay_buffer.size() > MIN_OBSERVATION:
                s_batch, a_batch, r_batch, d_batch, s2_batch = self.replay_buffer.sample(MINIBATCH_SIZE)
                self.deep_q.train(s_batch, a_batch, r_batch, d_batch, s2_batch, observation_num)
                self.deep_q.target_train()

            # Save the network every 100000 iterations
            if observation_num % 10000 == 9999 or observation_num == num_frames-1:
                print("Saving Network")
                self.deep_q.save_network("saved.h5")

            alive_frame += 1
            observation_num += 1
            
    def calculate_mean(self, num_episode = 100):
        reward_list = []
        print("Printing scores of each trial")
        for i in range(num_episode):
            done = False
            tot_award = 0
            self.env.reset()
            while not done:
                state = self.convert_process_buffer()
                predict_movement_int = self.deep_q.predict_movement(state, 0.0)[0]
                predict_movement_vect = self.opt_policy_to_action_vect(predict_movement_int)
                predict_movement = self.convert_from_vect(predict_movement_vect)
                
                observation, reward, done, _ = self.env.step(predict_movement)
                tot_award += reward
                self.process_buffer.append(observation)
                self.process_buffer = self.process_buffer[1:]
            print(tot_award)
            reward_list.append(tot_award)
        return np.mean(reward_list), np.std(reward_list)

Now we can define the model (agent), and then train it.

In [32]:
my_agent = Grid2Op(runner.env.action_space, mode="DDQN")
my_agent.train(1000000)

Lived with maximum time  92
Earned a total of reward equal to  1839.9644994319299
We had a loss equal to  10.302423
We had a loss equal to  10.060549
We had a loss equal to  10.055819
We had a loss equal to  9.978113
We had a loss equal to  9.79144
We had a loss equal to  9.92123
We had a loss equal to  9.756811
We had a loss equal to  9.896887
Lived with maximum time  80
Earned a total of reward equal to  1579.975641080914
We had a loss equal to  9.734239
We had a loss equal to  9.7110405
Saving Network
Successfully saved network.


And now, time to test this trained agent.

In [33]:
# make a runner
runner = Runner(init_grid_path=grid2op.CASE_14_FILE, # this should be the same grid as the one the agent is trained one
               path_chron=grid2op.CHRONICS_MLUTIEPISODE,  # chronics can changed of course
               gridStateclass=Multifolder, # the class of chronics can changed too
               gridStateclass_kwargs={"gridvalueClass": GridStateFromFileWithForecasts},  # so this can changed too
               names_chronics_to_backend = grid2op.NAMES_CHRONICS_TO_BACKEND,  # this also can changed
                agentInstance=my_agent,  # here i pass a trained agent, no need to read it from the 
                agentClass=None,  # if i use an instance of Agent, i cannot provide a class
               rewardClass=L2RPNReward,  # this can be anything, not necessarily the same for training
               actionClass=PowerLineSet  # this should be the same as the one used for training.
               )
# initialize it
res = runner.run(nb_episode=1)
print("The results for the DoNothing agent are:")
for chron_name, cum_reward, nb_time_step, max_ts in res:
    msg_tmp = "\tFor chronics located at {}\n".format(chron_name)
    msg_tmp += "\t\t - cumulative reward: {:.6f}\n".format(cum_reward)
    msg_tmp += "\t\t - number of time steps completed: {:.0f} / {:.0f}".format(nb_time_step, max_ts)
    print(msg_tmp)

The results for the DoNothing agent are:
	For chronics located at /home/donnotben/.local/lib/python3.6/site-packages/grid2op/data/test_multi_chronics/1
		 - cumulative reward: 5739.939937
		 - number of time steps completed: 287 / 287
