results.tex

\chapter{Nash Equilibrium Analysis}
\label{ch:nashanalysis}
This chapter presents a simulation that examines a system of agents competing to
sell electricity and its convergence to a Nash equilibrium.  Value function
based and policy gradient reinforcement learning algorithms are compared in
their convergence to an optimal policy using a six bus electric power system
model.

\section{Introduction}
This thesis presents the firqst case of policy gradient reinforcement learning
methods being applied to simulated electricity trade.  As a first step it is
necessary to confirm that when using these methods, a system of multiple agents
will converge to the same Nash equilibrium\footnote{Informally, a Nash
equilibrium is a point in a non-cooperative game at which no player is
motivated to deviate from its strategy, as it would result in lower gain
\cite{nash50,nash51}.} that a traditional closed-form simulation would produce.

This is the same approach used by \citeA{krause:nash06} before performing the
study of congestion management techniques that is reviewed in Section
\ref{sec:related_cong}.  Nash equilibria can be difficult to determine in
complex systems so the experiment presented here utilises a model simple enough
that solutions can be determined through exhaustive search.

By observing actions taken and rewards received by each agent over the initial
simulation periods it is possible to compare the speed and consistency with
which different algorithms converge to an optimal policy. In the following
sections the objectives of the simulation are defined, the simulation setup is
explained and plots of results, with discussion and critical analysis, are
provided.

\section{Aims and Objectives}
Some elements of the simulations reported in this chapter are similar to those
presented by \citeA{krause:nash06}.  One initial aim of this work is to
reproduce their findings as a means of validating the approach.  The additional
objectives are to show:
\begin{itemize}
  \item That policy gradient methods converge to the same Nash equilibrium as
  value function based methods and traditional closed-form simulations,
  \item Some the characteristics of policy gradient methods and how they differ
  from value function based methods.
%   \item The sensitivity of policy convergence to algorithm parameter choices
%   and policy function approximation structure.
\end{itemize}
Meeting these objectives aims to provide a basis for using policy gradient
methods in more complex simulations, to show that they can be employed in
\textit{learning to trade power} and to provide guidance for algorithm parameter
selection.

\section{Method of Simulation}
% Each learning method is tested individually using a range of parameter
% configurations.  A power system model with one bus, one generator $k$ and one
% dispatchable load $l$.  In this context, the market clearing process is
% equivalent to creating offer and bids stacks and finding the point of
% intersection.  A passive agent is associated with the dispatchable load.  This
% agent bids for $-p_{g,l}^{min}$ at marginal cost each period regardless of
% environment state or reward signal.  A dispathcable load is used instead of a
% constant load to allow a price to be set. Generator $k$ is given sufficient
% capacity to supply the demand of the dispatchable load, $p_{g,k}^{max} >
% -p_{g,l}^{min}$, and the marginal of the $k$ is half that of the load $l$.
% The generator and dispatchable load attributes are given in Table X.  A price
% cap for the market is set to twice the marginal cost of the $l$ at full
% capacity, $p_{g,l}^{min}$.  The DC optimal power flow formulation (See Section
% \ref{sec:opf}, above) is used to clear the market and reactive power trade is
% omitted.
Learning methods are compared in this chapter by utilising the same base
simulation problem and switching the algorithms used by the agents.  An
alternative might be to use a combination of methods in the same simulation, but
the approach used here is intended to be an extension of the work by
\citeA{krause:nash06}.

Each simulation uses a six bus electric power system model adapted from
\citeA[pp.~104, 112, 119, 123-124, 549]{wood:pgoc}.  The model provides a simple
environment for electricity trade with a small number of generators and branch
flow constraints that slightly increase the complexity of the Nash equilibria.
The buses are connected by eleven transmission lines at 230kV. The model
contains three generating units with a total capacity of 440MW and loads at
three locations, each 70MW in size. The connectivity of the branches and the
locations of the generators and loads is shown in Figure~\ref{fig:case6ww}. Data
for the power system model was taken from a case provided with \textsc{Matpower}
and is listed in Appendix \ref{adx:case6ww}.

Two sets of quadratic generator operating cost functions, of the form
$c(p_i)=ap_i^2+bp_i+c$ where $p_i$ is the output of generator $i$, are defined
in order to create two different equilibria for investigation.  The coefficients
$a$, $b$ and $c$ for cost configuration 1 are listed in Table
\ref{tbl:case6ww_gencost1}.  This configuration defines two low cost generators
that can not offer a price greater than the marginal cost of the most expensive
generator when they apply the maximum possible markup. The set of coefficients
for cost configuration 2 is listed in Table \ref{tbl:case6ww_gencost2}.  This
configuration narrows the cost differences such that offer prices may overlap
and may exceed the marginal cost of the most expensive generator.
% To strengthen the penalty for not being dispatched and shutdown cost
% $C_{down}$ of \$100 is specified in this configuration.

\ifthenelse{\boolean{includefigures}}{\input{tikz/case6ww}}{}

\begin{table}
\caption{Generator cost configuration 1.}
\label{tbl:case6ww_gencost1}
\begin{center}
\begin{tabular}{c|c|c|c}
\hline
Gen &$a$ &$b$ &$c$ \\
\hline\hline
 1 &0.0 &4.0 &200.0 \\
 2 &0.0 &3.0 &200.0 \\
 3 &0.0 &6.0 &200.0 \\
\hline
\end{tabular}
\end{center}
\end{table}

\begin{table}
\caption{Generator cost configuration 2.}
\label{tbl:case6ww_gencost2}
\begin{center}
\begin{tabular}{c|c|c|c}
\hline
Gen &$a$ &$b$ &$c$ \\
\hline\hline
 1 &0.0 &5.1 &200.0 \\
 2 &0.0 &4.5 &200.0 \\
 3 &0.0 &6.0 &200.0 \\
\hline
\end{tabular}
\end{center}
\end{table}

As in \citeA{krause:nash06}, no specific load profile is defined. The system
load is assumed to be at peak in all periods and only one state is defined for
methods using look-up tables.  Each simulation step is assumed to be one hour in
length.

For all generators $P^{min}=0$ so as to simplify the equilbria and avoid the
need for the unit de-commitment algorithm.  The maximum capacity for the most
expensive generator $P^{max}_3=220$MW such that it may supply all of the load if
dispatched, subject to branch flow limits.  This generator is associated with a
passive agent that always offers full capacity at marginal cost.  For the less
expensive generators $P^{max}_1=P^{max}_2=110$MW.  These two generators are each
associated with an active learning agent whose activity in the market is
restricted to one offer of maximum capacity in each period, at a price
representing a markup of between 0 and 30\% on marginal cost.  Methods
restricted to discrete actions may markup in steps of 10\%, giving possible
markup actions of 0, 10\%, 20\% and 30\%.  No capacity withholding is
implemented.
% The market price cap is set such that it is never reached by any markup and
% does not complicate the experiment.
Discriminatory pricing (pay-as-bid) is used in order to provide a clearer reward
signal to agents with low cost generators.

The algorithms compared are: Q-learning, ENAC, REINFORCE and the modified
Roth-Erev technique (See Section \ref{sec:rl}). Default algorithm parameter
values from PyBrain are used and no attempt is made to study parameter
sensitivity or variations in function approximator design.

For the Q-learning algorithm $\alpha=0.3$, $\gamma=0.99$ and $\epsilon$-greedy
action selection is used with $\epsilon=0.9$ and $d=0.98$.  For the Roth-Erev
technique $\epsilon=0.55$, $\phi=0.3$ and Boltzmann action selection is used
with $\tau=100$ and $d=0.99$.

Both REINFORCE and ENAC use a two-layer neural network with one linear input
node, one linear output node, no bias nodes and with the connection weight
initialised to zero.  This is a typical PyBrain configuration taken from the
supplied examples \cite{schaul:2010}. A two-step episode is defined for the
policy gradient methods and five episodes are performed per learning step.  The
exploration parameter $\sigma$ for these methods is initialised to zero and
adjusted manually after each episode such that:
\begin{equation}
\label{eq:sigmadecay}
\sigma_{t} = d(\sigma_{t-1}-\sigma_{n})+\sigma_{n}
\end{equation}
where $d=0.998$ is a decay parameter and $\sigma_{n}=-0.5$ specifies the
value that is converged to asymptotically.  Initially, the learning rate
$\gamma=0.01$ for the policy gradient methods, apart from for ENAC under cost
configuration 2 where $\gamma=0.005$.  To illustrate the effect of altering
the learning rate, simulations under cost configuration 1 are repeated with
$\gamma=0.05$ and $\gamma=0.005$.  Both active agents use the same parameter
values in each simulation.

As in \citeA{krause:nash06}, the point of Nash equilibrium is established by
computing each agent's reward for all possible combinations of discrete markup.
The rewards for Agent 1 and Agent 2 under cost configuration 1 are given in
Table \ref{tbl:nash1}.  The Nash equilibrium points are marked with a *.  The
table shows that the optimal policy for each agent is to apply the maximum
markup to each offer as their generators are always dispatched. The rewards
under cost configuration~2 are given in Table \ref{tbl:nash2}. This table shows
that the optimal point occurs when Agent 2 applies its maximum markup and Agent
1 offers a price just below the marginal cost of the passive agent's generator.

\begin{table}
\caption{Agent rewards under cost configuration~1}
\label{tbl:nash1}
\begin{center}
\begin{small}
\begin{tabular}{c.{2.2}|.{2.1}.{2.1}|.{2.1}.{2.1}|.{2.1}.{2.1}|.{3.1}.{2.1}|}
\cline{3-10}
 & &\multicolumn{8}{c|}{$G_1$} \\
\cline{3-10}
 & &\multicolumn{2}{c|}{0.0\%} &\multicolumn{2}{c|}{10.0\%}
 &\multicolumn{2}{c|}{20.0\%} &\multicolumn{2}{c|}{30.0\%} \\
 & &r_1 &r_2 &r_1 &r_2 &r_1 &r_2 &r_1 &r_2 \\
\hline
\multicolumn{1}{|c|}{\multirow{4}{*}{$G_2$}}
&0.0\% &0.0 &0.0 &40.0 &0.0 &80.0 &0.0 &120.0 &0.0 \\
\multicolumn{1}{|c|}{}
&10.0\% &0.0 &33.0 &40.0 &33.0 &80.0 &33.0 &120.0 &33.0 \\
\multicolumn{1}{|c|}{}
&20.0\% &0.0 &66.0 &40.0 &66.0 &80.0 &66.0 &120.0 &66.0 \\
\multicolumn{1}{|c|}{}
&30.0\% &0.0 &99.0 &40.0 &99.0 &80.0 &99.0 &120.0^* &99.0^* \\
\hline
\end{tabular}
\end{small}
\end{center}
\end{table}

\begin{table}
\caption{Agent rewards under cost configuration~2}
\label{tbl:nash2}
\begin{center}
\begin{small}
\begin{tabular}{c.{2.2}|.{2.1}.{3.1}|.{2.1}.{3.1}|.{2.1}.{3.1}|.{2.1}.{3.1}|}
\cline{3-10}
 & &\multicolumn{8}{c|}{$G_1$} \\
\cline{3-10}
 & &\multicolumn{2}{c|}{0.0\%} &\multicolumn{2}{c|}{10.0\%}
 &\multicolumn{2}{c|}{20.0\%} &\multicolumn{2}{c|}{30.0\%} \\
 & &r_1 &r_2 &r_1 &r_2 &r_1 &r_2 &r_1 &r_2 \\
\hline
\multicolumn{1}{|c|}{\multirow{4}{*}{$G_2$}}
&0.0\% &0.0 &0.0 &51.0 &0.0 &0.0 &0.0 &0.0 &0.0 \\
\multicolumn{1}{|c|}{}
&10.0\% &0.0 &49.5 &51.0 &49.5 &0.0 &49.5 &0.0 &49.5 \\
\multicolumn{1}{|c|}{}
&20.0\% &0.0 &92.2 &51.0 &99.0 &0.0 &99.0 &0.0 &99.0 \\
\multicolumn{1}{|c|}{}
&30.0\% &0.0 &126.8 &54.8^* &138.4^* &0.0 &148.5 &0.0 &148.5 \\
\hline
\end{tabular}
\end{small}
\end{center}
\end{table}

\section{Simulation Results}
Each action taken by an agent and the consequent reward is recorded for each
simulation.  Values are averaged over 10 simulation runs and standard deviations
are calculated using the formula
\begin{equation}
SD = \sqrt{\frac{1}{N-1}\sum_{i=0}^{N}(x_i - \bar{x})^2}
\end{equation}
where $x_i$ is the action or reward value in simulation $i$ of $N$ simulation
runs and $\bar{x}$ is the mean of the values.

Figure \ref{fig:5_1_action_a1} shows the average markup on marginal cost and the
standard deviation over 10 simulation runs for Agent 1 under price configuration
1, using the four learning methods.  The second $y$-axis in each plot relates to
the exploration parameter for each method.  Figure \ref{fig:5_1_action_a2} shows
the same information for Agent 2.  Plots of reward are not given as generator
prices and the market are configured such that an agent's reward is directly
proportional to its action.  The plots are vertically aligned and have equal
$x$-axis limits to assist algorithm comparison.

Figure \ref{fig:5_3_action_a2} shows the average markup on marginal cost and the
standard deviation over 10 simulation runs for Agent 2 operating policy gradient
methods under price configuration 1 with an increased learning rate of 0.05.
Figure \ref{fig:5_4_action_a2} shows the same infomation, but with a reduced
learning rate of 0.005.

\ifthenelse{\boolean{includefigures}}{
	\begin{figure}
	  \centering
	  \includegraphics{figures/fig5_1_action_a1}
	  \caption{Average markup and standard deviation for Agent 1
	  under cost configuration 1.}
	  \label{fig:5_1_action_a1}
	\end{figure}

	\begin{figure}
	  \centering
	  \includegraphics{figures/fig5_1_action_a2}
	  \caption{Average markup and standard deviation for Agent 2
	  under cost configuration 1.}
	  \label{fig:5_1_action_a2}
	\end{figure}

    \begin{figure}
      \centering
      \includegraphics{figures/fig5_3_action_a2}
      \caption{Average markup and standard deviation for Agent 2
      under cost configuration 1 with higher policy gradient learning rate.}
      \label{fig:5_3_action_a2}
    \end{figure}

    \begin{figure}
      \centering
      \includegraphics{figures/fig5_4_action_a2}
      \caption{Average markup and standard deviation for Agent 2
      under cost configuration 1 with lower policy gradient learning rate.}
      \label{fig:5_4_action_a2}
    \end{figure}
}{}

Figures \ref{fig:5_2_action_a1} and \ref{fig:5_2_reward_a1} plot the average
markup and reward over 10 simulation runs for Agent 1 and Agent 2, respectively,
under price configuration 2 for the variant Roth-Erev, Q-learning learning
methods.  The plots for REINFORCE and ENAC in these figures are for actual
values in one simulation run as the number of interactions and variation in
values makes the results difficult to observe otherwise.
% Not all $x$-axis extents are equal in these two figures.

\ifthenelse{\boolean{includefigures}}{
	\begin{figure}
	  \centering
	  \includegraphics{figures/fig5_2_action_a1}
	  \caption{Percentage markup for Agent 1 under cost configuration 2.}
	  \label{fig:5_2_action_a1}
	\end{figure}
% 	\begin{figure}
% 	  \centering
% 	  \includegraphics{figures/fig5_2_action_a2}
% 	  \caption{Average markup for agent 2 and standard deviation.}
% 	  \label{fig:5_2_action_a2}
% 	\end{figure}
	\begin{figure}
	  \centering
	  \includegraphics{figures/fig5_2_reward_a1}
	  \caption{Reward received by Agent 1 under cost configuration 2.}
	  \label{fig:5_2_reward_a1}
	\end{figure}
% 	\begin{figure}
% 	  \centering
% 	  \includegraphics{figures/fig5_2_reward_a2}
% 	  \caption{Average reward for agent 2 and standard deviation.}
% 	  \label{fig:5_2_reward_a2}
% 	\end{figure}
}{}

\section{Interpretation and Discussion of Results}
Under cost configuration 1 the agents face a relatively simple control task and
receive a clear reward signal that is directly proportional to their markup. The
results show that all of the methods consistently converge to the point of Nash
equilibrium.  The variant Roth-Erev method shows very little variation around
the mean once converged, due to the use of Boltzmann exploration with a low
temperature parameter value.  The constant variation around the mean that can be
seen for Q-learning once converged is due to the use of $\epsilon$-greedy action
selection and can be removed if a Boltzmann explorer is selected.

Empirical studies have also shown that the speed of convergence is largely
determined by the rate at which the exploration parameter value is reduced.
However, the episodic nature of the policy gradient methods requires them to
make several interactions per learning step and therefore a larger number of
initial exploration steps are required.  Figures
\ref{fig:5_1_action_a2}, \ref{fig:5_3_action_a2} and \ref{fig:5_4_action_a2}
illustrate the effect on policy gradient methods of changes in learning rate.
Higher values cause larger changes to be made to the policy parameters at each
step.  Increasing the learning rate had a positive effect here, but high values
can cause the algorithms to behave sporadically as the adjustments made are too
great. Conversely, low values cause the algorithms to learn very slowly.

Cost configuration 2 provides a more challenging control problem in which
Agent~1 must learn to undercut the passive agent.  The results show that the
variant Roth-Erev and Q-learning methods both consistently learn their optimal
policy and converge to the Nash equilibrium.  However, there is space for
Agent~1 to markup its offer by slightly more than 10\% and still undercut the
passive agent, but the methods with discrete actions are not able to exploit
this and do not receive the small additional profit.

The results for the policy gradient methods under cost configuration 2 show that
they learn to reduce their markup if their offer price starts to exceed that of
the passive agent and the reward signal drops.  However, a chattering effect
below the Nash equilibrium point can be clearly seen for ENAC and the method
does not learn to always undercut the other agent.  These methods require a much
larger number of simulation steps and for the exploration parameter to decay
slowly if they are to produce this behaviour.  This is due to the need for a
lower learning rate that ensures fine policy adjustments can be made and for
several interactions to be performed between each learning step.
% When using REINFORCE or ENAC, Agent~2 tends also to learn to maximise its
% markup, but less consistently.  Agent~1 typically learns to undercut the
% passive agent, but does not converge to a consistent value.  The problem is
% similar to the cliff-edge walking problems often used as benchmarks in
% reinforcement learning research and may be difficult to approximate a policy
% for using a small number of $\tanh$ functions. It may be possible to improve
% the performance of these agents through more educated policy function
% approximator design. but these methods are not really intended for operation
% in such simple environments.

\section{Summary}
By observing the state to which a multi-learning-agent system converges, it is
possible to verify that learning algorithms produce the same Nash equilibrium
that closed-form simulations provide.  The results presented in this chapter
closely correspond with those from \citeA{krause:nash06} for Q-learning and show
equivalent behaviour for the variant Roth-Erev method.  The simulations
illustrate how challenging unsupervised learning is in a continuous environment,
even for a simple problem. Tasks in which a large reward change can occur for a
very small change in policy prove difficult for policy gradient methods to learn
and require low learning rates and lengthy periods of exploration.  The
operation of policy gradient methods with noisy, multi-dimensional state data is
not examined in this chapter and deserves investigation.

% This experiment confirms the convergence to a Nash equilibrium of the
% Q-learning methods that is published in \citeA{krause:nash06} and, to a
% degree, extends the conclusion to policy gradient methods.  The results show
% that while these methods do converge to the same or similar policies as the
% Q-learning and Roth-Erev methods, they do not exhibit the same level of
% consistency.  Value function based methods or the Roth-Erev method may be the
% most suitable choice of algorithm in the simple electricity market simulations
% typically found in the literature.

% The simulations conducted here do not exploit any of the abilities of policy
% gradient methods to utilise multi-dimensional continuous state information and
% their behaviour in more complex electricity market environments deserves
% investigation.

%\section{Summary}


% \chapter{Competitive Power Trade}
% Having compared the learning methods in a one-player context, this section
% describes the method used to pit them against one and other and compare their
% performance.
%
% \section{Aims \& Objectives}
% Competition is fundamental to markets and this experiment aims to compare
% learning methods in a complex dynamic market environment with multiple
% competing participants.  The objective is to compare:
% \begin{itemize}
%   \item Performance, in terms of profitability, over a finite number of
%   periods,
%   \item Profitability when trading both active and reactive power.
%   \item Consistency of profit making and,
%   \item Sensitivity to algorithm parameter changes.
% \end{itemize}

% \section{Method of Simulation} Figure X illustrates the structure of the six
% bus power system model, from \cite{wood:pgoc}, with three generators and fixed
% demand at three of the buses used to provide a dynamic environment with
% typical system values.  Bus, branch and generator attribute values are stated
% in Tables X, Y, Z, respectively.  Three learning methods are compared in six
% simulations encapsulating all method--generator combinations.

% A price cap $c_{cap}$ of twice the marginal cost of the most expensive
% generator at full capacity is set by the market.  The simulations are repeated
% for with agents actions composing both price and quantity and with just price.
%  For the value-function methods, the state is defined by the market clearing
% price from the previous period, divided equally into $x_s$ discrete states
% between $0$ and $c_{cap}$.  The state vector $s_t$ for the policy gradient
% methods consists of the market clearing price and generator set-point from the
% previous period. \begin{equation} s_t = \begin{bmatrix}
% c_{mcp}\\
% p_g \end{bmatrix} \end{equation} The script used to conduct the simulation is
% provided in Listing X.

\chapter{System Constraint Exploitation}
\label{ch:exploitation}
% One of the main features of agents using policy gradient learning methods and
% artifical neural networks for policy function approximation is their ability
% to accept many signals of continuous sensor data.  This section describes an
% experiment in which the power system is severly constrained for certain
% periods, resulting in elevated nodal marginal prices in particular areas.  The
% methods are tested in their ability to exploit these constraints and improve
% their total accumulated reward.
This chapter explores the exploitation of constraints by learning agents in a
dynamic electricity trading environment.  Value function based and policy
gradient reinforcement learning methods are compared using a modified version of
the IEEE Reliability Test System.

\section{Introduction}
Having examined the basic learning characteristics of four algorithms in
Chapter~\ref{ch:nashanalysis}, this chapter extends the approach to examine
their operation in a complex dynamic environment.  It explores the ability of
policy gradient methods to operate with multi-dimensional, continuous state and
action data in the context of \textit{learning to trade power}.

A reference electric power system model from the IEEE Reliability Test System
(RTS) \cite{ieee79rts} provides a realistic environment for agents to compete
with diverse portfolios of generating plant supply dynamic loads.  System
constraints change as agents adjust their strategies and loads follow a hourly
profile that is varied in shape from day-to-day over the course of a simulated
year. By observing average profits at each hour of the day, the ability of
methods to successfully observe and exploit constraints is examined.

\section{Aims and Objectives}
This experiment aims to compare policy gradient and traditional reinforcement
learning methods in a dynamic electricity trading environment. Specifically, the
objectives are to determine:
\begin{itemize}
  \item If the policy gradient methods can achieve greater profitability
  under dynamic system constraints using a detailed state vector.
%   \item If policy gradient methods can exploit outages and the resulting
%   system constraints to further increased profit.
  \item The value of using an AC optimal power flow formulation in agent based
  electricity market simulation.
\end{itemize}
Meeting the first objective aims to demonstrate some of the value of using
policy gradient methods in electricity market participant modelling and to
determine if they warrant further research in this domain.

\section{Method of Simulation}
Learning methods are again compared by repeating simulations of competitive
electricity trade switching the algorithms used by the competing agents. Some
simplification of the state and action representations for value function based
methods is required, but generation portfolios and load profiles are identical
for each algorithm test.

The RTS has 24 bus locations that are connected by 32 transmission lines, 4
transformers and 2 underground cables. The transformers tie a 230kV area to a
138kV area.  The original model has 32 generators of 9 different types with a
total capacity of 3.45GW.  To reduce the size of the discrete action space, five
12MW and four 20MW generators are removed.  This is deemed to be a minor change
as it reduced the number of generators by 28\%, but their combined capacity is
only 4.1\% of the original total generation capacity and the remainder is more
than sufficient to meet demand.  To further reduce action space sizes all
generators of the same type at the same bus are aggregated into one generating
unit. This can be considered to be the representation of each individual power
station in the market, rather than each alternator stage.  The model has loads
at 17 locations and the total demand at system peak is 2.85GW.

Again, generator marginal costs are quadratic functions of output and are
defined by the parameters in Table \ref{tbl:ieee_rts_gencosts}. Figure
\ref{fig:ieee_rts_gencost_plot} shows the cost functions for each of the seven
types of generator and illustrates their categorisation by fuel type.  Generator
cost function coefficients were taken from an RTS model by Georgia Tech Power
Systems Control and Automation
Laboratory\footnote{http://pscal.ece.gatech.edu/testsys/} which assumes Coal
costs of 1.5~\$/MBtu\footnote{1 Btu (British thermal unit) $\approx$ 1055
Joules}, Oil costs of 5.5~\$/MBtu and Uranium costs of 0.46~\$/MBtu.  Data for
the modified model is provided in Appendix \ref{adx:ieee_rts} and the
connectivity of branches and the location of generators and loads is illustrated
in Figure \ref{fig:ieee79rts}.

\begin{table}
\caption{Generator types and cost parameters for the simplified IEEE
Reliability Test System.}
\begin{center}
\label{tbl:ieee_rts_gencosts}
\begin{tabular}{c|.{1.5}|.{2.4}|.{3.3}|c}
\hline
Code &\multicolumn{1}{c}{$a$} &\multicolumn{1}{|c|}{$b$}
&\multicolumn{1}{c|}{$c$} &Type \\
\hline\hline
% U12 &0.0 &0.32841 &56.564 &86.385 &Oil \\
% U20 &0.0 &0.0 &130.0 &400.685 &Oil \\
U50 &0.0 &0.0010 &0.001 &Hydro \\
U76	&0.01414	&16.0811	&212.308 &Coal \\
U100	&0.05267	&43.6615	&781.521 &Oil \\
U155	&0.00834	&12.3883	&382.239 &Coal \\
U197	&0.00717	&48.5804	&832.758 &Oil \\
U350	&0.00490	&11.8495	&665.109 &Coal \\
U400	&0.00021	&4.4231	&395.375 &Nuclear \\
\hline
\end{tabular}
\end{center}
\end{table}

\ifthenelse{\boolean{includefigures}}{
	\begin{figure}
	  \centering
	  \includegraphics{figures/ieee_rts_gencosts}
	  \caption{Generator cost functions for the IEEE Reliability Test System}
	  \label{fig:ieee_rts_gencost_plot}
	\end{figure}

	\begin{figure}
	  \centering
	  \includegraphics{figures/ieee_rts_profiles}
	  \caption{Hourly, daily and weekly load profile plots from the IEEE
	  Reliability Test System}
	  \label{fig:ieee_rts_profiles}
	\end{figure}
}{}


\begin{table}
\caption{Agent portfolios.}
\label{tbl:agent_portfolios}
\begin{center}
\begin{tabular}{c|c|c|c|c|c|c|c|c}
\hline
\multirow{2}{*}{Agent} &U50 &U76 &U100 &U155 &U197 &U350 &U400 &Total \\
 &Hydro &Coal &Oil &Coal &Oil &Coal &Nuclear &(MW) \\
\hline\hline
1 & &2$\times$ & &1$\times$ & & &1$\times$ &707 \\
2 & &2$\times$ & &1$\times$ & & &1$\times$ &707 \\
3 &6$\times$ & & & &3$\times$ & & &891 \\
4 & & &3$\times$ &2$\times$ & &1$\times$ & &960 \\
\hline
\end{tabular}
\end{center}
\end{table}

The generating stock is divided into 4 portfolios (See Table
\ref{tbl:agent_portfolios}) that are each endowed to a learning agent.
Portfolios were chosen such that each agent has: a mix of base load and peaking
plant, approximately the same total generation capacity and generators in
different areas of the network.  The generator labels in Figure
\ref{fig:ieee79rts} specify the associated agent.  The synchronous condenser is
associated with a passive agent that always offers 0 MW at 0~\$/MWh (the unit
can be dispatched to provide or absorb reactive power within capacity limits).

\ifthenelse{\boolean{includefigures}}{\input{tikz/ieee79rts}}{}

Markups on marginal cost are restricted to a maximum of 30\% and discrete
markups of 0, 15\% or 30\% are defined for value function based methods.  Up to
20\% of the total capacity of each generator can be withheld and discrete
withholds of 0~or~20\% are defined.  Initially only one offer per generator is
required, but this is increased to two in order to explore the effect of
increased offer flexibility.
% Agent~3 has the largest discrete action space with XX possible actions to be
% explored in each state.

The environment state for all algorithm tests consists of a forecast of the
total system demand for the next period.  The system demand follows an hourly
profile that is adjusted according to the day of the week and the time of year.
The profiles are provided by the RTS and are illustrated in Figure
\ref{fig:ieee_rts_profiles}.  For tests of value function based methods and the
Stateful Roth-Erev learning algorithm, the continuous state is divided into 3
discrete states of equal size, that allow differentiation between low, medium
and peak demand.

To investigate the exploration of constraints, AC optimal power flow is used and
the state vector for agents using policy gradient methods is optionally adjusted
to combine the demand forecast with voltage constraint Lagrangian multipliers
for all generator buses and the voltage magnitude at all other buses. Lagrangian
multipliers are used for generator buses as generators typically fix the voltage
at their associated bus.  Branch flows are not included in the state vector as
flow limits in the RTS are high and are typically not reached at peak demand.
Generator capacity limits are binding in most states of the RTS, but the output
of other generators is deemed to be hidden from agents.

The nodal marginal pricing scheme is used and cleared offer prices are
determined by the Lagrangian multiplier on the power balance constraint for the
bus at which the generator associated with the offer is connected.

Typical parameter values are used for each of the algorithms.  Again, no attempt
to study parameter sensitivity is made.  Learning rates are set low and
exploration parameters decay slowly due to the length and complexity of the
simulation.  For Q-learning $\alpha=0.2$, $\gamma=0.99$ and $\epsilon$-greedy
action selection is used with $\epsilon=0.9$ and $d=0.999$. For Roth-Erev
learning $\epsilon=0.55$, $\phi=0.3$ and Boltzmann action selection is used with
$\tau=100$ and $d=0.999$.

Again a typical PyBrain two-layer neural network configuration with linear input
and output nodes, no bias nodes and randomised initial connection weights are
used for policy function approximation.  The initial exploration rate $\sigma=0$
for both policy gradient methods and decays according to Equation
(\ref{eq:sigmadecay}) with $d=0.995$ and $\sigma_n=-0.5$.  Constant learning
rates are used in each simulation with $\gamma=0.01$ for REINFORCE and
$\gamma=0.005$ for ENAC.

\section{Simulation Results}
Each agent's rewards are recorded for a simulated year of 364 trading episodes,
each consisting of 24 interactions.  To compare algorithms, the average reward
for each hour of the day is calculated for each agent and plotted.  Only results
for Agent~1 and Agent~4 are given as Agents~1 and Agent~2 have identical
portfolios and most of Agent~3's portfolio consists of hydro-electric plant with
zero cost.  The method of applying percentage markups on marginal cost has not
effect for generators with zero cost and almost identical results are found for
all algorithms.

\ifthenelse{\boolean{includefigures}}{
  \begin{figure}
    \centering
    \includegraphics{figures/fig6_1}
    \caption{Average rewards for Agent 1 and Agent 4, comparing the modified
    and Stateful Roth-Erev methods.}
    \label{fig:6_1}
  \end{figure}
}

Figure \ref{fig:6_1} compares the modified Roth-Erev method with the Stateful
Roth-Erev method.  The plots show average rewards for Agent~1 and Agent~4 when
using Q-learning and the two Roth-Erev variants.

\ifthenelse{\boolean{includefigures}}{
  \begin{figure}
    \centering
    \includegraphics{figures/fig6_2_a1}
    \caption{Average rewards for Agent 1 under two state configurations.}
    \label{fig:6_2_a1}
  \end{figure}

  \begin{figure}
    \centering
    \includegraphics{figures/fig6_2_a4}
    \caption{Average rewards for Agent 4 under two state configurations.}
    \label{fig:6_2_a4}
  \end{figure}
}

Figure \ref{fig:6_2_a1} and Figure \ref{fig:6_2_a4} compare policy gradient
methods under two state vector configurations.  Figure \ref{fig:6_2_a1} concerns
Agent~1 and shows the average reward received for a state vector consisting
solely of a demand forecast and for a combined demand forecast and bus voltage
profile state vector.  Figure \ref{fig:6_2_a4} shows average rewards for Agent~4
under the same configurations.

\ifthenelse{\boolean{includefigures}}{
  \begin{figure}
    \centering
    \includegraphics{figures/fig6_3}
    \caption{Average rewards for Agent 1 and Agent 4 with two offers submitted
    per generator.}
    \label{fig:6_3}
  \end{figure}
}

Figure \ref{fig:6_3} shows average rewards for agents 1 and 4 from a repeat of
the bus voltage profile state simulation, but with two offers required per
generator.  Due to time constraints and limited simulation resources only
results for Q-learning and ENAC are given.

\section{Interpretation and Discussion of Results}
\label{sec:discuss}
Agents with a discrete environment have 216 possible actions to choose from in
each state when required to submit one offer per generator.  Figure
\ref{fig:6_1} shows that, using Q-learning, agents are able to learn an
effective policy that yields increased profits under two different portfolios.
The importance of utilising environment state data in a dynamic electricity
setting is illustrated by the differences in average reward received by the
modified Roth-Erev method and the Stateful Roth-Erev method.  The optimal action
for an agent depends upon the current system load and the stateless Roth-Erev
formulation is unable to interpret this.  The Stateful Roth-Erev method can be
seen to achieve approximately the same performance as Q-learning.

Including bus voltage constraint data in the state for a discrete environment
would result in a state space of impractical size, but including it in a
continuous environment was straight-forward.  Figure \ref{fig:6_2_a1} and Figure
\ref{fig:6_2_a4} show that ENAC achieves greater profits when presented with a
combined demand forecast and bus voltage state vector.  REINFORCE performs less
well than ENAC, but also shows improvement over the pure demand forecast case.
ENAC achieves equivalent, but not greater performance than Q-learning in all
periods of the trading day when using the voltage data.  ENAC is not able to use
the additional state information to any further advantage, but does learn a
profitable policy.

Simply changing the number of offers that are required to be submitted for each
generator from 1 to 2, increases the number of discrete action possibilities in
each state to 46,656.  Figure \ref{fig:6_3} shows that Q-learning is still able
to achieve a similar level of reward as in the one offer case.  The
profitability for both methods is degraded, but ENAC receives significantly
lower average reward when the agent is required to produce an action vector of
twice the size and is not able to use the increased flexibility in its offer
structure to any advantage.

With state and action spaces on this scale, computing updates to an agent's
look-up table or neural network adds considerably to the computational expense
of the simulation.  Researchers wishing to apply these methods in larger
problems must be willing to investigate program optimisation and parallel or
distributed processing.  Studies not requiring this level of complexity are
seemingly best using a state-value function based method, such as Q-learning or
the Stateful Roth-Erev formulation.

The lack of involvement in the market from the hydro-electric power plant
largely negates the participation of Agent~3 and exposes one significant
shortcoming of the approach. This could be overcome by allowing markups in
dollars, rather than as a percentage of marginal cost.

Generation portfolios were configured such that agents would receive a mix of
low-cost base-load plant and expensive peak-supply plant.  However, the cost
differences between fuel types are such that an offer of power from a coal or
nuclear power station can not exceed in price that from a unit with a more
expensive fuel type.  Greater competition and more complex equilibria could be
introduced to the simulation if fuel cost differences were not as large or
maximum markups on price were greater.  In \citeA{cincotti:09}, for example, a
300\% markup limit is set.

The dynamics of this simulation could also be greatly increased by introducing
demand-side participation.  It could allow agents to directly influence the
state of the environment, through the demand forecast sensor.  It would also
give agents more options for competition, increasing the complexity of their
optimal policy and posing a greater challenge to the learning algorithms.

\section{Summary}
In this chapter policy gradient reinforcement learning algorithms have been
applied in a complex dynamic electricity trading simulation and assessed in
their ability to exploit constraints in the system.  They were found to be a
valid technique for \textit{learning to trade power}, but were outperformed by
Q-learning in most configurations of environment state and action space.  This
includes a simulation with action spaces that were expected to be too large for
Q-learning to explore, but to be of no significant challenge to policy gradient
methods.

The addition of bus voltage sensor data in the state vector of agents operating
policy gradient methods was found to improve their performance. However, no
evidence was found to suggest that they could use this information to increase
their profitability above that of agents operating the Q-learning or Stateful
Roth-Erev method.  Indeed, it is believed that this can be considered a general
finding in reinforcement learning research, that despite great effort and the
development of many new algorithms, few surpass the original temporal difference
methods from \citeA{suttonbarto:1998}.

Shortcomings in the price markup methodology and competition levels have been
identified and possible solutions proposed.  The implications of increased
computational expense for further development of this work have also been noted.
% No evidence has been found to suggest that policy gradient methods can be used
% to exploit complex constraints in a power system model.  However, they have
% been shown to operate better with large state vectors.  Some limitations of
% the standard Roth-Erev formulation in an dynamic environment have been
% demonstrated. Q-learning was able to produce an effective policy in all
% simulations, including one involving a relatively large action space that saw
% severely degraded performance from a policy gradient method.
AC optimal power flow adds enormously to simulation times when analysing an
entire year of hourly trading interactions.  The addition of bus voltage data
was found to improve performance of the policy gradient methods, but it has not
been shown if the same could not be achieved by perhaps using bus voltage angles
from a DC optimal power flow formulation.