MDPs is a Julia package for working with Markov decision processes (MDPs).
MDPs supports both Julia 0.3 and 0.4, and can be installed from the REPL via:
Pkg.add("MDPs")
So far only a few simple types and the value iteration algorithm have been implemented. A basic usage could look like this:
using MDPs
P, R = MDPs.Examples.random(10, 3) # a random MDP with 10 states and 3 actions
mdp = MDP(P, R)
Q = value_iteration(mdp, 0.9) # value iteration with a discount factor of 0.9
value(Q) # the optimal value vector
policy(Q) # the optimal policy vector
The documentation is here for now until it becomes more complete. For more information please check the docstrings and source code.
There are four categories of types:
- Transition probability
- Reward
- Q function
- MDP
In the following discussion,
let S
be the number of states and A
be the number of actions.
A transition probability is the probability that
the system moves from state s to state s' given
that action a was taken.
So there must be some function P
,
that takes input arguments s
, t
and a
,
and maps these to an output p
.
p
is the probability that s
transitions to t
when a
is performed.
The transition probability types are used to abstract this functionality.
The base type is AbstractTransitionProbability
,
and types that implement transition probability functionality should be subtypes of it.
Any subtype of AbstractTransitionProbability
is expected to implement a method called probability
The calling signature of this function looks like:
probability(P, s, t, a)
where P
is a subtype of AbstractTransitionProbability
,
s
is the starting state,
t
is the next state,
a
is the action,
and it returns a Real
between 0 and 1 inclusive.
Any algorithm should be designed to accept this abstract type and
interact with it using the probability
function.
This absrtact type has two subtypes:
AbstractTransitionProbabilityArray
and MDPs.TransitionProbabilityFunction
.
AbstractTransitionProbabilityArray
is the base type for transition probability types
that store data as a Julia AbstractArray
subtype.
It has two concrete subtypes TransitionProbabilityArray
and SparseTransitionProbabilityArray
.
These types are constructed by passing the approprioate Julia array type:
either an Array{T<:Real,3}
or Vector{SparseMatrixCSC{Tv<:Real,Ti<:Integer}}
respectively.
These types all have a convenience constructor TransitionProbability
that will return the correct type based on the type of its argument.
For example, these define a transition probabilities for five states and two actions, where states always transition back to themselves:
P_dense = TransitionProbability(cat(3, eye(5), eye(5)))
P_sparse = TransitionProbability(SparseMatrixCSC{Float64,Int}[speye(5) for _ = 1:2])
probability(P_dense, 1, 1, 2) == probability(P_sparse, 1, 1, 2) == 1
probability(P_dense, 1, 2, 2) == probability(P_sparse, 1, 2, 2) == 0
The constructor for FunctionTransitionProbability
takes a Julia Function
and
the number of states and actions.
The function should accept three arguments s
, t
, a
and return a Real
.
For example:
P = TransitionProbability((s, t, a) -> s == t ? 1 : 0, 5, 2)
probability(P, 1, 1, 2) == 1
probability(P, 1, 2, 2) == 0
The benefit of using a function could be to define very large state or action spaces. As an exagerated example:
s = BigInt(string(typemax(Int))^2) # 92233720368547758079223372036854775807 on 64 bit
probability(P, s, s, 1000) == 1
A reward is a score used to adjust the value that is assigned to each state in the long run.
The base type is AbstractReward
, and rewards should be a subtype of this.
Each subtype should also define the reward
method.
The calling signature is
reward(R, s, a)
where R
is a subtype of AbstractReward
,
s
is the state,
a
is the action,
and it returns a Real
.
There is a subtype AbstractArrayReward
with two further subtypes:
ArrayReward
and SparseReward
.
ArrayReward
can be constructed with either a Vector
, Matrix
or Array{T,3}
.
SparseReward
is constructed with a SparseMatrixCSC
.
There is a convenience function Reward
which will return the appropriate type.
Examples with 10 states and 3 actions:
R1 = Reward(rand(10)) # reward depends only on state
R2 = Reward(rand(10, 3))
R3 = Reward(rand(10, 10, 3))
R4 = Reward(sprand(10, 3, 1/3))
The 3-dimensional ArrayReward
is most useful for reinforcement learning algorithms.
The Q-function types abstract the process of assigning values to states and actions.
The base type is AbstractQFunction
, and
there are currently two subtypes: ArrayQFunction
and VectorQFunction
.
ArrayQFunction
stores a dense matrix of the value for each state-action pair.
On the other hand,
VectorQFunction
stores only the currently optimal value and policy.
These two types represent different trade-offs for memory/time efficiency.
The methods that should be implemented for these types are
valuetype(Q)
, value(Q)
, policy(Q)
, value!(V, Q)
, and setvalue!(Q, v, s, a)
,
where Q
is a subtype of AbstractQFunction
.
valuetype(Q)
: the type of the valuespolicy(Q)
: the optimal policyvalue(Q)
: the value of each state when following the optimal policyvalue!(V, Q)
: copy the value toV
setvalue!(Q, v, s, a)
: set the value of being in states
and taking actiona
tov
Examples for 10 states and 3 actions:
Q1 = ArrayQFunction(10, 3) # initialised to Float64 zeros
Q2 = ArrayQFunction(rand(10, 3)) # pass in pre-initialised array
Q3 = QFunction(rand(10, 3)) # identical to previous
Q4 = VectorQFunction(10) # initialised to Float64 zeros
Q5 = VectorQFunction(rand(10), rand(Int, 10)) # pass in pre-initialised vectors
Q6 = QFunction(rand(10), rand(Int, 10)) # identical to previous
The base type is AbstractMDP
with one concrete subtype MDP
.
POMDP
is planned for the future.
The MDP types provide a wrapper around AbstractTransitionProbability
and AbstractReward
types,
ensure that the number of states and actions in both are the same,
and that the transition probability matrices are square and stochastic.
They are initialised by passing an AbstractTransitionProbability
and AbsrtactReward
.
Also algorithms take an AbstractQFunction
instance and an AbstractMDP
instance to solve the problem.
Currently there are value_iteration(mdp::MDP, δ)
and value_iteration!(Q::AbsrtactQFunction, mdp::MDP, δ)
defined.
These perform the value iteration algorithm.
The former will construct its own Q-function and return it,
while the later will modify the Q-function that is passed to it.
This is a list of the other exported functions:
bellman
: the Bellman operator.bellman!
: the in-place Bellman operator.is_square_stochastic
: checks thatP
is square-stochastic.ismdp
: checks thatP
andR
describe a valid MDP.num_actions
: number of states.num_states
: number of actions.
There are some examples in the Examples
submodule that
return transition and reward types, which
can be passed to the TransitionProbability
and Reward
methods respectively.
MDPs.Examples.random
: random denseMDPs.Examples.sprandom
: random sparseMDPs.Examples.small
: two states and two actions
MDPs.Examples.random(states, actions)
will create a random Float64 transition
array that is of size states
×states
×actions
and a random Float64 reward
array that is of size states
×actions
.
MDPs.Examples.random{N}(states, actions, mask::Array{Bool,N})
will set the
transition array to zero everywhere that mask is false
. mask
can be
states
×states
×actions
or states
×actions
in size.
Bellman R, 1957, A Markovian Decision Process, Journal of Mathematics and Mechanics, vol. 6, no. 5, pp. 679–684.
Please file and issues or feature requests through the GitHub issue tracker.
The package is licensed under the terms of the MIT "Expat" License. See LICENSE.md for details.