# Question 3

You are not required to use QuickPOMDPs, but the examples will use it.

The following code shows a definition of a problem with 4 nonterminal states. It receives a reward of 3 in state 1 and then terminates immediately (state 5 is a "terminal state")

In [1]:
using QuickPOMDPs

In [3]:
S = 1:5

A = [-1, 1] # -1 is left, 1 is right

function T(s, a, sp) # returns probability of transitioning to sp given s, a
    # handle transitioning to the terminal state
    if s == 1
        if sp == 5
            return 1.0
        else
            return 0.0
        end
    # now handle normal transitions
    elseif sp == clamp(s + a, 1, 4)
        return 0.9
    elseif sp == clamp(s - a, 1, 4)
        return 0.1
    else
        return 0.0
    end
end

function R(s, a)
    if s == 1
        return 3.0
    else
        return 0.0
    end
end

γ = 0.99

terminals = Set(5) # set of terminal states - no reward and no transitioning out of these states

m = DiscreteExplicitMDP(S, A, T, R, γ, terminals=terminals);

In the homework problem, you will have to account for the key. You may want to continue using `Int`s to represent the state, or you can use something like more complex like [`NamedTuple`s](https://docs.julialang.org/en/v1/base/base/#Core.NamedTuple).

When you get ready to solve the problem, you may wish to use the [DiscreteValueIteration package](https://github.com/JuliaPOMDP/DiscreteValueIteration.jl). It will be able to solve the problem. For this question, see especially the [`POMDPs.value` function](https://juliapomdp.github.io/POMDPs.jl/stable/api/#POMDPs.value).

# Question 4

In this question, you need only define the generative model for the continuous-state MDP.

The following MDP has a generative model defined for it, but the next state is simply the action plus a uniformly-generated random number between 0 and 1.

In [16]:
using POMDPs
using QuickPOMDPs
using Distributions
using POMDPPolicies
using POMDPSimulators

In [17]:
m = QuickMDP(
    function G(s, a, rng) # this is the generative model - it takes in a state, action, and random number generator
        sp = a + rand(rng) # the next state, s'
        r = -abs(s)+0.8*abs(a) # strange reward function
        return (sp=sp, r=r) # package state and reward in a NamedTuple to return
    end,
    initialstate_distribution = Normal(), # a distribution from Distributions.jl to draw initial states from
    actiontype = Float64 # since there is no other way to infer the type of the actions, we have to tell it
);

To define a policy, define a function that takes in a state and returns the action, then wrap that in a FunctionPolicy from POMDPPolicies.jl

In [18]:
function pfunc(s)
    return -s
end
policy = FunctionPolicy(pfunc)

FunctionPolicy{typeof(pfunc)}(pfunc)

To run a simulation and get a reward, use the rollout simulator

In [19]:
sim = RolloutSimulator(max_steps=100)
r = simulate(sim, m, policy)

-19.362495060704443

With many such simulations, you can evaluate different policies.

If simulations are going too slowly, ask on Piazza or consult the [Performance Tips](https://docs.julialang.org/en/v1/manual/performance-tips/#Avoid-global-variables-1)

# Problem 5

I won't give too many hints on this one

In [34]:
using DMUStudent
using DMUStudent.HW2

In [22]:
n = 1
m = UnresponsiveACASMDP(n); # this is the most coarsly discretized version of the problem

In [23]:
t = transition_matrices(m) # a dictionary of transition matrices for each action

Dict{Int64,Array{Float64,2}} with 3 entries:
  2 => [0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 1.0 0.0; 0.0 0.0 … …
  3 => [0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 1.0 0.0; 0.0 0.0 … …
  1 => [0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 1.0 0.0; 0.0 0.0 … …

In [26]:
t[1][2,3] # probability of transitioning from 2 to 3 with action 1, i.e. T(3|2,1)

0.0

In [27]:
r = reward_vectors(m) # a dictionary of reward vectors

Dict{Int64,Array{Float64,1}} with 3 entries:
  2 => [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0…
  3 => [-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0  …  0.0, 0.0…
  1 => [-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0  …  0.0, 0.0…

In [28]:
r[1][2] # reward collected in state 2 if action 1 is taken, i.e. R(2,1)

-1.0

In [30]:
length(r[1]) # number of states is given by the size of the vectors. Make sure your value vector has this size

1250

In [31]:
n = 2 # by increasing n, we have a finer discretization
m2 = UnresponsiveACASMDP(n)
r2 = reward_vectors(m2)
length(r2[1]) # this one has many more states!

10000

If you run into memory issues, see the docstring for `transition_matrices`. If code in the problem definition is running too slow, contact the instructor; we may be able to speed it up.

In [35]:
# to evaluate, use
v = zeros(length(r2[1])) # this should be your actual value function
evaluate(v, "hw2")

Evaluation complete! Score: 0


└ @ DMUStudent.HW2 none:21


0

In [36]:
# and to submit, use
submit(v, "hw2", "identikey@colorado.edu", nickname="nickname")

ErrorException: You must use your indentikey@colorado.edu email address. Your identikey is four letters followed by four numbers.