## Example: Online Planning in the Lava Grid World

In [1]:
include("Include.jl");

[32m[1m    Updating[22m[39m git-repo `https://github.com/varnerlab/VLDecisionsPackage.jl.git`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m    Updating[22m[39m `~/Desktop/julia_work/CHEME-5760-Examples-F23/Project.toml`
  [90m[10f378ab] [39m[93m~ VLDecisionsPackage v0.1.0 `https://github.com/varnerlab/VLDecisionsPackage.jl.git#main` ⇒ v0.1.0 `https://github.com/varnerlab/VLDecisionsPackage.jl.git#main`[39m
[32m[1m    Updating[22m[39m `~/Desktop/julia_work/CHEME-5760-Examples-F23/Manifest.toml`
  [90m[10f378ab] [39m[93m~ VLDecisionsPackage v0.1.0 `https://github.com/varnerlab/VLDecisionsPackage.jl.git#main` ⇒ v0.1.0 `https://github.com/varnerlab/VLDecisionsPackage.jl.git#main`[39m
[32m[1mPrecompiling[22m[39m project...
[32m  ✓ [39mVLDecisionsPackage
  1 dependency successfully precompiled in 4 seconds. 227 already precompiled.
[32m[1m  Activating[22m[39m project at `~/Desktop/julia_work/CHEME-5760-Examples-F23`
[32m[1m    Updating[22m[39m 

## Task 1: Build the world model
We encoded the `rectangular grid world` using the `MyRectangularGridWorldModel` model, which we construct using a `build(...)` method. Let's setup the data for the world, setup the states, actions, rewards and then construct the world model. 
* First, set values for the `number_of_rows` and `number_of_cols` variables, the `nactions` that are avialble to the agent and the `discount factor` $\gamma$. 
* Then, we'll compute the number of states, and setup the state set $\mathcal{S}$ and the action set $\mathcal{A}$

In [2]:
number_of_rows = 5
number_of_cols = 5
nactions = 4;
γ = 0.95;
nstates = (number_of_rows*number_of_cols);
𝒮 = range(1,stop=nstates,step=1) |> collect;
𝒜 = range(1,stop=nactions,step=1) |> collect;

Next, we'll set up a description of the rewards, the `rewards::Dict{Tuple{Int,Int}, Float64}` dictionary, which maps the $(x,y)$-coordinates to a reward value. We only need to put `non-default` reward values in the reward dictionary (we'll add default values to the other locations later). Lastly, let's put the locations on the grid that are `absorbing`, meaning the charging station or lava pits in your living room:

In [3]:
# setup rewards -
rewards = Dict{Tuple{Int,Int}, Float64}()
rewards[(2,2)] = -100000.0 # lava in the (2,2) square 
rewards[(4,4)] = -100000.0 # lava in the (4,4) square
rewards[(3,3)] = 1000.0    # charging station square

# setup set of absorbing states -
absorbing_state_set = Set{Tuple{Int,Int}}()
push!(absorbing_state_set, (2,2));
push!(absorbing_state_set, (3,3));
push!(absorbing_state_set, (4,4));

Finally, we can build an instance of the `MyRectangularGridWorldModel` type, which models the grid world. We save this instance in the `world` variable
* We must pass in the number of rows `nrows`, number of cols `ncols`, and our initial reward description in the `rewards` field into the `build(...)` method

In [4]:
world = VLDecisionsPackage.build(MyRectangularGridWorldModel, 
    (nrows = number_of_rows, ncols = number_of_cols, rewards = rewards));

## Task 2: Generate the components of the MDP problem
The MDP problem requires the return function (or array) `R(s, a)`, and the transition function (or array) `T(s, s′, a)`. Let's construct these from our grid world model instance, starting with the reward function `R(s, a)`:

### Rewards $R(s,a)$
We'll encode the reward function as a $\dim\mathcal{S}\times\dim\mathcal{A}$ array, which holds the reward values for being in state $s\in\mathcal{S}$ and taking action $a\in\mathcal{A}$. After initializing the `R`-array and filling it with zeros, we'll populate the non-zero values of $R(s, a)$ using nested `for` loops. During each iteration of the `outer` loop, we'll:
* Select a state `s`, an action `a`, and a move `Δ`
* We'll then compute the new position resulting from implementing action `a` from the current position and store this in the `new_position` variable. * If the `new_position`$\in\mathcal{S}$ is in our initial `rewards` dictionary (the charging station or a lava pit), we use that reward value from the `rewards` dictionary. If we are still in the world but not in a special location, we set the reward to `-1`.
* Finally, if `new_position`$\notin\mathcal{S}$, i.e., the `new_position` is a space outside the grid, we set a penalty of `-50000.0`.

In [5]:
R = zeros(nstates, nactions);
fill!(R, 0.0)
for s ∈ 𝒮
    for a ∈ 𝒜
        
        Δ = world.moves[a];
        current_position = world.coordinates[s]
        new_position =  current_position .+ Δ
        if (haskey(world.states, new_position) == true)
            if (haskey(rewards, new_position) == true)
                R[s,a] = rewards[new_position];
            else
                R[s,a] = -1.0;
            end
        else
            R[s,a] = -50000.0; # we are off the grid, big negative penalty
        end
    end
end
R

25×4 Matrix{Float64}:
  -50000.0       -1.0   -50000.0       -1.0
  -50000.0  -100000.0       -1.0       -1.0
  -50000.0       -1.0       -1.0       -1.0
  -50000.0       -1.0       -1.0       -1.0
  -50000.0       -1.0       -1.0   -50000.0
      -1.0       -1.0   -50000.0  -100000.0
      -1.0       -1.0       -1.0       -1.0
      -1.0     1000.0  -100000.0       -1.0
      -1.0       -1.0       -1.0       -1.0
      -1.0       -1.0       -1.0   -50000.0
      -1.0       -1.0   -50000.0       -1.0
 -100000.0       -1.0       -1.0     1000.0
      -1.0       -1.0       -1.0       -1.0
      -1.0  -100000.0     1000.0       -1.0
      -1.0       -1.0       -1.0   -50000.0
      -1.0       -1.0   -50000.0       -1.0
      -1.0       -1.0       -1.0       -1.0
    1000.0       -1.0       -1.0  -100000.0
      -1.0       -1.0       -1.0       -1.0
      -1.0       -1.0  -100000.0   -50000.0
      -1.0   -50000.0   -50000.0       -1.0
      -1.0   -50000.0       -1.0       -1.0
      -1.0

### Transition $T(s, s^{\prime},a)$
Next, build the transition function $T(s,s^{\prime},a)$. We'll encode this as a $\dim\mathcal{S}\times\dim\mathcal{S}\times\dim\mathcal{A}$ [multidimension array](https://docs.julialang.org/en/v1/manual/arrays/) and populate it using nested `for` loops. 

* The `outer` loop we will iterate over actions. For every $a\in\mathcal{A}$ will get the move associated with that action and store it in the `Δ`
* In the `inner` loop, we will iterate over states $s\in\mathcal{S}$. We compute a `new_position` resulting from implementing action $a$ and check if `new_position`$\in\mathcal{S}$. If `new_position` is in the world, and `current_position` is _not_ an `absorbing state` we set $s^{\prime}\leftarrow$`world.states[new_position]`, and `T[s, s′,  a] = 1.0`
* However, if the `new_position` is outside of the grid (or we are jumping from an `absorbing` state), we set `T[s, s,  a] = 1.0`, i.e., the probability that we stay in `s` if we take action `a` is `1.0`.

In [6]:
T = Array{Float64,3}(undef, nstates, nstates, nactions);
fill!(T, 0.0)
for a ∈ 𝒜
    
    Δ = world.moves[a];
    
    for s ∈ 𝒮
        current_position = world.coordinates[s]
        new_position =  current_position .+ Δ
        if (haskey(world.states, new_position) == true && 
                in(current_position, absorbing_state_set) == false)
            s′ = world.states[new_position];
            T[s, s′,  a] = 1.0
        else
            T[s, s,  a] = 1.0
        end
    end
end

Finally, we construct an instance of the `MyMDPProblemModel` which encodes the data required to solve the MDP problem.
* We must pass the states `𝒮`, the actions `𝒜`, the transition matrix `T`, the reward matrix `R`, and the discount factor `γ` into the `build(...)` method. We store the MDP model in the `m` variable:

In [7]:
m = VLDecisionsPackage.build(MyMDPProblemModel, 
    (𝒮 = 𝒮, 𝒜 = 𝒜, T = T, R = R, γ = γ));

## Task 3: Online planning solution

In [None]:
function randpolicy(problem::MyMDPProblemModel, world::MyRectangularGridWorldModel, s::Int)
    
end