## Example: Online Planning in the Lava Grid World

This example will familiarize students with the `rollout` solution of a `two-dimensional` navigation problem, i.e., the lava world [roomba](https://www.irobot.com) problem we have discussed. 

### Problem
You have a [roomba](https://www.irobot.com) that has finished cleaning the kitchen floor and needs to return to its charging station. However, between your kitchen floor and the `charging station` (safety), there are one or more `lava pits` (destruction for the [roomba](https://www.irobot.com)). This is an example of a two-dimensional grid-world navigational decision task. 

This example will familiarize students with using `rollout` for solving a two-dimensional grid-world navigation task, the role of the discount factor $\gamma$. In particular, we will:

* __Task 1__: Build a `5` $\times$ `5` world model with two lava pits and a charging station.
* __Task 2__: Generate the components of the MDP problem 
* __Task 3__: Develop on online planning solution by implementing a `rollout(...)` method.

In [1]:
include("Include.jl");

[32m[1m    Updating[22m[39m git-repo `https://github.com/varnerlab/VLDecisionsPackage.jl.git`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/Desktop/julia_work/CHEME-5760-Examples-F23/Project.toml`
[32m[1m  No Changes[22m[39m to `~/Desktop/julia_work/CHEME-5760-Examples-F23/Manifest.toml`
[32m[1m  Activating[22m[39m project at `~/Desktop/julia_work/CHEME-5760-Examples-F23`
[32m[1m    Updating[22m[39m registry at `~/.julia/registries/General.toml`
[32m[1m    Updating[22m[39m git-repo `https://github.com/varnerlab/VLDecisionsPackage.jl.git`
[32m[1m  No Changes[22m[39m to `~/Desktop/julia_work/CHEME-5760-Examples-F23/Project.toml`
[32m[1m  No Changes[22m[39m to `~/Desktop/julia_work/CHEME-5760-Examples-F23/Manifest.toml`


## Task 1: Build the world model
We encoded the `rectangular grid world` using the `MyRectangularGridWorldModel` model, which we construct using a `build(...)` method. Let's setup the data for the world, setup the states, actions, rewards and then construct the world model. 
* First, set values for the `number_of_rows` and `number_of_cols` variables, the `nactions` that are avialble to the agent and the `discount factor` $\gamma$. 
* Then, we'll compute the number of states, and setup the state set $\mathcal{S}$ and the action set $\mathcal{A}$

In [2]:
number_of_rows = 5
number_of_cols = 5
nactions = 4;
γ = 0.25;
nstates = (number_of_rows*number_of_cols);
𝒮 = range(1,stop=nstates,step=1) |> collect;
𝒜 = range(1,stop=nactions,step=1) |> collect;

Next, we'll set up a description of the rewards, the `rewards::Dict{Tuple{Int,Int}, Float64}` dictionary, which maps the $(x,y)$-coordinates to a reward value. We only need to put `non-default` reward values in the reward dictionary (we'll add default values to the other locations later). Lastly, let's put the locations on the grid that are `absorbing`, meaning the charging station or lava pits in your living room:

In [3]:
# setup rewards -
rewards = Dict{Tuple{Int,Int}, Float64}()
rewards[(2,2)] = -100000.0 # lava in the (2,2) square 
rewards[(4,4)] = -100000.0 # lava in the (4,4) square
rewards[(3,3)] = 1000.0    # charging station square

# setup set of absorbing states -
absorbing_state_set = Set{Tuple{Int,Int}}()
push!(absorbing_state_set, (2,2));
push!(absorbing_state_set, (3,3));
push!(absorbing_state_set, (4,4));

Finally, we can build an instance of the `MyRectangularGridWorldModel` type, which models the grid world. We save this instance in the `world` variable
* We must pass in the number of rows `nrows`, number of cols `ncols`, and our initial reward description in the `rewards` field into the `build(...)` method

In [4]:
world = VLDecisionsPackage.build(MyRectangularGridWorldModel, 
    (nrows = number_of_rows, ncols = number_of_cols, rewards = rewards));

## Task 2: Generate the components of the MDP problem
The MDP problem requires the return function (or array) `R(s, a)`, and the transition function (or array) `T(s, s′, a)`. Let's construct these from our grid world model instance, starting with the reward function `R(s, a)`:

### Rewards $R(s,a)$
We'll encode the reward function as a $\dim\mathcal{S}\times\dim\mathcal{A}$ array, which holds the reward values for being in state $s\in\mathcal{S}$ and taking action $a\in\mathcal{A}$. After initializing the `R`-array and filling it with zeros, we'll populate the non-zero values of $R(s, a)$ using nested `for` loops. During each iteration of the `outer` loop, we'll:
* Select a state `s`, an action `a`, and a move `Δ`
* We'll then compute the new position resulting from implementing action `a` from the current position and store this in the `new_position` variable. * If the `new_position`$\in\mathcal{S}$ is in our initial `rewards` dictionary (the charging station or a lava pit), we use that reward value from the `rewards` dictionary. If we are still in the world but not in a special location, we set the reward to `-1`.
* Finally, if `new_position`$\notin\mathcal{S}$, i.e., the `new_position` is a space outside the grid, we set a penalty of `-50000.0`.

In [5]:
R = zeros(nstates, nactions);
fill!(R, 0.0)
for s ∈ 𝒮
    for a ∈ 𝒜
        
        Δ = world.moves[a];
        current_position = world.coordinates[s]
        new_position =  current_position .+ Δ
        if (haskey(world.states, new_position) == true)
            if (haskey(rewards, new_position) == true)
                R[s,a] = rewards[new_position];
            else
                R[s,a] = -1.0;
            end
        else
            R[s,a] = -50000.0; # we are off the grid, big negative penalty
        end
    end
end
R

25×4 Matrix{Float64}:
  -50000.0       -1.0   -50000.0       -1.0
  -50000.0  -100000.0       -1.0       -1.0
  -50000.0       -1.0       -1.0       -1.0
  -50000.0       -1.0       -1.0       -1.0
  -50000.0       -1.0       -1.0   -50000.0
      -1.0       -1.0   -50000.0  -100000.0
      -1.0       -1.0       -1.0       -1.0
      -1.0     1000.0  -100000.0       -1.0
      -1.0       -1.0       -1.0       -1.0
      -1.0       -1.0       -1.0   -50000.0
      -1.0       -1.0   -50000.0       -1.0
 -100000.0       -1.0       -1.0     1000.0
      -1.0       -1.0       -1.0       -1.0
      -1.0  -100000.0     1000.0       -1.0
      -1.0       -1.0       -1.0   -50000.0
      -1.0       -1.0   -50000.0       -1.0
      -1.0       -1.0       -1.0       -1.0
    1000.0       -1.0       -1.0  -100000.0
      -1.0       -1.0       -1.0       -1.0
      -1.0       -1.0  -100000.0   -50000.0
      -1.0   -50000.0   -50000.0       -1.0
      -1.0   -50000.0       -1.0       -1.0
      -1.0

### Transition $T(s, s^{\prime},a)$
Next, build the transition function $T(s,s^{\prime},a)$. We'll encode this as a $\dim\mathcal{S}\times\dim\mathcal{S}\times\dim\mathcal{A}$ [multidimension array](https://docs.julialang.org/en/v1/manual/arrays/) and populate it using nested `for` loops. 

* The `outer` loop we will iterate over actions. For every $a\in\mathcal{A}$ will get the move associated with that action and store it in the `Δ`
* In the `inner` loop, we will iterate over states $s\in\mathcal{S}$. We compute a `new_position` resulting from implementing action $a$ and check if `new_position`$\in\mathcal{S}$. If `new_position` is in the world, and `current_position` is _not_ an `absorbing state` we set $s^{\prime}\leftarrow$`world.states[new_position]`, and `T[s, s′,  a] = 1.0`
* However, if the `new_position` is outside of the grid (or we are jumping from an `absorbing` state), we set `T[s, s,  a] = 1.0`, i.e., the probability that we stay in `s` if we take action `a` is `1.0`.

In [6]:
T = Array{Float64,3}(undef, nstates, nstates, nactions);
fill!(T, 0.0)
for a ∈ 𝒜
    
    Δ = world.moves[a];
    
    for s ∈ 𝒮
        current_position = world.coordinates[s]
        new_position =  current_position .+ Δ
        if (haskey(world.states, new_position) == true && 
                in(current_position, absorbing_state_set) == false)
            s′ = world.states[new_position];
            T[s, s′,  a] = 1.0
        else
            T[s, s,  a] = 1.0
        end
    end
end

Finally, we construct an instance of the `MyMDPProblemModel` which encodes the data required to solve the MDP problem.
* We must pass the states `𝒮`, the actions `𝒜`, the transition matrix `T`, the reward matrix `R`, and the discount factor `γ` into the `build(...)` method. We store the MDP model in the `m` variable:

In [7]:
m = VLDecisionsPackage.build(MyMDPProblemModel, 
    (𝒮 = 𝒮, 𝒜 = 𝒜, T = T, R = R, γ = γ));

## Task 3: Online planning solution
First, let's set the `depth` that are going to explore, i.e., how many steps are we going to take when exploring each state `s`:

In [28]:
d = 48;

Next, let's implement three functions:

> The `myrandpolicy(problem::MyMDPProblemModel, world::MyRectangularGridWorldModel, s::Int) -> Int` function takes a `MyMDPProblemModel` instance, a `MyRectangularGridWorldModel` instance and the state `s`. This function returns a random action $a\in\mathcal{A}$.

> The `myrandstep(problem::MyMDPProblemModel, world::MyRectangularGridWorldModel, s::Int, a::Int)` function takes a `MyMDPProblemModel` instance, a `MyRectangularGridWorldModel` instance, the state `s` and an action `a` and returns the next state $s^{\prime}$ and reward $r$.

> The `myrollout(problem::MyMDPProblemModel, world::MyRectangularGridWorldModel, s::Int64, depth::Int64) -> Float64` function takes a `MyMDPProblemModel` instance, a `MyRectangularGridWorldModel` instance, the state `s` and the depth `d`. This function returns the cumulative reward after exploring the network for `d` steps.

These implementations were based on `Algorithm 9.1` of the [Decisions Book](https://algorithmsbook.com)

In [9]:
function myrandpolicy(problem::MyMDPProblemModel, 
        world::MyRectangularGridWorldModel, s::Int)::Int
    
    # initialize -
    d = Categorical([0.25,0.25,0.25,0.25]); # you specify this
    
    # should keep chooseing -
    should_choose_gain = true;
    a = -1; # default
    while (should_choose_gain == true)
       
        # initialize a random categorical distribution over actions -
        aᵢ = rand(d);
        
        # get the move, and the current location -
        Δ = world.moves[aᵢ];
        current_position = world.coordinates[s]
        new_position =  current_position .+ Δ
        if (haskey(world.states, new_position) == true)
            a = aᵢ
            should_choose_gain = false;
        end
    end
    
    return a;
end;

In [10]:
function myrandstep(problem::MyMDPProblemModel, 
        world::MyRectangularGridWorldModel, s::Int, a::Int)
    
    # get the reward value -
    r = problem.R[s,a];
    
    # get the move, and the current location -
    Δ = world.moves[a];
    current_position = world.coordinates[s]
    
    # propose a new position -
    new_position =  current_position .+ Δ
    s′ = s; # default, we don't do anything
    if (haskey(world.states, new_position) == true)
        s′ = world.states[new_position];
    end
    
    # return -
    return (s′,r)
end;

In [11]:
function myrollout(problem::MyMDPProblemModel, 
        world::MyRectangularGridWorldModel, s::Int64, depth::Int64)::Float64
    
    # initialize -
    ret = 0.0;
    for i ∈ 1:depth
        a = myrandpolicy(problem, world, s);
        s, r = myrandstep(problem, world, s, a);
        ret += problem.γ^(i-1)*r;
    end
    return ret;
end;

Finally, we'll make a simple helper function `U(s)` that compute the value (utility) for state `s` by calling the `myrollout(...)` function:

In [29]:
U(s) = myrollout(m,world,s,d)

U (generic function with 1 method)

To compute the value (utility) at each state in the network $U(s)$, we use a `for` loop:
* For each state $s\in\mathcal{S}$ we call the `U(s)` helper function, which explores the problem to a depth `d`, returns the value (utility) at state `s`, and saves the value in the `utility_array`

In [30]:
utility_array = Array{Float64,1}();
for s ∈ 𝒮
    push!(utility_array, U(s))
end

In [31]:
utility_array

25-element Vector{Float64}:
      -1.3333333333388844
    -416.3742291304142
     -99.3699122373286
      -1.4346602161725415
      -1.3586644856722978
       2.576822916644462
     248.5352007547127
      61.2291651145015
      -1.333339154045879
   -6667.932935979048
      -1.3330937440808663
 -106250.2711694763
     264.5575452397264
 -106642.17340596551
      -1.7162893478378625
      -2.8591969807942705
       9.181369954820424
      65.13932314973029
  -25001.488646685404
   -6247.360700366165
      -1.3334326718987388
      -1.3323190410931904
      -1.3571749543403497
      -1.3333334242250805
  -25098.764031986662

Extract the `action-value function` or $Q(s, a)$ from the `utility_array`. We can do this using the `Q(...)` function, which takes `m` and the `utility_array`:

In [32]:
my_Q = Q(m, utility_array)

25×4 Matrix{Float64}:
 -50000.3           -0.355794  -50000.3         -105.094
 -50104.1       -99937.9           -1.33333      -25.8425
 -50024.8           14.3073      -105.094         -1.35867
 -50000.4           -1.33333      -25.8425        -1.33967
 -50000.3        -1667.98          -1.35867   -50000.3
     -1.33333       -1.33327   -49999.4       -99937.9
     61.1338        61.1338        61.1338        61.1338
    -25.8425      1066.14      -99937.9           -1.33333
     -1.35867   -26661.5           14.3073     -1667.98
     -1.33967       -1.42907       -1.33333   -51667.0
     -0.355794      -1.7148    -50000.3       -26563.6
 -99937.9            1.29534       -1.33327     1066.14
     65.1394        65.1394        65.1394        65.1394
     -1.33333       -1.0625e5    1066.14          -1.42907
  -1667.98       -1562.84      -26661.5       -50000.4
     -1.33327       -1.33336   -50000.7            1.29534
 -26563.6           -1.33308       -1.7148        15.2848
   1066

Finally, we can extract the policy $\pi(s)$ from the action-value function $Q(s,a)$ using the `policy(...)` function:

In [33]:
my_π = policy(my_Q);

In [34]:
my_π

25-element Vector{Int64}:
 2
 3
 2
 2
 3
 2
 1
 2
 3
 3
 1
 4
 1
 3
 2
 4
 4
 1
 1
 1
 4
 1
 1
 3
 3

### Visualize

In [18]:
move_arrows = Dict{Int,Any}();
move_arrows[1] = "←"
move_arrows[2] = "→"
move_arrows[3] = "↓"
move_arrows[4] = "↑"
move_arrows[5] = "∅";

In [35]:
for s ∈ 𝒮
    a = my_π[s];
    Δ = world.moves[a];
    current_position = world.coordinates[s]
    new_position =  current_position .+ Δ
    
    if (in(current_position, absorbing_state_set) == true)
        println("$(current_position) $(move_arrows[5])")
    else
        println("$(current_position) $(move_arrows[a]) $(new_position)")
    end
end

(1, 1) → (2, 1)
(1, 2) ↓ (1, 1)
(1, 3) → (2, 3)
(1, 4) → (2, 4)
(1, 5) ↓ (1, 4)
(2, 1) → (3, 1)
(2, 2) ∅
(2, 3) → (3, 3)
(2, 4) ↓ (2, 3)
(2, 5) ↓ (2, 4)
(3, 1) ← (2, 1)
(3, 2) ↑ (3, 3)
(3, 3) ∅
(3, 4) ↓ (3, 3)
(3, 5) → (4, 5)
(4, 1) ↑ (4, 2)
(4, 2) ↑ (4, 3)
(4, 3) ← (3, 3)
(4, 4) ∅
(4, 5) ← (3, 5)
(5, 1) ↑ (5, 2)
(5, 2) ← (4, 2)
(5, 3) ← (4, 3)
(5, 4) ↓ (5, 3)
(5, 5) ↓ (5, 4)
