## Example: The Tiger Problem as a Markov Decision Problem (MDP)

<center>
    <img src="figs/Fig-Linear-MDP-Schematic.png" style="align:right; width:80%">
</center>

An agent trapped in a long hallway with two doors at either end. Behind the red door is a tiger (and certain death), while behind the green door is freedom. If the agent opens the red door, the agent is eaten (and receives a large negative reward). However, if the agent opens the green door, it escapes and gets a positive reward. 

For this problem, the MDP has the tuple components:
* $\mathcal{S} = \left\{1,2,\dots,N\right\}$ while the action set is $\mathcal{A} = \left\{a_{1},a_{2}\right\}$; action $a_{1}$ moves the agent one state to the right, action $a_{2}$ moves the agent one state to the left.
* The agent receives a reward of +10 for entering state 1 (escapes). However, the agent is penalized -100 for entering state N (eaten by the tiger).  Finally, the agent is not charged to move to adjacent locations.
* Let the probability of correctly executing the action $a_{j}$ be $\alpha$

Let's compute $U^{\pi}(s)$ for different choices for the policy function $\pi$.

## Setup
Let's load some packages that are required for the example by calling the `include(...)` function on our initialization file `Include.jl`:

In [1]:
include("Include.jl");

[32m[1m    Updating[22m[39m git-repo `https://github.com/varnerlab/VLDecisionsPackage.jl.git`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m    Updating[22m[39m `~/Desktop/julia_work/CHEME-5760-Examples-F23/Project.toml`
  [90m[10f378ab] [39m[93m~ VLDecisionsPackage v0.1.0 `https://github.com/varnerlab/VLDecisionsPackage.jl.git#main` ⇒ v0.1.0 `https://github.com/varnerlab/VLDecisionsPackage.jl.git#main`[39m
[32m[1m    Updating[22m[39m `~/Desktop/julia_work/CHEME-5760-Examples-F23/Manifest.toml`
  [90m[10f378ab] [39m[93m~ VLDecisionsPackage v0.1.0 `https://github.com/varnerlab/VLDecisionsPackage.jl.git#main` ⇒ v0.1.0 `https://github.com/varnerlab/VLDecisionsPackage.jl.git#main`[39m
[32m[1mPrecompiling[22m[39m project...
[32m  ✓ [39mVLDecisionsPackage
  1 dependency successfully precompiled in 4 seconds. 226 already precompiled.
[32m[1m  Activating[22m[39m project at `~/Desktop/julia_work/CHEME-5760-Examples-F23`
[32m[1m    Updating[22m[39m 

In [2]:
# setup some global constants -
α = 0.75; # probability of moving the direction we are expect

## States and actions

In [3]:
# setup the states and actions -
safety = 1;
tiger = 10;

# Setup the states -
states = range(safety,stop=tiger, step=1) |> collect;

# Setup the actions
actions = [1,2,3]; # a₁ = move left, a₂ = move right, a₃ = stand still

# Discount factor
γ = 0.95; # discount factor

## Rewards

In [4]:
# setup the rewards -
R = Array{Float64,2}(undef,length(states), length(actions));
fill!(R,0.0) # fill R w/zeros

# set the rewards for the ends -
R[safety + 1,1] = 10;
R[tiger-1, 2] = -100;
R[1:length(states), 3] .= -1;

In [5]:
R

10×3 Matrix{Float64}:
  0.0     0.0  -1.0
 10.0     0.0  -1.0
  0.0     0.0  -1.0
  0.0     0.0  -1.0
  0.0     0.0  -1.0
  0.0     0.0  -1.0
  0.0     0.0  -1.0
  0.0     0.0  -1.0
  0.0  -100.0  -1.0
  0.0     0.0  -1.0

## Transitions

In [6]:
# Setup the transitions
T = Array{Float64,3}(undef, length(states), length(states), length(actions));
fill!(T,0.0);

# We need to put values into the transition array (these are probabilities, so eah row much sum to 1)
T[safety, 1, 1:length(actions)] .= 1.0; # if we are in state 1, we stay in state 1 ∀a ∈ 𝒜
T[tiger, tiger, 1:length(actions)] .= 1.0; # if we are in state 5, we stay in state 5 

### Left, Right and Listen Actions

In [7]:
# left actions -
for s ∈ 2:(tiger - 1)
    T[s,s-1,1] = α;
    T[s,s+1,1] = (1-α);
end

# right actions -
for s ∈ 2:(tiger - 1)
    T[s,s-1,2] = (1-α);
    T[s,s+1,2] = α; 
end

# listen action (we don't move to a new state)
for s ∈ 2:(tiger-1)
    T[s,s,3] = 1.0;
end

In [8]:
T[:,:,3] # probability matrix for taking action aᵢ

10×10 Matrix{Float64}:
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0

## Build the MDP problem object and estimate the utility $U^{\pi}(s)$ 

In [9]:
m = build(MyMDPProblemModel, 
    (𝒮 = states, 𝒜 = actions, T = T, R = R, γ = γ));

In [10]:
# build a always right policy -
always_move_right(s) = 2;
always_move_left(s) = 1;

In [15]:
U = iterative_policy_evaluation(m, always_move_left, 50*length(states));

In [16]:
U

10-element Vector{Float64}:
  0.0
 12.751213234222105
 11.584055723040441
 10.521331762767128
  9.54817709516132
  8.638855638693649
  7.729597719541938
  6.629107692516689
  4.72323923091814
  0.0

### Estimate the Q-Array

In [17]:
Q_array = Q(m, U)[2:end-1,:]

8×3 Matrix{Float64}:
 12.7512     8.25364  11.1137
 11.5841    10.5249   10.0049
 10.5213     9.55429   8.99527
  9.54818    8.654     8.07077
  8.63886    7.77503   7.20691
  7.7296     6.77497   6.34312
  6.62911    5.20109   5.29765
  4.72324  -98.4256    3.48708

In [14]:
best_policy = policy(Q_array)

8-element Vector{Int64}:
 1
 1
 1
 1
 1
 1
 1
 1