## CHEME 1800/4800 The Tiger Problem as a Markov Decision Problem

<center>
    <img src="figs/Fig-Linear-MDP-Schematic.png" style="align:right; width:80%">
</center>

### Introduction

An agent trapped in a long hallway with two doors at either end. Behind the red door is a tiger (and certain death), while behind the green door is freedom. If the agent opens the red door, the agent is eaten (and receives a large negative reward). However, if the agent opens the green door, it escapes and gets a positive reward. 

For this problem, the MDP has the tuple components:
* $\mathcal{S} = \left\{1,2,\dots,N\right\}$ while the action set is $\mathcal{A} = \left\{a_{1},a_{2}\right\}$; action $a_{1}$ moves the agent one state to the right, action $a_{2}$ moves the agent one state to the left.
* The agent receives a reward of +10 for entering state 1 (escapes). However, the agent is penalized -100 for entering state N (eaten by the tiger).  Finally, the agent is not charged to move to adjacent locations.
* Let the probability of correctly executing the action $a_{j}$ be $\alpha$

Let's compute $U^{\pi}(s)$ for different choices for the policy function $\pi$.

### Example setup

In [1]:
import Pkg; Pkg.activate("."); Pkg.resolve(); Pkg.instantiate();

[32m[1m  Activating[22m[39m project at `~/Desktop/course_repos/CHEME-1800-4800-Course-Repository-S23/examples/unit-4-examples/mdp-tiger-problem`
[32m[1m  No Changes[22m[39m to `~/Desktop/course_repos/CHEME-1800-4800-Course-Repository-S23/examples/unit-4-examples/mdp-tiger-problem/Project.toml`
[32m[1m  No Changes[22m[39m to `~/Desktop/course_repos/CHEME-1800-4800-Course-Repository-S23/examples/unit-4-examples/mdp-tiger-problem/Manifest.toml`


In [2]:
using Distributions
using Plots

In [3]:
include("CHEME-1800-Tiger-MDP-CodeLib.jl");

In [4]:
# setup some global constants -
α = 0.75; # probability of moving the direction we are expect

#### States and actions

In [5]:
# setup the states and actions -
safety = 1;
tiger = 10;

# Setup the states -
states = range(safety,stop=tiger, step=1) |> collect;

# Setup the actions
actions = [1,2,3]; # a₁ = move left, a₂ = move right, a₃ = stand still

# Discount factor
γ = 0.95; # discount factor

#### Rewards

In [6]:
# setup the rewards -
R = Array{Float64,2}(undef,length(states), length(actions));
fill!(R,0.0) # fill R w/zeros

# set the rewards for the ends -
R[safety + 1,1] = 10;
R[tiger-1, 2] = -100;
R[1:length(states), 3] .= -1;

In [7]:
R

10×3 Matrix{Float64}:
  0.0     0.0  -1.0
 10.0     0.0  -1.0
  0.0     0.0  -1.0
  0.0     0.0  -1.0
  0.0     0.0  -1.0
  0.0     0.0  -1.0
  0.0     0.0  -1.0
  0.0     0.0  -1.0
  0.0  -100.0  -1.0
  0.0     0.0  -1.0

#### Transitions

In [8]:
# Setup the transitions
T = Array{Float64,3}(undef, length(states), length(states), length(actions));
fill!(T,0.0);

# We need to put values into the transition array (these are probabilities, so eah row much sum to 1)
T[safety, 1, 1:length(actions)] .= 1.0; # if we are in state 1, we stay in state 1 ∀a ∈ 𝒜
T[tiger, tiger, 1:length(actions)] .= 1.0; # if we are in state 5, we stay in state 5 

##### Left, Right and Listen Actions

In [9]:
# left actions -
for s ∈ 2:(tiger - 1)
    T[s,s-1,1] = α;
    T[s,s+1,1] = (1-α);
end

# right actions -
for s ∈ 2:(tiger - 1)
    T[s,s-1,2] = (1-α);
    T[s,s+1,2] = α; 
end

# listen action (we don't move to a new state)
for s ∈ 2:(tiger-1)
    T[s,s,3] = 1.0;
end

In [10]:
T[:,:,3] # probability matrix for taking action aᵢ

10×10 Matrix{Float64}:
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0

### Build the MDP problem object and estimate the utility $U^{\pi}(s)$ 

In [11]:
m = build(MDP; 𝒮 = states, 𝒜 = actions, T = T, R = R, γ = γ);

In [12]:
# build a always right policy -
always_move_right(s) = 2;
always_move_left(s) = 1;

In [19]:
U = iterative_policy_evaluation(m, always_move_right, 50*length(states));

In [20]:
U

10-element Vector{Float64}:
    0.0
  -47.23239230918141
  -66.29107692516689
  -77.2959771954194
  -86.3885563869365
  -95.48177095161319
 -105.21331762767126
 -115.8405572304044
 -127.51213234222104
    0.0

### Estimate the Q-Array

In [21]:
Q_array = Q(m, U)[2:end-1,:]

8×3 Matrix{Float64}:
   -5.74413   -47.2324   -45.8708
  -52.0109    -66.2911   -63.9765
  -67.7497    -77.296    -74.4312
  -77.7503    -86.3886   -83.0691
  -86.54      -95.4818   -91.7077
  -95.5429   -105.213   -100.953
 -105.249    -115.841   -111.049
  -82.5364   -127.512   -122.137

In [22]:
best_policy = π(Q_array)

8-element Vector{Int64}:
 1
 1
 1
 1
 1
 1
 1
 1