---

# CSCI 3202, Fall 2022
# Homework 2: MDP & Reinforcement Learning
# Due: Friday September 9, 2022 at 6:00 PM

<br> 

### Your name: Giuliano Costa

<br> 

In [29]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
from collections import defaultdict

# added packages
import heapq
from matplotlib import colors



---

Consider a **cube** state space defined by $0 \le x, y, z \le L$. Suppose you are piloting/programming a drone to learn how to land on a platform at the center of the $z=0$ surface (the bottom). Some assumptions:
* In this discrete world, if the drone is at $(x,y,z)$ it means that the box is centered at $(x,y,z)$. There are boxes (states) centered at $(x,y,z)$ for all $0 \le x,y,z \le L$. Each state is a 1 unit cube. 
* In this world, $L$ is always an even value.
* All of the states with $z=0$ are terminal states.
* The state at the center of the bottom of the cubic state space is the landing pad. For example, when $L=4$, the landing pad is at $(x,y,z) = (2,2,0)$.
* All terminal states ***except*** the landing pad have a reward of -1. The landing pad has a reward of +1.
* All non-terminal states have a living reward of -0.01.
* The drone takes up exactly 1 cubic unit, and begins in a random non-terminal state.
* The available actions in non-terminal states include moving exactly 1 unit Up (+z), Down (-z), North (+y), South (-y), East (+x) or West (-x). In a terminal state, the training episode should end.

#### Part A
How many states would be in the discrete state space if $L=2$? Explain your reasoning.

*Your answer here:*
State Space
$$ L = 2 $$
$$ Z states: 0, 1, 2 $$
$$ Z_{num} = 3 $$
$$ Z_{num} = 3^3 $$

$\qquad$  States $\qquad = 27$




#### Part B
Write a class `MDPLanding` to represent the Markov decision process for this drone. Include methods for:
1. `actions(state)`, which should return a list of all actions available from the given state
2. `reward(state)`, which should return the reward for the given state
3. `result(state, action)`, which should return the resulting state of doing the given action in the given state

and attributes for:
1. `states`, a list of all the states in the state space, where each state is represented as an $(x,y,z)$ tuple
2. `terminal_states`, a dictionary where keys are the terminal state tuples and the values are the rewards associated with those terminal states
3. `default_reward`, a scalar for the reward associated with non-terminal states
4. `all_actions`, a list of all possible actions (Up, Down, North, South, East, West)
5. `discount`, the discount factor (use $\gamma = 0.999$ for this entire problem)

How you feed arguments/information into the class constructor is up to you.

Note that actions are *deterministic* here.  The drone does not need to include transition probabilities for outcomes of particular actions. What the drone does need to learn, however, is where the landing pad is, and how to get there from any initial state.

Before moving on to Part C, we recommend that you test that your MDPLanding code is set up correctly. Write unit tests that display the actions for a given state, rewards, results, etc. This will help you identify errors in your implementation and save you a lot of debugging time later.

In [171]:
# Solution:
import itertools as it

class MDPLanding():
    def __init__(self, L):
        self.states = list(it.product(range(L+1), repeat=3))
        self.terminal_states = {}
        
        for s in self.states:
            if s[2] == 0: #if z coordinate of tuple is 0, bottom of cube
                if(s[0] == L/2 ) and (s[2] == L/2): 
                    self.terminal_states[s] = 1
                else:
                    self.terminal_states[s] = -1
        self.terminal_states[(int(L/2), int(L/2) , 0)] = 1                
        self.term_rewards = 1
        self.default_reward = -0.1
        self.all_actions = [(0,1,0), (0,-1,0), (-1,0,0), (1,0,0), (0,0,-1), (0,0,1)]
        self.discount = 0.999
        self.L = L
        
        
    #Return a set of actions, based on current (xyz) coord    
    def actions(self, state):
        action = []
        
        if(state[2] == 0): #if at the bottom
            return action
        else:
            for move in self.all_actions:
                n = (state[0] + move[0], state[1] + move[1], state[2] + move[2])
                
                if n[0] < 0 or n[0] > self.L: #If change in x coord is less than 0, or greater than bounds, 
                    #print("x Not out of bounds \n")
                    continue
                elif n[1] < 0 or n[1] > self.L: #y
                    #print("y Not out of bounds \n")
                    continue
                elif n[2] < 0 or n[2] > self.L: #z
                    #print("z Not out of bounds \n")
                    continue
                else:
                    action.append(move)
            return action       
        # if(state in self.terminal_states): #If term state, return false
        #     return 0          
        # else: #Otherwise return all_actions
        #     return self.all_actions        
    
    def reward(self, state):
        if (state[2] == 0):
            return self.terminal_states
        else:
            return self.default_reward
        
    def result(self, state, action):
        return (state[0] + action[0], state[1] + action[1], state[2] + action[2])
                                                           
        


In [177]:
test = MDPLanding(L=2)
len(test.states)
#test.actions((0,1,2))
#test.actions((0,-2,0))
#test.terminal_states
test.actions((0,0,2))


[(0, 1, 0), (1, 0, 0), (0, 0, -1)]

#### Part C
Write a function to implement **policy iteration** for this drone landing MDP. Create an MDP environment to represent the $L=4$ case.

Use your function to find an optimal policy for your new MDP environment. Check (by printing to screen) that the policy for the following states are what you expect, and **comment on the results**:
1. $(2,2,1)$
1. $(0,2,1)$
1. $(2,0,1)$

The policy for each of these states is the action that the agent should take in that state. 

In [199]:
mdpenv = MDPLanding(L=4)

#Variables:
# pi: A policy
# a = 
# U = Utility
#
#
#def expectedUtility() Expected utility of doing an action in a state (s), using mdp) 

def expected_utility(a, s, U, mdp):
     #return sum(p* U[s1] for (p, s1) in mdp.T(s,a))
    return sum([p * U[s1] for (p, s1) in mdp.T(s, a)])
 
#def policyEvaluation

def policy_evaluation(pi, U, mdp):
    pi = 1
    R, T, gamma = mdp.default_reward, mdp.result, mdp.discount
    k =  100
    for i in range(k): #Number of iteratons k
        for s in mdp.states:
            U[s] = R(s) + gamma * U[T(s, pi[s])]
    return U

#def expectedUtility() Expected utility of doing an action in a state (s), using mdp) 

def policy_iteration(mdp):
    
    U = {s: 0 for s in mdp.states} #A vector of utilities for states in mdp.state

    #Calculate Utility of each action if it were to be executed
    
    #Pi dictionary of {key: value}
    #Where key is a state (Up/Down etc...) and  value is an action
    
    pi = {}
    #pi = { s: np.random.choice( mdp.action(i)) for i in len(mdp.state)}
    
    #Need to select states that aren't terminal
    states = []
    for s in mdp.states:
        if s[2] != 0:
            states.append(s)
            
    print(len(mdp.states))
    print(len(states))
    
    for s in states:
        #print("Number of actions", len(mdp.actions(s)), s, "\n")
        pi[s] = mdp.actions(s)[np.random.choice(len(mdp.actions(s) ))]
        #print("State s, and action", pi[s], "\n")
    print(len(pi))
    action = {}
    #_action[self.all_actions] = 0
    
    while True:
        #For all actions, select max value action
        U = policy_evaluation(pi, U, mdp)
        unchanged = True
        
        a = max(mdp.action(s))
        
        for s in mdp.state: #for each state
            if mdp.result(s, pi[s]) == s:
                action[pi[s]] = -100000
            else: 
                match pi[s]:
                    case (0,0,1):
                        action['Up'] = expected_utility(pi[s], s, U, mdp)
                    case (0,0,-1):
                        action(0,0,-1) = expected_utility(pi[s], s, U, mdp)
                    case 'North':
                        action['North'] = expected_utility(pi[s], s, U, mdp)
                    case 'South':
                        action['South'] = expected_utility(pi[s], s, U, mdp)
                    case 'East':
                        action['East'] = expected_utility(pi[s], s, U, mdp)
                    case 'West':
                        action['West'] = expected_utility(pi[s], s, U, mdp)
                
            if a != pi[s]:
                pi[s] = a
                unchanged = False
            if unchanged:
                return pi                                                                                     



In [200]:
#mdpenv.actions((0,1,0))
policy_iteration(mdpenv)
#np.random.choice(len(mdpenv.actions((0,1,0))))

125
100
100


TypeError: 'float' object is not callable

#### Part D
Provide an example of a non-deterministic transition that could be included in your code in Part C. Describe the function. How would you modify your code to handle a non-deterministic transition function?

*Your answer here*

#### Part E
Describe the main differences between **policy iteration** and **value iteration**? How would your code change in Part C to convert it to **value iteration**?

*Your answer here:*

Policy iteration changes the underlying policy that determines the actions of the agent. Value iteration seeks to increase rewards

 


#### Part F

Code up a **Q-learning** agent/algorithm to learn how to land the drone. You can do this however you like, as long as you use the MDP class structure defined above.  

Your code should include some kind of a wrapper to run many trials to train the agent and learn the Q values.  You also do not need to have a separate function for the actual "agent"; your code can just be a "for" loop within which you are refining your estimate of the Q values.

From each training trial, save the cumulative discounted reward (utility) over the course of that episode. That is, add up all of $\gamma^t R(s_t)$ where the drone is in state $s_t$ during time step $t$, for the entire sequence. We refer to this as "cumulative reward" because we usually refer to "utility" as the utility *under an optimal policy*.

Some guidelines:
* The drone should initialize in a random non-terminal state for each new training episode.
* The training episodes should be limited to 50 time steps, even if the drone has not yet landed. If the drone lands (in a terminal state), the training episode is over.
* You may use whatever learning rate $\alpha$ you decide is appropriate, and gives good results.
* There are many forms of Q-learning. You can use whatever you would like, subject to the reliability targets below.
* Your code should return:
  * The learned Q values associated with each state-action pair.
  * The cumulative reward for each training trial. 
  * Anything else that might be useful in the ensuing analysis.

In [3]:
# Solution:



#### Part G

Initialize the $L=10$ environment (so that the landing pad is at $(5,5,0)$). Run some number of training trials to train the drone.

**How do I know if my drone is learned enough?**  If you take the mean cumulative reward across the last 5000 training trials, it should be around 0.80. This means at least about 10,000 (but probably more) training episodes will be necessary. It will take a few seconds on your computer, so start small to test your code.

**Then:** Compute block means of cumulative reward from all of your training trials. Use blocks of 500 training trials. This means you need to create some kind of array-like structure such that its first element is the mean of the first 500 trials' cumulative rewards; its second element is the mean of the 501-1000th trials' cumulative rewards; and so on. Make a plot of the block mean rewards as the training progresses. It should increase from about -0.5 initially to somewhere around +0.8.

**And:** Print to the screen the mean of the last 5000 trials' cumulative rewards, to verify that it is indeed about 0.80.

In [1]:
# Solution:


#### Part H

**Question 1:** Why does the cumulative reward start off around -0.5 at the beginning of the training?

**Question 2:** Why will it be difficult for us to train the drone to reliably obtain rewards much greater than about 0.8?

**Your answer here:**

