# RL as Classification

In this assignment we explore the idea of using classification in Reinforcement Learning algorithms.  


### Approximate Policy Iteration

Approximate policy iteration (API), which is an example of approximate dynamic programming, can be used when the state space of the underlying MDP is extremely large (and thus exact methods fail) [2].  
Instead of computing the exact improved policy at each step, API uses an approximation to the improved policy.  

Recall the policy iteration algorithm, as shown in the schematic below:
![alt text](images/exact_PI_flow.png)

Instead of computing the value function for each state, we can use rollouts to sample the MDP and obtain a state-action value function approximation which we can use to pick which action maximizes $Q_\pi(s,a)$ for the current policy.  
Once we have this data, we can then use a binary classifier to output the maximizing action for each state. The training data thus has the maximizing action as the label for the state.  
We can describe this algorithm using this simple schematic:
![alt text](images/approximate_PI_flow.png)

This approach brings three main advantages, as written in [4]:
 - Often, policies are simpler to represent and learn than value functions
 - A rough estimate of the value function often is sufficient to separate the best action from the rest.
 - Even if the best action estimates are noisy (due to the value function approximation), the generalization afforded by classification methods usually smooths out the noise


We provide an implementation of approximate policy iteration using a simple 2 layers fully connected network on the cartpole task.

[1] Reinforcement Learning, an Introduction, 2017 Draft, Sutton, Barto 

[2] Reinforcement Learning as Classification: Leveraging Modern Classifiers, 2003, Lagoudakis, Parr

[3] Oregon State CS533 Class Notes (https://web.engr.oregonstate.edu/~afern/classes/cs533/notes/api.pdf), Fern

[4] Classification-based Approximate Policy Iteration: Experiments and Extended Discussions, 2014, Farahmand, Precup, Barreto, Ghavamzadeh