# RL Framework

* Framed as MDPs (S, A, P, R, gamma(discount factor))
* State transition and reward models as a joint probability P(S+1,R+1 | S,A)
* State value function V(S)
* Action value fuction Q(S,A)
* Gaol: find optimum policy pi* that maximised total expected reward
* Gamms is used to assign a lower weight to future rewards
* Model-based learning: dynamic programming, policy iteration, value iteration. Known transition and reward model. Apply dynamic programming to iteratively compute the desired value functions and optimal pocies using that method.
* Model-free learning: monte carlo, temporal differance. Don't reqire a specific model, instead carrys out exploratory actions and use the experience gains to directly extimate value fucntions.

## Continous and Discrete Space

Using deep neural networks, instead of a table and dictionary, therefore can operate in continuous spaces.

How to represent state spaces, especially if there are continuous, i.e. never ending? How to solve optimization problems in continous space where the number of samples to optimize are infitite. Real physical spaces are not always neatly broken up into grids.

## Discretization

Break up the continuous space into discrete steps.

![discretizationOfGears.png](attachment:discretizationOfGears.png)

## Tile Coding

Overlay multiple grids of tiles each slightly offset from each other.

Can make the tiles adaptive so each tile doesnt need to be the same size and we can pick splits when we are blocked in learning. Adaptive tile coding partions the state space appropriately based on its complexity.

## Corase Coding

Use a sparser set of features to encode the state space.

Can use raidal basic function to reduce the number of encodings required.

## Function Approximation

Given a problem domain with continuous states s∈S=Rns \in \mathcal{S} = {\mathbb{R}^{n}}s∈S=Rn, we wish to find a way to represent the value function vπ(s)v_{\pi}(s)vπ​(s) (for prediction) or qπ(s,a)q_{\pi}(s, a)qπ​(s,a) (for control).

We can do this by choosing a parameterized function that approximates the true value function:

v^(s,w)≈vπ(s)\hat{v}(s, \mathbf{w}) \approx v_{\pi}(s)v^(s,w)≈vπ​(s)
q^(s,a,w)≈qπ(s,a)\hat{q}(s, a, \mathbf{w}) \approx q_{\pi}(s, a)q^​(s,a,w)≈qπ​(s,a)

Our goal then reduces to finding a set of parameters w\mathbf{w}w that yield an optimal value function. We can use the general reinforcement learning framework, with a Monte-Carlo or Temporal-Difference approach, and modify the update mechanism according to the chosen function.

### Feature Vectors

A common intermediate step is to compute a feature vector that is representative of the state: x(s)

### Linear Function Approximation

![linearFunctionApproximation.png](attachment:linearFunctionApproximation.png)

We can use gradient decent on this!

![gradientDecentLFA.png](attachment:gradientDecentLFA.png)

### Action Vector Approximation

In each step change the weights a small step away from the error direction.

In order to solve a model free control problem, that is, to take actions in an unknown environment, we need to approximate the action value function.

Define a feature transformation that util  both state and action, then use gradient decent. 

Wish to compute all action values at one, produce and action vector. 

Trying to find N different action value functions, one for each action value dimension. Can extend weight vector into a matrix, each col a seperate linear function. The common features computed from the state and action keep these functions tied to each other.

Can be used when we have a continuous state space but a discrete action space, where we can easily select the action with the maximum value. Without this parallel processing, we would need to pass each action one by one and then find their maximum.

Can only do one dimensional values, so if underlying value function has a non-lienar shape, we will need to start to look at non-lienar functions.

![actionVectorApproximation.png](attachment:actionVectorApproximation.png)

### Kernel Functions

Can help canpture non-linear relationships.

In a feature vector each feature itself can be non-linear.

![featureVectorAndKernels.png](attachment:featureVectorAndKernels.png)

Radial basis functions.

![radialBasisFunction.png](attachment:radialBasisFunction.png)

We can reduce state representation into a vector of responses from radial basis functions.

### Non-linear function approximation

Activation functions, i.e. used in DNNs. Activation functions immesely increase the representational capacity of our system.

![activationFunctions.png](attachment:activationFunctions.png)

Combined with gradient decent we get

![gradientDecentWithActivation.png](attachment:gradientDecentWithActivation.png)

## Summary

In order to get around MDP issues with continuous spaces (computational requirement of optimization going to infinity), we can either attempt to discritize the state space, or directly approximate the desired value functions.

### Discretization

![summaryDiscretization.png](attachment:summaryDiscretization.png)

### Direct Approximation of continuous value fuction. 

![summaryDirectApproximation.png](attachment:summaryDirectApproximation.png)

First define a feature transformation, and then computing linear combination of those features.