Discreatisizing example:
Each location can be located by the tiles it activates and can be represented by a bit vector (ones for tiles activated and zeros elsewhere).
The state value function computation when using this scheme:
This approach doesn't require manually designing the tiles ahead of time.
Example for devision criteria: when we are no longer learning from the data (our value function has stopped changing).
Workshop: Tile_Coding.ipynb
(gym: Acrobot-v1)
Each location on the plane is converted into a binary vector, when index i is '1', then it means that the encoded location is in circle i. This is a sparse representation of the plane.
A more continuous mapping of the area into a vector:
We are interested in obtaining a good approximating of the actual value function (or q-function). This sometimes requires adding a parameter w:
This is called linear function approximation.
We obtain
This is the rule that we will follow for each sampled state until the error (between the approximate and true state value function).
In order to do this while Q-learning, we need to approximate the action-value function (q).
But why stop here. Lets estimate the state-actions value:
Each column of the W matrix emulates a separate linear function.
We can still use a linear combination of these non-linear features and therefor use linear function approximation.
This allows the value function to represent non-linear relations between the input state and the output value.
This greatly increases our representational capacity of our approximation. This is also the way neural networks work.
We can use gradient descent to optimize and estimate w:
This sets us up for deep-reinforcement learning.