-
Notifications
You must be signed in to change notification settings - Fork 25
User Tutorial 2. Q Learning on the rain car environment
- We strongly recommend doing Tutorial #1 before this one.
We will work on the same environment we used in the last tutorial, the rain-car problem, and we will configure a Q-Learning learner (finally, some Reinforcement Learning!) to learn the control of the car. Some of the things we will review in this tutorial are:
- Linear state-action feature maps
- Experience replay
- More report analysis tools
A Q-Learning agent learns a function Q(s,a) which represents the value of taking action a being in state s: the rewards we expect to receive after taking action a if we choose the best action thereafter. To learn a good estimation of Q(s,a) the agent must try to take all possible actions in all possible states during learning (exploration). In evaluation, though, the agent will take in every state s the action with the highest value Q(s,a) (exploitaition).
Run Badger and create a new experiment in the Editor tab. We will leave the default Log parameters and set the World parameters exactly as in Tutorial #1. Since the default Log parameters are Log-Evaluation-Episodes = true and Log-Training-Episodes = false, only evaluation episodes will be available for analysis after the experiment. Next, we will set the Experiment parameters.
There is one main difference with respect to the previous tutorial: we are using Reinforcement Learning, so that means we will have to make the agent train on the task for some time:
Num-Episodes = 100Eval-Freq = 10-
Episode-Length = 60These totally arbitrary settings mean that the agent will train for100episodes, each of them60slong. Every10episodes, the agent will be evaluated (no explorations).
We will ignore the following parameters: Target-Function-Update-Freq, Gamma, Freeze-Target-Function and Use-Importance-Weights.
We will start setting the parameters of the State-Feature-Map. This defines the way in which the state variables (continuous values) are mapped to features in the Q-function being learned. First, we decide which state variables will be used as input of the function learned by the agent. Basically, we have to answer the question: on which variables should the policy depend? The state variables in the rain-car problem are: position, acceleration and position-deviation. The first variable represents the position of the car as an absolute value, whereas the last represents the relative position of the car with respect to the goal position under the tree. We can choose any of them, and also acceleration. Intuitively, the output of the control algorithm should depend on the current acceleration, which means it also should be an input to the function.
We leave the default value Num-Features-Per-Dimension = 20 and select Type = Discrete-Grid. This means that the continuous value of the variable will be discretized as 20 boxes that will be uniformly distributed in the value range of the state variable. There are currently two other feature maps: Gaussian Radial Basis functions (the world Gaussian should be part of the name), and Tile-coding.
For the Action-Feature-Map, we must select the only action defined in the rain-car environment: the acceleration. We will use the same configuration as for the State-Feature-Map: a discrete grid with 20 features per-dimension.
Next, we will enable the Experience-Replay technique with the default parameters: Buffer-Size = 1000 and Update-Batch-Size = 10. This means that experience tuples <s,a,s_p,r> will be stored in a buffer with room for 1000 tuples (once full, they will be replaced randomly). Every control step, after executing the action selected by the agent a, the agent will learn from the last experience tuple and also from 10 randomly selected tuples from the buffer. This technique has become very popular because it makes learning more stable by breaking down temporal correlation.
The result should look like this:
We will select a unique Simion of type Q-Learning.
Next, we must set the action selection policy. This is the algorithm by which the agent decides which action to take based on the current estimate of Q(s,a):
- Epsilon-Greedy: the most popular action selection policy. Every time-step, the agent selects with probability U+025B a random action and with probability 1-U+025B the action with the highest estimated Q(s,a) (greedy selection).
-
Soft-Max policy: actions with higher values are selected with higher probability, and a
temperatureparameter controls how the probability distribution is shaped. - Greedy-Q-Plus-Noise: any of the noise signals used by continuous-output learners is added to the action selected by greedy selection.
We will select Epsilon-Greedy with a Constant value of 0.3 (arbitrary number). We will disable E-Traces (this is a technique that conflicts with Experience-Replay and commonly believed to yield worse results) and a constant Alpha = 0.01.
Some of the most important parameters used in Reinforcement Learning are given totally arbitrary values. To increase the chance of success, it's best to run several parameter configurations at once, especially if we have computing resources available. Linear RL algorithms (those not using Neural Networks) run on a single-threaded application, so that means that, even if we are working with a single machine, we can run an experimental unit per CPU core with no performance penalty.
We will fork two parameters: Epsilon (values 0.1, 0.2 and 0.3), and Alpha (values 0.001, 0.005 and 0.01). The number beside the experiment's name now should be showing that the experiment will produce 9 experimental units (two parameters forked with three values each: 3*3= 9)
Just click Launch, save the experiment batch, select the Herd Agents you want to use, and click Play. Depending on the number of Herd Agents selected and the number of CPU cores each has, the experiment will produce a different number of jobs, each with a potentially different number of tasks. Once finished, press the Reports button (
In the Variable selection window, choose reward/r with Evaluation averages. This will produce one track for each experimental unit in which each point represents a value in a different evaluation episode of the experimental unit.
Looking at the stats below, we can see that the best performance was obtained with parameters Epsilon = 0.2 and Alpha = 0.001. Looking at the plot above we can also see that this best performance doesn't correspond to the last evaluation episode. This happens commonly in RL: the quality of a policy may decay with time. We will visualize this experiment off-line right-clicking on that experiment in the table with the statistics and selecting Visualize-Experiments.
From the new window that will pop-up, we can explore the evaluation episodes with the keyboard's left and right arrows. We can also speed-up or down the playing rate with the up and down arrows. In Episode 8, for example, the car stabilizes around the position were the agent receives the greatest reward. Below the visualization we can now see the evolution of the functions learned by the agent. Take into account that these functions are generated using interpolation from the functions logged in the experiments, which is controlled by parameter Log/Num-Functions-Logged.
We can also directly analyse the functions logged in the experiment by right-clicking on the experiment in the statistics table and selecting Visualize-Functions. From this window, we can traverse the episodes and functions (if more than one was logged in the experiment), and also save all the logged instances of the functions as images.
Finally, we will make a second query with only this experimental unit where Epsilon = 0.2 and Alpha = 0.001. First, we will enable Use fork selection on the left and deselect on the experiment/fork hierarchy above all but those two values. Now, select position-deviation with All evaluations, and make the query.
We see a set of tracks, each of them corresponding to the evolution over time in a different evaluation episode. The index of the evaluation episode has been added to Id of each track to identify them properly. If we click on the plot's Settings button (







