Can we learn a control policy able to adapt its behaviour in real time so as to take any desired amount of risk? The general Reinforcement Learning framework solely aims at optimising a total reward in expectation, which may not be desirable in critical applications. In stark contrast, the Budgeted Markov Decision Process (BMDP) framework is a formalism in which the notion of risk is implemented as a hard constraint on a failure signal. Existing algorithms solving BMDPs rely on strong assumptions and have so far only been applied to toy-examples. In this work, we relax some of these assumptions and demonstrate the scalability of our approach on two practical problems: a spoken dialogue system and an autonomous driving task. On both examples, we reach similar performances as Lagrangian Relaxation methods with a significant improvement in sample and memory efficiency.
We compare two approaches for contructing a batch of samples. The animations display the trajectories collected in each intermediate subbatch. The first row corresponds to a classical risk neutral epsilon-greedy exploration policy while the second row showcases a risk-sensitive exploration strategy introduced in the paper. Each animation corresponds to a different seed.
We display the evolution in the budgeted policy behavior with respect to the budget. The policies have been learnt with a risk-sensitive exploration. When the budget is low, the agent takes the safest path on the left. When the budget increases, it gradually switches to the other lane, earning higher rewards but also costs. This gradual process could not be achieved with a deterministic policy as it would chose either one path or the other. Each animation corresponds to a different seed.
Scalable Budgeted Fitted-Q
We show samples of driving styles emerging from constraining the time spent on the opposite lane with different constraint budget values. The budget enables to control in real-time the tradeoff between efficiency and safety.
In the following table, we display two dialogues done with the same BFTQ policy. The policy is given two budgets to respect in expectation, and . For budget 0, one can see that the system never uses the
ask_num_pad action. Instead, it uses
ask_oral, an action subject to recognition errors. The system keeps asking for the same slot 2, because it has the lowest speech recognition score. It eventually summarizes the form to the user, but then reaches the maximum dialogue length and thus faces a dialogue failure. For budget 0.5, the system first asks in a safe way, with
ask_oral. It may want to
ask_num_pad if one of the speech recognition score is low. Then, the system proceeds to a confirmation of the slot values. If it is incorrect, the system continues the dialogue using unsafe the
ask_num_pad action to be certain of the slot values.
How to reproduce
Install pycairo, numpy, scipy, highway-env and pytorch.
Change python path to the path of this repository:
Navigate to budgeted-rl folder :
Run main script using any config file. Choose the range of seeds you want to test on:
python main/egreedy/main-egreedy.py config/slot-filling.json 0 6
python main/egreedy/main-egreedy.py config/corridors.json 0 4
python main/egreedy/main-egreedy.py config/highway-easy.json 0 10
|-||Size of the environment||7 x 6|
|-||Standard deviation of the Gaussian noise applied to actions||(0.25, 0.25)|
|ser||Sentence Error Rate||0.6|
|Gaussian mean for misunderstanding||-0.25|
|Gaussian mean for understanding||0.25|
|Gaussian standard deviation||0.6|
|Probability of hang-up||0.25|
|-||Number of slots||3|
|Number of vehicles||2 - 6|
|Standard deviation of vehicles initial positions||100 m|
|Standard deviation of vehicles initial velocities||3 m/s|
|H||Episode duration||15 s|