-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Trained PPO from scratch won't work with Analyzer #40
Comments
Thank you!
I agree with you that the sequence is too long for jointBERT model. I will cut too long sentences in jointBERT. And yes, this is the right approach for training and testing a RL policy from scratch. But as you can see, the system response is too long because your policy outputs too many dialogue acts, which indicates the policy is not well trained. |
Thank you! I didn't see the warning, it seems it doesn't print on console on Colab. Another thing I've noticed is that if I train for more epochs (e.g. 1000) the policy loss eventually gets "nan". Could it be the case that the hyperparameters in the config file aren't the optimal ones? Or are those the same as you've used when reached 0.74 success rate? Cheers! |
First MLE then PPO: #15 (comment) |
move to #54 |
Describe the bug
Since I recently started learning how to use this library, I'm aiming for a very simple concrete task, which is training a PPO policy from scratch.
Given my limited resources at the moment I'm using Google Colab Pro, which gives me a Tesla P100-PCIE-16GB for cheap.
The approach I'm following is (it mostly follows the tutorial on Colab, however I'll be extra detailed because the only mistake here could be my way of using the different modules) :
I'm cloning convlab2 github repo and installing locally (I tried both "just in runtime" and also locally when connecting the notebook with a Google Drive folder
After importing all necessary libraries, I create a simple dialogue system as in
/ppo/train.py
:The key here is that i'm training for only 100 epochs, however even if I would not expect this trained policy being any good, at least I'd expect being able to generate dialogues.
policy_sys
) is trained, I create a session for testing dialogues. In order to being able to do it, I create the user and system agents (pipelines) using the trained policy for the system agent.The error I'm getting is "RuntimeError: CUDA error: device-side assert triggered" apparently from jointBERT library.
I suspect this is due the utterances generated by the system are longer than the MAX_LEN in BERT model?
So the main question here (apart from the obvious one "why is this happening?") would be: is this the right approach for training and testing a RL policy from scratch?
I've seen that in order to improve PPO's performance, some sort of imitation learning helps as pre-training step. However what I'm aiming is not fine-tuning PPO, but simply training it for a few epochs (100, 200, etc) and different hyperparameters and being able to use the analyzer library to assess the model's behaviour (e.g. avg Success Rate) when increasing the number of epochs or changing hyperparameters.
To Reproduce
All the code is here and it runs out of the box, it should reproduce the "CUDA error: device-side assert triggered" error I'm facing
https://colab.research.google.com/drive/1nz73WBKLohohScsZIFjpDJz0y0CG4SRB?usp=sharing
Expected behavior
I expected that building a system agent again and using the previously trained policy would work out of the box within the analyzer tool.
The text was updated successfully, but these errors were encountered: