[BUG] Trained PPO from scratch won't work with Analyzer #40

thenickben · 2020-07-09T20:16:57Z

Describe the bug

Since I recently started learning how to use this library, I'm aiming for a very simple concrete task, which is training a PPO policy from scratch.

Given my limited resources at the moment I'm using Google Colab Pro, which gives me a Tesla P100-PCIE-16GB for cheap.

The approach I'm following is (it mostly follows the tutorial on Colab, however I'll be extra detailed because the only mistake here could be my way of using the different modules) :

I'm cloning convlab2 github repo and installing locally (I tried both "just in runtime" and also locally when connecting the notebook with a Google Drive folder
After importing all necessary libraries, I create a simple dialogue system as in /ppo/train.py:

# simple rule DST
dst_sys = RuleDST()

policy_sys = PPO(True)
policy_sys.load(args.load_path)

# not use dst
dst_usr = None
# rule policy
policy_usr = RulePolicy(character='usr')
# assemble
simulator = PipelineAgent(None, None, policy_usr, None, 'user')

evaluator = MultiWozEvaluator()
env = Environment(None, simulator, None, dst_sys, evaluator)

for i in range(args.epoch):
    update(env, policy_sys, args.batchsz, i, args.process_num)

The key here is that i'm training for only 100 epochs, however even if I would not expect this trained policy being any good, at least I'd expect being able to generate dialogues.

Once the policy (policy_sys) is trained, I create a session for testing dialogues. In order to being able to do it, I create the user and system agents (pipelines) using the trained policy for the system agent.

# --- system ---

# BERT nlu
sys_nlu = BERTNLU()
# simple rule DST
sys_dst = RuleDST()
# TRAINED PPO POLICY ###
sys_policy = policy_sys
# template NLG
sys_nlg = TemplateNLG(is_user=False)
# assemble
sys_agent = PipelineAgent(sys_nlu, sys_dst, sys_policy, sys_nlg, name='sys')

# --- user ---

# MILU
user_nlu = MILU()
# not use dst
user_dst = None
# rule policy
user_policy = RulePolicy(character='usr')
# template NLG
user_nlg = TemplateNLG(is_user=True)
# assemble
user_agent = PipelineAgent(user_nlu, user_dst, user_policy, user_nlg, name='user')

# --- evaluator and session ---

evaluator = MultiWozEvaluator()
sess = BiSession(sys_agent=sys_agent, user_agent=user_agent, kb_query=None, evaluator=evaluator)

Same as in the tutorial, I use this simple loop to sample dialogues from the session

sys_response = ''
sess.init_session()
print('init goal:')
pprint(sess.evaluator.goal)
print('-'*50)
for i in range(20):
    sys_response, user_response, session_over, reward = sess.next_turn(sys_response)
    print('user:', user_response)
    print('sys:', sys_response)
    print()
    if session_over is True:
        break
print('task success:', sess.evaluator.task_success())
print('book rate:', sess.evaluator.book_rate())
print('inform precision/recall/f1:', sess.evaluator.inform_F1())
print('-'*50)
print('final goal:')
pprint(sess.evaluator.goal)
print('='*100)

The error I'm getting is "RuntimeError: CUDA error: device-side assert triggered" apparently from jointBERT library.

I suspect this is due the utterances generated by the system are longer than the MAX_LEN in BERT model?

So the main question here (apart from the obvious one "why is this happening?") would be: is this the right approach for training and testing a RL policy from scratch?

I've seen that in order to improve PPO's performance, some sort of imitation learning helps as pre-training step. However what I'm aiming is not fine-tuning PPO, but simply training it for a few epochs (100, 200, etc) and different hyperparameters and being able to use the analyzer library to assess the model's behaviour (e.g. avg Success Rate) when increasing the number of epochs or changing hyperparameters.

To Reproduce

All the code is here and it runs out of the box, it should reproduce the "CUDA error: device-side assert triggered" error I'm facing

https://colab.research.google.com/drive/1nz73WBKLohohScsZIFjpDJz0y0CG4SRB?usp=sharing

Expected behavior

I expected that building a system agent again and using the previously trained policy would work out of the box within the analyzer tool.

The text was updated successfully, but these errors were encountered:

zqwerty · 2020-07-10T01:13:21Z

Thank you!
As the warning:

WARNING:transformers.tokenization_utils_base:Token indices sequence length is longer than the specified maximum sequence length for this model (975 > 512). Running this sequence through the model will result in indexing errors

I agree with you that the sequence is too long for jointBERT model. I will cut too long sentences in jointBERT.

And yes, this is the right approach for training and testing a RL policy from scratch. But as you can see, the system response is too long because your policy outputs too many dialogue acts, which indicates the policy is not well trained.

thenickben · 2020-07-10T11:26:54Z

Thank you! I didn't see the warning, it seems it doesn't print on console on Colab.

Another thing I've noticed is that if I train for more epochs (e.g. 1000) the policy loss eventually gets "nan".

Could it be the case that the hyperparameters in the config file aren't the optimal ones? Or are those the same as you've used when reached 0.74 success rate? Cheers!

zqwerty · 2020-07-10T13:34:54Z

First MLE then PPO: #15 (comment)
Notice: I've improved user agenda policy, so this number may not be accurate, see README https://github.com/thu-coai/ConvLab-2#policy

zqwerty · 2020-07-16T01:35:15Z

move to #54

thenickben added the bug Something isn't working label Jul 9, 2020

zqwerty mentioned this issue Jul 10, 2020

Improve agenda policy #42

Merged

zqwerty mentioned this issue Jul 16, 2020

[Maintenance] RL policy training #54

Open

zqwerty closed this as completed Jul 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Trained PPO from scratch won't work with Analyzer #40

[BUG] Trained PPO from scratch won't work with Analyzer #40

thenickben commented Jul 9, 2020

zqwerty commented Jul 10, 2020

thenickben commented Jul 10, 2020

zqwerty commented Jul 10, 2020 •

edited

zqwerty commented Jul 16, 2020

[BUG] Trained PPO from scratch won't work with Analyzer #40

[BUG] Trained PPO from scratch won't work with Analyzer #40

Comments

thenickben commented Jul 9, 2020

zqwerty commented Jul 10, 2020

thenickben commented Jul 10, 2020

zqwerty commented Jul 10, 2020 • edited

zqwerty commented Jul 16, 2020

zqwerty commented Jul 10, 2020 •

edited