Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Trained PPO from scratch won't work with Analyzer #40

Closed
thenickben opened this issue Jul 9, 2020 · 4 comments
Closed

[BUG] Trained PPO from scratch won't work with Analyzer #40

thenickben opened this issue Jul 9, 2020 · 4 comments
Labels
bug Something isn't working

Comments

@thenickben
Copy link

Describe the bug

Since I recently started learning how to use this library, I'm aiming for a very simple concrete task, which is training a PPO policy from scratch.

Given my limited resources at the moment I'm using Google Colab Pro, which gives me a Tesla P100-PCIE-16GB for cheap.

The approach I'm following is (it mostly follows the tutorial on Colab, however I'll be extra detailed because the only mistake here could be my way of using the different modules) :

  1. I'm cloning convlab2 github repo and installing locally (I tried both "just in runtime" and also locally when connecting the notebook with a Google Drive folder

  2. After importing all necessary libraries, I create a simple dialogue system as in /ppo/train.py:

# simple rule DST
dst_sys = RuleDST()

policy_sys = PPO(True)
policy_sys.load(args.load_path)

# not use dst
dst_usr = None
# rule policy
policy_usr = RulePolicy(character='usr')
# assemble
simulator = PipelineAgent(None, None, policy_usr, None, 'user')

evaluator = MultiWozEvaluator()
env = Environment(None, simulator, None, dst_sys, evaluator)

for i in range(args.epoch):
    update(env, policy_sys, args.batchsz, i, args.process_num)

The key here is that i'm training for only 100 epochs, however even if I would not expect this trained policy being any good, at least I'd expect being able to generate dialogues.

  1. Once the policy (policy_sys) is trained, I create a session for testing dialogues. In order to being able to do it, I create the user and system agents (pipelines) using the trained policy for the system agent.
# --- system ---

# BERT nlu
sys_nlu = BERTNLU()
# simple rule DST
sys_dst = RuleDST()
# TRAINED PPO POLICY ###
sys_policy = policy_sys
# template NLG
sys_nlg = TemplateNLG(is_user=False)
# assemble
sys_agent = PipelineAgent(sys_nlu, sys_dst, sys_policy, sys_nlg, name='sys')

# --- user ---

# MILU
user_nlu = MILU()
# not use dst
user_dst = None
# rule policy
user_policy = RulePolicy(character='usr')
# template NLG
user_nlg = TemplateNLG(is_user=True)
# assemble
user_agent = PipelineAgent(user_nlu, user_dst, user_policy, user_nlg, name='user')

# --- evaluator and session ---

evaluator = MultiWozEvaluator()
sess = BiSession(sys_agent=sys_agent, user_agent=user_agent, kb_query=None, evaluator=evaluator)
  1. Same as in the tutorial, I use this simple loop to sample dialogues from the session
sys_response = ''
sess.init_session()
print('init goal:')
pprint(sess.evaluator.goal)
print('-'*50)
for i in range(20):
    sys_response, user_response, session_over, reward = sess.next_turn(sys_response)
    print('user:', user_response)
    print('sys:', sys_response)
    print()
    if session_over is True:
        break
print('task success:', sess.evaluator.task_success())
print('book rate:', sess.evaluator.book_rate())
print('inform precision/recall/f1:', sess.evaluator.inform_F1())
print('-'*50)
print('final goal:')
pprint(sess.evaluator.goal)
print('='*100)

The error I'm getting is "RuntimeError: CUDA error: device-side assert triggered" apparently from jointBERT library.

I suspect this is due the utterances generated by the system are longer than the MAX_LEN in BERT model?

So the main question here (apart from the obvious one "why is this happening?") would be: is this the right approach for training and testing a RL policy from scratch?

I've seen that in order to improve PPO's performance, some sort of imitation learning helps as pre-training step. However what I'm aiming is not fine-tuning PPO, but simply training it for a few epochs (100, 200, etc) and different hyperparameters and being able to use the analyzer library to assess the model's behaviour (e.g. avg Success Rate) when increasing the number of epochs or changing hyperparameters.

To Reproduce

All the code is here and it runs out of the box, it should reproduce the "CUDA error: device-side assert triggered" error I'm facing

https://colab.research.google.com/drive/1nz73WBKLohohScsZIFjpDJz0y0CG4SRB?usp=sharing

Expected behavior

I expected that building a system agent again and using the previously trained policy would work out of the box within the analyzer tool.

@thenickben thenickben added the bug Something isn't working label Jul 9, 2020
@zqwerty
Copy link
Member

zqwerty commented Jul 10, 2020

Thank you!
As the warning:

WARNING:transformers.tokenization_utils_base:Token indices sequence length is longer than the specified maximum sequence length for this model (975 > 512). Running this sequence through the model will result in indexing errors

I agree with you that the sequence is too long for jointBERT model. I will cut too long sentences in jointBERT.

And yes, this is the right approach for training and testing a RL policy from scratch. But as you can see, the system response is too long because your policy outputs too many dialogue acts, which indicates the policy is not well trained.

@thenickben
Copy link
Author

Thank you! I didn't see the warning, it seems it doesn't print on console on Colab.

Another thing I've noticed is that if I train for more epochs (e.g. 1000) the policy loss eventually gets "nan".

Could it be the case that the hyperparameters in the config file aren't the optimal ones? Or are those the same as you've used when reached 0.74 success rate? Cheers!

@zqwerty
Copy link
Member

zqwerty commented Jul 10, 2020

First MLE then PPO: #15 (comment)
Notice: I've improved user agenda policy, so this number may not be accurate, see README https://github.com/thu-coai/ConvLab-2#policy

@zqwerty
Copy link
Member

zqwerty commented Jul 16, 2020

move to #54

@zqwerty zqwerty closed this as completed Jul 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants