Training PPO-algorithm #8

ChrisGeishauser · 2020-05-06T12:58:18Z

I executed the provided train.py script in convlab2/policy/ppo with the prespecified configurations. During training, the success-rate starts pretty high with around 25% and then bumps around 30-35% for some while. When training is finished, I used the evaluation.py script in convlab2/policy to evaluate the performance which gives me 26%, far from the 74% reported in the table.

My Question: What is the exact configuration that has been used for training the 74% model?

zqwerty · 2020-05-08T04:55:04Z

The model we trained used an old version of the user simulator. We will re-train the model with the current simulator soon :)

liangrz15 · 2020-05-08T05:21:45Z

For better performance, you should do imitation learning before reinforcement learning. The immitating learning is implemented in the mle directory. You can download the model trained by imitation learning at https://convlab.blob.core.windows.net/convlab-2/mle_policy_multiwoz.zip. Unzip and you can get best_mle.pol.mdl

Then you can run python train.py --load_path FOLDER_OF_MODEL/best_mle in ppo directory. Note that the .pol.mdl suffix should not appear in the --load_path argument.

ChrisGeishauser · 2020-05-08T14:03:43Z

For better performance, you should do imitation learning before reinforcement learning. The immitating learning is implemented in the mle directory. You can download the model trained by imitation learning at https://convlab.blob.core.windows.net/convlab-2/mle_policy_multiwoz.zip. Unzip and you can get best_mle.pol.mdl

Then you can run python train.py --load_path FOLDER_OF_MODEL/best_mle in ppo directory. Note that the .pol.mdl suffix should not appear in the --load_path argument.

Thanks for the quick instructions! I followed the procedure but the performance was actually degrading to something around 35%.
I then set evaluator=None for the environment, which finally led to the performance mentioned in the paper. I guess this is was @zqwerty meant by the old version of the user simulator?

@zqwerty looking forward to the results you have!

zqwerty · 2020-05-08T15:03:42Z

Actually, @liangrz15 is the one that trained the RL policies. I think he will update the training script for reproducibility.

sherlock1987 · 2020-05-26T03:17:22Z

Hey, I got the same question, and actually when I read the log file of evaluation, it always shows this:
DEBUG - policy_agenda_multiwoz.py - _normalize_value - 228 - Value not found in standard value set: [none] (slot: type domain: attraction)
I think this is the main reason that it fail, but I have no idea what is happing.
And also, same thing happen when I train the pg model, in the dir: convlab2/policy/pg/train.py.
Pretty strange, and I have no idea where should I debug.

zqwerty · 2020-06-10T13:18:06Z

We have updated the policy to address this issue. Have a try!

zqwerty · 2020-06-17T07:38:48Z

I've try to train MLE and then PPO, please see #15 (comment)

thenickben · 2020-07-09T20:34:49Z

I executed the provided train.py script in convlab2/policy/ppo with the prespecified configurations. During training, the success-rate starts pretty high with around 25% and then bumps around 30-35% for some while. When training is finished, I used the evaluation.py script in convlab2/policy to evaluate the performance which gives me 26%, far from the 74% reported in the table.

My Question: What is the exact configuration that has been used for training the 74% model?

Hi Chris, how do you see the Success rate during training? I think the only logs I see on console are the losses. Cheers!

ChrisGeishauser · 2020-07-10T06:27:40Z

I executed the provided train.py script in convlab2/policy/ppo with the prespecified configurations. During training, the success-rate starts pretty high with around 25% and then bumps around 30-35% for some while. When training is finished, I used the evaluation.py script in convlab2/policy to evaluate the performance which gives me 26%, far from the 74% reported in the table.
My Question: What is the exact configuration that has been used for training the 74% model?

Hi Chris, how do you see the Success rate during training? I think the only logs I see on console are the losses. Cheers!

Hi Ben! I added a method called "evaluate" which is executed during the training. I basically copied the "evaluate" method of "convlab2/policy/evaluate.py" :D Cheers!

thenickben · 2020-07-15T10:32:16Z

For better performance, you should do imitation learning before reinforcement learning. The immitating learning is implemented in the mle directory. You can download the model trained by imitation learning at https://convlab.blob.core.windows.net/convlab-2/mle_policy_multiwoz.zip. Unzip and you can get best_mle.pol.mdl
Then you can run python train.py --load_path FOLDER_OF_MODEL/best_mle in ppo directory. Note that the .pol.mdl suffix should not appear in the --load_path argument.

Thanks for the quick instructions! I followed the procedure but the performance was actually degrading to something around 35%.
I then set evaluator=None for the environment, which finally led to the performance mentioned in the paper. I guess this is was @zqwerty meant by the old version of the user simulator?

@zqwerty looking forward to the results you have!

Hi Chris, once again asking you about your results, I can see you managed to replicate paper's figures for PPO? When you say you set evaluator = None in the environment, you meant you:

a) grabbed the best_mle model or trained yours (and for how many epochs?)
b) you trained PPO by doing this?

# simple rule DST
dst_sys = RuleDST()

policy_sys = PPO(True)
pre_trained_model_name = 'best_mle'
load_model(policy_sys, pre_trained_model_name )

# not use dst
dst_usr = None
# rule policy
policy_usr = RulePolicy(character='usr')
# assemble
simulator = PipelineAgent(None, None, policy_usr, None, 'user')

evaluator = MultiWozEvaluator()
env = Environment(None, simulator, None, dst_sys, None)

for i in range(args.epoch):
    update(env, policy_sys, args.batchsz, i, args.process_num)

where load_model(policy_sys, pre_trained_model_name ) is a function I made for loading.

Thanks a lot, Nick

zqwerty · 2020-07-16T01:34:39Z

move to #54

ChrisGeishauser closed this as completed May 8, 2020

ChrisGeishauser reopened this May 8, 2020

zqwerty assigned liangrz15 May 26, 2020

liangrz15 mentioned this issue Jun 10, 2020

Fix policy #14

Merged

zqwerty closed this as completed Jun 17, 2020

zqwerty mentioned this issue Jul 16, 2020

[Maintenance] RL policy training #54

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training PPO-algorithm #8

Training PPO-algorithm #8

ChrisGeishauser commented May 6, 2020

zqwerty commented May 8, 2020

liangrz15 commented May 8, 2020

ChrisGeishauser commented May 8, 2020

zqwerty commented May 8, 2020

sherlock1987 commented May 26, 2020

zqwerty commented Jun 10, 2020

zqwerty commented Jun 17, 2020 •

edited

Loading

thenickben commented Jul 9, 2020

ChrisGeishauser commented Jul 10, 2020

thenickben commented Jul 15, 2020

zqwerty commented Jul 16, 2020

Training PPO-algorithm #8

Training PPO-algorithm #8

Comments

ChrisGeishauser commented May 6, 2020

zqwerty commented May 8, 2020

liangrz15 commented May 8, 2020

ChrisGeishauser commented May 8, 2020

zqwerty commented May 8, 2020

sherlock1987 commented May 26, 2020

zqwerty commented Jun 10, 2020

zqwerty commented Jun 17, 2020 • edited Loading

thenickben commented Jul 9, 2020

ChrisGeishauser commented Jul 10, 2020

thenickben commented Jul 15, 2020

zqwerty commented Jul 16, 2020

zqwerty commented Jun 17, 2020 •

edited

Loading