Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training PPO-algorithm #8

Closed
ChrisGeishauser opened this issue May 6, 2020 · 11 comments
Closed

Training PPO-algorithm #8

ChrisGeishauser opened this issue May 6, 2020 · 11 comments
Assignees

Comments

@ChrisGeishauser
Copy link

I executed the provided train.py script in convlab2/policy/ppo with the prespecified configurations. During training, the success-rate starts pretty high with around 25% and then bumps around 30-35% for some while. When training is finished, I used the evaluation.py script in convlab2/policy to evaluate the performance which gives me 26%, far from the 74% reported in the table.

My Question: What is the exact configuration that has been used for training the 74% model?

@zqwerty
Copy link
Member

zqwerty commented May 8, 2020

The model we trained used an old version of the user simulator. We will re-train the model with the current simulator soon :)

@liangrz15
Copy link
Contributor

For better performance, you should do imitation learning before reinforcement learning. The immitating learning is implemented in the mle directory. You can download the model trained by imitation learning at https://convlab.blob.core.windows.net/convlab-2/mle_policy_multiwoz.zip. Unzip and you can get best_mle.pol.mdl

Then you can run python train.py --load_path FOLDER_OF_MODEL/best_mle in ppo directory. Note that the .pol.mdl suffix should not appear in the --load_path argument.

@ChrisGeishauser
Copy link
Author

For better performance, you should do imitation learning before reinforcement learning. The immitating learning is implemented in the mle directory. You can download the model trained by imitation learning at https://convlab.blob.core.windows.net/convlab-2/mle_policy_multiwoz.zip. Unzip and you can get best_mle.pol.mdl

Then you can run python train.py --load_path FOLDER_OF_MODEL/best_mle in ppo directory. Note that the .pol.mdl suffix should not appear in the --load_path argument.

Thanks for the quick instructions! I followed the procedure but the performance was actually degrading to something around 35%.
I then set evaluator=None for the environment, which finally led to the performance mentioned in the paper. I guess this is was @zqwerty meant by the old version of the user simulator?

@zqwerty looking forward to the results you have!

@zqwerty
Copy link
Member

zqwerty commented May 8, 2020

Actually, @liangrz15 is the one that trained the RL policies. I think he will update the training script for reproducibility.

@sherlock1987
Copy link

Hey, I got the same question, and actually when I read the log file of evaluation, it always shows this:
DEBUG - policy_agenda_multiwoz.py - _normalize_value - 228 - Value not found in standard value set: [none] (slot: type domain: attraction)
I think this is the main reason that it fail, but I have no idea what is happing.
And also, same thing happen when I train the pg model, in the dir: convlab2/policy/pg/train.py.
Pretty strange, and I have no idea where should I debug.

@zqwerty
Copy link
Member

zqwerty commented Jun 10, 2020

We have updated the policy to address this issue. Have a try!

@zqwerty zqwerty closed this as completed Jun 17, 2020
@zqwerty
Copy link
Member

zqwerty commented Jun 17, 2020

I've try to train MLE and then PPO, please see #15 (comment)

@thenickben
Copy link

I executed the provided train.py script in convlab2/policy/ppo with the prespecified configurations. During training, the success-rate starts pretty high with around 25% and then bumps around 30-35% for some while. When training is finished, I used the evaluation.py script in convlab2/policy to evaluate the performance which gives me 26%, far from the 74% reported in the table.

My Question: What is the exact configuration that has been used for training the 74% model?

Hi Chris, how do you see the Success rate during training? I think the only logs I see on console are the losses. Cheers!

@ChrisGeishauser
Copy link
Author

I executed the provided train.py script in convlab2/policy/ppo with the prespecified configurations. During training, the success-rate starts pretty high with around 25% and then bumps around 30-35% for some while. When training is finished, I used the evaluation.py script in convlab2/policy to evaluate the performance which gives me 26%, far from the 74% reported in the table.
My Question: What is the exact configuration that has been used for training the 74% model?

Hi Chris, how do you see the Success rate during training? I think the only logs I see on console are the losses. Cheers!

Hi Ben! I added a method called "evaluate" which is executed during the training. I basically copied the "evaluate" method of "convlab2/policy/evaluate.py" :D Cheers!

@thenickben
Copy link

For better performance, you should do imitation learning before reinforcement learning. The immitating learning is implemented in the mle directory. You can download the model trained by imitation learning at https://convlab.blob.core.windows.net/convlab-2/mle_policy_multiwoz.zip. Unzip and you can get best_mle.pol.mdl
Then you can run python train.py --load_path FOLDER_OF_MODEL/best_mle in ppo directory. Note that the .pol.mdl suffix should not appear in the --load_path argument.

Thanks for the quick instructions! I followed the procedure but the performance was actually degrading to something around 35%.
I then set evaluator=None for the environment, which finally led to the performance mentioned in the paper. I guess this is was @zqwerty meant by the old version of the user simulator?

@zqwerty looking forward to the results you have!

Hi Chris, once again asking you about your results, I can see you managed to replicate paper's figures for PPO? When you say you set evaluator = None in the environment, you meant you:

a) grabbed the best_mle model or trained yours (and for how many epochs?)
b) you trained PPO by doing this?

# simple rule DST
dst_sys = RuleDST()

policy_sys = PPO(True)
pre_trained_model_name = 'best_mle'
load_model(policy_sys, pre_trained_model_name )

# not use dst
dst_usr = None
# rule policy
policy_usr = RulePolicy(character='usr')
# assemble
simulator = PipelineAgent(None, None, policy_usr, None, 'user')

evaluator = MultiWozEvaluator()
env = Environment(None, simulator, None, dst_sys, None)

for i in range(args.epoch):
    update(env, policy_sys, args.batchsz, i, args.process_num)

where load_model(policy_sys, pre_trained_model_name ) is a function I made for loading.

Thanks a lot, Nick

@zqwerty
Copy link
Member

zqwerty commented Jul 16, 2020

move to #54

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants