Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the result of Pong #1

Open
lucasliunju opened this issue May 21, 2018 · 25 comments
Open

About the result of Pong #1

lucasliunju opened this issue May 21, 2018 · 25 comments

Comments

@lucasliunju
Copy link

Dear ronsailer,
I'm very sorry to trouble you. First, thanks for your contribution, and I am running the code on Pong and can not get a better result. So I want to ask whether you do this experiment.
I'm looking forward to your reply.

@ronsailer
Copy link
Owner

ronsailer commented May 21, 2018

Hi Lucas, I'm glad to see that someone is using this code! :)

You're very welcome. What do you mean you can't get a better result. Better than what?
Unfortunately I haven't run the Pong experiment. This code came standard with Breakout and I did not try to run it on Pong. The only code that I ran Pong on is this: https://github.com/ikostrikov/pytorch-a2c-ppo-acktr

I think that in order to get this code working on Pong you have to readjust the network hyperparameters like the input size but it sounds to me like you've already got it up and running.

Just a heads-up: With the code as-is (as of yesterday), it will not run because I'm currently in the process of translating it from Theano to PyTorch and it's broken. It worked yesterday though before I started migrating it to PyTorch.

The original code can be found here: https://github.com/jeanharb/a2oc_delib but from my experience, it's outdated and can't immediately be executed after cloning it. You'll need to tinker a bit with the lasagne library and remove some stuff from there. I've written down the changes I had to do to get that code running:

  • Install the dependencies
  • Make sure that .theanorc has floatX=float32 configured
  • Lasagne is incompatible with the latest version of Theano. You need to manually edit it and change "from theano.tensor.signal import downsample" to "from theano.tensor.signal.pool import pool_2d" and then corresponding function calls, so instead of downsample.max_pool_2d you will now simply call pool_2d.

(Comment on the above: I'm not sure what I meant with it. If I meant Lasagne or Theano. I recall having to edit some import that was outdated in the Lasagne code)

Also, I'm currently "hard-coding" it to work on Gridworld (https://github.com/nadavbh12/gym-gridworld) which I've attached to this repo but I will eventually just add it to the requirements or properly include in the code, not the way it's currently being used. With "hard-coding" I mean things like the input size of the network, which, instead of using magic numbers, should be inferred from the environment that's being used (Pong, Breakout, Gridworld, etc.).

Best of luck,
Ron

@lucasliunju
Copy link
Author

Dear ronsailer,
I'm very sorry to trouble you again. I'm running your code again. I found some error about "AttributeError: 'GridworldEnv' object has no attribute 'viewer'". So I want to ask whether the code is complete. I'm looking forward to your reply.

@ronsailer
Copy link
Owner

Hi Lucas,

This repo is discontinued. I've stopped working on this halfway through and instead implemented A2OC based off of https://github.com/ikostrikov/pytorch-a2c-ppo-acktr

Let me upload my code, I'll link it to you here.

@ronsailer
Copy link
Owner

ronsailer commented Jul 3, 2018

@lucasliunju please see: https://github.com/ronsailer/pytorch-a2c-a2oc-ppo-acktr

But this one also isn't ready. The architecture works but the only thing that's missing are a few lines for the termination loss so that the policy over options will learn as well (so right now it's as if options are being chosen at random). It converges and learns to play games nonetheless, just slower because you need to train all options as they are being chosen at random and only get 1/n of the actions to learn from, where n is the number of options.

@lucasliunju
Copy link
Author

Hi Sailer,
Thank you for your warm reply. I'll try to run the code you provided. I have tried to reproduce this algorithm in the past few days, but I have not got good results. As far as I know, a2oc_delib is state of the art in option discovery. Maybe I should try to run the author's original code.

@ronsailer
Copy link
Owner

Jean Harb's code (a2oc_delib) works after you change a few things unrelated to the algorithm itself. If I recall correctly, lasagne (a Python module) was changed and the import statements are now wrong and outdated but it doesn't take long to fix them and get it up and running. If you're having trouble feel free to ask me.

@lucasliunju
Copy link
Author

Hi Sailer,
Thanks! I can run the code under cpu. When I use gpu, It will be many strange errors. I think the vision of cuda an cudnn maybe a problem. So I want to ask the vision of your cuda and cudnn. Thanks.

@ronsailer
Copy link
Owner

Hi Lucas,

There was a mistake on my part and it really was broken. I've been editing the code a lot on my laptop which doesn't run CUDA and I didn't expect anyone else to run it anytime soon. I've pushed a fix. Please try again now and let me know if it works. It does on my CUDA machine.

Also, please let us move the conversation over there. You can open a new issue at https://github.com/ronsailer/pytorch-a2c-a2oc-ppo-acktr if you want.

@ronsailer
Copy link
Owner

@lucasliunju Lucas, check out the code at https://github.com/ronsailer/pytorch-a2c-a2oc-ppo-acktr now. I believe I've fixed the termination loss and the algorithm should be complete now. It works for Gridworld. I'm now training it on Pong.

@lucasliunju
Copy link
Author

lucasliunju commented Jul 6, 2018 via email

@lucasliunju
Copy link
Author

lucasliunju commented Jul 7, 2018 via email

@lucasliunju
Copy link
Author

I'm very happy to tell you that I can successfully run your implementation of a2oc in CUDA. https://github.com/ronsailer/pytorch-a2c-a2oc-ppo-acktr. But I find I can not open a new issue.

@ronsailer
Copy link
Owner

Glad to hear it works! I've trained a model on Breakout and it can play it.

In the paper it says that if you don't use a deliberation cost the termination probability quickly rises to 100% so that the options terminate after every step. I did not see this happen with my code. I hope I did not make a mistake.

Also, the deliberation cost should be negative, right? The algorithm in the paper adds it to the reward if there was a switch in options, and it should be a penalty. The table at the bottom at the Experiments section mentions they've tried it with deliberation costs between 0 and 0.03 with increments of 0.005 but they did not mention the sign.

I'm using a negative deliberation cost, -0.1 for example.

@lucasliunju
Copy link
Author

Yes. From the results of option-critic and a2oc (the section of deliberation cost is 0), we can find the option will soon be terminate. and I think it can not show the ability of option.

@lucasliunju
Copy link
Author

As for the setting of deliberation cost, I think it maybe related to the setting of super parameters. I‘m looking forward to your new results.

@lucasliunju
Copy link
Author

Dear Sailer,
I'm very sorry to trouble you again. I'm trying to compare the results with Jean Harb's code. I want to ask whether you run Jean Harb's code successfully on gpu. I can just run the code on cpu.
Thanks

@ronsailer
Copy link
Owner

Hi Lucas, sorry for the late reply. No I did not attempt to run his code on a gpu.

@lucasliunju
Copy link
Author

lucasliunju commented Jul 24, 2018 via email

@ronsailer
Copy link
Owner

Hi Lucas, please make sure you're running the latest version. I've uploaded a fix for the termination gradient about 2 days ago. There was a mistake an indeed the termination head did not converge. I find the results to be much better now.

Try AmidarNoFrameskip-v4 with 4 options and deliberation cost of 0.005:
python main.py --env-name AmidarNoFrameskip-v4 --num-options 4 --delib 0.005

10m frames (default) is enough to see that it has learned not to switch options except for certain times. You can really see the deliberation cost hover around 0.00 and only occasionally go up and when it does he switches. If you want to get better results I suggest increasing the number of frames. For me, after 10m frames, the reward hovers around 100.

I suggest you add this line to act_enjoy() in model.py, after the line "rand_num = torch.rand(1)":
print("option: {} termination: {:.3f} rand: {:.3f}".format(self.current_options.item(),
self.terminations.item(),
rand_num.item()))

image

I'm now working on adding a tracker to track things like termination probability over time and option choice over time:

image

Ignore the x-axis markers, this is after 10m iterations with the same configuration as above but with 8 options instead of 4. The results seem to be consistent with the paper. I'll have to try and run the other deliberation configurations as well.

@ronsailer
Copy link
Owner

ronsailer commented Jul 24, 2018

I've started a job with BreakoutNoFrameskip, with delib=0.03 (highest in paper). These are the results so far (about 500k frames):

image
image

I told you previously to ignore the x-axis markers, they represent the number of samples I took.

@lucasliunju
Copy link
Author

lucasliunju commented Jul 24, 2018 via email

@ronsailer
Copy link
Owner

Lucas, I can't see the image.

@lucasliunju
Copy link
Author

lucasliunju commented Jul 24, 2018 via email

@lucasliunju
Copy link
Author

lucasliunju commented Jul 24, 2018 via email

@ronsailer
Copy link
Owner

still can't see. email it to me at ronsailer@gmail.com

I've pasted my images straight from the clipboard, not uploaded as a file, maybe that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants