Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torchbeast RL lib from FAIR is giving INF loss on TPU in Colab PRO #2740

Closed
den-run-ai opened this issue Jan 19, 2021 · 1 comment
Closed
Labels
stale Has not had recent activity

Comments

@den-run-ai
Copy link

den-run-ai commented Jan 19, 2021

I ran a simple test with torchbeast library from FAIR on CPU/GPU and the results are reasonable. This library is re-implementation of IMPALA algorithm:

https://github.com/facebookresearch/torchbeast

When I switch to TPU with Pytorch XLA 1.7 or nightly (in this branch https://github.com/denfromufa/torchbeast), then training fails with INF loss:

!python -m torchbeast.monobeast --env PongNoFrameskip-v4 --num_actors 2 --num_threads 2 --batch_size 4 --total_steps 10000

[DEBUG:1090 cmd:719 2021-01-15 00:18:11,222] Popen(['git', 'version'], cwd=/content/torchbeast, universal_newlines=False, shell=None, istream=None)
[DEBUG:1090 cmd:719 2021-01-15 00:18:11,229] Popen(['git', 'version'], cwd=/content/torchbeast, universal_newlines=False, shell=None, istream=None)
[DEBUG:1090 cmd:719 2021-01-15 00:18:11,237] Popen(['git', 'cat-file', '--batch-check'], cwd=/content/torchbeast, universal_newlines=False, shell=None, istream=<valid stream>)
[DEBUG:1090 cmd:719 2021-01-15 00:18:11,244] Popen(['git', 'diff', '--cached', '--abbrev=40', '--full-index', '--raw'], cwd=/content/torchbeast, universal_newlines=False, shell=None, istream=None)
[DEBUG:1090 cmd:719 2021-01-15 00:18:11,253] Popen(['git', 'diff', '--abbrev=40', '--full-index', '--raw'], cwd=/content/torchbeast, universal_newlines=False, shell=None, istream=None)
Creating log directory: /root/logs/torchbeast/torchbeast-20210115-001811
[INFO:1090 file_writer:104 2021-01-15 00:18:11,285] Creating log directory: /root/logs/torchbeast/torchbeast-20210115-001811
Symlinked log directory: /root/logs/torchbeast/latest
[INFO:1090 file_writer:117 2021-01-15 00:18:11,286] Symlinked log directory: /root/logs/torchbeast/latest
Saving arguments to /root/logs/torchbeast/torchbeast-20210115-001811/meta.json
[INFO:1090 file_writer:129 2021-01-15 00:18:11,286] Saving arguments to /root/logs/torchbeast/torchbeast-20210115-001811/meta.json
Saving messages to /root/logs/torchbeast/torchbeast-20210115-001811/out.log
[INFO:1090 file_writer:137 2021-01-15 00:18:11,287] Saving messages to /root/logs/torchbeast/torchbeast-20210115-001811/out.log
Saving logs data to /root/logs/torchbeast/torchbeast-20210115-001811/logs.csv
[INFO:1090 file_writer:147 2021-01-15 00:18:11,287] Saving logs data to /root/logs/torchbeast/torchbeast-20210115-001811/logs.csv
Saving logs' fields to /root/logs/torchbeast/torchbeast-20210115-001811/fields.csv
[INFO:1090 file_writer:148 2021-01-15 00:18:11,287] Saving logs' fields to /root/logs/torchbeast/torchbeast-20210115-001811/fields.csv
xla:1
[INFO:1122 monobeast:138 2021-01-15 00:18:20,520] Actor 0 started.
[INFO:1123 monobeast:138 2021-01-15 00:18:20,524] Actor 1 started.
[INFO:1090 monobeast:418 2021-01-15 00:18:20,683] # Step	total_loss	mean_episode_return	pg_loss	baseline_loss	entropy_loss
[INFO:1090 monobeast:500 2021-01-15 00:18:25,690] Steps 0 @ 0.0 SPS. Loss inf. Stats:
{}
[INFO:1090 monobeast:500 2021-01-15 00:18:30,695] Steps 0 @ 0.0 SPS. Loss inf. Stats:
{}
[INFO:1090 monobeast:500 2021-01-15 00:18:35,700] Steps 0 @ 0.0 SPS. Loss inf. Stats:
{}
[INFO:1090 monobeast:500 2021-01-15 00:18:40,706] Steps 0 @ 0.0 SPS. Loss inf. Stats:
{}
[INFO:1090 monobeast:500 2021-01-15 00:18:45,711] Steps 0 @ 0.0 SPS. Loss inf. Stats:
{}
[INFO:1090 monobeast:500 2021-01-15 00:18:50,717] Steps 0 @ 0.0 SPS. Loss inf. Stats:
{}
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/content/torchbeast/torchbeast/monobeast.py", line 668, in <module>
    main(flags)
  File "/content/torchbeast/torchbeast/monobeast.py", line 661, in main
    train(flags)
  File "/content/torchbeast/torchbeast/monobeast.py", line 510, in train
    free_queue.put(None)
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 341, in put
    obj = _ForkingPickler.dumps(obj)
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
KeyboardInterrupt
^C
@den-run-ai den-run-ai changed the title torchbeast RL pytorch library from FAIR is giving INF loss when training on TPU in Colab PRO torchbeast RL pytorch library from FAIR is giving INF loss on TPU in Colab PRO Jan 19, 2021
@den-run-ai den-run-ai changed the title torchbeast RL pytorch library from FAIR is giving INF loss on TPU in Colab PRO torchbeast RL lib from FAIR is giving INF loss on TPU in Colab PRO Jan 19, 2021
@stale
Copy link

stale bot commented Jun 16, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Has not had recent activity label Jun 16, 2021
@stale stale bot closed this as completed Jun 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Has not had recent activity
Projects
None yet
Development

No branches or pull requests

1 participant