You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I ran a simple test with torchbeast library from FAIR on CPU/GPU and the results are reasonable. This library is re-implementation of IMPALA algorithm:
[DEBUG:1090 cmd:719 2021-01-15 00:18:11,222] Popen(['git', 'version'], cwd=/content/torchbeast, universal_newlines=False, shell=None, istream=None)
[DEBUG:1090 cmd:719 2021-01-15 00:18:11,229] Popen(['git', 'version'], cwd=/content/torchbeast, universal_newlines=False, shell=None, istream=None)
[DEBUG:1090 cmd:719 2021-01-15 00:18:11,237] Popen(['git', 'cat-file', '--batch-check'], cwd=/content/torchbeast, universal_newlines=False, shell=None, istream=<valid stream>)
[DEBUG:1090 cmd:719 2021-01-15 00:18:11,244] Popen(['git', 'diff', '--cached', '--abbrev=40', '--full-index', '--raw'], cwd=/content/torchbeast, universal_newlines=False, shell=None, istream=None)
[DEBUG:1090 cmd:719 2021-01-15 00:18:11,253] Popen(['git', 'diff', '--abbrev=40', '--full-index', '--raw'], cwd=/content/torchbeast, universal_newlines=False, shell=None, istream=None)
Creating log directory: /root/logs/torchbeast/torchbeast-20210115-001811
[INFO:1090 file_writer:104 2021-01-15 00:18:11,285] Creating log directory: /root/logs/torchbeast/torchbeast-20210115-001811
Symlinked log directory: /root/logs/torchbeast/latest
[INFO:1090 file_writer:117 2021-01-15 00:18:11,286] Symlinked log directory: /root/logs/torchbeast/latest
Saving arguments to /root/logs/torchbeast/torchbeast-20210115-001811/meta.json
[INFO:1090 file_writer:129 2021-01-15 00:18:11,286] Saving arguments to /root/logs/torchbeast/torchbeast-20210115-001811/meta.json
Saving messages to /root/logs/torchbeast/torchbeast-20210115-001811/out.log
[INFO:1090 file_writer:137 2021-01-15 00:18:11,287] Saving messages to /root/logs/torchbeast/torchbeast-20210115-001811/out.log
Saving logs data to /root/logs/torchbeast/torchbeast-20210115-001811/logs.csv
[INFO:1090 file_writer:147 2021-01-15 00:18:11,287] Saving logs data to /root/logs/torchbeast/torchbeast-20210115-001811/logs.csv
Saving logs' fields to /root/logs/torchbeast/torchbeast-20210115-001811/fields.csv
[INFO:1090 file_writer:148 2021-01-15 00:18:11,287] Saving logs' fields to /root/logs/torchbeast/torchbeast-20210115-001811/fields.csv
xla:1
[INFO:1122 monobeast:138 2021-01-15 00:18:20,520] Actor 0 started.
[INFO:1123 monobeast:138 2021-01-15 00:18:20,524] Actor 1 started.
[INFO:1090 monobeast:418 2021-01-15 00:18:20,683] # Step total_loss mean_episode_return pg_loss baseline_loss entropy_loss
[INFO:1090 monobeast:500 2021-01-15 00:18:25,690] Steps 0 @ 0.0 SPS. Loss inf. Stats:
{}
[INFO:1090 monobeast:500 2021-01-15 00:18:30,695] Steps 0 @ 0.0 SPS. Loss inf. Stats:
{}
[INFO:1090 monobeast:500 2021-01-15 00:18:35,700] Steps 0 @ 0.0 SPS. Loss inf. Stats:
{}
[INFO:1090 monobeast:500 2021-01-15 00:18:40,706] Steps 0 @ 0.0 SPS. Loss inf. Stats:
{}
[INFO:1090 monobeast:500 2021-01-15 00:18:45,711] Steps 0 @ 0.0 SPS. Loss inf. Stats:
{}
[INFO:1090 monobeast:500 2021-01-15 00:18:50,717] Steps 0 @ 0.0 SPS. Loss inf. Stats:
{}
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/content/torchbeast/torchbeast/monobeast.py", line 668, in <module>
main(flags)
File "/content/torchbeast/torchbeast/monobeast.py", line 661, in main
train(flags)
File "/content/torchbeast/torchbeast/monobeast.py", line 510, in train
free_queue.put(None)
File "/usr/lib/python3.6/multiprocessing/queues.py", line 341, in put
obj = _ForkingPickler.dumps(obj)
File "/usr/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
KeyboardInterrupt
^C
The text was updated successfully, but these errors were encountered:
den-run-ai
changed the title
torchbeast RL pytorch library from FAIR is giving INF loss when training on TPU in Colab PRO
torchbeast RL pytorch library from FAIR is giving INF loss on TPU in Colab PRO
Jan 19, 2021
den-run-ai
changed the title
torchbeast RL pytorch library from FAIR is giving INF loss on TPU in Colab PRO
torchbeast RL lib from FAIR is giving INF loss on TPU in Colab PRO
Jan 19, 2021
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I ran a simple test with torchbeast library from FAIR on CPU/GPU and the results are reasonable. This library is re-implementation of IMPALA algorithm:
https://github.com/facebookresearch/torchbeast
When I switch to TPU with Pytorch XLA 1.7 or nightly (in this branch https://github.com/denfromufa/torchbeast), then training fails with INF loss:
!python -m torchbeast.monobeast --env PongNoFrameskip-v4 --num_actors 2 --num_threads 2 --batch_size 4 --total_steps 10000
The text was updated successfully, but these errors were encountered: