torchbeast RL lib from FAIR is giving INF loss on TPU in Colab PRO #2740

den-run-ai · 2021-01-19T23:25:18Z

I ran a simple test with torchbeast library from FAIR on CPU/GPU and the results are reasonable. This library is re-implementation of IMPALA algorithm:

https://github.com/facebookresearch/torchbeast

When I switch to TPU with Pytorch XLA 1.7 or nightly (in this branch https://github.com/denfromufa/torchbeast), then training fails with INF loss:

!python -m torchbeast.monobeast --env PongNoFrameskip-v4 --num_actors 2 --num_threads 2 --batch_size 4 --total_steps 10000

[DEBUG:1090 cmd:719 2021-01-15 00:18:11,222] Popen(['git', 'version'], cwd=/content/torchbeast, universal_newlines=False, shell=None, istream=None)
[DEBUG:1090 cmd:719 2021-01-15 00:18:11,229] Popen(['git', 'version'], cwd=/content/torchbeast, universal_newlines=False, shell=None, istream=None)
[DEBUG:1090 cmd:719 2021-01-15 00:18:11,237] Popen(['git', 'cat-file', '--batch-check'], cwd=/content/torchbeast, universal_newlines=False, shell=None, istream=<valid stream>)
[DEBUG:1090 cmd:719 2021-01-15 00:18:11,244] Popen(['git', 'diff', '--cached', '--abbrev=40', '--full-index', '--raw'], cwd=/content/torchbeast, universal_newlines=False, shell=None, istream=None)
[DEBUG:1090 cmd:719 2021-01-15 00:18:11,253] Popen(['git', 'diff', '--abbrev=40', '--full-index', '--raw'], cwd=/content/torchbeast, universal_newlines=False, shell=None, istream=None)
Creating log directory: /root/logs/torchbeast/torchbeast-20210115-001811
[INFO:1090 file_writer:104 2021-01-15 00:18:11,285] Creating log directory: /root/logs/torchbeast/torchbeast-20210115-001811
Symlinked log directory: /root/logs/torchbeast/latest
[INFO:1090 file_writer:117 2021-01-15 00:18:11,286] Symlinked log directory: /root/logs/torchbeast/latest
Saving arguments to /root/logs/torchbeast/torchbeast-20210115-001811/meta.json
[INFO:1090 file_writer:129 2021-01-15 00:18:11,286] Saving arguments to /root/logs/torchbeast/torchbeast-20210115-001811/meta.json
Saving messages to /root/logs/torchbeast/torchbeast-20210115-001811/out.log
[INFO:1090 file_writer:137 2021-01-15 00:18:11,287] Saving messages to /root/logs/torchbeast/torchbeast-20210115-001811/out.log
Saving logs data to /root/logs/torchbeast/torchbeast-20210115-001811/logs.csv
[INFO:1090 file_writer:147 2021-01-15 00:18:11,287] Saving logs data to /root/logs/torchbeast/torchbeast-20210115-001811/logs.csv
Saving logs' fields to /root/logs/torchbeast/torchbeast-20210115-001811/fields.csv
[INFO:1090 file_writer:148 2021-01-15 00:18:11,287] Saving logs' fields to /root/logs/torchbeast/torchbeast-20210115-001811/fields.csv
xla:1
[INFO:1122 monobeast:138 2021-01-15 00:18:20,520] Actor 0 started.
[INFO:1123 monobeast:138 2021-01-15 00:18:20,524] Actor 1 started.
[INFO:1090 monobeast:418 2021-01-15 00:18:20,683] # Step	total_loss	mean_episode_return	pg_loss	baseline_loss	entropy_loss
[INFO:1090 monobeast:500 2021-01-15 00:18:25,690] Steps 0 @ 0.0 SPS. Loss inf. Stats:
{}
[INFO:1090 monobeast:500 2021-01-15 00:18:30,695] Steps 0 @ 0.0 SPS. Loss inf. Stats:
{}
[INFO:1090 monobeast:500 2021-01-15 00:18:35,700] Steps 0 @ 0.0 SPS. Loss inf. Stats:
{}
[INFO:1090 monobeast:500 2021-01-15 00:18:40,706] Steps 0 @ 0.0 SPS. Loss inf. Stats:
{}
[INFO:1090 monobeast:500 2021-01-15 00:18:45,711] Steps 0 @ 0.0 SPS. Loss inf. Stats:
{}
[INFO:1090 monobeast:500 2021-01-15 00:18:50,717] Steps 0 @ 0.0 SPS. Loss inf. Stats:
{}
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/content/torchbeast/torchbeast/monobeast.py", line 668, in <module>
    main(flags)
  File "/content/torchbeast/torchbeast/monobeast.py", line 661, in main
    train(flags)
  File "/content/torchbeast/torchbeast/monobeast.py", line 510, in train
    free_queue.put(None)
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 341, in put
    obj = _ForkingPickler.dumps(obj)
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
KeyboardInterrupt
^C

The text was updated successfully, but these errors were encountered:

stale · 2021-06-16T23:29:24Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

den-run-ai changed the title ~~torchbeast RL pytorch library from FAIR is giving INF loss when training on TPU in Colab PRO~~ torchbeast RL pytorch library from FAIR is giving INF loss on TPU in Colab PRO Jan 19, 2021

den-run-ai changed the title ~~torchbeast RL pytorch library from FAIR is giving INF loss on TPU in Colab PRO~~ torchbeast RL lib from FAIR is giving INF loss on TPU in Colab PRO Jan 19, 2021

den-run-ai mentioned this issue Jan 20, 2021

torchbeast RL lib from FAIR is giving INF loss on TPU in Colab PRO facebookresearch/torchbeast#21

Open

stale bot added the stale Has not had recent activity label Jun 16, 2021

stale bot closed this as completed Jun 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torchbeast RL lib from FAIR is giving INF loss on TPU in Colab PRO #2740

torchbeast RL lib from FAIR is giving INF loss on TPU in Colab PRO #2740

den-run-ai commented Jan 19, 2021 •

edited

Loading

stale bot commented Jun 16, 2021

torchbeast RL lib from FAIR is giving INF loss on TPU in Colab PRO #2740

torchbeast RL lib from FAIR is giving INF loss on TPU in Colab PRO #2740

Comments

den-run-ai commented Jan 19, 2021 • edited Loading

stale bot commented Jun 16, 2021

den-run-ai commented Jan 19, 2021 •

edited

Loading