Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when running vgg16_oktopk.sh #1

Closed
EtoDemerzel0427 opened this issue Mar 2, 2022 · 4 comments
Closed

Error when running vgg16_oktopk.sh #1

EtoDemerzel0427 opened this issue Mar 2, 2022 · 4 comments

Comments

@EtoDemerzel0427
Copy link

Dear authors:

Thank you for open source your work. I tried to reproduce your experiments on Ok-TopK by running sbatch vgg16_oktopk.h, but encountered the following error:

Exception in thread allreducer:
Traceback (most recent call last):
  File "<my-dir>/anaconda3/envs/py38_oktopk/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "<my-dir>/anaconda3/envs/py38_oktopk/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "<my-dir>/frontera/Ok-Topk/VGG/allreducer.py", line 643, in run
    self._boundaries[new_name][i] = global_boundaries[i]
IndexError: index 0 is out of bounds for axis 0 with size 0
Exception in thread allreducer:
Traceback (most recent call last):
  File "<my-dir>/anaconda3/envs/py38_oktopk/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "<my-dir>/anaconda3/envs/py38_oktopk/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "<my-dir>/<src-dir>/Ok-Topk/VGG/allreducer.py", line 643, in run
    self._boundaries[new_name][i] = global_boundaries[i]
IndexError: index 0 is out of bounds for axis 0 with size 0
Exception in thread allreducer:
Traceback (most recent call last):
  File "<my-dir2>/anaconda3/envs/py38_oktopk/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "<my-dir2>//anaconda3/envs/py38_oktopk/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "<my-dir2>/<src-dir>/Ok-Topk/VGG/allreducer.py", line 643, in run
    self._boundaries[new_name][i] = global_boundaries[i]
IndexError: index 0 is out of bounds for axis 0 with size 0

Do you have any ideas on what happened here? The only changes I made is decreasing the number of nodes:

#SBATCH --nodes=4
#SBATCH --ntasks=4
...
nworkers="${nworkers:-4}"
@Shigangli
Copy link
Owner

Hi, I rerun the code with the same config as yours and it works on my side. Have no idea where causes the error. Maybe check the size of self._boundaries[new_name].

@EtoDemerzel0427
Copy link
Author

Thanks for your quick response. I found that the MPI.comm_world's size in AllReducer is always 1 here. Could be some mpi4py problem, according to a similar question on StackOverflow. Still working on it.

@EtoDemerzel0427
Copy link
Author

EtoDemerzel0427 commented Mar 2, 2022

@Shigangli Hi, I got to make it run with mpiexec.hydra, somehow srun just didn't allocate the jobs correctly. But this message keeps recurring:

2022-03-02 16:43:44,335 [dl_trainer.py:609] WARNING NaN detected! sparsities:  []
2022-03-02 16:43:44,334 [dl_trainer.py:610] INFO Average Sparsity: nan, compression ratio: nan, communication size: nan
2022-03-02 16:43:44,335 [dl_trainer.py:610] INFO Average Sparsity: nan, compression ratio: nan, communication size: nan

Is it normal?
After a quick scan through the dl_trainer.py, I didn't find where you updated the self.sparsities, so it's always an empty list. Did I miss anything?

@Shigangli
Copy link
Owner

Shigangli commented Mar 3, 2022

That should be OK, there're some unnecessary info printed out. If you find current density in the output file you can see the ratio of dense gradient values. It also conducts a warmup using dense-allreduce, as show here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants