-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker not using all cores #13
Comments
From @mrakitin: I think I got something interesting with parallel execution. When I use the following command there are 12 cores allocated: docker run --cpuset-cpus 50-71 --name x0 --rm s10 su - vagrant -c 'for i in 1 2 3 4 5 6; do mpiexec -verbose -n 2 python SRWLIB_Example10.py & done; wait' When I tried cpus starting from 0 that didn’t work. Probably CPU #0 is somewhat “special”… |
Maxim, would you try this to see if we can get all 32 cores used with one MPI master: docker run --rm s10 su - vagrant -c 'for i in $(seq 1); do mpiexec -verbose -n 32 python SRWLIB_Example10.py & done; wait' I'd like to test your "cpu0" hypothesis with: docker run --cpuset-cpus 1-71 --rm s10 su - vagrant -c 'for i in $(seq 4); do mpiexec -verbose -n 8 python SRWLIB_Example10.py & done; wait' And: docker run --cpuset-cpus 0-71 --rm s10 su - vagrant -c 'for i in $(seq 4); do mpiexec -verbose -n 8 python SRWLIB_Example10.py & done; wait' I'm skeptical |
What are the core assignments? I find that more meaningful, because we can
know exactly which cores doubled up. With htop, all we know is that the
cores are busy, but they could be busy with 1 or 2 processes. Some of this
may be MPI.
So that's interesting. (1) is a good sign. Let's try different values of n
where i=1, e.g. 17, 37, 48, 72.
Rob
|
@robnagler, I updated my previous comment with the requested Here are the results with fixed i=1:
|
Parallelization/distribution of the processes over the cores looks fine without docker. |
What is the allocation with i=2 and n=16 and cpuset=0-35, 1-35, 17-71,
37-71?
|
Hi Rob, For i=2 and n=16 I got the following results:
Thanks, |
Grasping at straws. If you run this: docker run --rm s10 su - vagrant -c 'for i in 1 2; do mpiexec -verbose -n 16 python SRWLIB_Example10.py & done; wait' What does this say? ps axww -o pid,ppid,psr,cp,ucomm,args | sort --key=2 -n | grep 'SRWLIB_Example10.py' Note that I added |
Here is the output:
|
Those didn't overlap. Bumping to n=36. If they don't overlap, try i=3. Thanks! docker run --rm s10 su - vagrant -c 'for i in 1 2; do mpiexec -verbose -n 36 python SRWLIB_Example10.py & done; wait' I changed this line a bit to make it easier to read: ps axww -o psr,ppid,pid,cp,ucomm,args | sort -n | grep 'SRWLIB_Example10.py' |
Hi Rob,
|
My concern isn't about n=16, but about n=8 for i=3:
Only 16 cores are occupied in this case instead of 24. |
When there are two slaves on one core, they have two different MPI masters.
The fact that this doesn't happen outside a container is very strange.
Inside a container, two mpi masters should use the same resource allocation
as outside. The container restricts processes, but it shouldn't cause
allocations to be different.
I have't seen this behavior on our cluster, but then we don't have as many
cores per node.
Would be good to have another cluster with a different distro.
I'm a bit stumped at the moment.
|
How many cores do you have? Probably you can test with i=3 and n=2 or 3? |
Good call! I can reproduce it here in an interactive docker script:
Here's a test I'm using for this specific example: x=( $(ps -o psr,args ax | grep SRWLIB_Example10.py | sort -nu | colrm 4) ); (( ${#x[@]} != 16 )) && echo ${#x[@]} || echo OK If it works, it'll print OK, else the number of processes (including the grep and mpiexec). I'll experiment more. |
I was able to reproduce simply, and it doesn't seem to have anything to do with docker. See robnagler/mpi-play#1 |
FTR, the difference between Debian native and container was caused by OpenMPI 1.6 (native) vs 1.8 (container), which changed the default to bind-to core. |
I started 6 independent partially-coherent calculations in 6 different browsers, and they look to work very efficiently, e.g. occupy 6*8=48 cores: root@cpu-001:~# ps axww -o psr,ppid,pid,cp,ucomm,args | sort -n | grep 'mpi_run.py' | grep -v mpiexec | grep -v grep
2 47120 47126 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
5 47164 47170 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
6 48449 48453 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
7 47120 47125 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
8 47164 47167 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
10 47120 47128 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
11 47264 47272 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
12 47120 47129 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
14 47120 47123 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
15 47723 47728 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
16 47723 47731 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
17 47264 47271 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
18 47264 47268 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
21 47264 47269 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
22 47264 47266 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
24 47120 47122 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
26 47120 47127 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
27 47723 47727 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
28 47636 47645 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
31 48449 48455 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
32 47723 47732 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
33 48449 48454 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
34 48449 48451 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
35 48449 48458 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
36 47636 47642 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
37 47164 47171 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
38 47164 47173 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
40 47636 47641 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
44 47636 47638 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
46 47164 47168 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
48 47636 47640 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
49 47636 47644 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
50 47723 47729 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
53 47723 47725 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
54 47264 47267 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
55 47164 47169 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
56 47636 47643 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
57 47723 47730 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
58 48449 48452 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
61 47264 47270 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
62 48449 48456 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
63 48449 48457 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
64 47164 47166 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
65 47164 47172 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
66 47120 47124 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
67 47723 47726 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
70 47264 47273 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
71 47636 47639 999 python /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
root@cpu-001:~# ps axww -o psr,ppid,pid,cp,ucomm,args | sort -n | grep 'mpi_run.py' | grep -v mpiexec | grep -v grep | wc -l
48 Thanks for fixing the issue Rob! |
A cluster at BNL fails to use all cores when run in Docker. Running outside Docker runs normally.
We have isolated it to running N instances of SRW inside a single container. If you run 2 instances (i=2) with 4 slaves (n=4, mpiexec -n 4), 8 cores are used. However, if your run, say, i=4 and n=4, it only uses 8 cores.
Adding @mrakitin
The text was updated successfully, but these errors were encountered: