Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker not using all cores #13

Closed
robnagler opened this issue Mar 22, 2016 · 19 comments
Closed

Docker not using all cores #13

robnagler opened this issue Mar 22, 2016 · 19 comments
Assignees

Comments

@robnagler
Copy link
Member

A cluster at BNL fails to use all cores when run in Docker. Running outside Docker runs normally.

We have isolated it to running N instances of SRW inside a single container. If you run 2 instances (i=2) with 4 slaves (n=4, mpiexec -n 4), 8 cores are used. However, if your run, say, i=4 and n=4, it only uses 8 cores.

Adding @mrakitin

@robnagler
Copy link
Member Author

From @mrakitin:

I think I got something interesting with parallel execution. When I use the following command there are 12 cores allocated:

docker run --cpuset-cpus 50-71 --name x0 --rm s10 su - vagrant -c 'for i in 1 2 3 4 5 6; do mpiexec -verbose -n 2 python SRWLIB_Example10.py & done; wait'

When I tried cpus starting from 0 that didn’t work. Probably CPU #0 is somewhat “special”…

@robnagler
Copy link
Member Author

Maxim, would you try this to see if we can get all 32 cores used with one MPI master:

docker run --rm s10 su - vagrant -c 'for i in $(seq 1); do mpiexec -verbose -n 32 python SRWLIB_Example10.py & done; wait'

I'd like to test your "cpu0" hypothesis with:

docker run --cpuset-cpus 1-71 --rm s10 su - vagrant -c 'for i in $(seq 4); do mpiexec -verbose -n 8 python SRWLIB_Example10.py & done; wait'

And:

docker run --cpuset-cpus 0-71 --rm s10 su - vagrant -c 'for i in $(seq 4); do mpiexec -verbose -n 8 python SRWLIB_Example10.py & done; wait'

I'm skeptical i=6, because we saw different behavior with i divisible by 3.

@mrakitin
Copy link

Hi Rob,

Here are the requested results:

  1. i=1, n=32: i 1_n 32
root@cpu-001:~# ps axww -o pid,psr,cp,ucomm,args | sort --key=2 -n | grep 'SRWLIB_Example10.py'
37231   0 733 python          python SRWLIB_Example10.py
37076   1  24 su              su - vagrant -c for i in $(seq 1); do mpiexec -verbose -n 32 python SRWLIB_Example10.py & done; wait
37340   2 650 python          python SRWLIB_Example10.py
37484   4 682 python          python SRWLIB_Example10.py
37848   9 715 python          python SRWLIB_Example10.py
38066  12 798 python          python SRWLIB_Example10.py
37232  18 623 python          python SRWLIB_Example10.py
37304  19 652 python          python SRWLIB_Example10.py
37376  20 652 python          python SRWLIB_Example10.py
37107  21   0 bash            -bash -c for i in $(seq 1); do mpiexec -verbose -n 32 python SRWLIB_Example10.py & done; wait
37595  23 683 python          python SRWLIB_Example10.py
37886  27 756 python          python SRWLIB_Example10.py
37012  28  20 docker          docker run --rm s10 su - vagrant -c for i in $(seq 1); do mpiexec -verbose -n 32 python SRWLIB_Example10.py & done; wait
37958  28 756 python          python SRWLIB_Example10.py
37229  34  17 mpiexec         mpiexec -verbose -n 32 python SRWLIB_Example10.py
37268  37 621 python          python SRWLIB_Example10.py
37412  39 649 python          python SRWLIB_Example10.py
37559  41 681 python          python SRWLIB_Example10.py
37631  42 681 python          python SRWLIB_Example10.py
37703  43 716 python          python SRWLIB_Example10.py
37776  44 715 python          python SRWLIB_Example10.py
37922  46 755 python          python SRWLIB_Example10.py
37994  47 754 python          python SRWLIB_Example10.py
38138  49 797 python          python SRWLIB_Example10.py
38210  50 797 python          python SRWLIB_Example10.py
38282  51 846 python          python SRWLIB_Example10.py
38472  55   0 grep            grep SRWLIB_Example10.py
37448  57 651 python          python SRWLIB_Example10.py
37523  58 684 python          python SRWLIB_Example10.py
37667  60 683 python          python SRWLIB_Example10.py
37740  61 718 python          python SRWLIB_Example10.py
37812  62 717 python          python SRWLIB_Example10.py
38030  65 756 python          python SRWLIB_Example10.py
38102  66 800 python          python SRWLIB_Example10.py
38174  67 799 python          python SRWLIB_Example10.py
38246  68 798 python          python SRWLIB_Example10.py
38318  69 848 python          python SRWLIB_Example10.py
  1. i=4, n=8 on cores 1 to 71: i 4_n 8_1-71
root@cpu-001:~# ps axww -o pid,psr,cp,ucomm,args | sort --key=2 -n | grep 'SRWLIB_Example10.py'
41156   1 457 python          python SRWLIB_Example10.py
41161   1 453 python          python SRWLIB_Example10.py
41444   2 463 python          python SRWLIB_Example10.py
41448   2 451 python          python SRWLIB_Example10.py
41732   3 509 python          python SRWLIB_Example10.py
41751   3 496 python          python SRWLIB_Example10.py
41000   6  35 mpiexec         mpiexec -verbose -n 8 python SRWLIB_Example10.py
41001  11  36 mpiexec         mpiexec -verbose -n 8 python SRWLIB_Example10.py
40767  12  17 docker          docker run --cpuset-cpus 1-71 --rm s10 su - vagrant -c for i in $(seq 4); do mpiexec -verbose -n 8 python SRWLIB_Example10.py & done; wait
41002  13  34 mpiexec         mpiexec -verbose -n 8 python SRWLIB_Example10.py
41003  15  35 mpiexec         mpiexec -verbose -n 8 python SRWLIB_Example10.py
41012  18 429 python          python SRWLIB_Example10.py
41017  18 418 python          python SRWLIB_Example10.py
41301  19 466 python          python SRWLIB_Example10.py
41306  19 454 python          python SRWLIB_Example10.py
41588  20 456 python          python SRWLIB_Example10.py
41600  20 452 python          python SRWLIB_Example10.py
40846  21  50 su              su - vagrant -c for i in $(seq 4); do mpiexec -verbose -n 8 python SRWLIB_Example10.py & done; wait
40877  21   1 bash            -bash -c for i in $(seq 4); do mpiexec -verbose -n 8 python SRWLIB_Example10.py & done; wait
41876  21 499 python          python SRWLIB_Example10.py
41893  21 484 python          python SRWLIB_Example10.py
41006  36 214 python          python SRWLIB_Example10.py
41007  36 215 python          python SRWLIB_Example10.py
41010  36 217 python          python SRWLIB_Example10.py
41011  36 214 python          python SRWLIB_Example10.py
41163  37 466 python          python SRWLIB_Example10.py
41167  37 453 python          python SRWLIB_Example10.py
41450  38 455 python          python SRWLIB_Example10.py
41465  38 453 python          python SRWLIB_Example10.py
41736  39 496 python          python SRWLIB_Example10.py
41739  39 496 python          python SRWLIB_Example10.py
42193  40   0 grep            grep SRWLIB_Example10.py
41013  54 421 python          python SRWLIB_Example10.py
41015  54 418 python          python SRWLIB_Example10.py
41300  55 457 python          python SRWLIB_Example10.py
41326  55 454 python          python SRWLIB_Example10.py
41589  56 464 python          python SRWLIB_Example10.py
41596  56 453 python          python SRWLIB_Example10.py
41891  57 485 python          python SRWLIB_Example10.py
41901  57 485 python          python SRWLIB_Example10.py

i=4, n=8 on cores 0 to 71: i 4_n 8_0-71
(results are identical)

root@cpu-001:~# ps axww -o pid,psr,cp,ucomm,args | sort --key=2 -n | grep 'SRWLIB_Example10.py'
42493   0 449 python          python SRWLIB_Example10.py
42494   0 447 python          python SRWLIB_Example10.py
42644   1 480 python          python SRWLIB_Example10.py
42664   1 482 python          python SRWLIB_Example10.py
42958   2 477 python          python SRWLIB_Example10.py
42968   2 478 python          python SRWLIB_Example10.py
43219   3 530 python          python SRWLIB_Example10.py
43233   3 518 python          python SRWLIB_Example10.py
42483   6  30 mpiexec         mpiexec -verbose -n 8 python SRWLIB_Example10.py
43678   6   0 grep            grep SRWLIB_Example10.py
42484   9  30 mpiexec         mpiexec -verbose -n 8 python SRWLIB_Example10.py
42495  18 462 python          python SRWLIB_Example10.py
42505  18 449 python          python SRWLIB_Example10.py
42794  19 480 python          python SRWLIB_Example10.py
42801  19 480 python          python SRWLIB_Example10.py
43075  20 538 python          python SRWLIB_Example10.py
43084  20 517 python          python SRWLIB_Example10.py
42361  21   1 bash            -bash -c for i in $(seq 4); do mpiexec -verbose -n 8 python SRWLIB_Example10.py & done; wait
43366  21 516 python          python SRWLIB_Example10.py
43401  21 512 python          python SRWLIB_Example10.py
42485  23  34 mpiexec         mpiexec -verbose -n 8 python SRWLIB_Example10.py
42486  23  35 mpiexec         mpiexec -verbose -n 8 python SRWLIB_Example10.py
42330  32  26 su              su - vagrant -c for i in $(seq 4); do mpiexec -verbose -n 8 python SRWLIB_Example10.py & done; wait
42267  34  25 docker          docker run --cpuset-cpus 0-71 --rm s10 su - vagrant -c for i in $(seq 4); do mpiexec -verbose -n 8 python SRWLIB_Example10.py & done; wait
42490  36 447 python          python SRWLIB_Example10.py
42491  36 462 python          python SRWLIB_Example10.py
42639  37 498 python          python SRWLIB_Example10.py
42649  37 480 python          python SRWLIB_Example10.py
42931  38 496 python          python SRWLIB_Example10.py
42939  38 478 python          python SRWLIB_Example10.py
43226  39 518 python          python SRWLIB_Example10.py
43246  39 516 python          python SRWLIB_Example10.py
42499  54 449 python          python SRWLIB_Example10.py
42514  54 449 python          python SRWLIB_Example10.py
42785  55 498 python          python SRWLIB_Example10.py
42815  55 479 python          python SRWLIB_Example10.py
43094  56 517 python          python SRWLIB_Example10.py
43107  56 517 python          python SRWLIB_Example10.py
43374  57 511 python          python SRWLIB_Example10.py
43384  57 515 python          python SRWLIB_Example10.py
  1. Also I did additional selection of the cpus - i=4, n=8 on cores 40 to 71: i 4_n 8_40-71
    Some cores are solely occupied by a process, some are shared among several processes.
root@cpu-001:~# ps axww -o pid,psr,cp,ucomm,args | sort --key=2 -n | grep 'SRWLIB_Example10.py'
43754   0  17 docker          docker run --cpuset-cpus 40-71 --rm s10 su - vagrant -c for i in $(seq 4); do mpiexec -verbose -n 8 python SRWLIB_Example10.py & done; wait
45163  38   0 grep            grep SRWLIB_Example10.py
44702  40 973 python          python SRWLIB_Example10.py
43978  41 874 python          python SRWLIB_Example10.py
44418  42 921 python          python SRWLIB_Example10.py
44127  43 872 python          python SRWLIB_Example10.py
43973  44  20 mpiexec         mpiexec -verbose -n 8 python SRWLIB_Example10.py
44128  44 874 python          python SRWLIB_Example10.py
44130  45 874 python          python SRWLIB_Example10.py
44414  46 922 python          python SRWLIB_Example10.py
43819  47  11 su              su - vagrant -c for i in $(seq 4); do mpiexec -verbose -n 8 python SRWLIB_Example10.py & done; wait
43980  47 872 python          python SRWLIB_Example10.py
43971  48  20 mpiexec         mpiexec -verbose -n 8 python SRWLIB_Example10.py
44126  48 874 python          python SRWLIB_Example10.py
44708  49 974 python          python SRWLIB_Example10.py
44416  50 922 python          python SRWLIB_Example10.py
43972  51  20 mpiexec         mpiexec -verbose -n 8 python SRWLIB_Example10.py
43981  51 875 python          python SRWLIB_Example10.py
44422  52 921 python          python SRWLIB_Example10.py
43979  53 875 python          python SRWLIB_Example10.py
43982  54 224 python          python SRWLIB_Example10.py
43983  54 224 python          python SRWLIB_Example10.py
43984  54 224 python          python SRWLIB_Example10.py
43985  54 224 python          python SRWLIB_Example10.py
44266  55 236 python          python SRWLIB_Example10.py
44270  55 236 python          python SRWLIB_Example10.py
44274  55 236 python          python SRWLIB_Example10.py
44279  55 236 python          python SRWLIB_Example10.py
44553  56 235 python          python SRWLIB_Example10.py
44560  56 235 python          python SRWLIB_Example10.py
44567  56 235 python          python SRWLIB_Example10.py
44570  56 235 python          python SRWLIB_Example10.py
43848  57   1 bash            -bash -c for i in $(seq 4); do mpiexec -verbose -n 8 python SRWLIB_Example10.py & done; wait
44844  57 245 python          python SRWLIB_Example10.py
44845  57 245 python          python SRWLIB_Example10.py
44846  57 245 python          python SRWLIB_Example10.py
44850  57 244 python          python SRWLIB_Example10.py
44706  58 972 python          python SRWLIB_Example10.py
43970  59  20 mpiexec         mpiexec -verbose -n 8 python SRWLIB_Example10.py
44704  62 972 python          python SRWLIB_Example10.py
  1. On all cores the results are the same as in the item (2): i 4_n 8_all
root@cpu-001:~# ps axww -o pid,psr,cp,ucomm,args | sort --key=2 -n | grep 'SRWLIB_Example10.py'
39495   0 444 python          python SRWLIB_Example10.py
39498   0 442 python          python SRWLIB_Example10.py
39646   1 473 python          python SRWLIB_Example10.py
39652   1 473 python          python SRWLIB_Example10.py
39941   2 475 python          python SRWLIB_Example10.py
39970   2 503 python          python SRWLIB_Example10.py
39368   3   1 bash            -bash -c for i in $(seq 4); do mpiexec -verbose -n 8 python SRWLIB_Example10.py & done; wait
40222   3 505 python          python SRWLIB_Example10.py
40235   3 503 python          python SRWLIB_Example10.py
39491   5  31 mpiexec         mpiexec -verbose -n 8 python SRWLIB_Example10.py
39490   6  32 mpiexec         mpiexec -verbose -n 8 python SRWLIB_Example10.py
39507  18 475 python          python SRWLIB_Example10.py
39520  18 470 python          python SRWLIB_Example10.py
39820  19 467 python          python SRWLIB_Example10.py
39836  19 474 python          python SRWLIB_Example10.py
40091  20 512 python          python SRWLIB_Example10.py
40115  20 502 python          python SRWLIB_Example10.py
40366  21 500 python          python SRWLIB_Example10.py
40381  21 497 python          python SRWLIB_Example10.py
39493  25  32 mpiexec         mpiexec -verbose -n 8 python SRWLIB_Example10.py
39500  36 444 python          python SRWLIB_Example10.py
39501  36 441 python          python SRWLIB_Example10.py
39277  37  18 docker          docker run --rm s10 su - vagrant -c for i in $(seq 4); do mpiexec -verbose -n 8 python SRWLIB_Example10.py & done; wait
39335  37  31 su              su - vagrant -c for i in $(seq 4); do mpiexec -verbose -n 8 python SRWLIB_Example10.py & done; wait
39660  37 472 python          python SRWLIB_Example10.py
39662  37 467 python          python SRWLIB_Example10.py
39926  38 472 python          python SRWLIB_Example10.py
39949  38 468 python          python SRWLIB_Example10.py
40226  39 513 python          python SRWLIB_Example10.py
40278  39 501 python          python SRWLIB_Example10.py
40680  41   0 grep            grep SRWLIB_Example10.py
39502  54 473 python          python SRWLIB_Example10.py
39510  54 470 python          python SRWLIB_Example10.py
39782  55 474 python          python SRWLIB_Example10.py
39795  55 469 python          python SRWLIB_Example10.py
40071  56 506 python          python SRWLIB_Example10.py
40084  56 503 python          python SRWLIB_Example10.py
40367  57 506 python          python SRWLIB_Example10.py
40377  57 496 python          python SRWLIB_Example10.py
39492  71  32 mpiexec         mpiexec -verbose -n 8 python SRWLIB_Example10.py

@robnagler
Copy link
Member Author

robnagler commented Mar 23, 2016 via email

@mrakitin
Copy link

@robnagler, I updated my previous comment with the requested ps output.

Here are the results with fixed i=1:

  • n=17:
root@cpu-001:~# ps axww -o pid,psr,cp,ucomm,args | sort --key=2 -n | grep 'SRWLIB_Example10.py'
45881   1 807 python          python SRWLIB_Example10.py
45953   2 862 python          python SRWLIB_Example10.py
45720   3   1 bash            -bash -c for i in $(seq 1); do mpiexec -verbose -n 17 python SRWLIB_Example10.py & done; wait
46027   3 857 python          python SRWLIB_Example10.py
46099   4 928 python          python SRWLIB_Example10.py
46171   5 924 python          python SRWLIB_Example10.py
46245   6 997 python          python SRWLIB_Example10.py
46317   7 999 python          python SRWLIB_Example10.py
46389   8 998 python          python SRWLIB_Example10.py
45842   9  18 mpiexec         mpiexec -verbose -n 17 python SRWLIB_Example10.py
45618  10  21 docker          docker run --rm s10 su - vagrant -c for i in $(seq 1); do mpiexec -verbose -n 17 python SRWLIB_Example10.py & done; wait
45917  19 859 python          python SRWLIB_Example10.py
46063  21 925 python          python SRWLIB_Example10.py
46135  22 922 python          python SRWLIB_Example10.py
46353  25 995 python          python SRWLIB_Example10.py
45679  33  21 su              su - vagrant -c for i in $(seq 1); do mpiexec -verbose -n 17 python SRWLIB_Example10.py & done; wait
45844  36 802 python          python SRWLIB_Example10.py
45845  54 806 python          python SRWLIB_Example10.py
45991  56 864 python          python SRWLIB_Example10.py
46209  59 927 python          python SRWLIB_Example10.py
46281  60 999 python          python SRWLIB_Example10.py
46465  62   0 grep            grep SRWLIB_Example10.py
  • n=37:
root@cpu-001:~# ps axww -o pid,psr,cp,ucomm,args | sort --key=2 -n | grep 'SRWLIB_Example10.py'
46787   1 640 python          python SRWLIB_Example10.py
47330   2 756 python          python SRWLIB_Example10.py
47004   3 675 python          python SRWLIB_Example10.py
47040   4 714 python          python SRWLIB_Example10.py
47222   5 756 python          python SRWLIB_Example10.py
47693   6 860 python          python SRWLIB_Example10.py
47765   7 859 python          python SRWLIB_Example10.py
47912   8 923 python          python SRWLIB_Example10.py
47546   9 804 python          python SRWLIB_Example10.py
47837  10 922 python          python SRWLIB_Example10.py
47984  11 999 python          python SRWLIB_Example10.py
47873  12 925 python          python SRWLIB_Example10.py
47294  13 756 python          python SRWLIB_Example10.py
47184  14 712 python          python SRWLIB_Example10.py
48059  15 999 python          python SRWLIB_Example10.py
47258  16 756 python          python SRWLIB_Example10.py
46968  17 674 python          python SRWLIB_Example10.py
47402  18 804 python          python SRWLIB_Example10.py
46641  19  13 su              su - vagrant -c for i in $(seq 1); do mpiexec -verbose -n 37 python SRWLIB_Example10.py & done; wait
47076  19 713 python          python SRWLIB_Example10.py
46824  20 675 python          python SRWLIB_Example10.py
46663  21   1 bash            -bash -c for i in $(seq 1); do mpiexec -verbose -n 37 python SRWLIB_Example10.py & done; wait
47474  22 804 python          python SRWLIB_Example10.py
46896  23 675 python          python SRWLIB_Example10.py
47801  24 926 python          python SRWLIB_Example10.py
47948  25 924 python          python SRWLIB_Example10.py
48023  26 999 python          python SRWLIB_Example10.py
46860  27 675 python          python SRWLIB_Example10.py
47148  28 713 python          python SRWLIB_Example10.py
47729  29 860 python          python SRWLIB_Example10.py
47366  30 756 python          python SRWLIB_Example10.py
47582  31 804 python          python SRWLIB_Example10.py
46788  32 641 python          python SRWLIB_Example10.py
47654  33 860 python          python SRWLIB_Example10.py
47112  34 711 python          python SRWLIB_Example10.py
47438  35 806 python          python SRWLIB_Example10.py
47618  36 860 python          python SRWLIB_Example10.py
46577  37  31 docker          docker run --rm s10 su - vagrant -c for i in $(seq 1); do mpiexec -verbose -n 37 python SRWLIB_Example10.py & done; wait
46785  37  22 mpiexec         mpiexec -verbose -n 37 python SRWLIB_Example10.py
46932  39 675 python          python SRWLIB_Example10.py
48143  53   0 grep            grep SRWLIB_Example10.py
47510  54 805 python          python SRWLIB_Example10.py
  • n=48:
root@cpu-001:~# ps axww -o pid,psr,cp,ucomm,args | sort --key=2 -n | grep 'SRWLIB_Example10.py'
49796   0 850 python          python SRWLIB_Example10.py
48786   4 672 python          python SRWLIB_Example10.py
50201   5   0 grep            grep SRWLIB_Example10.py
48859   6 701 python          python SRWLIB_Example10.py
48534   7 644 python          python SRWLIB_Example10.py
48895   8 701 python          python SRWLIB_Example10.py
49111   9 732 python          python SRWLIB_Example10.py
49039  11 733 python          python SRWLIB_Example10.py
49003  12 700 python          python SRWLIB_Example10.py
48459  13  18 mpiexec         mpiexec -verbose -n 48 python SRWLIB_Example10.py
49291  13 769 python          python SRWLIB_Example10.py
48642  14 645 python          python SRWLIB_Example10.py
48239  15  14 docker          docker run --rm s10 su - vagrant -c for i in $(seq 1); do mpiexec -verbose -n 48 python SRWLIB_Example10.py & done; wait
50086  15 953 python          python SRWLIB_Example10.py
48570  16 645 python          python SRWLIB_Example10.py
49579  17 807 python          python SRWLIB_Example10.py
48931  19 702 python          python SRWLIB_Example10.py
48337  21   0 bash            -bash -c for i in $(seq 1); do mpiexec -verbose -n 48 python SRWLIB_Example10.py & done; wait
48498  21 645 python          python SRWLIB_Example10.py
49760  22 850 python          python SRWLIB_Example10.py
49147  23 733 python          python SRWLIB_Example10.py
49832  26 900 python          python SRWLIB_Example10.py
49183  29 733 python          python SRWLIB_Example10.py
49471  30 808 python          python SRWLIB_Example10.py
49327  31 769 python          python SRWLIB_Example10.py
49942  33 900 python          python SRWLIB_Example10.py
48303  34  24 su              su - vagrant -c for i in $(seq 1); do mpiexec -verbose -n 48 python SRWLIB_Example10.py & done; wait
48678  34 671 python          python SRWLIB_Example10.py
48461  35 620 python          python SRWLIB_Example10.py
49507  36 807 python          python SRWLIB_Example10.py
49868  37 900 python          python SRWLIB_Example10.py
49075  38 732 python          python SRWLIB_Example10.py
49435  39 808 python          python SRWLIB_Example10.py
49219  41 732 python          python SRWLIB_Example10.py
49651  43 851 python          python SRWLIB_Example10.py
50122  44 951 python          python SRWLIB_Example10.py
50014  46 955 python          python SRWLIB_Example10.py
49363  47 768 python          python SRWLIB_Example10.py
49723  49 851 python          python SRWLIB_Example10.py
49543  50 807 python          python SRWLIB_Example10.py
48823  52 671 python          python SRWLIB_Example10.py
48714  53 672 python          python SRWLIB_Example10.py
49904  55 898 python          python SRWLIB_Example10.py
50050  56 953 python          python SRWLIB_Example10.py
49687  57 851 python          python SRWLIB_Example10.py
49978  59 896 python          python SRWLIB_Example10.py
48750  60 672 python          python SRWLIB_Example10.py
49615  61 852 python          python SRWLIB_Example10.py
49399  63 768 python          python SRWLIB_Example10.py
49255  64 769 python          python SRWLIB_Example10.py
48606  68 645 python          python SRWLIB_Example10.py
48967  69 700 python          python SRWLIB_Example10.py
48462  70 645 python          python SRWLIB_Example10.py
  • n=72:
root@cpu-001:~# ps axww -o pid,psr,cp,ucomm,args | sort --key=2 -n | grep 'SRWLIB_Example10.py'
53131   0   0 grep            grep SRWLIB_Example10.py
52359   1 685 python          python SRWLIB_Example10.py
51375   2 488 python          python SRWLIB_Example10.py
50273   3  17 docker          docker run --rm s10 su - vagrant -c for i in $(seq 1); do mpiexec -verbose -n 72 python SRWLIB_Example10.py & done; wait
50369   3   0 bash            -bash -c for i in $(seq 1); do mpiexec -verbose -n 72 python SRWLIB_Example10.py & done; wait
50859   3 423 python          python SRWLIB_Example10.py
52759   4 837 python          python SRWLIB_Example10.py
50530   5 404 python          python SRWLIB_Example10.py
50820   6 423 python          python SRWLIB_Example10.py
52107   7 640 python          python SRWLIB_Example10.py
51630   8 518 python          python SRWLIB_Example10.py
50931   9 446 python          python SRWLIB_Example10.py
51447  10 485 python          python SRWLIB_Example10.py
50491  11  27 mpiexec         mpiexec -verbose -n 72 python SRWLIB_Example10.py
52287  11 687 python          python SRWLIB_Example10.py
52904  12 923 python          python SRWLIB_Example10.py
53048  13 928 python          python SRWLIB_Example10.py
50638  14 405 python          python SRWLIB_Example10.py
50674  15 390 python          python SRWLIB_Example10.py
51777  16 552 python          python SRWLIB_Example10.py
51741  17 555 python          python SRWLIB_Example10.py
50336  18  11 su              su - vagrant -c for i in $(seq 1); do mpiexec -verbose -n 72 python SRWLIB_Example10.py & done; wait
51153  18 463 python          python SRWLIB_Example10.py
51117  19 471 python          python SRWLIB_Example10.py
51594  20 518 python          python SRWLIB_Example10.py
52577  21 756 python          python SRWLIB_Example10.py
50748  22 425 python          python SRWLIB_Example10.py
51411  23 483 python          python SRWLIB_Example10.py
51264  24 455 python          python SRWLIB_Example10.py
52323  25 687 python          python SRWLIB_Example10.py
52541  26 754 python          python SRWLIB_Example10.py
50566  27 405 python          python SRWLIB_Example10.py
50494  28 386 python          python SRWLIB_Example10.py
51669  29 518 python          python SRWLIB_Example10.py
51006  30 446 python          python SRWLIB_Example10.py
50967  31 447 python          python SRWLIB_Example10.py
50602  32 404 python          python SRWLIB_Example10.py
52613  33 754 python          python SRWLIB_Example10.py
50784  34 425 python          python SRWLIB_Example10.py
50493  35 384 python          python SRWLIB_Example10.py
51078  36 437 python          python SRWLIB_Example10.py
52685  36 831 python          python SRWLIB_Example10.py
52143  37 631 python          python SRWLIB_Example10.py
51924  38 593 python          python SRWLIB_Example10.py
52649  39 756 python          python SRWLIB_Example10.py
51483  40 514 python          python SRWLIB_Example10.py
51963  41 592 python          python SRWLIB_Example10.py
51999  42 587 python          python SRWLIB_Example10.py
51705  43 551 python          python SRWLIB_Example10.py
52035  44 589 python          python SRWLIB_Example10.py
52251  45 639 python          python SRWLIB_Example10.py
51522  46 518 python          python SRWLIB_Example10.py
51813  47 551 python          python SRWLIB_Example10.py
51336  48 488 python          python SRWLIB_Example10.py
51189  49 464 python          python SRWLIB_Example10.py
52976  50 936 python          python SRWLIB_Example10.py
50710  51 420 python          python SRWLIB_Example10.py
51225  52 470 python          python SRWLIB_Example10.py
52468  53 689 python          python SRWLIB_Example10.py
51300  54 485 python          python SRWLIB_Example10.py
52721  55 841 python          python SRWLIB_Example10.py
52071  56 638 python          python SRWLIB_Example10.py
51852  57 552 python          python SRWLIB_Example10.py
52795  58 836 python          python SRWLIB_Example10.py
51558  59 516 python          python SRWLIB_Example10.py
52215  60 635 python          python SRWLIB_Example10.py
50895  61 442 python          python SRWLIB_Example10.py
52395  62 680 python          python SRWLIB_Example10.py
52179  63 637 python          python SRWLIB_Example10.py
52432  64 683 python          python SRWLIB_Example10.py
52940  65 933 python          python SRWLIB_Example10.py
51888  66 593 python          python SRWLIB_Example10.py
52868  67 836 python          python SRWLIB_Example10.py
52505  68 761 python          python SRWLIB_Example10.py
52832  69 826 python          python SRWLIB_Example10.py
51042  70 444 python          python SRWLIB_Example10.py
53012  71 936 python          python SRWLIB_Example10.py

@mrakitin
Copy link

Parallelization/distribution of the processes over the cores looks fine without docker.

@robnagler
Copy link
Member Author

robnagler commented Mar 26, 2016 via email

@mrakitin
Copy link

Hi Rob,

For i=2 and n=16 I got the following results:

  • cpuset=0-35:
root@cpu-001:~# ps axww -o pid,psr,cp,ucomm,args | sort --key=2 -n | grep 'SRWLIB_Example10.py'
21235   0 420 python          python SRWLIB_Example10.py
21236   0 420 python          python SRWLIB_Example10.py
21309   1 434 python          python SRWLIB_Example10.py
21311   1 434 python          python SRWLIB_Example10.py
21453   2 433 python          python SRWLIB_Example10.py
21454   2 433 python          python SRWLIB_Example10.py
21597   3 450 python          python SRWLIB_Example10.py
21598   3 450 python          python SRWLIB_Example10.py
21741   4 468 python          python SRWLIB_Example10.py
21743   4 467 python          python SRWLIB_Example10.py
21885   5 468 python          python SRWLIB_Example10.py
21886   5 468 python          python SRWLIB_Example10.py
22029   6 486 python          python SRWLIB_Example10.py
22031   6 485 python          python SRWLIB_Example10.py
22173   7 485 python          python SRWLIB_Example10.py
22175   7 485 python          python SRWLIB_Example10.py
21237  18 435 python          python SRWLIB_Example10.py
21239  18 434 python          python SRWLIB_Example10.py
21077  19  21 su              su - vagrant -c for i in $(seq 2); do mpiexec -verbose -n 16 python SRWLIB_Example10.py & done; wait
21381  19 434 python          python SRWLIB_Example10.py
21383  19 434 python          python SRWLIB_Example10.py
21525  20 451 python          python SRWLIB_Example10.py
21526  20 451 python          python SRWLIB_Example10.py
21107  21   0 bash            -bash -c for i in $(seq 2); do mpiexec -verbose -n 16 python SRWLIB_Example10.py & done; wait
21669  21 450 python          python SRWLIB_Example10.py
21671  21 449 python          python SRWLIB_Example10.py
21813  22 467 python          python SRWLIB_Example10.py
21815  22 466 python          python SRWLIB_Example10.py
21957  23 467 python          python SRWLIB_Example10.py
21959  23 467 python          python SRWLIB_Example10.py
22101  24 486 python          python SRWLIB_Example10.py
22103  24 485 python          python SRWLIB_Example10.py
22245  25 485 python          python SRWLIB_Example10.py
22247  25 484 python          python SRWLIB_Example10.py
21230  30  11 mpiexec         mpiexec -verbose -n 16 python SRWLIB_Example10.py
20996  31   8 docker          docker run --cpuset-cpus 0-35 --rm s10 su - vagrant -c for i in $(seq 2); do mpiexec -verbose -n 16 python SRWLIB_Example10.py & done; wait
21229  35  10 mpiexec         mpiexec -verbose -n 16 python SRWLIB_Example10.py
22407  63   0 grep            grep SRWLIB_Example10.py
  • cpuset=1-35:
root@cpu-001:~# ps axww -o pid,psr,cp,ucomm,args | sort --key=2 -n | grep 'SRWLIB_Example10.py'
22819   1 398 python          python SRWLIB_Example10.py
22820   1 398 python          python SRWLIB_Example10.py
22963   2 397 python          python SRWLIB_Example10.py
22965   2 397 python          python SRWLIB_Example10.py
23107   3 423 python          python SRWLIB_Example10.py
23109   3 423 python          python SRWLIB_Example10.py
22619   4   1 bash            -bash -c for i in $(seq 2); do mpiexec -verbose -n 16 python SRWLIB_Example10.py & done; wait
23251   4 421 python          python SRWLIB_Example10.py
23252   4 420 python          python SRWLIB_Example10.py
23395   5 448 python          python SRWLIB_Example10.py
23396   5 448 python          python SRWLIB_Example10.py
23539   6 448 python          python SRWLIB_Example10.py
23541   6 448 python          python SRWLIB_Example10.py
23685   7 479 python          python SRWLIB_Example10.py
23687   7 478 python          python SRWLIB_Example10.py
22746  12 746 python          python SRWLIB_Example10.py
22587  13  34 su              su - vagrant -c for i in $(seq 2); do mpiexec -verbose -n 16 python SRWLIB_Example10.py & done; wait
22745  15 745 python          python SRWLIB_Example10.py
22747  18 377 python          python SRWLIB_Example10.py
22748  18 377 python          python SRWLIB_Example10.py
22891  19 398 python          python SRWLIB_Example10.py
22892  19 398 python          python SRWLIB_Example10.py
23035  20 422 python          python SRWLIB_Example10.py
23036  20 421 python          python SRWLIB_Example10.py
23179  21 421 python          python SRWLIB_Example10.py
23181  21 421 python          python SRWLIB_Example10.py
23323  22 449 python          python SRWLIB_Example10.py
23325  22 448 python          python SRWLIB_Example10.py
23467  23 450 python          python SRWLIB_Example10.py
23468  23 450 python          python SRWLIB_Example10.py
23611  24 479 python          python SRWLIB_Example10.py
23613  24 479 python          python SRWLIB_Example10.py
23757  25 476 python          python SRWLIB_Example10.py
23759  25 475 python          python SRWLIB_Example10.py
22741  29  18 mpiexec         mpiexec -verbose -n 16 python SRWLIB_Example10.py
22742  29  18 mpiexec         mpiexec -verbose -n 16 python SRWLIB_Example10.py
22531  37  15 docker          docker run --cpuset-cpus 1-35 --rm s10 su - vagrant -c for i in $(seq 2); do mpiexec -verbose -n 16 python SRWLIB_Example10.py & done; wait
23909  44   0 grep            grep SRWLIB_Example10.py
  • cpuset=17-71:
root@cpu-001:~# ps axww -o pid,psr,cp,ucomm,args | sort --key=2 -n | grep 'SRWLIB_Example10.py'
24215  18 822 python          python SRWLIB_Example10.py
24359  19 821 python          python SRWLIB_Example10.py
24505  20 872 python          python SRWLIB_Example10.py
24087  21   1 bash            -bash -c for i in $(seq 2); do mpiexec -verbose -n 16 python SRWLIB_Example10.py & done; wait
24647  21 871 python          python SRWLIB_Example10.py
24792  22 928 python          python SRWLIB_Example10.py
24935  23 928 python          python SRWLIB_Example10.py
25079  24 927 python          python SRWLIB_Example10.py
25223  25 990 python          python SRWLIB_Example10.py
24209  27  18 mpiexec         mpiexec -verbose -n 16 python SRWLIB_Example10.py
23993  33  23 docker          docker run --cpuset-cpus 17-71 --rm s10 su - vagrant -c for i in $(seq 2); do mpiexec -verbose -n 16 python SRWLIB_Example10.py & done; wait
24213  36 392 python          python SRWLIB_Example10.py
24214  36 393 python          python SRWLIB_Example10.py
24287  37 415 python          python SRWLIB_Example10.py
24290  37 415 python          python SRWLIB_Example10.py
24431  38 414 python          python SRWLIB_Example10.py
24436  38 415 python          python SRWLIB_Example10.py
24575  39 441 python          python SRWLIB_Example10.py
24579  39 440 python          python SRWLIB_Example10.py
24055  40  17 su              su - vagrant -c for i in $(seq 2); do mpiexec -verbose -n 16 python SRWLIB_Example10.py & done; wait
24719  40 440 python          python SRWLIB_Example10.py
24721  40 440 python          python SRWLIB_Example10.py
24863  41 469 python          python SRWLIB_Example10.py
24864  41 468 python          python SRWLIB_Example10.py
25007  42 467 python          python SRWLIB_Example10.py
25011  42 468 python          python SRWLIB_Example10.py
25151  43 500 python          python SRWLIB_Example10.py
25157  43 500 python          python SRWLIB_Example10.py
25382  44   0 grep            grep SRWLIB_Example10.py
24210  45  18 mpiexec         mpiexec -verbose -n 16 python SRWLIB_Example10.py
24217  54 821 python          python SRWLIB_Example10.py
24361  55 821 python          python SRWLIB_Example10.py
24503  56 872 python          python SRWLIB_Example10.py
24648  57 871 python          python SRWLIB_Example10.py
24791  58 928 python          python SRWLIB_Example10.py
24937  59 927 python          python SRWLIB_Example10.py
25081  60 992 python          python SRWLIB_Example10.py
25224  61 988 python          python SRWLIB_Example10.py
  • cpuset=37-71:
root@cpu-001:~# ps axww -o pid,psr,cp,ucomm,args | sort --key=2 -n | grep 'SRWLIB_Example10.py'
25470  34  20 docker          docker run --cpuset-cpus 37-71 --rm s10 su - vagrant -c for i in $(seq 2); do mpiexec -verbose -n 16 python SRWLIB_Example10.py & done; wait
25774  37 395 python          python SRWLIB_Example10.py
25776  37 394 python          python SRWLIB_Example10.py
25550  38  13 su              su - vagrant -c for i in $(seq 2); do mpiexec -verbose -n 16 python SRWLIB_Example10.py & done; wait
25918  38 420 python          python SRWLIB_Example10.py
25920  38 419 python          python SRWLIB_Example10.py
26062  39 418 python          python SRWLIB_Example10.py
26064  39 418 python          python SRWLIB_Example10.py
25573  40   1 bash            -bash -c for i in $(seq 2); do mpiexec -verbose -n 16 python SRWLIB_Example10.py & done; wait
26208  40 447 python          python SRWLIB_Example10.py
26210  40 447 python          python SRWLIB_Example10.py
26352  41 448 python          python SRWLIB_Example10.py
26354  41 448 python          python SRWLIB_Example10.py
26496  42 481 python          python SRWLIB_Example10.py
26498  42 480 python          python SRWLIB_Example10.py
26640  43 479 python          python SRWLIB_Example10.py
26642  43 479 python          python SRWLIB_Example10.py
26864  44   0 grep            grep SRWLIB_Example10.py
25699  46 781 python          python SRWLIB_Example10.py
25700  48 780 python          python SRWLIB_Example10.py
25695  49  20 mpiexec         mpiexec -verbose -n 16 python SRWLIB_Example10.py
25701  54 396 python          python SRWLIB_Example10.py
25702  54 395 python          python SRWLIB_Example10.py
25846  55 420 python          python SRWLIB_Example10.py
25848  55 420 python          python SRWLIB_Example10.py
25990  56 418 python          python SRWLIB_Example10.py
25991  56 419 python          python SRWLIB_Example10.py
26134  57 448 python          python SRWLIB_Example10.py
26137  57 447 python          python SRWLIB_Example10.py
26280  58 447 python          python SRWLIB_Example10.py
26281  58 447 python          python SRWLIB_Example10.py
26424  59 482 python          python SRWLIB_Example10.py
26425  59 480 python          python SRWLIB_Example10.py
26568  60 480 python          python SRWLIB_Example10.py
26570  60 480 python          python SRWLIB_Example10.py
26712  61 515 python          python SRWLIB_Example10.py
26714  61 515 python          python SRWLIB_Example10.py
25696  65  20 mpiexec         mpiexec -verbose -n 16 python SRWLIB_Example10.py

Thanks,
Maksim

@robnagler
Copy link
Member Author

Grasping at straws. If you run this:

docker run --rm s10 su - vagrant -c 'for i in 1 2; do mpiexec -verbose -n 16 python SRWLIB_Example10.py & done; wait'

What does this say?

ps axww -o pid,ppid,psr,cp,ucomm,args | sort --key=2 -n | grep 'SRWLIB_Example10.py'

Note that I added ppid to the list. I want to see how the jobs are distributed based on the mpi "world".

@mrakitin
Copy link

Here is the output:

root@cpu-001:~# ps axww -o pid,ppid,psr,cp,ucomm,args | sort --key=2 -n | grep 'SRWLIB_Example10.py'
42644  6176  13  10 docker          docker run --rm s10 su - vagrant -c for i in 1 2; do mpiexec -verbose -n 16 python SRWLIB_Example10.py & done; wait
44027 17831  10   0 grep            grep SRWLIB_Example10.py
42699 34744  34  18 su              su - vagrant -c for i in 1 2; do mpiexec -verbose -n 16 python SRWLIB_Example10.py & done; wait
42731 42699   4   1 bash            -bash -c for i in 1 2; do mpiexec -verbose -n 16 python SRWLIB_Example10.py & done; wait
42854 42731  14  16 mpiexec         mpiexec -verbose -n 16 python SRWLIB_Example10.py
42855 42731   9  16 mpiexec         mpiexec -verbose -n 16 python SRWLIB_Example10.py
42858 42854   0 816 python          python SRWLIB_Example10.py
42860 42854  54 814 python          python SRWLIB_Example10.py
42932 42854   1 815 python          python SRWLIB_Example10.py
43004 42854  19 860 python          python SRWLIB_Example10.py
43076 42854   2 857 python          python SRWLIB_Example10.py
43148 42854  20 860 python          python SRWLIB_Example10.py
43221 42854  39 859 python          python SRWLIB_Example10.py
43299 42854  21 906 python          python SRWLIB_Example10.py
43371 42854   4 908 python          python SRWLIB_Example10.py
43439 42854  58 909 python          python SRWLIB_Example10.py
43511 42854   5 908 python          python SRWLIB_Example10.py
43588 42854  23 904 python          python SRWLIB_Example10.py
43655 42854   6 964 python          python SRWLIB_Example10.py
43730 42854  24 963 python          python SRWLIB_Example10.py
43802 42854  43 964 python          python SRWLIB_Example10.py
43875 42854  25 956 python          python SRWLIB_Example10.py
42859 42855  36 816 python          python SRWLIB_Example10.py
42861 42855  18 813 python          python SRWLIB_Example10.py
42933 42855  37 814 python          python SRWLIB_Example10.py
43005 42855  55 860 python          python SRWLIB_Example10.py
43077 42855  38 857 python          python SRWLIB_Example10.py
43149 42855  56 859 python          python SRWLIB_Example10.py
43220 42855   3 860 python          python SRWLIB_Example10.py
43292 42855  57 907 python          python SRWLIB_Example10.py
43364 42855  40 908 python          python SRWLIB_Example10.py
43438 42855  22 909 python          python SRWLIB_Example10.py
43510 42855  41 908 python          python SRWLIB_Example10.py
43582 42855  59 904 python          python SRWLIB_Example10.py
43654 42855  42 964 python          python SRWLIB_Example10.py
43729 42855  60 963 python          python SRWLIB_Example10.py
43803 42855   7 964 python          python SRWLIB_Example10.py
43877 42855  61 955 python          python SRWLIB_Example10.py

@robnagler
Copy link
Member Author

Those didn't overlap. Bumping to n=36. If they don't overlap, try i=3. Thanks!

docker run --rm s10 su - vagrant -c 'for i in 1 2; do mpiexec -verbose -n 36 python SRWLIB_Example10.py & done; wait'

I changed this line a bit to make it easier to read:

ps axww -o psr,ppid,pid,cp,ucomm,args | sort -n | grep 'SRWLIB_Example10.py'

@mrakitin
Copy link

Hi Rob,

  • i=2, n=36:
root@cpu-001:~# ps axww -o psr,ppid,pid,cp,ucomm,args | sort -n | grep 'SRWLIB_Example10.py'
  0 21211 21215 553 python          python SRWLIB_Example10.py
  1 21211 21291 584 python          python SRWLIB_Example10.py
  2 21211 21435 587 python          python SRWLIB_Example10.py
  2 34744 21051  18 su              su - vagrant -c for i in 1 2; do mpiexec -verbose -n 36 python SRWLIB_Example10.py & done; wait
  3 21212 21583 585 python          python SRWLIB_Example10.py
  4 21211 21723 585 python          python SRWLIB_Example10.py
  5 21212 21871 623 python          python SRWLIB_Example10.py
  6 21212 22015 654 python          python SRWLIB_Example10.py
  7 21212 22159 660 python          python SRWLIB_Example10.py
  8 21212 22299 705 python          python SRWLIB_Example10.py
  9 21212 22443 701 python          python SRWLIB_Example10.py
 10 21212 22590 759 python          python SRWLIB_Example10.py
 11 21212 22737 757 python          python SRWLIB_Example10.py
 12 21212 22877 756 python          python SRWLIB_Example10.py
 13 21212 23021 805 python          python SRWLIB_Example10.py
 14 21212 23165 808 python          python SRWLIB_Example10.py
 15 21211 23316 888 python          python SRWLIB_Example10.py
 16 21212 23453 886 python          python SRWLIB_Example10.py
 17 21212 23598 978 python          python SRWLIB_Example10.py
 18 17831 23821   0 grep            grep SRWLIB_Example10.py
 18 21091 21211  25 mpiexec         mpiexec -verbose -n 36 python SRWLIB_Example10.py
 18 21211 21217 554 python          python SRWLIB_Example10.py
 19 21212 21364 580 python          python SRWLIB_Example10.py
 20 21051 21091   1 bash            -bash -c for i in 1 2; do mpiexec -verbose -n 36 python SRWLIB_Example10.py & done; wait
 20 21211 21507 578 python          python SRWLIB_Example10.py
 21 21212 21655 622 python          python SRWLIB_Example10.py
 21  6176 20988  17 docker          docker run --rm s10 su - vagrant -c for i in 1 2; do mpiexec -verbose -n 36 python SRWLIB_Example10.py & done; wait
 22 21212 21796 620 python          python SRWLIB_Example10.py
 23 21211 21939 658 python          python SRWLIB_Example10.py
 24 21211 22083 660 python          python SRWLIB_Example10.py
 25 21212 22232 660 python          python SRWLIB_Example10.py
 26 21211 22375 701 python          python SRWLIB_Example10.py
 27 21212 22518 705 python          python SRWLIB_Example10.py
 28 21211 22661 756 python          python SRWLIB_Example10.py
 29 21211 22805 753 python          python SRWLIB_Example10.py
 30 21211 22953 819 python          python SRWLIB_Example10.py
 31 21212 23093 812 python          python SRWLIB_Example10.py
 32 21211 23244 886 python          python SRWLIB_Example10.py
 33 21212 23381 885 python          python SRWLIB_Example10.py
 34 21212 23526 977 python          python SRWLIB_Example10.py
 35 21211 23671 969 python          python SRWLIB_Example10.py
 36 21212 21216 553 python          python SRWLIB_Example10.py
 37 21212 21292 584 python          python SRWLIB_Example10.py
 38 21212 21436 587 python          python SRWLIB_Example10.py
 39 21211 21579 585 python          python SRWLIB_Example10.py
 40 21212 21730 613 python          python SRWLIB_Example10.py
 41 21211 21867 625 python          python SRWLIB_Example10.py
 42 21211 22011 660 python          python SRWLIB_Example10.py
 43 21211 22155 661 python          python SRWLIB_Example10.py
 44 21211 22300 706 python          python SRWLIB_Example10.py
 45 21091 21212  24 mpiexec         mpiexec -verbose -n 36 python SRWLIB_Example10.py
 45 21211 22444 704 python          python SRWLIB_Example10.py
 46 21211 22589 759 python          python SRWLIB_Example10.py
 47 21211 22733 758 python          python SRWLIB_Example10.py
 48 21211 22880 756 python          python SRWLIB_Example10.py
 49 21211 23025 815 python          python SRWLIB_Example10.py
 50 21211 23174 889 python          python SRWLIB_Example10.py
 51 21212 23309 889 python          python SRWLIB_Example10.py
 52 21211 23454 887 python          python SRWLIB_Example10.py
 53 21211 23597 978 python          python SRWLIB_Example10.py
 54 21212 21218 555 python          python SRWLIB_Example10.py
 55 21211 21363 587 python          python SRWLIB_Example10.py
 56 21212 21508 581 python          python SRWLIB_Example10.py
 57 21211 21651 623 python          python SRWLIB_Example10.py
 58 21211 21795 621 python          python SRWLIB_Example10.py
 59 21212 21947 659 python          python SRWLIB_Example10.py
 60 21212 22087 659 python          python SRWLIB_Example10.py
 61 21211 22227 660 python          python SRWLIB_Example10.py
 62 21212 22371 701 python          python SRWLIB_Example10.py
 63 21211 22517 705 python          python SRWLIB_Example10.py
 64 21212 22663 756 python          python SRWLIB_Example10.py
 65 21212 22806 753 python          python SRWLIB_Example10.py
 66 21212 22949 820 python          python SRWLIB_Example10.py
 67 21211 23094 813 python          python SRWLIB_Example10.py
 68 21212 23237 890 python          python SRWLIB_Example10.py
 69 21211 23385 884 python          python SRWLIB_Example10.py
 70 21211 23525 977 python          python SRWLIB_Example10.py
 71 21212 23673 969 python          python SRWLIB_Example10.py
  • i=3, n=36:
root@cpu-001:~# ps axww -o psr,ppid,pid,cp,ucomm,args | sort -n | grep 'SRWLIB_Example10.py'
  0 24025 24147  28 mpiexec         mpiexec -verbose -n 36 python SRWLIB_Example10.py
  0 24147 24152 312 python          python SRWLIB_Example10.py
  0 24148 24153 303 python          python SRWLIB_Example10.py
  1 24147 24280 325 python          python SRWLIB_Example10.py
  1 34744 23992  34 su              su - vagrant -c for i in 1 2 3; do mpiexec -verbose -n 36 python SRWLIB_Example10.py & done; wait
  2 24146 24484 341 python          python SRWLIB_Example10.py
  2 24147 24511 323 python          python SRWLIB_Example10.py
  2  6176 23927  34 docker          docker run --rm s10 su - vagrant -c for i in 1 2 3; do mpiexec -verbose -n 36 python SRWLIB_Example10.py & done; wait
  3 24146 24728 339 python          python SRWLIB_Example10.py
  4 17831 28053   0 grep            grep SRWLIB_Example10.py
  4 23992 24025   1 bash            -bash -c for i in 1 2 3; do mpiexec -verbose -n 36 python SRWLIB_Example10.py & done; wait
  4 24146 24926 370 python          python SRWLIB_Example10.py
  5 24147 25177 365 python          python SRWLIB_Example10.py
  5 24148 25127 364 python          python SRWLIB_Example10.py
  6 24146 25343 369 python          python SRWLIB_Example10.py
  6 24147 25393 385 python          python SRWLIB_Example10.py
  7 24147 25580 380 python          python SRWLIB_Example10.py
  7 24148 25559 371 python          python SRWLIB_Example10.py
  8 24147 25794 414 python          python SRWLIB_Example10.py
  8 24148 25776 395 python          python SRWLIB_Example10.py
  9 24147 26051 410 python          python SRWLIB_Example10.py
 10 24146 26208 419 python          python SRWLIB_Example10.py
 10 24147 26240 446 python          python SRWLIB_Example10.py
 11 24147 26444 423 python          python SRWLIB_Example10.py
 11 24148 26424 411 python          python SRWLIB_Example10.py
 12 24146 26641 449 python          python SRWLIB_Example10.py
 12 24148 26650 465 python          python SRWLIB_Example10.py
 13 24147 26863 530 python          python SRWLIB_Example10.py
 13 24148 26861 460 python          python SRWLIB_Example10.py
 14 24146 27083 479 python          python SRWLIB_Example10.py
 14 24147 27075 554 python          python SRWLIB_Example10.py
 15 24146 27291 569 python          python SRWLIB_Example10.py
 15 24147 27295 545 python          python SRWLIB_Example10.py
 16 24146 27509 549 python          python SRWLIB_Example10.py
 16 24147 27507 536 python          python SRWLIB_Example10.py
 17 24147 27723 618 python          python SRWLIB_Example10.py
 18 24146 24162 310 python          python SRWLIB_Example10.py
 18 24148 24155 310 python          python SRWLIB_Example10.py
 19 24148 24369 324 python          python SRWLIB_Example10.py
 20 24148 24587 343 python          python SRWLIB_Example10.py
 21 24146 24803 366 python          python SRWLIB_Example10.py
 21 24148 24805 353 python          python SRWLIB_Example10.py
 22 24146 25019 383 python          python SRWLIB_Example10.py
 24 24146 25468 394 python          python SRWLIB_Example10.py
 24 24147 25497 373 python          python SRWLIB_Example10.py
 25 24147 25702 383 python          python SRWLIB_Example10.py
 25 24148 25667 433 python          python SRWLIB_Example10.py
 26 24148 25884 399 python          python SRWLIB_Example10.py
 27 24147 26141 436 python          python SRWLIB_Example10.py
 28 24148 26320 426 python          python SRWLIB_Example10.py
 29 24025 24148  30 mpiexec         mpiexec -verbose -n 36 python SRWLIB_Example10.py
 29 24148 26537 488 python          python SRWLIB_Example10.py
 30 24146 26750 477 python          python SRWLIB_Example10.py
 31 24146 26975 528 python          python SRWLIB_Example10.py
 32 24146 27183 508 python          python SRWLIB_Example10.py
 32 24147 27192 471 python          python SRWLIB_Example10.py
 33 24147 27399 538 python          python SRWLIB_Example10.py
 33 24148 27419 541 python          python SRWLIB_Example10.py
 34 24148 27628 607 python          python SRWLIB_Example10.py
 35 24148 27831 578 python          python SRWLIB_Example10.py
 36 24146 24154 346 python          python SRWLIB_Example10.py
 37 24146 24281 337 python          python SRWLIB_Example10.py
 37 24148 24263 333 python          python SRWLIB_Example10.py
 38 24148 24478 357 python          python SRWLIB_Example10.py
 39 24147 24695 342 python          python SRWLIB_Example10.py
 39 24148 24704 337 python          python SRWLIB_Example10.py
 40 24147 24963 351 python          python SRWLIB_Example10.py
 40 24148 24911 352 python          python SRWLIB_Example10.py
 41 24146 25128 346 python          python SRWLIB_Example10.py
 42 24148 25347 384 python          python SRWLIB_Example10.py
 43 24146 25568 388 python          python SRWLIB_Example10.py
 44 24146 25784 402 python          python SRWLIB_Example10.py
 45 24146 25992 409 python          python SRWLIB_Example10.py
 45 24148 26016 389 python          python SRWLIB_Example10.py
 46 24148 26217 433 python          python SRWLIB_Example10.py
 47 24146 26425 465 python          python SRWLIB_Example10.py
 48 24147 26643 481 python          python SRWLIB_Example10.py
 49 24146 26857 525 python          python SRWLIB_Example10.py
 50 24148 27096 469 python          python SRWLIB_Example10.py
 51 24148 27296 539 python          python SRWLIB_Example10.py
 52 24025 24146  30 mpiexec         mpiexec -verbose -n 36 python SRWLIB_Example10.py
 52 24148 27512 555 python          python SRWLIB_Example10.py
 53 24146 27749 614 python          python SRWLIB_Example10.py
 53 24148 27758 577 python          python SRWLIB_Example10.py
 54 24147 24160 341 python          python SRWLIB_Example10.py
 55 24146 24394 348 python          python SRWLIB_Example10.py
 55 24147 24393 344 python          python SRWLIB_Example10.py
 56 24146 24589 331 python          python SRWLIB_Example10.py
 56 24147 24634 338 python          python SRWLIB_Example10.py
 57 24147 24809 359 python          python SRWLIB_Example10.py
 58 24147 25071 339 python          python SRWLIB_Example10.py
 58 24148 25041 349 python          python SRWLIB_Example10.py
 59 24146 25235 390 python          python SRWLIB_Example10.py
 59 24147 25256 375 python          python SRWLIB_Example10.py
 59 24148 25236 376 python          python SRWLIB_Example10.py
 60 24148 25451 372 python          python SRWLIB_Example10.py
 61 24146 25675 395 python          python SRWLIB_Example10.py
 62 24146 25899 409 python          python SRWLIB_Example10.py
 62 24147 25935 379 python          python SRWLIB_Example10.py
 63 24146 26100 420 python          python SRWLIB_Example10.py
 63 24148 26122 443 python          python SRWLIB_Example10.py
 64 24146 26316 430 python          python SRWLIB_Example10.py
 64 24147 26357 432 python          python SRWLIB_Example10.py
 65 24146 26532 446 python          python SRWLIB_Example10.py
 65 24147 26536 453 python          python SRWLIB_Example10.py
 66 24147 26749 453 python          python SRWLIB_Example10.py
 66 24148 26760 457 python          python SRWLIB_Example10.py
 67 24147 26967 476 python          python SRWLIB_Example10.py
 67 24148 27000 500 python          python SRWLIB_Example10.py
 68 24148 27194 460 python          python SRWLIB_Example10.py
 69 24146 27406 556 python          python SRWLIB_Example10.py
 70 24146 27615 577 python          python SRWLIB_Example10.py
 70 24147 27647 630 python          python SRWLIB_Example10.py
 71 24146 27854 598 python          python SRWLIB_Example10.py
 71 24147 27839 631 python          python SRWLIB_Example10.py

@mrakitin
Copy link

My concern isn't about n=16, but about n=8 for i=3:

root@cpu-001:~# ps axww -o psr,ppid,pid,cp,ucomm,args | sort -n | grep 'SRWLIB_Example10.py'
  0 28379 28386 605 python          python SRWLIB_Example10.py
  0 28380 28385 581 python          python SRWLIB_Example10.py
  1 28380 28495 603 python          python SRWLIB_Example10.py
  2 28379 28744 630 python          python SRWLIB_Example10.py
  2  6176 28165  33 docker          docker run --rm s10 su - vagrant -c for i in 1 2 3; do mpiexec -verbose -n 8 python SRWLIB_Example10.py & done; wait
  3 28378 28978 704 python          python SRWLIB_Example10.py
  3 28379 28950 615 python          python SRWLIB_Example10.py
  4 28228 28258   1 bash            -bash -c for i in 1 2 3; do mpiexec -verbose -n 8 python SRWLIB_Example10.py & done; wait
  5 28258 28379  18 mpiexec         mpiexec -verbose -n 8 python SRWLIB_Example10.py
  7 28258 28378  18 mpiexec         mpiexec -verbose -n 8 python SRWLIB_Example10.py
 18 28379 28401 620 python          python SRWLIB_Example10.py
 19 28378 28613 573 python          python SRWLIB_Example10.py
 20 28378 28823 649 python          python SRWLIB_Example10.py
 20 28379 28835 677 python          python SRWLIB_Example10.py
 21 28378 29036 645 python          python SRWLIB_Example10.py
 21 28380 29035 644 python          python SRWLIB_Example10.py
 22 28258 28380  19 mpiexec         mpiexec -verbose -n 8 python SRWLIB_Example10.py
 33 34744 28228  22 su              su - vagrant -c for i in 1 2 3; do mpiexec -verbose -n 8 python SRWLIB_Example10.py & done; wait
 36 28378 28383 574 python          python SRWLIB_Example10.py
 37 28378 28497 610 python          python SRWLIB_Example10.py
 37 28379 28524 657 python          python SRWLIB_Example10.py
 38 28378 28711 629 python          python SRWLIB_Example10.py
 38 28380 28723 606 python          python SRWLIB_Example10.py
 39 28380 28927 669 python          python SRWLIB_Example10.py
 41 28146 29262   0 grep            grep SRWLIB_Example10.py
 54 28378 28391 615 python          python SRWLIB_Example10.py
 54 28380 28387 635 python          python SRWLIB_Example10.py
 55 28379 28615 661 python          python SRWLIB_Example10.py
 55 28380 28603 635 python          python SRWLIB_Example10.py
 56 28380 28819 662 python          python SRWLIB_Example10.py
 57 28379 29037 676 python          python SRWLIB_Example10.py

Only 16 cores are occupied in this case instead of 24.

@robnagler
Copy link
Member Author

robnagler commented Mar 29, 2016 via email

@mrakitin
Copy link

How many cores do you have? Probably you can test with i=3 and n=2 or 3?

@robnagler
Copy link
Member Author

Good call! I can reproduce it here in an interactive docker script:

$ for i in 1 2 3; do mpiexec -verbose -n 4 python SRWLIB_Example10.py & done
$ ps -o psr,args ax | grep SRWLIB_Example10.py | grep -v grep | grep -v mpiexec | sort -n
  0 python SRWLIB_Example10.py
  0 python SRWLIB_Example10.py
  1 python SRWLIB_Example10.py
  2 python SRWLIB_Example10.py
  3 python SRWLIB_Example10.py
 16 python SRWLIB_Example10.py
 17 python SRWLIB_Example10.py
 17 python SRWLIB_Example10.py
 18 python SRWLIB_Example10.py
 18 python SRWLIB_Example10.py
 19 python SRWLIB_Example10.py
 19 python SRWLIB_Example10.py

Here's a test I'm using for this specific example:

x=( $(ps -o psr,args ax | grep SRWLIB_Example10.py | sort -nu | colrm 4) ); (( ${#x[@]} != 16 )) && echo ${#x[@]} || echo OK

If it works, it'll print OK, else the number of processes (including the grep and mpiexec).

I'll experiment more.

@robnagler
Copy link
Member Author

I was able to reproduce simply, and it doesn't seem to have anything to do with docker. See robnagler/mpi-play#1

@robnagler
Copy link
Member Author

FTR, the difference between Debian native and container was caused by OpenMPI 1.6 (native) vs 1.8 (container), which changed the default to bind-to core.

@mrakitin
Copy link

I started 6 independent partially-coherent calculations in 6 different browsers, and they look to work very efficiently, e.g. occupy 6*8=48 cores:

root@cpu-001:~# ps axww -o psr,ppid,pid,cp,ucomm,args | sort -n | grep 'mpi_run.py' | grep -v mpiexec | grep -v grep
  2 47120 47126 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
  5 47164 47170 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
  6 48449 48453 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
  7 47120 47125 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
  8 47164 47167 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 10 47120 47128 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 11 47264 47272 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 12 47120 47129 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 14 47120 47123 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 15 47723 47728 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 16 47723 47731 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 17 47264 47271 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 18 47264 47268 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 21 47264 47269 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 22 47264 47266 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 24 47120 47122 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 26 47120 47127 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 27 47723 47727 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 28 47636 47645 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 31 48449 48455 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 32 47723 47732 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 33 48449 48454 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 34 48449 48451 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 35 48449 48458 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 36 47636 47642 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 37 47164 47171 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 38 47164 47173 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 40 47636 47641 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 44 47636 47638 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 46 47164 47168 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 48 47636 47640 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 49 47636 47644 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 50 47723 47729 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 53 47723 47725 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 54 47264 47267 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 55 47164 47169 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 56 47636 47643 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 57 47723 47730 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 58 48449 48452 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 61 47264 47270 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 62 48449 48456 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 63 48449 48457 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 64 47164 47166 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 65 47164 47172 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 66 47120 47124 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 67 47723 47726 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 70 47264 47273 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
 71 47636 47639 999 python          /home/vagrant/.pyenv/versions/2.7.10/bin/python mpi_run.py
root@cpu-001:~# ps axww -o psr,ppid,pid,cp,ucomm,args | sort -n | grep 'mpi_run.py' | grep -v mpiexec | grep -v grep | wc -l
48

Thanks for fixing the issue Rob!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants