Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paper: PMDA – Parallel Molecular Dynamics Analysis #476

Merged
merged 116 commits into from Jul 3, 2019
Merged
Changes from 1 commit
Commits
Show all changes
116 commits
Select commit Hold shift + click to select a range
9b3524f
generate folder and rst, bib files
VOD555 May 17, 2019
a3bced1
add abstract
VOD555 May 17, 2019
e166e7b
skeleton for PMDA paper
orbeckst May 20, 2019
b46f485
add brief description of pmda.parallel.ParallelAnalysisBase
VOD555 May 20, 2019
210309a
new pmda.bib
VOD555 May 20, 2019
ef27ec2
correct pmda.bib
VOD555 May 20, 2019
3cbbcce
change abstract
VOD555 May 20, 2019
6a74028
change sign
VOD555 May 20, 2019
7e653d4
reduced size of bib file
orbeckst May 20, 2019
74a1de2
updated title and citations
orbeckst May 20, 2019
df1cf48
minor edits and comments on workflow
orbeckst May 20, 2019
4f29605
equal contributions: Max and Shujie
orbeckst May 20, 2019
d4ddcda
Merge pull request #16 from VOD555/2019
orbeckst May 20, 2019
b758b1b
manually merged shujie_fan.rst into fan.rst
orbeckst May 20, 2019
7e47bf8
renamed paper to pmda.rst
orbeckst May 20, 2019
c74f224
remove time record part in method section
VOD555 May 21, 2019
36f4020
change name
VOD555 May 21, 2019
0af8f45
remove Timing
VOD555 May 21, 2019
b0816db
fix title overline
VOD555 May 21, 2019
58f5d55
add user-defined parallel task 1
VOD555 May 21, 2019
df6fa01
add self-defined analysis task 2
VOD555 May 21, 2019
f1871e4
Rearrange With pmda.parallel.ParallelAnalysisBase
VOD555 May 21, 2019
f237713
references for MD applications
orbeckst May 21, 2019
7b6710e
first two introductory paragraphs + XSEDE ack
orbeckst May 21, 2019
1e5fe9d
ref prev work
orbeckst May 21, 2019
a324a01
introduction v1
orbeckst May 21, 2019
b1b046e
Merge pull request #21 from Becksteinlab/introduction
orbeckst May 21, 2019
ec2ebab
add to intro: contains library of analysis classes
orbeckst May 21, 2019
156bd06
moved code availability to end
orbeckst May 21, 2019
edac61c
Methods: reference numpy
orbeckst May 21, 2019
1ec5e5d
consistent adornments for headings
orbeckst May 21, 2019
c04b36b
more methods defs
orbeckst May 22, 2019
b5526ed
methods (#23): time series vs reduction
orbeckst May 22, 2019
23bf0d6
add benchmark introduction
VOD555 May 22, 2019
49bcbe3
merge changes
VOD555 May 22, 2019
43d8b80
remove timeit part
VOD555 May 22, 2019
1669922
methods: PMDA schema
orbeckst May 22, 2019
862e7ef
Merge branch '2019' of github.com:Becksteinlab/scipy_proceedings into…
orbeckst May 22, 2019
e09ad83
methods: implementation
orbeckst May 22, 2019
78961f1
better line breaking of code
orbeckst May 22, 2019
dbccb05
Update pmda.rst
richardjgowers May 22, 2019
ff7412d
methods: performance evaluation
orbeckst May 22, 2019
c07f8aa
updated examples and usage section
orbeckst May 22, 2019
a3cbbdf
code fix for Rgyr
orbeckst May 22, 2019
4d8b8b1
add efficiency and speedup plot for rdf and rms
VOD555 May 22, 2019
acf619b
add total time comparison
VOD555 May 22, 2019
d994849
add links to data repo to Methods
orbeckst May 22, 2019
0875aac
Merge branch '2019' of github.com:Becksteinlab/scipy_proceedings into…
orbeckst May 22, 2019
8dbc6da
combine total time, efficiency, speedup for rdf
VOD555 May 22, 2019
302a9a3
Merge branch '2019' of github.com:Becksteinlab/scipy_proceedings into…
VOD555 May 22, 2019
c29a2b7
combine total, efficiency, speed up for rms
VOD555 May 22, 2019
818833c
update acknowledgements
kain88-de May 22, 2019
9ca29f6
add fig for wait, compute, io times of rms
VOD555 May 22, 2019
71c2a42
add fig for wait compute io rdf, remove the unfixed wait compute io f…
VOD555 May 22, 2019
01fef5c
add Table with benchmark environments
orbeckst May 22, 2019
18f13c9
Merge branch '2019' of github.com:Becksteinlab/scipy_proceedings into…
orbeckst May 22, 2019
24e7a8f
fix wait compute io plot for rms
VOD555 May 22, 2019
19111b5
Merge branch '2019' of github.com:Becksteinlab/scipy_proceedings into…
orbeckst May 22, 2019
5bdbb88
add figures to Results
orbeckst May 22, 2019
1009f73
fix new fig captions
orbeckst May 22, 2019
c2a9eef
add graph for rdf's prepare, conclude universe time
VOD555 May 22, 2019
4af4bc0
add graph for rms' prepare, conclude and universe time
VOD555 May 22, 2019
e79a444
corrected how detailed timing information was obtained
orbeckst May 22, 2019
d858536
Merge branch '2019' of github.com:Becksteinlab/scipy_proceedings into…
orbeckst May 22, 2019
98531c9
Results: completed RMSD section
orbeckst May 22, 2019
0c2bee0
corrected water g(r): OO
orbeckst May 22, 2019
e8d45d0
results: finished RDF
orbeckst May 23, 2019
13557fd
wrote conclusions
orbeckst May 23, 2019
43cec73
final spell check
orbeckst May 23, 2019
5ad7e8e
small improvements
orbeckst May 23, 2019
3e2953d
abstract fix
orbeckst May 23, 2019
1ba6eac
more abstract fix
orbeckst May 23, 2019
3701f39
tense fix: use past tense for result
orbeckst May 23, 2019
b85f750
made optional/required methods clearer
orbeckst May 23, 2019
d0c8eba
maded AnalysisFromFunction() a bit clearer
orbeckst May 23, 2019
2b9ef60
more tense fixes (RMSD results)
orbeckst May 23, 2019
8198feb
Merge branch '2019' into patch-1
orbeckst May 23, 2019
731c99d
Merge pull request #25 from richardjgowers/patch-1
orbeckst May 23, 2019
b4f1b62
Merge pull request #26 from kain88-de/patch-1
orbeckst May 23, 2019
606ed6d
load booktabs package explicitly
orbeckst May 23, 2019
3746e85
consistently italicized multiprocessing and distributed
orbeckst May 23, 2019
069cfe6
more conservative description of RMSD speed-up
orbeckst May 23, 2019
24f8444
add DOI for test trajectories
orbeckst May 30, 2019
7f3c52f
break code inside column
orbeckst May 30, 2019
f775cff
improved AnalysisFromFunction example
orbeckst May 31, 2019
050947c
fix number of cores and nodes for ssd distributed
VOD555 Jun 7, 2019
2cf16d0
fixed: Table: only 1 node for distributed/SSD
orbeckst Jun 7, 2019
b282591
Merge branch '2019' of github.com:Becksteinlab/scipy_proceedings into…
orbeckst Jun 7, 2019
ace8281
fixed definition of speed-up S(M)
orbeckst Jun 13, 2019
cb46d91
add zenodo DOI for data/script repository
orbeckst Jun 13, 2019
4220250
fixed typo found by reviewer @cyrush
orbeckst Jun 13, 2019
76416eb
data is a plural noun: fixed
orbeckst Jun 13, 2019
97a90d1
added particle type indices to make g(r) equation clearer
orbeckst Jun 13, 2019
783dda4
methods updates
orbeckst Jun 13, 2019
4af3190
installation details
orbeckst Jun 14, 2019
f72c0b6
re-arranged performance evaluation section
orbeckst Jun 14, 2019
bf5ee59
float juggling to make figures appear sooner
orbeckst Jun 14, 2019
d9e1a11
cleaned up bib file
orbeckst Jun 14, 2019
07def91
updated software versions
orbeckst Jun 14, 2019
67b7afb
add detail for RMSD calculation
orbeckst Jun 14, 2019
c31013d
text improvements in RMSD Task results section
orbeckst Jun 14, 2019
0fe1c37
add errorbars to wait compute io plot
VOD555 Jun 19, 2019
f7c2003
Merge branch '2019' of github.com:Becksteinlab/scipy_proceedings into…
VOD555 Jun 19, 2019
a978b2d
add errorbars to pre_con_uni plots
VOD555 Jun 19, 2019
f44abfe
add error bars total efficiency speedup plots
VOD555 Jun 19, 2019
28cbff8
add stacked graphs with percentage times
VOD555 Jun 19, 2019
13f4579
modify color of lines on graphs
VOD555 Jun 19, 2019
d78a7e9
updated text for plots with error bars
orbeckst Jun 19, 2019
108ecf5
Merge branch '2019' of github.com:Becksteinlab/scipy_proceedings into…
orbeckst Jun 19, 2019
2da50d9
integrate stacked fraction of time plots
orbeckst Jun 20, 2019
615ae24
add reviewer @cyrush to Acknowledgements for the idea of the stacked …
orbeckst Jun 20, 2019
a32203d
updated computational details for RDF calculation
orbeckst Jun 20, 2019
b82547f
fix some typo
VOD555 Jun 20, 2019
aa19461
updated text for errors of speedup and efficiency
VOD555 Jun 20, 2019
a18166d
fix typo
VOD555 Jul 2, 2019
ade200a
fix typo
VOD555 Jul 2, 2019
File filter...
Filter file types
Jump to…
Jump to file or symbol
Failed to load files and symbols.

Always

Just for now

re-arranged performance evaluation section

- followed reviewer @cyrush 's recommendation #476 (comment)
  to change the order so that benchmarking method directly follows benchmarking results (thank you!)
- close #32
- replaced 'Results & Discussion' with 'Performance Evaluation' and integrated the "methods" part;
  this had the advantage that the sectioning became clearer, some duplicated text could be removed,
  and that the logical flow is much better
- various smaller text fixes/improvements
  • Loading branch information...
orbeckst committed Jun 14, 2019
commit f72c0b6d16e4f627e8feb5c10188332f15b41b1c
@@ -264,73 +264,6 @@ For example, the default "append" reduction is
In general, the :code:`ParallelAnalysisBase` controls access to instance attributes via a context manager :code:`ParallelAnalysisBase.readonly_attributes()`.
It sets them to "read-only" for all parallel parts to prevent the common mistake to set an instance attribute in a parallel task, which breaks under parallelization as the value of an attribute in an instance in a parallel process is never communicated back to the calling process.


Performance evaluation
----------------------

To evaluate the performance of the parallelization, two common computational tasks were tested that differ in their computational cost and represent two different requirements for data reduction.
We computed the time series of root mean square distance after optimum superposition (RMSD) of all |Calpha| atoms of a protein with the initial coordinates at the first frame as reference, as implemented in class :code:`pmda.rms.RMSD`.
The RMSD calculation with optimum superposition was performed with the fast QCPROT algorithm :cite:`Theobald:2005vn` as implemented in MDAnalysis :cite:`Michaud-Agrawal:2011fu`.
As a second test case we computed the oxygen-oxygen radial distribution function (RDF, Eq. :ref:`eq:rdf`) for all oxygen atoms in the water molecules in our test system, using the class :code:`pmda.rdf.InterRDF`.
The RDF calculation is compute-intensive due to the necessity to calculate and histogram a large number (:math:`\mathcal{O}(N^2)`) of distances for each time step; it additionally exemplifies a non-trivial reduction.

Test system, benchmarking environment, and data files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We tested PMDA 0.2.1, MDAnalysis 0.20.0 (development version), Dask 1.1.1, and NumPy 1.15.4 under Python 3.
All packages except PMDA and MDAnalysis were installed in an anaconda environment with `conda`_ from the `conda-forge`_ channel.
PMDA and MDAnalysis development versions were installed from source in a conda environment with ``pip install``.

Benchmarks were run on the CPU nodes of XSEDE's :cite:`XSEDE` *SDSC Comet* supercomputer, a 2 PFlop/s cluster with 1,944 Intel Haswell Standard Compute Nodes in total.
Each node contains two Intel Xeon CPUs (E5-2680v3, 12 cores, 2.5 GHz) with 24 CPU cores per node, 128 GB DDR4 DRAM main memory, and a non-blocking fat-tree InfiniBand FDR 56 Gbps node interconnect.
All nodes share a Lustre parallel file system and have access to node-local 320 GB SSD scratch space.
Jobs are run through the SLURM batch queuing system.
Our SLURM submission shell scripts and Python benchmark scripts for *SDSC Comet* are available in the repository https://github.com/Becksteinlab/scipy2019-pmda-data and are archived under DOI `10.5281/zenodo.3228422`_.

The test data files consist of a topology file ``YiiP_system.pdb`` (with :math:`N = 111,815` atoms) and two trajectory files ``YiiP_system_9ns_center.xtc`` (Gromacs XTC format, :math:`T = 900` frames) and ``YiiP_system_90ns_center.xtc`` (Gromacs XTC format, :math:`T = 9000` frames) of the membrane protein YiiP in a lipid bilayer together with water and ions.
The test trajectories are made available on figshare at DOI `10.6084/m9.figshare.8202149`_.

.. raw:: latex

\begin{table}
\begin{longtable*}[c]{p{0.3\tablewidth}p{0.1\tablewidth}lp{0.07\tablewidth}p{0.07\tablewidth}}
\toprule
\textbf{configuration label} & \textbf{file storage} & \textbf{scheduler} & \textbf{max nodes} & \textbf{max processes} \tabularnewline
\midrule
\endfirsthead
Lustre-distributed-3nodes & Lustre & \textit{distributed} & 3 & 72 \tabularnewline
Lustre-distributed-6nodes & Lustre & \textit{distributed} & 6 & 72 \tabularnewline
Lustre-multiprocessing & Lustre & \textit{multiprocessing} & 1 & 24 \tabularnewline
SSD-distributed & SSD & \textit{distributed} & 1 & 24 \tabularnewline
SSD-multiprocessing & SSD & \textit{multiprocessing} & 1 & 24 \tabularnewline
\bottomrule
\end{longtable*}
\caption{Testing configurations on \textit{SDSC Comet}.
\textbf{max nodes} is the maximum number of nodes that were tested; the \textit{multiprocessing} scheduler is limited to a single node.
\textbf{max processes} is the maximum number of processes or Dask workers that were employed.
\DUrole{label}{tab:configurations}
}
\end{table}

We tested different combinations of Dask schedulers (*distributed*, *multiprocessing*) with different means to read the trajectory data (either from the shared *Lustre* parallel file system or from local *SSD*) as shown in Table :ref:`tab:configurations`.
Using either *multiprocessing* scheduler or the SSD restrict runs to a single node (maximum 24 CPU cores).
With *distributed* (and *Lustre*) we tested fully utilizing all cores on a node and also only occupying half the available cores, while doubling the total number of nodes.
In all cases the trajectory were split in as many blocks as there were available processes or Dask workers.
We performed single benchmark runs for *distributed* on local SSD (*SSD-distributed*) and *multiprocessing* on Lustre (*Lustre-multiprocessing*) and five repeats for all other scenarios in Table :ref:`tab:configurations`.
We plotted results for one typical benchmark run each.

.. TODO: replace figures with ones where 5 repeats are averaged and error bars indicate 1 stdev --- issue #27
Measured parameters
~~~~~~~~~~~~~~~~~~~

The :code:`ParallelAnalysisBase` class collects detailed timing information for all blocks and all frames and makes these data available in the attribute :code:`ParallelAnalysisBase.timing`:
We measured the time |tprepare| for :code:`_prepare()`, the time |twait| that each task :math:`k` waits until it is executed by the scheduler, the time |tuniverse| to create a new :code:`Universe` for each Dask task (which includes opening the shared trajectory and topology files and loading the topology into memory), the time |tIO| to read each frame :math:`t` in each block :math:`k` from disk into memory, the time |tcomp| to perform the computation in :code:`_single_frame()` and reduction in :code:`_reduce()`, the time |tconclude| to perform the final processing of all data in :code:`_conclude()`, and the total wall time to solution |ttotal|.

We quantified the strong scaling behavior by calculating the speed-up for running on :math:`M` CPU cores with :math:`M` parallel Dask tasks as :math:`S(M) = t^\text{total}(1)/t^\text{total}(M)`, where :math:`t^\text{total}(1)` is the performance of the PMDA code using the serial scheduler.
The efficiency was calculated as :math:`E(M) = S(M)/M`.



Using PMDA
@@ -482,16 +415,87 @@ This class can be used in the same way as the class that we defined with :code:`
Results and Discussion
Performance Evaluation
======================

In order to characterize the performance of PMDA on a typical HPC machine we performed computational experiments for two different analysis tasks, the RMSD calculation after optimal superposition (*RMSD*) and the water oxygen radial distribution function (*RDF*), in different scenarios, as summarized in Table :ref:`tab:configurations`.
We investigated a long (9000 frames) and a short trajectory (900 frames) to get a sense of to which degree parallelization remained practical.
In order to characterize the performance of PMDA on a typical HPC machine we performed computational experiments for two different analysis tasks, the RMSD calculation after optimum superposition (*RMSD*) and the water oxygen radial distribution function (*RDF*).

For the *RMSD* task we computed the time series of root mean square distance after optimum superposition (RMSD) of all |Calpha| atoms of a protein with the initial coordinates at the first frame as reference, as implemented in class :code:`pmda.rms.RMSD`.
The RMSD calculation with optimum superposition was performed with the fast QCPROT algorithm :cite:`Theobald:2005vn` as implemented in MDAnalysis :cite:`Michaud-Agrawal:2011fu`.

As a second test case we computed the water oxygen-oxygen radial distribution function (*RDF*, Eq. :ref:`eq:rdf`) for all oxygen atoms in the water molecules in our test system, using the class :code:`pmda.rdf.InterRDF`.
The RDF calculation is compute-intensive due to the necessity to calculate and histogram a large number (:math:`\mathcal{O}(N^2)`) of distances for each time step; it additionally exemplifies a non-trivial reduction.

These two common computational tasks differ in their computational cost and represent two different requirements for data reduction and thus allow us to investigate two distinct use cases.
We investigated a long (9000 frames) and a short trajectory (900 frames) to assess to which degree parallelization remained practical.
The computational experiments were performed in different scenarios to assess the influence of different Dask schedulers (*multiprocessing* and *distributed*) and the role of the file storage system (shared Lustre parallel file system and local SSD), as described below and summarized in Table :ref:`tab:configurations`.



Test system, benchmarking environment, and data files
-----------------------------------------------------

We tested PMDA 0.2.1, MDAnalysis 0.20.0 (development version), Dask 1.1.1, and NumPy 1.15.4 under Python 3.
All packages except PMDA and MDAnalysis were installed with the `conda`_ package manager from the `conda-forge`_ channel.
PMDA and MDAnalysis development versions were installed from source in a conda environment with ``pip install``.

Benchmarks were run on the CPU nodes of XSEDE's :cite:`XSEDE` *SDSC Comet* supercomputer, a 2 PFlop/s cluster with 1,944 Intel Haswell Standard Compute Nodes in total.
Each node contains two Intel Xeon CPUs (E5-2680v3, 12 cores, 2.5 GHz) with 24 CPU cores per node, 128 GB DDR4 DRAM main memory, and a non-blocking fat-tree InfiniBand FDR 56 Gbps node interconnect.
All nodes share a Lustre parallel file system and have access to node-local 320 GB SSD scratch space.
Jobs are run through the SLURM batch queuing system.
Our SLURM submission shell scripts and Python benchmark scripts for *SDSC Comet* are available in the repository https://github.com/Becksteinlab/scipy2019-pmda-data and are archived under DOI `10.5281/zenodo.3228422`_.

The test data files consist of a topology file ``YiiP_system.pdb`` (with :math:`N = 111,815` atoms) and two trajectory files ``YiiP_system_9ns_center.xtc`` (Gromacs XTC format, :math:`T = 900` frames) and ``YiiP_system_90ns_center.xtc`` (Gromacs XTC format, :math:`T = 9000` frames) of the membrane protein YiiP in a lipid bilayer together with water and ions.
The test trajectories are made available on figshare at DOI `10.6084/m9.figshare.8202149`_.

.. raw:: latex

\begin{table}
\begin{longtable*}[c]{p{0.3\tablewidth}p{0.1\tablewidth}lp{0.07\tablewidth}p{0.07\tablewidth}}
\toprule
\textbf{configuration label} & \textbf{file storage} & \textbf{scheduler} & \textbf{max nodes} & \textbf{max processes} \tabularnewline
\midrule
\endfirsthead
Lustre-distributed-3nodes & Lustre & \textit{distributed} & 3 & 72 \tabularnewline
Lustre-distributed-6nodes & Lustre & \textit{distributed} & 6 & 72 \tabularnewline
Lustre-multiprocessing & Lustre & \textit{multiprocessing} & 1 & 24 \tabularnewline
SSD-distributed & SSD & \textit{distributed} & 1 & 24 \tabularnewline
SSD-multiprocessing & SSD & \textit{multiprocessing} & 1 & 24 \tabularnewline
\bottomrule
\end{longtable*}
\caption{Testing configurations on \textit{SDSC Comet}.
\textbf{max nodes} is the maximum number of nodes that were tested; the \textit{multiprocessing} scheduler is limited to a single node.
\textbf{max processes} is the maximum number of processes or Dask workers that were employed.
\DUrole{label}{tab:configurations}
}
\end{table}

We tested different combinations of Dask schedulers (*distributed*, *multiprocessing*) with different means to read the trajectory data (either from the shared Lustre parallel file system or from local SSD) as shown in Table :ref:`tab:configurations`.
Using either the *multiprocessing* scheduler or the SSD restrict runs to a single node (maximum 24 CPU cores).
With *distributed* (and Lustre) we tested fully utilizing all cores on a node and also only occupying half the available cores, while doubling the total number of nodes.
In all cases the trajectory were split in as many blocks as there were available processes or Dask workers.
We performed single benchmark runs for *distributed* on local SSD (*SSD-distributed*) and *multiprocessing* on Lustre (*Lustre-multiprocessing*) and five repeats for all other scenarios in Table :ref:`tab:configurations`.
We plotted results for one typical benchmark run each.

.. TODO: replace figures with ones where 5 repeats are averaged and error bars indicate 1 stdev --- issue #27
Measured parameters
-------------------

The :code:`ParallelAnalysisBase` class collects detailed timing information for all blocks and all frames and makes these data available in the attribute :code:`ParallelAnalysisBase.timing`:
We measured the time |tprepare| for :code:`_prepare()`, the time |twait| that each task :math:`k` waits until it is executed by the scheduler, the time |tuniverse| to create a new :code:`Universe` for each Dask task (which includes opening the shared trajectory and topology files and loading the topology into memory), the time |tIO| to read each frame :math:`t` in each block :math:`k` from disk into memory, the time |tcomp| to perform the computation in :code:`_single_frame()` and reduction in :code:`_reduce()`, the time |tconclude| to perform the final processing of all data in :code:`_conclude()`, and the total wall time to solution |ttotal|.

We analyzed the total time to completion as a function of the number of CPU cores, which was equal to the number of trajectory blocks, so that each block could be processed in parallel.
We quantified the strong scaling behavior by calculating the *speed-up* for running on :math:`M` CPU cores with :math:`M` parallel Dask tasks as :math:`S(M) = t^\text{total}(1)/t^\text{total}(M)`, where :math:`t^\text{total}(1)` is the performance of the PMDA code using the serial scheduler.
The *efficiency* was calculated as :math:`E(M) = S(M)/M`.

To gain better insight into the performance-limiting steps in our algorithm (Fig. :ref:`fig:schema`) we plotted the *maximum* times over all ranks because the overall time to completion cannot be faster than the slowest parallel process.
For example, for the read I/O time we calculated the total read I/O time for each rank :math:`k` as :math:`t^\text{I/O}_k = \sum_{t=t_k}^{t_k + \tau_k} t^\text{I/O}_{k, t}` and then reported :math:`\max_k t^\text{I/O}_k`.




RMSD analysis task
------------------

ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.