Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance #16

Open
svdhoog opened this issue May 25, 2018 · 1 comment
Open

Performance #16

svdhoog opened this issue May 25, 2018 · 1 comment

Comments

@svdhoog
Copy link
Owner

svdhoog commented May 25, 2018

A couple of design decisions of FLAViz lead to performance issues.

  • HDF5 file has to be read in completely before starting with data filtering, selection, etc. This is particularly an issue for large-scale data sets, where this read-in phase might take a long time. If the user only wishes to plot a specific subset of the data, this read-in of the entire file is unnecessary. A solution would be to search the HDF5 file for the subset of data, and retrieve only that part. Probably this requires to move away from HDF5 as a data format, and to use other database formats.

  • No parallelization. All processes of the python scripts are serial. A couple of points are parallizable however:

1 Data conversion from db to h5. This works on a file-per-file basis and is highly parallisable by simply building in the launching of multiple sub-processes that run the same db_hdf5_v2.py script on a subset of the files. Currently only one core is used.

1b Processing of the per-set-and-run files set_*_run_*.h5
Since these files are being generated at the intermediary stage (translated from set_run.db files), the FLAViz routines could work on these files instead of on the more monolithic Agent.h5 files. If only a subset of data is required for some specific task, it is clearly more efficient to only use the files needed, rather than loading the entire data set into memory.

2 Plotting. Multiple plots are currently processed one by one. Each plot can be a separate sub-process, that retrieves data from the main data set once that has been read-in into main memory.

3 Transformations. If there are multiple tasks, each task could be on its own sub-process, allocated to a different core.

Testing performance

There have been some preliminary attempts to test the performance of the scripts.
This is documented in the manual.

@svaksha
Copy link

svaksha commented May 28, 2018

Slicing & indexing the h5 file as numpy arrays with pandas for specific subset of the data will allow the user to plot on a number of subsets instead of the whole file, else try chunking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants