Optimize readers searching for matching filenames #1178

djhoese · 2020-05-04T14:52:38Z

This addresses some of the performance issues mentioned in #1172. Thanks to @gerritholl's investigating, it was discovered that the base reader's functions for searching for file was taking a long time. We've tracked it down to a few key parts, the main one being that globify was called thousands of times for only hundreds of files. The changes in this PR seem to make a big difference.

Here is a PyCharm profiling graph showing the most called or longest running functions in the call graph of my test script:

With a total run time of ~16.5s. Here's what it looks like after this PR (globify is not even listed):

With a total run time of ~4.7s.

Closes find_files_and_readers is slow #1172
Tests added
Tests passed
Passes flake8 satpy

djhoese · 2020-05-04T14:54:07Z

I think I'd like to look at caching these globified patterns so in MultiScene cases they are only globified once. That's my next step although I don't expect it to make a big difference with my current test case.

mraspaud · 2020-05-04T14:57:53Z

If globify isn't listed anymore, then maybe the optimization should happen somewhere else now ?

coveralls · 2020-05-04T15:59:49Z

Coverage increased (+0.004%) to 89.612% when pulling 0cbea18 on djhoese:optimize-reader-globify into a82e4a6 on pytroll:master.

coveralls · 2020-05-04T15:59:49Z

Coverage increased (+0.01%) to 89.62% when pulling 11e3b4c on djhoese:optimize-reader-globify into a82e4a6 on pytroll:master.

codecov · 2020-05-04T15:59:50Z

Codecov Report

Merging #1178 into master will increase coverage by 0.00%.
The diff coverage is 97.36%.

@@           Coverage Diff           @@
##           master    #1178   +/-   ##
=======================================
  Coverage   89.61%   89.61%           
=======================================
  Files         200      200           
  Lines       29504    29537   +33     
=======================================
+ Hits        26439    26471   +32     
- Misses       3065     3066    +1

Impacted Files	Coverage Δ
satpy/readers/yaml_reader.py	`95.30% <96.15%> (-0.13%)`	⬇️
satpy/readers/__init__.py	`95.10% <100.00%> (+0.01%)`	⬆️
satpy/tests/test_readers.py	`98.98% <100.00%> (+0.02%)`	⬆️
satpy/tests/test_yaml_reader.py	`99.78% <100.00%> (ø)`
satpy/tests/reader_tests/test_fci_l1c_fdhsi.py	`100.00% <0.00%> (ø)`
satpy/readers/fci_l1c_fdhsi.py	`96.37% <0.00%> (+0.16%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a82e4a6...11e3b4c. Read the comment docs.

djhoese · 2020-05-04T17:28:26Z

Did a quick MultiScene test trying to create 60 Scenes using MultiScene.from_files with the current changes from this PR:

Takes 37 seconds to create 60 Scenes. Not sure that is good or bad.

djhoese · 2020-05-04T18:37:51Z

We could think about using asv to perform some benchmark tests: https://tomaugspurger.github.io/maintaing-performance.html

If we wanted this semi-automated we could add it to the integration tests on Jenkins (running on the SSEC's bumi server).

djhoese · 2020-05-04T18:52:25Z

Ok, after answering messages and taking a break for a bit, I started playing around with lru_cache (from functools in the standard library) and added it to some key points where patterns were being used. It cut the timing down to 2.8s total.

On slack @mraspaud had brought up making parser faster, but I wasn't sure that was possible since it is a lower-level function. This caching shows it is possible. We just have to control it well so we don't fill up people's disk or hold on to too much (the default is 128 items I think).

I'd still like to restructure some of the stuff in Satpy as it is needlessly calling some of these functions. I'll see what I can do.

djhoese · 2020-05-04T20:22:41Z

Ok I think I'm done with all of the satpy-specific optimizations, but I have a related refactor I want to do. @mraspaud what are your thoughts on renaming some of the internal functions so they have _ at the front? @gerritholl has asked if some functions are considered public interfaces. Some of them are, sure, but others shouldn't be. I'd like to do prefix some of these with _ so I can more easily justify require a set object as input rather than converting to a set first everywhere.

Edit: I still need to make a trollsift PR. I'm not sure any of the changes here need tests since they are all refactors.

djhoese · 2020-05-04T20:35:36Z

See pytroll/trollsift#25 for trollsift optimization.

djhoese · 2020-05-04T20:38:09Z

Here is the pycharm profiling graph after both these sets of changes (total run time ~2.3s):

mraspaud

LGTM. We might want to plan for using asv in the future...

gerritholl · 2020-05-05T12:57:10Z

I tried this with the latest trollsift master and also #1169 merged. With the same test script as before. New times are:

125 fake files: 40 calls to globify, 0.14 seconds for find_files_and_readers (previously 5.12 seconds).
1000 fake files: 40 calls to globify, 0.28 seconds for find_files_and_readers.
60000 fake files: 40 calls to globify, 12.4 seconds for find_files_and_readers.

satpy/readers/yaml_reader.py

djhoese · 2020-05-05T14:49:18Z

I think I've addressed everything @gerritholl mentioned and I'm ready for a re-review and merge assuming the tests pass.

gerritholl · 2020-05-07T07:26:43Z

Excellent work, thanks. All good as far as I can tell :)

mraspaud

LGTM

Optimize readers searching for matching filenames

0cbea18

djhoese added enhancement code enhancements, features, improvements component:readers optimization labels May 4, 2020

djhoese requested review from mraspaud and gerritholl May 4, 2020 14:52

djhoese self-assigned this May 4, 2020

djhoese added this to To do in PCW Spring 2020 via automation May 4, 2020

djhoese added 2 commits May 4, 2020 14:42

Consolidate glob patterns used to search for files in a directory

6f33e33

Switch to using sets when possible in base reader filename handling

f5f868f

djhoese mentioned this pull request May 4, 2020

Add lru_cache to parsing for improved performance pytroll/trollsift#25

Merged

djhoese moved this from To do to In progress in PCW Spring 2020 May 4, 2020

djhoese added 2 commits May 4, 2020 15:47

Fix set operation in yaml reader

94c475b

Fix bad set usage in base reader

b4289ee

PCW Spring 2020 automation moved this from In progress to Ready to merge May 5, 2020

mraspaud approved these changes May 5, 2020

View reviewed changes

gerritholl approved these changes May 5, 2020

View reviewed changes

satpy/readers/yaml_reader.py Outdated Show resolved Hide resolved

satpy/readers/yaml_reader.py Outdated Show resolved Hide resolved

Cleanup set handling and make some yaml reader functions private

11e3b4c

mraspaud approved these changes May 7, 2020

View reviewed changes

mraspaud merged commit 5343a83 into pytroll:master May 7, 2020

PCW Spring 2020 automation moved this from Ready to merge to Done May 7, 2020

djhoese deleted the optimize-reader-globify branch May 7, 2020 12:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize readers searching for matching filenames #1178

Optimize readers searching for matching filenames #1178

djhoese commented May 4, 2020

djhoese commented May 4, 2020

mraspaud commented May 4, 2020

coveralls commented May 4, 2020

coveralls commented May 4, 2020 •

edited

codecov bot commented May 4, 2020 •

edited

djhoese commented May 4, 2020

djhoese commented May 4, 2020

djhoese commented May 4, 2020

djhoese commented May 4, 2020 •

edited

djhoese commented May 4, 2020

djhoese commented May 4, 2020

mraspaud left a comment

gerritholl commented May 5, 2020

djhoese commented May 5, 2020

gerritholl commented May 7, 2020

mraspaud left a comment

Optimize readers searching for matching filenames #1178

Optimize readers searching for matching filenames #1178

Conversation

djhoese commented May 4, 2020

djhoese commented May 4, 2020

mraspaud commented May 4, 2020

coveralls commented May 4, 2020

coveralls commented May 4, 2020 • edited

codecov bot commented May 4, 2020 • edited

Codecov Report

djhoese commented May 4, 2020

djhoese commented May 4, 2020

djhoese commented May 4, 2020

djhoese commented May 4, 2020 • edited

djhoese commented May 4, 2020

djhoese commented May 4, 2020

mraspaud left a comment

Choose a reason for hiding this comment

gerritholl commented May 5, 2020

djhoese commented May 5, 2020

gerritholl commented May 7, 2020

mraspaud left a comment

Choose a reason for hiding this comment

coveralls commented May 4, 2020 •

edited

codecov bot commented May 4, 2020 •

edited

djhoese commented May 4, 2020 •

edited