Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve import times #3055

Merged
merged 5 commits into from Oct 8, 2018

Conversation

Projects
None yet
3 participants
@philippjfr
Copy link
Contributor

philippjfr commented Oct 6, 2018

This PR attempts to improve import times by deferring imports of:

  • xarray
  • dask
  • datashader and therefore numba (these were accidentally imported automatically in element/graphs.py)

Here are the current import timings:

  • Before this PR (with iris): 2.5 seconds
  • Before this PR (without iris): 1.5 seconds
  • After this PR: 0.55 seconds

In addition to some utilities to load the backends this PR works by declaring two new methods to Interfaces. The first method Interface.loaded is called to check whether an interfaces dependencies have been loaded, if not the interface is skipped, this defaults to true for backward compatibility. The second method Interface.applies is used to check whether the Interface applies to the supplied data object. By default this method simply checks whether the data is one of the Interface.types, maintaining backward compatibility. However interfaces with heavy dependencies can subclass this method to defer the import.

@philippjfr philippjfr force-pushed the import_times branch 5 times, most recently from 505f89e to 17188b5 Oct 6, 2018

@philippjfr philippjfr force-pushed the import_times branch from 7c8de06 to 08e26ff Oct 7, 2018

@philippjfr philippjfr added the data label Oct 7, 2018

@philippjfr

This comment has been minimized.

Copy link
Contributor Author

philippjfr commented Oct 7, 2018

Ready to review and merge.

@philippjfr philippjfr force-pushed the import_times branch from 796b209 to 14510e3 Oct 7, 2018

@philippjfr

This comment has been minimized.

Copy link
Contributor Author

philippjfr commented Oct 8, 2018

Just going to note that this PR shaved about 10 minutes off running nbsmoke on all our current notebook examples, user guides etc. (244 notebooks in total), unfortunately it's still slightly to long a job for travis.

@philippjfr

This comment has been minimized.

Copy link
Contributor Author

philippjfr commented Oct 8, 2018

Also please take a second to appreciate the coverage of 88.888% 😃

@jlstevens

This comment has been minimized.

Copy link
Contributor

jlstevens commented Oct 8, 2018

... unfortunately it's still slightly to long a job for travis.

So roughly how long does it take?

from ..ndmapping import OrderedDict
from ..spaces import HoloMap, DynamicMap
from ..util import (basestring, dimension_range as d_range, get_param_values,
isfinite, process_ellipses, unique_iterator, wrap_tuple)

This comment has been minimized.

@philippjfr

philippjfr Oct 8, 2018

Author Contributor

I had some weird issues when importing from .. import util getting the wrong utilities, hence I did this.

This comment has been minimized.

@jlstevens

jlstevens Oct 8, 2018

Contributor

I see. I would prefer to figure out the issue rather than switching to the unqualified version...

@philippjfr

This comment has been minimized.

Copy link
Contributor Author

philippjfr commented Oct 8, 2018

So roughly how long does it take?

Travis jobs time out at 50 minutes, by that time it's about 90% through the notebooks.

if 'dask' in sys.modules:
import dask.dataframe as dd
else:
dd = None

This comment has been minimized.

@jlstevens

jlstevens Oct 8, 2018

Contributor

Could also do:

dd = None
if 'dask' in sys.modules:
   import dask.dataframe as dd

Saves a line and maybe even more explicit that dd always exists...

This comment has been minimized.

@jlstevens

jlstevens Oct 8, 2018

Contributor

This isn't a strong opinion but I do think my suggested version is slightly better. Don't you?

This comment has been minimized.

@philippjfr

philippjfr Oct 8, 2018

Author Contributor

Basically have no preference here.

array_types += (da.Array,)
return array_types

def get_dask_array():

This comment has been minimized.

@jlstevens

jlstevens Oct 8, 2018

Contributor

I think get_dask_array_module or dask_array_module or even da_module or get_da_module would be clearer. You aren't getting the data type, you are getting a module...

This comment has been minimized.

@philippjfr

philippjfr Oct 8, 2018

Author Contributor

Sounds good.

@@ -210,7 +208,7 @@ class Dataset(Element):

def __init__(self, data, kdims=None, vdims=None, **kwargs):
if isinstance(data, Element):
pvals = util.get_param_values(data)
pvals = get_param_values(data)

This comment has been minimized.

@jlstevens

jlstevens Oct 8, 2018

Contributor

Why remove the namespace qualification? Does it affect the import times? (It shouldn't!)

This comment has been minimized.

@philippjfr

philippjfr Oct 8, 2018

Author Contributor

See my comment above, I can have another look at this but:

I had some weird issues when importing from .. import util getting the wrong utilities, hence I did this.

return dim.range
elif dim in self.dimensions() and data_range and len(self):
lower, upper = self.interface.range(self, dim)
else:
lower, upper = (np.NaN, np.NaN)
if not dimension_range:
return lower, upper
return util.dimension_range(lower, upper, dim.range, dim.soft_range)
return d_range(lower, upper, dim.range, dim.soft_range)

This comment has been minimized.

@jlstevens

jlstevens Oct 8, 2018

Contributor

I think I prefer the old version...I don't see why the name d_range should be introduced for a single use (at least in this PR diff...)

This comment has been minimized.

@philippjfr

philippjfr Oct 8, 2018

Author Contributor

Again, see the comment above.

This comment has been minimized.

@jlstevens

jlstevens Oct 8, 2018

Contributor

Well, really it is two things:

  1. For some reason you have problems with the qualified import (the root problem which would ideally be fixed)
  2. You renamed to d_range to avoid a local variable name clash.

This PR would be greatly simplified if 1 can be fixed!

This comment has been minimized.

@philippjfr

philippjfr Oct 8, 2018

Author Contributor

Having nightmares with this, but I should be able to fix it.


datatype = 'xarray'

@classmethod
def loaded(cls):

This comment has been minimized.

@jlstevens

jlstevens Oct 8, 2018

Contributor

This method makes sense but based on the way it is used, shouldn't it be called available instead of loaded? I thought the idea was that you can check sys.modules for availability and it is only loaded once actually imported...

This comment has been minimized.

@philippjfr

philippjfr Oct 8, 2018

Author Contributor

Slightly confused here, what is the difference between the semantics of available and loaded in your mind? Loaded currently checks if the dependency has been loaded, it does not check whether the library is importable (which is my interpretation of "available") because there is no way to check whether a library is importable without importing it.

This comment has been minimized.

@jlstevens

jlstevens Oct 8, 2018

Contributor

I understand it now, loaded is correct.

@jlstevens

This comment has been minimized.

Copy link
Contributor

jlstevens commented Oct 8, 2018

I've made a few suggestions and everything is clearer now that I understand you expect that the user will populate sys.modules (unless they use an interface with a literal constructor).

The main issue is the removal of the qualified namespace. Restoring that would greatly simplify this PR and keep things cleaner overall imho.

Otherwise, I'm very happy with the import time speed up and am happy to merge.

@jlstevens

This comment has been minimized.

Copy link
Contributor

jlstevens commented Oct 8, 2018

Travis jobs time out at 50 minutes, by that time it's about 90% through the notebooks.

I think things would be better if we could at least nbsmoke the user/getting started guides. If we skip the elements (or maybe only do those on doc builds?) can we get it to around 20 minutes (i.e around how long it takes to run test builds now)

@philippjfr

This comment has been minimized.

Copy link
Contributor Author

philippjfr commented Oct 8, 2018

I think things would be better if we could at least nbsmoke the user/getting started guides. If we skip the elements (or maybe only do those on doc builds?) can we get it to around 20 minutes (i.e around how long it takes to run test builds now)

I think we can probably get them all working in a travis cron job.

Philipp Rudiger Philipp Rudiger
@jlstevens

This comment has been minimized.

Copy link
Contributor

jlstevens commented Oct 8, 2018

All looks good now, thanks! I'll merge when the tests pass.

@jlstevens jlstevens merged commit d6aa608 into master Oct 8, 2018

4 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.02%) to 88.888%
Details
s3-reference-data-cache Test data is cached.
Details
@jbednar

This comment has been minimized.

Copy link
Contributor

jbednar commented Oct 8, 2018

Travis jobs time out at 50 minutes, by that time it's about 90% through the notebooks.

Can we split them into two separate jobs in the .travis.yml, or is Travis counting the entire time of all the CI targets?

@philippjfr

This comment has been minimized.

Copy link
Contributor Author

philippjfr commented Oct 8, 2018

Can we split them into two separate jobs in the .travis.yml, or is Travis counting the entire time of all the CI targets?

We can do that but that'll probably wait until we've set up the new architecture. Don't really want to make the travis.yml more complex only to then have to rewrite it all.

philippjfr added a commit that referenced this pull request Oct 8, 2018

@philippjfr philippjfr deleted the import_times branch Nov 12, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.