New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: improve performance of blaze core loader #1227

Merged
merged 5 commits into from Jun 3, 2016

Conversation

Projects
None yet
3 participants
@llllllllll
Member

llllllllll commented May 23, 2016

Breaking out what I have now so I don't build up a very large diff.

  1. Only do the lower bound query for the base expression, not the deltas. This was extra work that is not needed.
  2. DatetimeIndex.time >= time is really slow. Implements a mask_time_between which will do the time comparison against the datetime index much more quickly.
  3. Upgrade blaze ecosystem. One important change is to drop an unneeded copy in the blaze server/client.

This shaves about 1.5s off the tests which is nice.

**odo_kwargs
)
if lower is pd.NaT:
if not deltas:

This comment has been minimized.

@llllllllll

llllllllll May 23, 2016

Member

The diff is messy here. The block was just indented by 4 inside the if.

@coveralls

This comment has been minimized.

coveralls commented May 23, 2016

Coverage Status

Coverage increased (+0.006%) to 81.692% when pulling 9de66d3 on blaze-loader-perf into 96ec1fd on master.

@coveralls

This comment has been minimized.

coveralls commented May 23, 2016

Coverage Status

Coverage increased (+0.006%) to 81.692% when pulling 9de66d3 on blaze-loader-perf into 96ec1fd on master.

@llllllllll llllllllll force-pushed the blaze-loader-perf branch 3 times, most recently from 271bc48 to c569a3d May 24, 2016

llllllllll added some commits May 25, 2016

ENH: improve performance of time comparisons
Adds `mask_time_between` to do more efficient comparisons between
pandas.DatetimeIndex and datetime.time objects.

This is used in the loader utils to more efficiently normalize datetimes
around the query time.
ENH: remove the ffill lower bound query in blaze loader
This query is often only cutting out a couple of months or a week of
data. The cost of computing this lower bound does not outway the cost of
sending back too much data.
@coveralls

This comment has been minimized.

coveralls commented May 25, 2016

Coverage Status

Coverage increased (+0.002%) to 81.753% when pulling c569a3d on blaze-loader-perf into d34b1b9 on master.

def _time_to_micros(time):
"""Convert a time into milliseconds since midnight.

This comment has been minimized.

@ssanderson

ssanderson May 27, 2016

Member

The name here says micros, but the docstring says millis. From reading the code, I think the docstring is correct?

This comment has been minimized.

@llllllllll

llllllllll May 27, 2016

Member

the docstring is wrong, 1000000 * seconds is micro

Returns
-------
ms : int
The number of milliseconds since midnight.

This comment has been minimized.

@ssanderson

ssanderson May 27, 2016

Member

(Assuming no leap seconds or DST. Does that affect the places where this is being used?)

This comment has been minimized.

@llllllllll

llllllllll May 27, 2016

Member

I put this in the note, it does not affect the usage.

lop, rop = _opmap[include_start, include_end]
if start_micros <= end_micros:
join_op = op.and_
else:

This comment has been minimized.

@ssanderson

ssanderson May 27, 2016

Member

What's the intended behavior here if start is greater than end?

This comment has been minimized.

@llllllllll

llllllllll May 27, 2016

Member

This is actually how we are using it: if we unroll our particular case of 8:45 <= time < 0:00 we get:

(start_micros <= time_micros) or (time_micros < end_micros)

since time_micros = 0 we reduce to:

start_micros <= time which is correct.

See Also
--------
:meth:`pandas.DatetimeIndex.indexer_between_times`

This comment has been minimized.

@ssanderson

ssanderson May 27, 2016

Member

The method (at least on pandas 16) is indexer_between_time (no s).

--------
:meth:`pandas.DatetimeIndex.indexer_between_times`
"""
time_micros = dts._get_time_micros()

This comment has been minimized.

@ssanderson

ssanderson May 27, 2016

Member

We should probably comment here that this is extracted from indexer_between_times and give credit to the original author(s). (Looks like Wes McKinney, Chang She, and Grant Roch). We could also ask jreback if there's a preferred citation method for code that's ported out of pandas.

This comment has been minimized.

@llllllllll

llllllllll May 27, 2016

Member

Good call, I will put something in the Notes and check with Jeff.

This comment has been minimized.

@ssanderson

ssanderson May 27, 2016

Member

I'm not sure it needs to be in the docstring, since where the code came from isn't really public-facing info.

This comment has been minimized.

@ssanderson

ssanderson May 27, 2016

Member

We're also still using DatetimeIndex._get_time_micros, which isn't public.

The implementation of that method (in 0.16 at least) is:

    def _get_time_micros(self):
        utc = _utc()
        values = self.asi8
        if self.tz is not None and self.tz is not utc:
            values = self._local_timestamps()
        return tslib.get_time_micros(values)

Where get_time_micros is a Cython function that looks like a vectorized version of _time_to_micros.
Should we at least use the tslib function instead of relying on a private DatetimeIndex method?

def get_time_micros(ndarray[int64_t] dtindex):
    """
    Datetime as int64 representation to a structured array of fields
    """
    cdef:
        Py_ssize_t i, n = len(dtindex)
        pandas_datetimestruct dts
        ndarray[int64_t] micros

    micros = np.empty(n, dtype=np.int64)

    for i in range(n):
        pandas_datetime_to_datetimestruct(dtindex[i], PANDAS_FR_ns, &dts)
        micros[i] = 1000000LL * (dts.hour * 60 * 60 +
                                 60 * dts.min + dts.sec) + dts.us

    return micros

This comment has been minimized.

@llllllllll

llllllllll May 27, 2016

Member

are you suggesting inlining this to rely only on the tslib function? Isn't tslib also "private" but just not prefixed with "_"? If that is the case I would leave it like this to make it clear we are using some priivate methods.

@llllllllll llllllllll merged commit cf1687e into master Jun 3, 2016

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@llllllllll llllllllll deleted the blaze-loader-perf branch Jun 3, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment