Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas .resample() "how" deprecation as of its 0.19 version. Fix our daily(), monthly(), quarterly() #6

Closed
3 of 6 tasks
rsvp opened this issue Nov 5, 2016 · 3 comments

Comments

@rsvp
Copy link
Owner

rsvp commented Nov 5, 2016

Description of specific issue

When resampling a time-series the following warning(s) will appear:

FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).median() fill_method=None)

FutureWarning: .resample() is now a deferred operation
use .resample(...).mean() instead of .resample(...)

It is somewhat cryptic until one realizes how='median'
was being used as an argument to the .resample function.
So how becomes the problem for yi_fred module,
specifically for our functions
daily(), monthly(), and quarterly() in fecon235.

(Sidenote: how='median' since it is more robust than 'mean'.)

The second cryptic warning can be traced to our use of
fill_method=None when upsampling. The new API
urges us to instead use methods
:

  • .backfill() : use NEXT valid observation to fill
  • .ffill() : propagate last valid observation forward to next valid
  • .fillna() : fill using nulls
  • .asfreq() : convert TimeSeries to specified frequency

  • Bug as of pandas 0.19
  • Enhancement

Expected behavior

No such warning, possibly fatal termination.

Observed behavior

Warnings started as of pandas 0.18

Why would the improvement be useful to most users?

Because daily(), weekly(), and monthly() in fecon235
should just work without the casual user needing to learn
obscure flags and methods (subject to future API changes).

Additional helpful details for bugs

  • Problem started recently, but not in older versions

  • Problem happens with all files, not only some files

  • Problem can be reliably reproduced

  • Problem happens randomly

  • fecon235 version: v4.16.1030

  • pandas version: 0.18

  • Python version: both 2.7 and 3

  • Operating system: cross-platform

@rsvp rsvp added this to the pandas API break milestone Nov 5, 2016
@rsvp
Copy link
Owner Author

rsvp commented Nov 5, 2016

An immediate remedy is to downgrade to pandas 0.18.0 or 0.18.1
if you fatally encounter this issue.

The problem summarized: for pandas API > 0.18, you can either
downsample OR upsample, but not both.

The prior API implementations would allow you to pass an aggregator function
(e.g. mean) even though you were upsampling, providing a bit of confusion.

Thus fecon235 resampling functions which have been working under
both upsampling and downsampling situations will break
e.g. see yi_fred code.

So is there a pandas way to detect which type of sampling is being requested
given the data argument? Otherwise, the fix may have to involve an additional
mandatory flag, and tedious edits across many fecon235 notebooks.

@rsvp rsvp changed the title pandas .resample() "how" deprecation as of its 0.19 version. Need to fix our daily(), weekly(), monthly() pandas .resample() "how" deprecation as of its 0.19 version. Fix our daily(), monthly(), quarterly() Nov 6, 2016
@rsvp rsvp closed this as completed in 86fb993 Nov 7, 2016
rsvp added a commit that referenced this issue Nov 7, 2016
Esp. for index_delta_secs() and resample_main()
to fix #6 daily(), monthly(), and quarterly().
@rsvp
Copy link
Owner Author

rsvp commented Nov 7, 2016

Key points in resolving this issue

  • Reliably infer the frequency of a DataFrame's index
  • Write a function to compare index frequencies and handle resampling
  • Let the machine decide whether downsampling or upsampling is appropriate
  • Hide the messy details from the casual user

pandas breaks previous API for resampling

  • The code fix will now require pandas 0.18 or higher
  • Accordingly, we increment our project from v4 to v5

Code which solves current issue

def index_delta_secs( dataframe ):
    '''Find minimum in seconds between index values.'''
    nanosecs_timedelta64 = np.diff(dataframe.index.values).min()
    #  Picked min() over median() to conserve memory;      ^^^^^!
    #  also avoids missing values issue, 
    #  e.g. weekend or holidays gaps for daily data.
    secs_timedelta64 = tools.div( nanosecs_timedelta64, 1e9 )
    #  To avoid numerical error, we divide before converting type: 
    secs = secs_timedelta64.astype( np.float32 )
    if secs == 0.0:
        system.warn('Index contains duplicate, min delta was 0.')
        return secs
    else:
        return secs

    #  There are OTHER METHODS to get the FREQUENCY of a dataframe:
    #       e.g.  df.index.freq  OR  df.index.freqstr , 
    #  however, these work only if the frequency was attributed:
    #       e.g.  '1 Hour'       OR  'H'  respectively. 
    #  The fecon235 derived dataframes will usually return None.
    #  
    #  Two timedelta64 units, 'Y' years and 'M' months, are 
    #  specially treated because the time they represent depends upon
    #  their context. While a timedelta64 day unit is equivalent to 
    #  24 hours, there is difficulty converting a month unit into days 
    #  because months have varying number of days. 
    #       Other numpy timedelta64 units can be found here: 
    #  http://docs.scipy.org/doc/numpy/reference/arrays.datetime.html
    #  
    #  For pandas we could do:  pd.infer_freq( df.index )
    #  which, for example, might output 'B' for business daily series.
    #  
    #  But the STRING representation of index frequency is IMPRACTICAL
    #  since we may want to compare two unevenly timed indexes. 
    #  That comparison is BEST DONE NUMERICALLY in some common unit 
    #  (we use seconds since that is the Unix epoch convention).
    #
    #  Such comparison will be crucial for the machine 
    #  to chose whether downsampling or upsampling is appropriate.
    #  The casual user should not be expected to know the functions
    #  within index_delta_secs() to smoothly work with a notebook.


#  For details on frequency conversion, see McKinney 2013, 
#       Chp. 10 RESAMPLING, esp. Table 10-5 on downsampling.
#       pandas defaults are:  how='mean', closed='right', label='right'
#
#  2014-08-10  closed and label to the 'left' conform to FRED practices.
#              how='median' since it is more robust than 'mean'. 
#  2014-08-14  If upsampling, interpolate() does linear evenly, 
#              disregarding uneven time intervals.
#  2016-11-06  McKinney 2013 on resampling is outdated as of pandas 0.18


def resample_main( dataframe, rule, secs ):
    '''Generalized resample routine for downsampling or upsampling.'''
    #  rule is the offset string or object representing target conversion,
    #       e.g. 'B', 'MS', or 'QS-OCT' to be compatible with FRED.
    #  secs should be the maximum seconds expected for rule frequency.
    if index_delta_secs(dataframe) < secs:
        df = dataframe.resample(rule, closed='left', label='left').median()
        #    how='median' for DOWNSAMPLING deprecated as of pandas 0.18
        return df
    else:
        df = dataframe.resample(rule, closed='left', label='left').fillna(None)
        #    fill_method=None for UPSAMPLING deprecated as of pandas 0.18
        #    note that None almost acts like np.nan which fails as argument.
        #    interpolate() applies to those filled nulls when upsampling:
        #    'linear' ignores index values treating it as equally spaced.
        return df.interpolate(method='linear')


def daily( dataframe ):
    '''Resample data to daily using only business days.'''
    #                         'D' is used calendar daily
    #                         'B' for business daily
    secs1day2hours = 93600.0
    return resample_main( dataframe, 'B', secs1day2hours )


def monthly( dataframe ):
    '''Resample data to FRED's month start frequency.'''
    #  FRED uses the start of the month to index its monthly data.
    #                         'M'  is used for end of month.
    #                         'MS' for start of month.
    secs31days = 2678400.0
    return resample_main( dataframe, 'MS', secs31days )


def quarterly( dataframe ):
    '''Resample data to FRED's quarterly start frequency.'''
    #  FRED uses the start of the month to index its monthly data.
    #  Then for quarterly data: 1-01, 4-01, 7-01, 10-01.
    #                            Q1    Q2    Q3     Q4
    #  ________ Start at first of months,
    #  ________ for year ending in indicated month.
    #  'QS-OCT'
    secs93days = 8035200.0
    return resample_main( dataframe, 'QS-OCT', secs93days )

@rsvp
Copy link
Owner Author

rsvp commented Jun 25, 2018

2018 Addendum

The fecon235 source code was refactored in https://git.io/fecon236

Here's the specific module which fixes the issue:
https://github.com/MathSci/fecon236/blob/master/fecon236/host/fred.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
0-Top
9-Done
Development

No branches or pull requests

1 participant