Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Values falls after last bin when Resampling using pd.tseries.offsets.Nano as period #12037

Closed
marcelnem opened this issue Jan 14, 2016 · 5 comments
Labels

Comments

@marcelnem
Copy link

I have a timeseries in dataframe named dfi with non-eqispaced times as index

print dfi.value
print
print "len "+ str(len(dfi))

Output:

datetime
2015-10-01 13:58:10.427   -10.072100
2015-10-01 13:58:11.419   -10.072100
2015-10-01 13:58:12.417   -10.072100
2015-10-01 13:58:13.420   -10.072100
2015-10-01 13:58:14.426   -10.072100
2015-10-01 13:58:15.427   -10.072100
2015-10-01 13:58:16.418   -10.072100
2015-10-01 13:58:17.418    -9.753230
2015-10-01 13:58:18.416    -9.753230
2015-10-01 13:58:19.428    -9.753230
2015-10-01 13:58:20.427    -9.753230
2015-10-01 13:58:21.419    -9.753230
2015-10-01 13:58:22.416    -9.753230
2015-10-01 13:58:23.429    -9.753230
2015-10-01 13:58:24.416    -9.753230
2015-10-01 13:58:25.428    -9.753230
2015-10-01 13:58:26.418    -9.753230
2015-10-01 13:58:27.416    -9.396140
2015-10-01 13:58:28.416    -9.396140
2015-10-01 13:58:29.429    -9.396140
2015-10-01 13:58:32.416    -9.396140
2015-10-01 13:58:33.427    -9.396140
2015-10-01 13:58:34.428    -9.396140
2015-10-01 13:58:35.462    -9.396140
2015-10-01 13:58:36.416    -9.396140
2015-10-01 13:58:37.427    -9.010000
2015-10-01 13:58:38.428    -9.010000
2015-10-01 13:58:39.435    -9.010000
2015-10-01 13:58:40.437    -9.010000
2015-10-01 13:58:41.416    -9.010000
                             ...    
2015-10-03 23:59:28.052    -0.759718
2015-10-03 23:59:29.040    -0.759718
2015-10-03 23:59:30.048    -0.759718
2015-10-03 23:59:31.048    -0.759718
2015-10-03 23:59:32.060    -0.759718
2015-10-03 23:59:33.049    -0.759718
2015-10-03 23:59:34.051    -0.759718
2015-10-03 23:59:35.041    -0.759718
2015-10-03 23:59:36.061    -0.759718
2015-10-03 23:59:37.059    -1.010490
2015-10-03 23:59:38.040    -1.010490
2015-10-03 23:59:39.051    -1.010490
2015-10-03 23:59:40.040    -1.010490
2015-10-03 23:59:41.072    -1.010490
2015-10-03 23:59:42.049    -1.010490
2015-10-03 23:59:43.038    -1.010490
2015-10-03 23:59:44.040    -1.010490
2015-10-03 23:59:45.040    -1.010490
2015-10-03 23:59:48.049    -1.133730
2015-10-03 23:59:49.049    -1.133730
2015-10-03 23:59:50.048    -1.133730
2015-10-03 23:59:52.050    -1.133730
2015-10-03 23:59:53.050    -1.133730
2015-10-03 23:59:54.059    -1.133730
2015-10-03 23:59:55.049    -1.133730
2015-10-03 23:59:56.041    -1.133730
2015-10-03 23:59:59.039    -1.296430
2015-10-04 00:00:00.050    -1.296430
2015-10-04 00:00:01.060    -1.296430
2015-10-04 00:00:02.040    -1.296430
Name: value, dtype: float64

I get an error when running this code:

print period_seconds
period_nanos=int(period_seconds*(10**9))
print period_nanos
res= dfi.value.resample(pd.tseries.offsets.Nano(period_nanos), how=[np.min, np.max,'mean'])

Output + error:

4.035752
4035751999

ValueError                                Traceback (most recent call last)
<ipython-input-14-92e377227823> in <module>()
      5     period_nanos=int(period_seconds*(10**9))
      6     print period_nanos
----> 7     res= dfi.value.resample(pd.tseries.offsets.Nano(period_nanos), how=[np.min, np.max,'mean'])
      8 
      9     nullrows=pd.isnull(res).any(1).nonzero()[0]

C:\Users\USER1\Anaconda2\lib\site-packages\pandas\core\generic.pyc in resample(self, rule, how, axis, fill_method, closed, label, convention, kind, loffset, limit, base)
   3641                               fill_method=fill_method, convention=convention,
   3642                               limit=limit, base=base)
-> 3643         return sampler.resample(self).__finalize__(self)
   3644 
   3645     def first(self, offset):

C:\Users\USER1\Anaconda2\lib\site-packages\pandas\tseries\resample.pyc in resample(self, obj)
     80 
     81         if isinstance(ax, DatetimeIndex):
---> 82             rs = self._resample_timestamps()
     83         elif isinstance(ax, PeriodIndex):
     84             offset = to_offset(self.freq)

C:\Users\USER1\Anaconda2\lib\site-packages\pandas\tseries\resample.pyc in _resample_timestamps(self, kind)
    274         axlabels = self.ax
    275 
--> 276         self._get_binner_for_resample(kind=kind)
    277         grouper = self.grouper
    278         binner = self.binner

C:\Users\USER1\Anaconda2\lib\site-packages\pandas\tseries\resample.pyc in _get_binner_for_resample(self, kind)
    118             kind = self.kind
    119         if kind is None or kind == 'timestamp':
--> 120             self.binner, bins, binlabels = self._get_time_bins(ax)
    121         elif kind == 'timedelta':
    122             self.binner, bins, binlabels = self._get_time_delta_bins(ax)

C:\Users\USER1\Anaconda2\lib\site-packages\pandas\tseries\resample.pyc in _get_time_bins(self, ax)
    179 
    180         # general version, knowing nothing about relative frequencies
--> 181         bins = lib.generate_bins_dt64(ax_values, bin_edges, self.closed, hasnans=ax.hasnans)
    182 
    183         if self.closed == 'right':

pandas\lib.pyx in pandas.lib.generate_bins_dt64 (pandas\lib.c:20875)()

ValueError: Values falls after last bin

packages versions:

import pip
installed_packages = pip.get_installed_distributions()
installed_packages_list = sorted(["%s==%s" % (i.key, i.version)
     for i in installed_packages])
for i in installed_packages_list:
    print i

Output:

alabaster==0.7.6
anaconda-client==1.2.1
argcomplete==1.0.0
astropy==1.1.1
babel==2.1.1
backports-abc==0.4
backports.ssl-match-hostname==3.4.0.2
beautifulsoup4==4.4.1
bitarray==0.8.1
blaze==0.9.0
bokeh==0.11.0
boto==2.38.0
bottleneck==1.0.0
cdecimal==2.3
cffi==1.2.1
clyent==1.2.0
colorama==0.3.3
comtypes==1.1.2
conda-build==1.18.2
conda-env==2.4.5
conda==3.19.0
configobj==5.0.6
cryptography==0.9.1
cycler==0.9.0
cython==0.23.4
cytoolz==0.7.4
datashape==0.5.0
decorator==4.0.6
docutils==0.12
enum34==1.1.2
et-xmlfile==1.0.1
fastcache==1.0.2
flask==0.10.1
funcsigs==0.4
futures==3.0.3
gevent-websocket==0.9.3
gevent==1.0.1
greenlet==0.4.9
grin==1.2.1
h5py==2.5.0
idna==2.0
ipaddress==1.0.14
ipykernel==4.1.1
ipython-genutils==0.1.0
ipython==4.0.1
ipywidgets==4.1.0
itsdangerous==0.24
jdcal==1.2
jedi==0.9.0
jinja2==2.8
jsonschema==2.4.0
jupyter-client==4.1.1
jupyter-console==4.0.3
jupyter-core==4.0.6
jupyter==1.0.0
llvmlite==0.8.0
lxml==3.5.0
markupsafe==0.23
matplotlib==1.5.1
menuinst==1.3.2
mistune==0.7.1
multipledispatch==0.4.8
nbconvert==4.1.0
nbformat==4.0.1
networkx==1.10
nltk==3.1
nose==1.3.7
notebook==4.1.0
numba==0.22.1
numexpr==2.4.6
numpy==1.10.1
odo==0.4.0
openpyxl==2.3.2
pandas==0.17.1
path.py==0.0.0
patsy==0.4.0
pep8==1.6.2
pickleshare==0.5
pillow==3.0.0
pip==7.1.2
ply==3.8
psutil==3.2.2
py==1.4.30
pyasn1==0.1.9
pycosat==0.6.1
pycparser==2.14
pycrypto==2.6.1
pycurl==7.19.5.3
pyflakes==1.0.0
pygments==2.0.2
pyopenssl==0.15.1
pyparsing==2.0.3
pyreadline==2.1
pytest==2.8.1
python-dateutil==2.4.2
pytz==2015.7
pywin32==219
pyyaml==3.11
pyzmq==15.2.0
qtconsole==4.1.1
requests==2.9.0
rope==0.9.4
scikit-image==0.11.3
scikit-learn==0.17
scipy==0.16.0
setuptools==19.1.1
simplegeneric==0.8.1
singledispatch==3.4.0.3
six==1.10.0
snowballstemmer==1.2.0
sockjs-tornado==1.0.1
sphinx-rtd-theme==0.1.7
sphinx==1.3.1
spyder==2.3.8
sqlalchemy==1.0.11
statsmodels==0.6.1
sympy==0.7.6.1
tables==3.2.2
toolz==0.7.4
tornado==4.3
traitlets==4.0.0
ujson==1.33
unicodecsv==0.14.1
werkzeug==0.11.3
wheel==0.26.0
xlrd==0.9.4
xlsxwriter==0.7.7
xlwings==0.6.1
xlwt==1.0.0
@jreback
Copy link
Contributor

jreback commented Jan 14, 2016

xref #9119

does look buggy. can you post an easily reproducible/simpler example that can be easily copy-pasted

@jreback jreback added this to the Next Major Release milestone Jan 14, 2016
@marcelnem
Copy link
Author

Here is reproducible/simpler example:

running in ipython notebook and python 2

import pandas as pd
import numpy as np

start=1443707890427
end=1443916802040
dif=end-start
length=1000
np.random.seed(seed=16516)
timestamps=np.random.random_integers(0,dif,length);
timestamps =timestamps+start
timestamps = np.sort(timestamps)

datetimes=pd.to_datetime(timestamps,unit="ms")
values = np.random.rand(length)

dt_test=pd.DataFrame(values,columns=["value"],index=datetimes)
print "dt_test.head()"
print dt_test.head()
print

period_seconds=4.035752
print "period_seconds"
print period_seconds
print
period_nanos=int(period_seconds*(10**9))
print "period_nanos"
print period_nanos

res= dt_test.value.resample(pd.tseries.offsets.Nano(period_nanos), how=[np.min, np.max,'mean'])

Output:

dt_test.head()
                            value
2015-10-01 13:59:11.020  0.795006
2015-10-01 13:59:30.583  0.725395
2015-10-01 14:01:31.597  0.731184
2015-10-01 14:06:40.423  0.982237
2015-10-01 14:08:28.432  0.014274

period_seconds
4.035752

period_nanos
4035751999
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-f6df1fcb5427> in <module>()
     27 print period_nanos
     28 
---> 29 res= dt_test.value.resample(pd.tseries.offsets.Nano(period_nanos), how=[np.min, np.max,'mean'])

C:\Users\USER1\Anaconda2\lib\site-packages\pandas\core\generic.pyc in resample(self, rule, how, axis, fill_method, closed, label, convention, kind, loffset, limit, base)
   3641                               fill_method=fill_method, convention=convention,
   3642                               limit=limit, base=base)
-> 3643         return sampler.resample(self).__finalize__(self)
   3644 
   3645     def first(self, offset):

C:\Users\USER1\Anaconda2\lib\site-packages\pandas\tseries\resample.pyc in resample(self, obj)
     80 
     81         if isinstance(ax, DatetimeIndex):
---> 82             rs = self._resample_timestamps()
     83         elif isinstance(ax, PeriodIndex):
     84             offset = to_offset(self.freq)

C:\Users\USER1\Anaconda2\lib\site-packages\pandas\tseries\resample.pyc in _resample_timestamps(self, kind)
    274         axlabels = self.ax
    275 
--> 276         self._get_binner_for_resample(kind=kind)
    277         grouper = self.grouper
    278         binner = self.binner

C:\Users\USER1\Anaconda2\lib\site-packages\pandas\tseries\resample.pyc in _get_binner_for_resample(self, kind)
    118             kind = self.kind
    119         if kind is None or kind == 'timestamp':
--> 120             self.binner, bins, binlabels = self._get_time_bins(ax)
    121         elif kind == 'timedelta':
    122             self.binner, bins, binlabels = self._get_time_delta_bins(ax)

C:\Users\USER1\Anaconda2\lib\site-packages\pandas\tseries\resample.pyc in _get_time_bins(self, ax)
    179 
    180         # general version, knowing nothing about relative frequencies
--> 181         bins = lib.generate_bins_dt64(ax_values, bin_edges, self.closed, hasnans=ax.hasnans)
    182 
    183         if self.closed == 'right':

pandas\lib.pyx in pandas.lib.generate_bins_dt64 (pandas\lib.c:20875)()

ValueError: Values falls after last bin

@BranYang
Copy link
Contributor

The issue is caused by line 164, 165 in pandas/tseries/resample.py

binner = labels = DatetimeIndex(freq=self.freq,
                                start=first.replace(tzinfo=None),
                                # replace will truncate to millisecond 
                                end=last.replace(tzinfo=None),
                                tz=tz,
                                name=ax.name)

Consider this example

In [1]: import pandas as pd

In [2]: from pandas.tseries.index import DatetimeIndex

In [3]: s_ns = 1443707950041939524

In [4]: itvl = 10**9

In [5]: e_ns = s_ns + itvl

In [6]: s = pd.Timestamp(s_ns).tz_localize(None)

In [7]: e = pd.Timestamp(e_ns).tz_localize(None)

In [8]: e
Out[8]: Timestamp('2015-10-01 13:59:11.041939524')

In [9]: indx = DatetimeIndex(freq=pd.tseries.offsets.Nano(itvl/20),start=s, end=
e,tz=None)

In [10]: indx[-1]
Out[10]: Timestamp('2015-10-01 13:59:11.041939524', offset='50000000N')

In [11]: replaced = DatetimeIndex(freq=pd.tseries.offsets.Nano(itvl/20),start=s.
replace(tzinfo=None), end=e.replace(tzinfo=None),tz=None)

In [12]: replaced[-1]
Out[12]: Timestamp('2015-10-01 13:59:11.041939', offset='50000000N')

The last item clearly out of the bound if using replace.
Should we consider not to use replace given its current behavior (i.e., throw away the nano second information)?

@jreback
Copy link
Contributor

jreback commented Jan 20, 2016

@BranYang hmm, that does look likely.

Timestamp.replace is pretty naive in that it doesn't understand nanoseconds at all. So it indeed dropping the nanos.

What you need to do is fix that as I believe this is a symptom of an invalid replace.

want to take a crack at it? looking tslib.pyx/Timestamp

@jreback
Copy link
Contributor

jreback commented Jan 20, 2016

a couple of other issues might be showing similar symtoms, e.g. #6085 (and linked from there). If this proves to fix, we will want to add tests for those as well.

cldy pushed a commit to cldy/pandas that referenced this issue Feb 11, 2016
…ano as period

Closes pandas-dev#12037

Author: Bran Yang <snowolfy@163.com>

Closes pandas-dev#12270 from BranYang/nanosec and squashes the following commits:

bff0c85 [Bran Yang] Add to whatsnew and some comments
fd0b307 [Bran Yang] Fix pandas-dev#12037 Error when Resampling using pd.tseries.offsets.Nano as period
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants