Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Faulty DatetimeIndex union #61159

Closed
3 tasks done
SamRWest opened this issue Mar 21, 2025 · 3 comments
Closed
3 tasks done

BUG: Faulty DatetimeIndex union #61159

SamRWest opened this issue Mar 21, 2025 · 3 comments
Labels
Bug Non-Nano datetime64/timedelta64 with non-nanosecond resolution setops union, intersection, difference, symmetric_difference

Comments

@SamRWest
Copy link

SamRWest commented Mar 21, 2025

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

def test_pandas_datetime_index_union(self):
    """
    Demonstrates a suspected bug in pandas 2.2.3, where unions of DatetimeIndexes (and therefore pd.concats of dataframes with DatetimeIndexes)
    are returning unexpected values.

    My actual usecase (concatenating two dataframes with these DatetimeIndexes, from which I extracted these date ranges) works in
    pandas 1.5.3, but not 2.2.3.

    Interestingly, this passes in both versions if you change the dtype to datetime64[ns].

    """

    dti1 = DatetimeIndex(
        ['2021-10-05 17:30:00', 
         '2021-10-05 18:00:00', 
         '2021-10-05 18:30:00', 
         '2021-10-05 19:00:00', 
         '2021-10-05 19:30:00'],
        dtype='datetime64[us]', name='DATETIME', freq='30min'
    )
    dti2 = DatetimeIndex(
        ['2021-10-05 17:30:00', 
         '2021-10-05 18:00:00',  
         '2021-10-05 18:30:00', 
         '2021-10-05 19:00:00', 
         '2021-10-05 19:30:00', 
         '2021-10-05 20:00:00'], # <-- Extra datetime
        dtype='datetime64[us]', name='DATETIME', freq='30min'
    )

    union = set(dti1.union(dti2))
    expected = set(dti1) | set(dti2)
    print(f"{union=}")
    print(f"{expected=}")

    assert len(union) == len(expected), "Should have all the rows from the concatenated dataframes"


def test_range_index_equality(self):
    """ This (presumably) faulty equality check appears to be the root cause of the datetimeindex union bug above
    Note that the two stop values are different, so the RangeIndexes should not be equal.
    Interestingly, this fails in both pandas 1.5.3 and 2.2.3.
    """
    a = RangeIndex(start=1633455000000000, stop=1635262200000000, step=1800000000000)
    b = RangeIndex(start=1633455000000000, stop=1635264000000000, step=1800000000000)
    assert not a.equals(b)

Issue Description

These tests above (details in the function doc) demonstrate the issue and what I think is the root cause.
Basically we get back what appears to be an incorrect result when taking the union of two DatetimeIndexes with different ranges.

I traced this as far as the RangeIndex equality check in the second test, which appears to be faulty, returning True for two different stop values.

Expected Behavior

Out from first test should be (as in pandas 1.5.3):

union={Timestamp('2021-10-05 18:30:00'), Timestamp('2021-10-05 19:00:00'), Timestamp('2021-10-05 17:30:00'), Timestamp('2021-10-05 20:00:00'), Timestamp('2021-10-05 19:30:00'), Timestamp('2021-10-05 18:00:00')}

expected={Timestamp('2021-10-05 18:30:00'), Timestamp('2021-10-05 19:00:00'), Timestamp('2021-10-05 17:30:00'), Timestamp('2021-10-05 20:00:00'), Timestamp('2021-10-05 19:30:00'), Timestamp('2021-10-05 18:00:00')}

But the actual output in pandas 2.2.3 is (incorrectly):

union={Timestamp('2021-10-05 17:30:00'), Timestamp('2021-10-26 13:30:00')}

expected={Timestamp('2021-10-05 19:30:00'), Timestamp('2021-10-05 20:00:00'), Timestamp('2021-10-05 17:30:00'), Timestamp('2021-10-05 19:00:00'), Timestamp('2021-10-05 18:00:00'), Timestamp('2021-10-05 18:30:00')}

Installed Versions

INSTALLED VERSIONS ------------------ commit : 0691c5c python : 3.12.6 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 183 Stepping 1, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_Australia.1252 pandas : 2.2.3 numpy : 2.2.4 pytz : 2025.1 dateutil : 2.9.0.post0 pip : 25.0.1 Cython : None sphinx : 4.5.0 IPython : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.13.3 blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2025.3.0 html5lib : None hypothesis : None gcsfs : None jinja2 : 3.1.6 lxml.etree : None matplotlib : 3.10.1 numba : None numexpr : None odfpy : None openpyxl : 3.1.5 pandas_gbq : None psycopg2 : 2.9.10 pymysql : None pyarrow : 19.0.1 pyreadstat : None pytest : None python-calamine : None pyxlsb : None s3fs : 2025.3.0 scipy : 1.15.2 sqlalchemy : 2.0.39 tables : None tabulate : None xarray : None xlrd : None xlsxwriter : None zstandard : None tzdata : 2025.1 qtpy : None pyqt5 : None
@SamRWest SamRWest added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 21, 2025
@SamRWest
Copy link
Author

Update: I think pandas.core.indexes.datetimelike._as_range_index() is the real problem here: it always converts the freq to nanoseconds, but passes the first and last timestamps' _values through verbatim.

So if the timestamps are in us, we end up with a range with us and step in ns, which in my case, means that the step is so large that the ranges are actually equivalent (because the first step goes past the from param.

So I think _as_range_index() probably needs to be made aware of the time unit somehow, and not always assume ns precision. The 2-line change below worked for me, but there's probably a more robust way to do this check:

In pandas.core.indexes.datetimelike._as_range_index()

    def _as_range_index(self) -> RangeIndex:
        # Convert our i8 representations to RangeIndex
        # Caller is responsible for checking isinstance(self.freq, Tick)
        freq = cast(Tick, self.freq)
        time_unit = 'ns' if self._values._ndarray.dtype.name=='datetime64[ns]' else 'us' # check if we have ns or us values
        tick = Timedelta(freq).as_unit(time_unit)._value
        rng = range(self[0]._value, self[-1]._value + tick, tick)
        return RangeIndex(rng)

@rhshadrach
Copy link
Member

Thanks for the report. The proposed fix is essentially on main, and I get the expected output there. Closing.

@rhshadrach rhshadrach added setops union, intersection, difference, symmetric_difference Non-Nano datetime64/timedelta64 with non-nanosecond resolution and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 22, 2025
@SamRWest
Copy link
Author

Ah so it is, sorry I missed that. Thanks for checking!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Non-Nano datetime64/timedelta64 with non-nanosecond resolution setops union, intersection, difference, symmetric_difference
Projects
None yet
Development

No branches or pull requests

2 participants