Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revert to expected behavior when providing ns resolution #911

Closed
wants to merge 1 commit into from

Conversation

ianna
Copy link
Collaborator

@ianna ianna commented Jun 11, 2021

Thanks @drahnreb for clarifying it:

This is somewhat expected behavior when providing ns resolution.

>>> np.datetime64('2019-09-02T09:30:00', 'ns').tolist()
1567416600000000000
>>> np.datetime64("2019-09-02T09:30:00.0000000").tolist()
1567416600000000000

It's a very old and complex (if not weird) behavior.
Numpy considers datetime still experimenting, and this might change anytime, but I would prefer to mimic rather than sacrificing non-vectorized performance.

Originally posted by @drahnreb in #835 (comment)

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer @drahnreb's question, tolist doesn't have to be vectorized because its job is to convert into Python objects (which I take to be the definition of non-vectorized). However, at least the ak.nplike.of(array) could have been outside the loop.

Are we required to reproduce NumPy's tolist behavior? That's a question I've never though about it seems to me that tolist ought to make datetime objects, since it's generally about turning array data into their corresponding Python objects.

Are all datetimes turned into integers, or just the ns resolution ones? What about timedelta objects, which are more number-like than datetimes (in the sense that they have a meaningful zero)?

For that matter, what does pyarrow's to_pyobj do? (Maybe they chose a new name to avoid being held to NumPy's historic behavior.)

I'm inclined to reproduce NumPy's behavior, strange as it is, but could we check it first? For both datetime and timedelta, and for different resolutions?

@ianna
Copy link
Collaborator Author

ianna commented Jun 11, 2021

yes, it looks like only ak.layout.NumpyArray with M8[ns] format is affected:

>>> array = ak.layout.NumpyArray(df_ns)
>>> array
<NumpyArray format="M8[ns]" shape="3 1" strides="8 0" data="0x 00f00cda 1795c015 00f07118 7af5c315 00807ca2 7f8cc615" at="0x7ffb72d685e0"/>
>>> array[0]
<NumpyArray format="M8[ns]" shape="1" strides="0" data="2019-09-02T09:30:00" at="0x7ffb72d685e0"/>
>>> ak.to_list(array)
[[1567416600000000000], [1568367000000000000], [1569096000000000000]]
>>> ak.to_list(array[0])
[1567416600000000000]

PyArrow converts a NumPy datetime64 array to a TimestampArray:

>>> numpy_array = np.array(["2020-07-27T10:41:11", "2019-01-01", "2020-01-01"], "datetime64[ns]")
>>> numpy_array
array(['2020-07-27T10:41:11.000000000', '2019-01-01T00:00:00.000000000',
       '2020-01-01T00:00:00.000000000'], dtype='datetime64[ns]')
>>> nparr = pyarrow.array(numpy_array)
>>> nparr
<pyarrow.lib.TimestampArray object at 0x7ffb68a6d820>
[
  2020-07-27 10:41:11.000000000,
  2019-01-01 00:00:00.000000000,
  2020-01-01 00:00:00.000000000
]

@drahnreb
Copy link
Contributor

drahnreb commented Jun 12, 2021

Are we required to reproduce NumPy's tolist behavior?

I would say no. When it comes to timestamp handling everyone is doing it differently and even inconsistently within the same package. With my latest comment I opted for a most coherent solution (that introduces a new way - but for the good...).
List of py_obj or not I think you should always expect the same datatype (be it np.datetime or int etc.) no matter what unit.

Are all datetimes turned into integers, or just the ns resolution ones?

Like @ianna described, this is just the case for ns resolution. datetime.datetime does not support ns (or even beyond).

comment in numpy/core/src/multiarray/_datetime.h
    # /*
    # * Converts a datetime into a PyObject *.
    # *
    # * For days or coarser, returns a datetime.date.
    # * For microseconds or coarser, returns a datetime.datetime.
    # * For units finer than microseconds, returns an integer.
    # */
    # NPY_NO_EXPORT PyObject *
    # convert_datetime_to_pyobject(npy_datetime dt, PyArray_DatetimeMetaData *meta);

Link and the same applies for timedelta.

What about timedelta objects, which are more number-like than datetimes (in the sense that they have a meaningful zero)?

Similar, just that datetime[Y] and datetime[M] is converted into int too - just like anything with precision greater than datetime[us]. (I think the first might be a numpy bug)

what does pyarrow's to_pyobj do?

Of the four precision that pyarrow supports ("s", "ms", "us", "ns"), to_pylist converts to either datetime.datetime objects when possible otherwise to pa.Timestamp.

I'm inclined to reproduce NumPy's behavior, strange as it is, but could we check it first? For both datetime and timedelta, and for different resolutions?

I compiled quick snippet of conversions for various combinations (could be partially included in tests, if necessary and a behavior is reproduced).

Code
    import numpy as np
    # general datetime units; neglecting modulo event units e.g. [60s]//10
    np_base_units = ("Y", "M", "W", "D", "h", "m", "s", "ms", "us", "ns", "ps", "fs", "as")

    np_dts = {bu: np.datetime64('1970', bu) for bu in np_base_units}  #-01-01T01:01:01
    np_tds = {bu: np.timedelta64(1, bu) for bu in np_base_units}

    # NaT
    np_special_units = ('NaT', "", None)
    np_dt_spec = {u: np.datetime64(u) for u in np_special_units}
    np_td_spec = {u: np.timedelta64(u) for u in np_special_units}
    
    py_scalar_from_np_dt = {unit: timeobj.item() for unit, timeobj in np_dts.items()}
    py_scalar_from_np_td = {unit: timeobj.item() for unit, timeobj in np_tds.items()}

    # tolist converts to the "nearest compatible builtin python type" via .item()
    dt_arrays = {unit: np.asarray(timeobj) for unit, timeobj in np_dts.items()}
    dt_arrays['mixed'] = np.asarray(list(py_scalar_from_np_dt.values()))
    td_arrays = {unit: np.asarray(timeobj) for unit, timeobj in np_tds.items()}
    td_arrays['mixed'] = np.asarray(list(py_scalar_from_np_td.values()))

    py_list_from_np_dt = {unit: timearr.tolist() for unit, timearr in dt_arrays.items()}
    py_list_from_np_td = {unit: timearr.tolist() for unit, timearr in td_arrays.items()}


    import pyarrow as pa
    pa_base_units = ("s", "ms", "us", "ns")

    pa_t32s = [pa.time32(u) for u in pa_base_units[:2]]
    pa_t64s = [pa.time64(u) for u in pa_base_units[2:]]
    pa_ts = [pa.timestamp(u) for u in pa_base_units]
    
    import datetime
    np_from_pa_ts = {pa.array(
            [datetime.datetime(2002, 1, 23), datetime.datetime(2019, 2, 20)],
            type=t,
        ).to_numpy(False).dtype: pa.array(
            [datetime.datetime(2002, 1, 23), datetime.datetime(2019, 2, 20)],
            type=t,
        ).to_numpy(False) for t in pa_ts}
    np_from_pa_opt = {pa.array(
            [None, datetime.datetime(2002, 1, 23)],
            type=t,
        ).to_numpy(False).dtype: pa.array(
            [None, datetime.datetime(2002, 1, 23)],
            type=t,
        ).to_numpy(False) for t in pa_ts}

    def try_numpy_roundtrip(pt):
        try: return pa.array([pt]).to_numpy(False)
        except BaseException as e: return str(e)
    pa_dts = {repr(pt): try_numpy_roundtrip(pt) for pt in np_dts.values()}
    pa_tds = {repr(pt): try_numpy_roundtrip(pt) for pt in np_tds.values()}

    def try_to_py_obj(pt):
        try: return pa.array([pt]).to_pylist()
        except BaseException as e: return str(e)
    py_list_from_pa_dts = {repr(pt): try_to_py_obj(pt) for pt in np_dts.values()}
    py_list_from_pa_tds = {repr(pt): try_to_py_obj(pt) for pt in np_tds.values()}

    # dtype conversion from numpy
    def try_from_numpy(np_type):
        try: return pa.from_numpy_dtype(np_type)
        except BaseException as e: return str(e)
    
    pa_conv_dts = {unit + f'_{repr(np_dt)}': try_from_numpy(np_dt) for unit, np_dt in np_dts.items()}
    pa_conv_dt_special = {str(unit) + f'_{repr(np_dt)}': try_from_numpy(np_dt) for unit, np_dt in np_dt_spec.items()}

    pa_conv_tds = {unit + f'_{repr(np_td)}': try_from_numpy(np_td) for unit, np_td in np_tds.items()}
    pa_conv_td_special = {str(unit) + f'_{repr(np_td)}': try_from_numpy(np_td) for unit, np_td in np_td_spec.items()}

    import pandas as pd
    pd_ts = pd.DataFrame.from_dict([np_dts], orient='columns')
    # ignored:  pd_ts = pd.DataFrame.from_dict([np_dts['m']], orient='columns').astype('datetime64[M]')
    pd_ts_dtypes = pd_ts.dtypes
    pa_table = pa.Table.from_pandas(pd_ts)
    table_schema = pa_table.schema
    py_dict_from_pa_table = pa_table.to_pydict()

    ... # ellipsis for debug breakpoint

It's quite a mess... Honestly, as long as int conversions work and data can be loaded, I'd say most applications are feasible.
(in the python space - besides the arithmetics with datetime data that this great PR enabled in the awkward space!)

@jpivarski
Copy link
Member

@drahnreb, I see that #367 (comment) describes two different options for what tolist could do:

  • return datetime.datetime objects, but there are resolutions it doesn't handle
  • return np.datetime64 objects (scalars), but Nones from option-type (or worse from union-type) would make these arrays read back into NumPy as dtype="O".

As you say above, it should consistently return only one type, and I agree wholeheartedly. The lack of some resolutions for datetime.datetime, I believe, rules that out. I think it's fine to always return np.datetime64 objects from tolist, even though there are cases that if read back into NumPy would be read back as dtype="O". You could say the same thing about

>>> np.array(ak.Array([1, 2, 3, None, None, 6]).tolist())
array([1, 2, 3, None, None, 6], dtype=object)

So I think having tolist always return np.datetime64 from an ak.Array of np.datetime64 (or np.timedelta64 from an ak.Array of np.datetime64) is best. I see no reason to be consistent with this:

>>> np.array([1, 2, 3], "M8[us]").tolist()
[datetime.datetime(1970, 1, 1, 0, 0, 0, 1), datetime.datetime(1970, 1, 1, 0, 0, 0, 2), datetime.datetime(1970, 1, 1, 0, 0, 0, 3)]
>>> np.array([1, 2, 3], "M8[ns]").tolist()
[1, 2, 3]

(Also unlike other NumPy functions, we haven't been attempting to make tolist behave like NumPy's tolist: consider, for instance, NumPy masked arrays, which return a masked object, whereas we would return None. Maybe we should have followed pyarrow's to_pyobj from the beginning and matched all the types there—and we still can, since we haven't use the name "to_pyobj" yet.)

So let's just return np.datetime64 (and np.timedelta64) objects from tolist uniformly. Does that work for you, @drahnreb?

@jpivarski
Copy link
Member

@ianna and I agreed to close this: the desired behavior is in main.

@jpivarski jpivarski closed this Jun 15, 2021
@drahnreb
Copy link
Contributor

So let's just return np.datetime64 (and np.timedelta64) objects from tolist uniformly. Does that work for you, @drahnreb?

Yes, that works, @jpivarski. 👍🏼

Maybe we should have followed pyarrow's to_pyobj from the beginning and matched all the types there—and we still can, since we haven't use the name "to_pyobj" yet.

I assume you mean pyarrow.NumericArray.to_pylist (and py_dict on Tables) which acts on a scalar level via .as_py().
If to_pyobj gets implemented I‘d maybe suggest the coercion type int (and None instead of the generic time types for eg. NaT) as pure python objects.
Thus, drop unit information (one would usually know the uniform array’s unit) but enforce consistent behavior of the method independent of underlying different units (given uniformity along axis).

Simply to prevent this (lost information, or sloppy behavior at day unit):

>>> supported_py_datetime_units = ("D", "M", "h", "m", "s", "us")
>>> _np_times = np.datetime64, np.timedelta64
>>> for unit in supported_py_datetime_units:
        for _method in _np_times:
            arr = np.array([_method(1, unit)])
            print(repr(arr))
            try:
                parr = pa.array(arr)
                pa_type = parr.type
                py_type = type(parr.to_pylist()[0])
            except pa.lib.ArrowNotImplementedError as e:
                pa_type = f'{e}: {unit}'
                py_type = '-'

            print('  -> pyarrow type: ', pa_type)
            print('  -> python type: ', py_type)

        print('\n')

array(['1970-01-02'], dtype='datetime64[D]')
  -> pyarrow type:  date32[day]
  -> python type:  <class 'datetime.date'>
array([1], dtype='timedelta64[D]')
  -> pyarrow type:  Unsupported timedelta64 time unit: D
  -> python type:  -


array(['1970-02'], dtype='datetime64[M]')
  -> pyarrow type:  Unsupported datetime64 time unit: M
  -> python type:  -
array([1], dtype='timedelta64[M]')
  -> pyarrow type:  Unsupported timedelta64 time unit: M
  -> python type:  -


array(['1970-01-01T01'], dtype='datetime64[h]')
  -> pyarrow type:  Unsupported datetime64 time unit: h
  -> python type:  -
array([1], dtype='timedelta64[h]')
  -> pyarrow type:  Unsupported timedelta64 time unit: h
  -> python type:  -


array(['1970-01-01T00:01'], dtype='datetime64[m]')
  -> pyarrow type:  Unsupported datetime64 time unit: m
  -> python type:  -
array([1], dtype='timedelta64[m]')
  -> pyarrow type:  Unsupported timedelta64 time unit: m
  -> python type:  -


array(['1970-01-01T00:00:01'], dtype='datetime64[s]')
  -> pyarrow type:  timestamp[s]
  -> python type:  <class 'datetime.datetime'>
array([1], dtype='timedelta64[s]')
  -> pyarrow type:  duration[s]
  -> python type:  <class 'datetime.timedelta'>


array(['1970-01-01T00:00:00.000001'], dtype='datetime64[us]')
  -> pyarrow type:  timestamp[us]
  -> python type:  <class 'datetime.datetime'>
array([1], dtype='timedelta64[us]')
  -> pyarrow type:  duration[us]
  -> python type:  <class 'datetime.timedelta'>

Generally, working with time gets non-trivial easily and in this case the sdtlib datetime.datetime is a limiting factor. Requirements for time data are very use case specific, but I would say a) arithmetics are not done in the python space b) scientific arithmetics tend to require precisions beyond the provided datetime units c) datetime.datetime is left with the task of "formatting" and d) you can work your way around it with a much simpler type or convert back.
As such, I currently do not see why a list of maybe richter but not guaranteed datetime.datetime obj would be preferred ever (maybe pickling)?

@jpivarski
Copy link
Member

Thanks @drahnreb, I did mean to_pylist and to_pydict. (I didn't look up and check the name before writing it.)

So Arrow returns datetime.datetime objects, too... sometimes. Well, we're satisfying criteria (a) and (b) at least.

I had another thought about this, which is that np.datetime64/np.timedelta64 can easily be converted into datetime.datetime/datetime.timedelta:

>>> numpy.datetime64(1, "s").item()
datetime.datetime(1970, 1, 1, 0, 0, 1)
>>> numpy.timedelta64(1, "s").item()
datetime.timedelta(seconds=1)

but the reverse isn't true. We strictly depend on NumPy, so it's definitely available, and that makes returning np.datetime64/np.timedelta64 more powerful than returning datetime.datetime/datetime.timedelta. If you have a list of NumPy datetimes, you can write a very easy list comprehension to turn them into standard library datetimes, but it's not as easy the other way around. np.datetime64(stdlib_datetime) is not much harder than np_datetime.item(), but dealing with inconsistent types is.

So I think we all agree that Awkward Array's tolist should continue returning NumPy np.datetime64/np.timedelta64, as it is in main.

@jpivarski jpivarski deleted the ianna/datetime-bugfix branch June 15, 2021 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants