-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
revert to expected behavior when providing ns resolution #911
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To answer @drahnreb's question, tolist
doesn't have to be vectorized because its job is to convert into Python objects (which I take to be the definition of non-vectorized). However, at least the ak.nplike.of(array)
could have been outside the loop.
Are we required to reproduce NumPy's tolist
behavior? That's a question I've never though about it seems to me that tolist
ought to make datetime objects, since it's generally about turning array data into their corresponding Python objects.
Are all datetimes turned into integers, or just the ns resolution ones? What about timedelta objects, which are more number-like than datetimes (in the sense that they have a meaningful zero)?
For that matter, what does pyarrow's to_pyobj
do? (Maybe they chose a new name to avoid being held to NumPy's historic behavior.)
I'm inclined to reproduce NumPy's behavior, strange as it is, but could we check it first? For both datetime and timedelta, and for different resolutions?
yes, it looks like only >>> array = ak.layout.NumpyArray(df_ns)
>>> array
<NumpyArray format="M8[ns]" shape="3 1" strides="8 0" data="0x 00f00cda 1795c015 00f07118 7af5c315 00807ca2 7f8cc615" at="0x7ffb72d685e0"/>
>>> array[0]
<NumpyArray format="M8[ns]" shape="1" strides="0" data="2019-09-02T09:30:00" at="0x7ffb72d685e0"/>
>>> ak.to_list(array)
[[1567416600000000000], [1568367000000000000], [1569096000000000000]]
>>> ak.to_list(array[0])
[1567416600000000000]
>>> numpy_array = np.array(["2020-07-27T10:41:11", "2019-01-01", "2020-01-01"], "datetime64[ns]")
>>> numpy_array
array(['2020-07-27T10:41:11.000000000', '2019-01-01T00:00:00.000000000',
'2020-01-01T00:00:00.000000000'], dtype='datetime64[ns]')
>>> nparr = pyarrow.array(numpy_array)
>>> nparr
<pyarrow.lib.TimestampArray object at 0x7ffb68a6d820>
[
2020-07-27 10:41:11.000000000,
2019-01-01 00:00:00.000000000,
2020-01-01 00:00:00.000000000
] |
I would say no. When it comes to timestamp handling everyone is doing it differently and even inconsistently within the same package. With my latest comment I opted for a most coherent solution (that introduces a new way - but for the good...).
Like @ianna described, this is just the case for comment in numpy/core/src/multiarray/_datetime.h # /*
# * Converts a datetime into a PyObject *.
# *
# * For days or coarser, returns a datetime.date.
# * For microseconds or coarser, returns a datetime.datetime.
# * For units finer than microseconds, returns an integer.
# */
# NPY_NO_EXPORT PyObject *
# convert_datetime_to_pyobject(npy_datetime dt, PyArray_DatetimeMetaData *meta);
Link and the same applies for timedelta.
Similar, just that
Of the four precision that
I compiled quick snippet of conversions for various combinations (could be partially included in tests, if necessary and a behavior is reproduced). Code import numpy as np
# general datetime units; neglecting modulo event units e.g. [60s]//10
np_base_units = ("Y", "M", "W", "D", "h", "m", "s", "ms", "us", "ns", "ps", "fs", "as")
np_dts = {bu: np.datetime64('1970', bu) for bu in np_base_units} #-01-01T01:01:01
np_tds = {bu: np.timedelta64(1, bu) for bu in np_base_units}
# NaT
np_special_units = ('NaT', "", None)
np_dt_spec = {u: np.datetime64(u) for u in np_special_units}
np_td_spec = {u: np.timedelta64(u) for u in np_special_units}
py_scalar_from_np_dt = {unit: timeobj.item() for unit, timeobj in np_dts.items()}
py_scalar_from_np_td = {unit: timeobj.item() for unit, timeobj in np_tds.items()}
# tolist converts to the "nearest compatible builtin python type" via .item()
dt_arrays = {unit: np.asarray(timeobj) for unit, timeobj in np_dts.items()}
dt_arrays['mixed'] = np.asarray(list(py_scalar_from_np_dt.values()))
td_arrays = {unit: np.asarray(timeobj) for unit, timeobj in np_tds.items()}
td_arrays['mixed'] = np.asarray(list(py_scalar_from_np_td.values()))
py_list_from_np_dt = {unit: timearr.tolist() for unit, timearr in dt_arrays.items()}
py_list_from_np_td = {unit: timearr.tolist() for unit, timearr in td_arrays.items()}
import pyarrow as pa
pa_base_units = ("s", "ms", "us", "ns")
pa_t32s = [pa.time32(u) for u in pa_base_units[:2]]
pa_t64s = [pa.time64(u) for u in pa_base_units[2:]]
pa_ts = [pa.timestamp(u) for u in pa_base_units]
import datetime
np_from_pa_ts = {pa.array(
[datetime.datetime(2002, 1, 23), datetime.datetime(2019, 2, 20)],
type=t,
).to_numpy(False).dtype: pa.array(
[datetime.datetime(2002, 1, 23), datetime.datetime(2019, 2, 20)],
type=t,
).to_numpy(False) for t in pa_ts}
np_from_pa_opt = {pa.array(
[None, datetime.datetime(2002, 1, 23)],
type=t,
).to_numpy(False).dtype: pa.array(
[None, datetime.datetime(2002, 1, 23)],
type=t,
).to_numpy(False) for t in pa_ts}
def try_numpy_roundtrip(pt):
try: return pa.array([pt]).to_numpy(False)
except BaseException as e: return str(e)
pa_dts = {repr(pt): try_numpy_roundtrip(pt) for pt in np_dts.values()}
pa_tds = {repr(pt): try_numpy_roundtrip(pt) for pt in np_tds.values()}
def try_to_py_obj(pt):
try: return pa.array([pt]).to_pylist()
except BaseException as e: return str(e)
py_list_from_pa_dts = {repr(pt): try_to_py_obj(pt) for pt in np_dts.values()}
py_list_from_pa_tds = {repr(pt): try_to_py_obj(pt) for pt in np_tds.values()}
# dtype conversion from numpy
def try_from_numpy(np_type):
try: return pa.from_numpy_dtype(np_type)
except BaseException as e: return str(e)
pa_conv_dts = {unit + f'_{repr(np_dt)}': try_from_numpy(np_dt) for unit, np_dt in np_dts.items()}
pa_conv_dt_special = {str(unit) + f'_{repr(np_dt)}': try_from_numpy(np_dt) for unit, np_dt in np_dt_spec.items()}
pa_conv_tds = {unit + f'_{repr(np_td)}': try_from_numpy(np_td) for unit, np_td in np_tds.items()}
pa_conv_td_special = {str(unit) + f'_{repr(np_td)}': try_from_numpy(np_td) for unit, np_td in np_td_spec.items()}
import pandas as pd
pd_ts = pd.DataFrame.from_dict([np_dts], orient='columns')
# ignored: pd_ts = pd.DataFrame.from_dict([np_dts['m']], orient='columns').astype('datetime64[M]')
pd_ts_dtypes = pd_ts.dtypes
pa_table = pa.Table.from_pandas(pd_ts)
table_schema = pa_table.schema
py_dict_from_pa_table = pa_table.to_pydict()
... # ellipsis for debug breakpoint It's quite a mess... Honestly, as long as int conversions work and data can be loaded, I'd say most applications are feasible. |
@drahnreb, I see that #367 (comment) describes two different options for what
As you say above, it should consistently return only one type, and I agree wholeheartedly. The lack of some resolutions for >>> np.array(ak.Array([1, 2, 3, None, None, 6]).tolist())
array([1, 2, 3, None, None, 6], dtype=object) So I think having >>> np.array([1, 2, 3], "M8[us]").tolist()
[datetime.datetime(1970, 1, 1, 0, 0, 0, 1), datetime.datetime(1970, 1, 1, 0, 0, 0, 2), datetime.datetime(1970, 1, 1, 0, 0, 0, 3)]
>>> np.array([1, 2, 3], "M8[ns]").tolist()
[1, 2, 3] (Also unlike other NumPy functions, we haven't been attempting to make So let's just return |
@ianna and I agreed to close this: the desired behavior is in main. |
Yes, that works, @jpivarski. 👍🏼
I assume you mean Simply to prevent this (lost information, or sloppy behavior at day unit): >>> supported_py_datetime_units = ("D", "M", "h", "m", "s", "us")
>>> _np_times = np.datetime64, np.timedelta64
>>> for unit in supported_py_datetime_units:
for _method in _np_times:
arr = np.array([_method(1, unit)])
print(repr(arr))
try:
parr = pa.array(arr)
pa_type = parr.type
py_type = type(parr.to_pylist()[0])
except pa.lib.ArrowNotImplementedError as e:
pa_type = f'{e}: {unit}'
py_type = '-'
print(' -> pyarrow type: ', pa_type)
print(' -> python type: ', py_type)
print('\n')
array(['1970-01-02'], dtype='datetime64[D]')
-> pyarrow type: date32[day]
-> python type: <class 'datetime.date'>
array([1], dtype='timedelta64[D]')
-> pyarrow type: Unsupported timedelta64 time unit: D
-> python type: -
array(['1970-02'], dtype='datetime64[M]')
-> pyarrow type: Unsupported datetime64 time unit: M
-> python type: -
array([1], dtype='timedelta64[M]')
-> pyarrow type: Unsupported timedelta64 time unit: M
-> python type: -
array(['1970-01-01T01'], dtype='datetime64[h]')
-> pyarrow type: Unsupported datetime64 time unit: h
-> python type: -
array([1], dtype='timedelta64[h]')
-> pyarrow type: Unsupported timedelta64 time unit: h
-> python type: -
array(['1970-01-01T00:01'], dtype='datetime64[m]')
-> pyarrow type: Unsupported datetime64 time unit: m
-> python type: -
array([1], dtype='timedelta64[m]')
-> pyarrow type: Unsupported timedelta64 time unit: m
-> python type: -
array(['1970-01-01T00:00:01'], dtype='datetime64[s]')
-> pyarrow type: timestamp[s]
-> python type: <class 'datetime.datetime'>
array([1], dtype='timedelta64[s]')
-> pyarrow type: duration[s]
-> python type: <class 'datetime.timedelta'>
array(['1970-01-01T00:00:00.000001'], dtype='datetime64[us]')
-> pyarrow type: timestamp[us]
-> python type: <class 'datetime.datetime'>
array([1], dtype='timedelta64[us]')
-> pyarrow type: duration[us]
-> python type: <class 'datetime.timedelta'> Generally, working with time gets non-trivial easily and in this case the sdtlib |
Thanks @drahnreb, I did mean So Arrow returns I had another thought about this, which is that >>> numpy.datetime64(1, "s").item()
datetime.datetime(1970, 1, 1, 0, 0, 1)
>>> numpy.timedelta64(1, "s").item()
datetime.timedelta(seconds=1) but the reverse isn't true. We strictly depend on NumPy, so it's definitely available, and that makes returning So I think we all agree that Awkward Array's |
Thanks @drahnreb for clarifying it:
Originally posted by @drahnreb in #835 (comment)