Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix concatenating Variables with dtype=datetime64 #134

Merged
merged 6 commits into from May 20, 2014

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented May 19, 2014

This is an alternative to #125 which I think is a little cleaner.

Basically, there was a bug where Variable.values for datetime64 arrays always made a copy of values. This made it impossible to edit variable values in-place.

@akleeman would appreciate your thoughts.

This is a more direct fix for the bug @akleeman reported pydata#125.

Previously, it did not always work to modify Variable.values in-place (as done
in Variable.concat), because as_array_or_item could return a copy of an
ndarray instead of a view.

@akleeman I didn't incorporate your tests for the
@@ -73,7 +87,10 @@ def _as_compatible_data(data):
# check pd.Index first since it's (currently) an ndarray subclass
data = PandasIndexAdapter(data)
elif isinstance(data, np.ndarray):
data = NumpyArrayAdapter(utils.as_safe_array(data))
if data.dtype.kind == 'M':
data = data.astype('datetime64[ns]')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean any non-index that holds dates will get automatically copied?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. We should be using np.asarray or set copy=False with astype:

By default, astype always returns a newly allocated array. If this is set to false, and the dtype, order, and subok requirements are satisfied, the input array is returned instead of a copy.

I don't know why I thought astype wouldn't make a copy unless necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it looks like np.asarray(data, 'datetime64[ns]') works.

@akleeman
Copy link
Contributor

The reorganization does make things cleaner, but the behavior changed relative to #125. In particular, while this patch fixes concatenation with datetime64 times it doesn't work with datetimes:

In [2]: dates = [datetime.datetime(2011, 1, i + 1) for i in range(10)]
In [3]: ds = xray.Dataset({'time': ('time', dates)})
In [4]: xray.Dataset.concat([ds.indexed(time=slice(0, 4)), ds.indexed(time=slice(4, 8))], 'time')['time'].values
Out[4]: 
array([1293840000000000000L, 1293926400000000000L, 1294012800000000000L,
       1294099200000000000L, 1294185600000000000L, 1294272000000000000L,
       1294358400000000000L, 1294444800000000000L], dtype=object)

and not sure if this was broken before or not ...

In [5]: xray.Dataset.concat([x for _, x in ds.groupby('time')], 'time')['time'].values
ValueError: cannot slice a 0-d array

@shoyer
Copy link
Member Author

shoyer commented May 19, 2014

With regards to datetime.datetime being converted into integers, the issue is that pandas currently does dtype inference when indexing an Index (pandas-dev/pandas#6370). Fortunately the next version of pandas (0.14), due out in a few weeks, stops doing this.

I had made some attempts to fix this previously but it was untested and only sort of worked. I think this latest commit should fix that up.

Let me check on that second issue...

@shoyer
Copy link
Member Author

shoyer commented May 19, 2014

The later is a (newly exposed) bug in pandas.isnull. I'm waiting on Travis before submitting the PR upstream but here is the fix: https://github.com/shoyer/pandas/compare/isnull-0d-object-array

Looks like we'll need to add a work-around for now...

This is a work-around for this upstream issue:
pandas-dev/pandas#7176
@shoyer
Copy link
Member Author

shoyer commented May 19, 2014

@akleeman I have a work-around up and ready for testing.

@akleeman
Copy link
Contributor

Yeah all fixed.

In #125 I went the route of forcing datetimes to be datetime64[ns]. This is probably part of a broader conversation, but doing so might save some future headaches. Of course ... it would also restrict us to nanosecond precision. Basically I feel like we should either force datetimes to be datetime64[ns] or make sure that operations on datetime objects preserve their type.

Probably worth getting this in and picking that conversation back up if needed. In which case could you add tests which make sure variables with datetime objects are still datetime objects after concatenation? If those start getting cast to datetime[ns] it'll start get confusing for users.

@shoyer
Copy link
Member Author

shoyer commented May 19, 2014

I agree, we should either ensure datetime64[ns] or ensure that operations on datetime objects preserve dtype. The latest commit should verify this.

My main reason for not doing the former is that I thought it would be nice (in theory) to support using plain datetime objects if datetime64[ns] does not have a long enough time range for some users. But for most users, I suspect they would indeed rather have datetime64[ns].

@akleeman
Copy link
Contributor

Also worth considering: how should datetime64[us] datetimes be handled? Currently they get cast to [ns] which, since datetimes do not, could get confusing.

@shoyer
Copy link
Member Author

shoyer commented May 19, 2014

The reason I cast all datetime64 to datetime64[ns] is because pandas will not let you make an Index of datetime64 objects with anything other than ns precision. If you try to make it an Index with dtype=object you'll actually get an array of datetime.datetime objects:

>>> pd.Index(pd.date_range('2000-01-01', periods=5).values.astype('datetime64[us]'), dtype='object').values
array([datetime.datetime(2000, 1, 1, 0, 0),
       datetime.datetime(2000, 1, 2, 0, 0),
       datetime.datetime(2000, 1, 3, 0, 0),
       datetime.datetime(2000, 1, 4, 0, 0),
       datetime.datetime(2000, 1, 5, 0, 0)], dtype=object)

But I do agree this is not terribly consistent nor fully thought through. And it should certainly be well-documented.

@shoyer shoyer mentioned this pull request May 19, 2014
3 tasks
@shoyer shoyer added the bug label May 19, 2014
@shoyer shoyer added this to the 0.1.1 milestone May 20, 2014
shoyer added a commit that referenced this pull request May 20, 2014
Fix concatenating Variables with dtype=datetime64
@shoyer shoyer merged commit 5a165b2 into pydata:master May 20, 2014
@shoyer shoyer deleted the time-conversion-fix-alt branch May 20, 2014 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants