Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert_to_r_dataframe missing conversion for datetime values #2351

Closed
milkypostman opened this issue Nov 25, 2012 · 8 comments
Closed

convert_to_r_dataframe missing conversion for datetime values #2351

milkypostman opened this issue Nov 25, 2012 · 8 comments
Milestone

Comments

@milkypostman
Copy link

I am in desperate need of converting my datetime values into something R can use so I that I can plot some data with ggplot. It appears that rpy2 supports this conversion but I am not sure how to go from a column with a datetime field (stored as a numpy.datetime64) into a time.struct_time value.

The relevant information about rpy2 is here: http://rpy.sourceforge.net/rpy2/doc-2.2/html/vector.html?highlight=intvector#rpy2.robjects.vectors.POSIXlt

I see that there need to be some minor changes on line 244 of common.py but basically I am stuck on trying to figure out how to convert from numpy.datetime64 into the struct_time. I see that there is already functionality in Pandas, specifically the to_pydatetime function, for converting from this numpy format into a datetime object.

If someone could lead me over this hump I'd be happy to submit the pull request.

@wesm
Copy link
Member

wesm commented Nov 25, 2012

@dalejung can probably help

@dalejung
Copy link
Contributor

def convert_datetime_index_num(ind):
    """
        Convert to POSIXct using m8[s] format
        see robject.vectors.POSIXct where I grabbed logic
    """
    # convert m8[ns] to m8[s]
    vals = robjects.vectors.FloatSexpVector(ind.asi8 / 1E9)
    as_posixct = baseenv_ri['as.POSIXct']
    origin = StrSexpVector([time.strftime("%Y-%m-%d", 
                                          time.gmtime(0)),])
    # We will be sending ints as UTC
    tz = ind.tz and ind.tz.zone or 'UTC'
    tz = StrSexpVector([tz])
    utc_tz = StrSexpVector(['UTC'])

    posixct = as_posixct(vals, origin=origin, tz=utc_tz)
    posixct.do_slot_assign('tzone', tz)
    return posixct

as.POSIXct can accept a vector of epoch-seconds. This will be much faster than creating time_structs and using POSIXlt.

Note, the tz param in as.POSIXct actually adjusts the epoch for some weird reason.

> as.POSIXct.numeric
function (x, tz = "", origin, ...) 
{
    if (missing(origin)) 
        stop("'origin' must be supplied")
    as.POSIXct(origin, tz = tz, ...) + x
}

Which means that the original tz must be sent as UTC, and then adjusted later. Also note that R doesn't allow a tz-naive POSIXct. You can get into weird corner cases where your source data is in UTC and your stuff created in R is in your local TZ. So make sure to set your source data and system TZ to be the same.

@milkypostman
Copy link
Author

@dalejung thanks. I think I follow. I was already using POSIXct rather than POSIXlt.

There are a couple things I'm unclear on,

First, this means that the time sent to R is not going to be in the proper timezone, is that correct? Is there a way around this?

Second, what if what we are converting is not a DateTimeIndex? Rather, what if we have a column holding times and we want to convert that?

@dalejung
Copy link
Contributor

Yeah, but you have to set the tzone afterwards. The as.POSIXct.numeric tz is applied to the origin.

The stuff that I posted will only work with DatetimeIndex or numpy.datetime arrays. The dt.asi8 is equivalent to a naked m8['ns'] array. I suppose the above function could be split and call a function that takes in an numpy.datetime array and tz.

Outside of that, I think first converting to DatetimeIndex/np.datetime would be a good choice. I don't know the upside of having a column of datetime objects instead of an np.datetime array.

@milkypostman
Copy link
Author

OK, so after doing some more investigation. It looks like when you pass parse_dates as a parameter to parse_csv, the Series that is created is simply a np.object array with each element being a datetime.datetime. And my guess is this is done because there is no way to set np.nan to an element in a np.datetime array. What I do is just call astype(np.datetime64) on that column and reassign it. This will only work if there are no np.nan in that column/Series.

I've pushed new changes, #2352 should be updated with those changes. Let me know what you think.

I thought adding the function that could handle both a Series of type np.datetime64 and DatetimeIndex would be beneficial for uses in the future outside of convert_to_r_dataframe.

Do we need to add some type checking in here? Raise some errors or something?

@milkypostman
Copy link
Author

by the way, this timezone stuff gives me a headache. also, rpy2 is magic.

@dalejung
Copy link
Contributor

How I wish someone would take over and instate one global timezone. 3 PM is sunrise in your region? DEAL WITH IT!

Seriously though, I have a check that yells at me if I don't explicitly set the R Sys.TZ. I've been burned too many times with that.

@milkypostman
Copy link
Author

Don't get me started on daylight savings........

@wesm wesm closed this as completed Dec 1, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants