Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas dataframe values return as numpy.datetime64 objects in local time zone. parse_time does not understand these objects. #798

Closed
aringlis opened this issue Feb 7, 2014 · 18 comments · Fixed by #2572
Labels
Feature Request New feature wanted! Hacktoberfest Issues that could be the focus of GSoC or Hacktoberfests Package Novice Requires little knowledge of the internal structure of SunPy timeseries Affects the timeseries submodule

Comments

@aringlis
Copy link
Member

aringlis commented Feb 7, 2014

Came across this issue today when using a pandas DataFrame. When you explicitly ask for the values of the indices in a DataFrame, they can be returned as numpy.datetime64 objects. These time objects have the timezone attached to the end of them (see example below). parse_time at the moment cannot understand these objects.

The following example explains what I'm on about...

In [1]: import datetime
In [2]: import pandas
In [3]: import numpy as np
#create a test series
In [4]: x=np.linspace(0,19,20)
In [5]: basetime=datetime.datetime.utcnow()
In [6]: times=[]                                

In [7]: for thing in x:
   ...:     times.append(basetime + datetime.timedelta(0,thing)

In [8]: times
Out[8]: 
[datetime.datetime(2014, 2, 7, 21, 47, 51, 8288),
 datetime.datetime(2014, 2, 7, 21, 47, 52, 8288),
 datetime.datetime(2014, 2, 7, 21, 47, 53, 8288),
 datetime.datetime(2014, 2, 7, 21, 47, 54, 8288),
 datetime.datetime(2014, 2, 7, 21, 47, 55, 8288),
 datetime.datetime(2014, 2, 7, 21, 47, 56, 8288),
 datetime.datetime(2014, 2, 7, 21, 47, 57, 8288),
 datetime.datetime(2014, 2, 7, 21, 47, 58, 8288),
 datetime.datetime(2014, 2, 7, 21, 47, 59, 8288),
 datetime.datetime(2014, 2, 7, 21, 48, 0, 8288),
 datetime.datetime(2014, 2, 7, 21, 48, 1, 8288),
 datetime.datetime(2014, 2, 7, 21, 48, 2, 8288),
 datetime.datetime(2014, 2, 7, 21, 48, 3, 8288),
 datetime.datetime(2014, 2, 7, 21, 48, 4, 8288),
 datetime.datetime(2014, 2, 7, 21, 48, 5, 8288),
 datetime.datetime(2014, 2, 7, 21, 48, 6, 8288),
 datetime.datetime(2014, 2, 7, 21, 48, 7, 8288),
 datetime.datetime(2014, 2, 7, 21, 48, 8, 8288),
 datetime.datetime(2014, 2, 7, 21, 48, 9, 8288),
 datetime.datetime(2014, 2, 7, 21, 48, 10, 8288)]

In [9]: test_pandas=pandas.DataFrame(np.random.random(20),index=times)

If you now print the values from the pandas dataframe, they are displayed in another time zone! (not UT). In the following example, it displays a numpy.datetime64 in UT-5.

In [10]: test_pandas.index.values[0]
Out[10]: numpy.datetime64('2014-02-07T16:47:51.008288000-0500')

Also, parse_time can't read this format at the moment.

In [11]: from sunpy.time import parse_time
In [12]: parse_time(test_pandas.index.values[0])
ERROR: TypeError: argument of type 'numpy.datetime64' is not iterable [sunpy.time.time]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-7d3de1f9a633> in <module>()
----> 1 parse_time(test_pandas.index.values[0])

/Users/ainglis/python/sunpy/sunpy/time/time.pyc in parse_time(time_string)
    169         # remove trailing zeros and the final dot to allow any
    170         # number of zeros. This solves issue #289
--> 171         if '.' in time_string:
    172             time_string = time_string.rstrip("0").rstrip(".")
    173         for time_format in TIME_FORMAT_LIST:

TypeError: argument of type 'numpy.datetime64' is not iterable
@aringlis
Copy link
Member Author

aringlis commented Feb 7, 2014

This is important for the LightCurve object, where you can see the same behaviour, e.g.

from sunpy import lightcurve
lyra=lightcurve.LYRALightCurve.create('2012-09-10')
print lyra.data.index.values[0]
2012-09-09T20:00:00.124000000-0400

@Cadair
Copy link
Member

Cadair commented Sep 17, 2014

ping @DanRyanIrish this is what we were talking about at SIPWork no?

@DanRyanIrish
Copy link
Member

Hi @Cadair. Apologies for my late reply. I have been away for much of the last month and I am just catching up on things. This is not exactly what we were talking about. What we were talking about is that when you enter at a lightcurve object time index to parse_time(), it returns and Timestamp object, not a datetime object. Therefore, parse_time does not consistently return the same type of object. For example:

In [1]: import sunpy.lightcurve
In [2]: from sunpy.time import parse_time

In [3]: glc = sunpy.lightcurve.GOESLightCurve.create("2014-01-01", "2014-01-02")
In [4]: lc_time = parse_time(glc.data.index[0])
In [5]: lc_time
Out[5]: Timestamp('2014-01-01 00:00:00.421999', tz=None)

Meanwhile for other inputs to parse_time() you get a datetime object.

In [6]: str_time = parse_time("2014-01-01")
In [7]: str_time
Out[7]: datetime.datetime(2014, 1, 1, 0, 0)

@Cadair
Copy link
Member

Cadair commented Jan 14, 2015

@aringlis @DanRyanIrish can you check that this is fixed now.

@DanRyanIrish
Copy link
Member

@aringlis @Cadair: This is now fixed for my situation, i.e. if you do

lc_time = parse_time(glc.data.index[0])

then lc_time is a datetime.datetime object. However, @aringlis's situation remains the same. If you do

lc_time = glc.data.index.values[0]

the result is still a numpy.datetime64 in the local time zone. And entering this to parse_time() still causes it to crash. However, I'm not sure this is as big an issue as simply using glc.data.index instead of glc.data.index.values solves the problem.

@Cadair
Copy link
Member

Cadair commented Jan 20, 2015

@DanRyanIrish interesting, it should have solved the second one as well. Thanks for checking, I will look into again.

@ankitkmr
Copy link
Contributor

@aringlis @Cadair I am a prospective GSOC 2015 student and I'd like to work on this Issue. In fact if I may this may work: https://github.com/ankitkmr/sunpy/blob/master/patch.py

@dpshelio
Copy link
Member

@ankitkmr could you do a pull-request with the changes within the files that are affected? Then in the message of the PR (not in the title) you can link to this issue by using # followed by the number (ie. #798 ).
In any case... this issue could be very annoying... and I don't know how tzlocal.get_localzone() and pandas will behave if I'm travelling and the time in my computer is updated.

@ankitkmr
Copy link
Contributor

@dpshelio Yeah sure I will do that and I think tzlocal.get_localzone() wont be a problem as you can see that I have saved it's value at an instant in a variable right before converting the data in pandas dataframe . The problem is that conversion into pd.Dataframe brings in perspective of current local time in that data and we need the local time that it brings in for use in times.tz_localize(tz) for converting back to utc !! hope that helps.

@dpshelio
Copy link
Member

@ankitkmr I've just realised that tzlocal is not a standard library... so that means an additional requirement... for something it should not happen in first time.

Also, it seems pandas uses pytz... maybe that would be better instead of tzlocal?

Though.. I'm kind of lost now on what this needs to fix... I've found the following.

  • parse_time() complains if the input time contains timezone information...
>>> parse_time('2015-03-18T12:49:22.979471000+0000')
ValueError: 2015-03-18T12:49:22.979471000+ is not a valid time string!

It seems it takes the timezone away... This should be fixed.

  • numpy.datetime64 can be parsed to parse_time using __str__()
    Maybe this could be bypassed if parse_time uses the __str__ representation when the input is a numpy.datetime object
  • pandas has an option to convert to UTC - which I believe it should be our standard across sunpy. However, I cannot make it work. There's some kind of mismatch between the docstring and what it actually does... or I don't see the difference because I'm in UTC :-/
>>> test_pandas.index.tz_convert?
...
tz : ....
    None will remove timezone holding UTC time.
...
>>> test_pandas.index.tz_convert()
TypeError: tz_convert() takes exactly 2 arguments (1 given)
...
  • Would astropy.times solve the timezone problem?

@ankitkmr
Copy link
Contributor

@dpshelio
Ok about the first issue that it takes away timezone can you explain how I register timezone? I kinda dont know this new format '%Y': '(?P\d{4})',
the above happens because if you see parse_time source code then at the else condition in the end of its function definition http://docs.sunpy.org/en/latest/_modules/sunpy/time/time.html#parse_time
else:
if '.' in time_string:
time_string = time_string.rstrip("0").rstrip(".")

needs to be made
else:
if '.' in time_string and '+' not in time_string:
time_string = time_string.rstrip("0").rstrip(".")
moreover the way time_string in our case is formatted is not supported in the TIME_FORMAT_LIST and REGEX (lists defined in source code for parse_time)

So I can add that support but I kinda dont know how to register for the 000 in the end of +000 like

'%T': '(?P\d{4})', # Assuming I replace T for end 000 what goes after colon. Need some help here

Let me work out some other alternative where we convert +000 to Zulu format because that is supported by parse_time. Can I convert all localtime data into corresponding UTC and then pass it like sunpy.time.parse_time('2005-08-04T00:01:02.000Z') ?

@ankitkmr
Copy link
Contributor

@dpshelio Second bullet , and str() yeah true that just that now again I have to add support for +000 formatted time_string

Third bullet, thats what i used.

And as I explained above astropy.times won't be any new help i think

@ankitkmr
Copy link
Contributor

@dpshelio Also I didn't get the PR thing...should I start contributing to unifiedDownloader now or it is to get familiar with and base my proposal around ? A to-do list before application would clear my doubts.

Also should I start my proposal now. I mean I have done like quarter of it but I would like to get focussed on it after I've completed all the prereqs

Thanks a lot

@ankitkmr
Copy link
Contributor

@dpshelio @aringlis @Cadair

>>> parse_time('2015-03-18T12:49:22.979471000+0000')
ValueError: 2015-03-18T12:49:22.979471000+ is not a valid time string!

OK Fixed the issue here, run this script, https://github.com/ankitkmr/sunpy/blob/master/parse_time.py
Look for '''NEW PIECE OF CODE ADDED''' to '''NEW PIECE OF CODE ENDS'''

How do I add this correction in the original source code now? I am new to open source dev :(

@ankitkmr
Copy link
Contributor

@aringlis @Cadair @dpshelio

In [11]: from sunpy.time import parse_time
In [12]: parse_time(test_pandas.index.values[0])
ERROR: TypeError: argument of type 'numpy.datetime64' is not iterable [sunpy.time.time]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-7d3de1f9a633> in <module>()
----> 1 parse_time(test_pandas.index.values[0])
/Users/ainglis/python/sunpy/sunpy/time/time.pyc in parse_time(time_string)
    169         # remove trailing zeros and the final dot to allow any
    170         # number of zeros. This solves issue #289
--> 171         if '.' in time_string:
    172             time_string = time_string.rstrip("0").rstrip(".")
    173         for time_format in TIME_FORMAT_LIST:
TypeError: argument of type 'numpy.datetime64' is not iterable

Workaround to this problem : https://github.com/ankitkmr/sunpy/blob/master/patch.py

ankitkmr added a commit to ankitkmr/sunpy-1 that referenced this issue Mar 20, 2015
This offers a solution to the issue sunpy#798
ankitkmr added a commit to ankitkmr/sunpy-1 that referenced this issue Mar 20, 2015
This offers a solution to the issue sunpy#798
ankitkmr added a commit to ankitkmr/sunpy-1 that referenced this issue Mar 20, 2015
This offers a solution to the issue sunpy#798
ankitkmr added a commit to ankitkmr/sunpy-1 that referenced this issue Mar 22, 2015
ankitkmr added a commit to ankitkmr/sunpy-1 that referenced this issue Mar 22, 2015
ankitkmr added a commit to ankitkmr/sunpy-1 that referenced this issue Mar 23, 2015
@nabobalis
Copy link
Contributor

If lightcurve is dead, does this need to be open? Does it affect time series?

Should I add a timeseries label?

@DanRyanIrish
Copy link
Member

ping @Alex-Ian-Hamilton

@nabobalis nabobalis added the Hacktoberfest Issues that could be the focus of GSoC or Hacktoberfests label Oct 4, 2017
@dstansby
Copy link
Member

I think I might have fixed this with #2370 ?

@nabobalis nabobalis added timeseries Affects the timeseries submodule and removed lightcurve labels Mar 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Request New feature wanted! Hacktoberfest Issues that could be the focus of GSoC or Hacktoberfests Package Novice Requires little knowledge of the internal structure of SunPy timeseries Affects the timeseries submodule
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants