Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chef changes time concepts #124

Open
semio opened this issue May 27, 2020 · 6 comments
Open

chef changes time concepts #124

semio opened this issue May 27, 2020 · 6 comments

Comments

@semio
Copy link
Owner Author

semio commented May 27, 2020

here is the thing : time concept was converted to int16

For 4 digits year number it works well but numbers like 20200520 is bigger than the upper bound of int16

@semio
Copy link
Owner Author

semio commented May 27, 2020

I didn't use string as dtype here because it's not space efficient and in our datasets the time values are integers. Using categorical is better choice which is more space efficient, but it's not easy to get all categories (i.e. all possible values of the time concept).

But the decision was made before I used dask to load datapoints. Dask solves the data-bigger-than-memory issue, so possibly for now it's ok to just use string for time. Need to test it

semio added a commit that referenced this issue May 27, 2020
work around #124, need more consideration
@jheeffer
Copy link

jheeffer commented May 27, 2020

Time concept is not neccesarily numeric. See the documentation. Week, Quarter and Month contain w, q and - respectively.
Also undocumented in DDF but following ISO8601, when adding time, there would be a t seperator between date and time and . separating seconds from milliseconds

If you can internally parse it to datetime and format it back when outputting that'd be fine too. Might be more memory efficient. But this may be hard, especially in case of time concept, whose format is ambiguous: should midnight 1st of january 2020 be formatted as 20200101 or 2020 or 2020-01? Ambiguity might be solved by using the format the data was in when you parsed it. Might be difficult to trace that?

semio added a commit that referenced this issue Jun 9, 2020
related to #124

Because pandas set_index() and reset_index() have a bug that reset
dtypes. uint64 will reset to int64 and cause problems so set it to
int64 for now.
@semio
Copy link
Owner Author

semio commented Jun 10, 2020

Ok, as we have defined the format for year, month etc, I think it's possible to parse these columns to datetime object internally. For time, we can use string for now, or maybe add some way to specify the format in recipe or datapackage.json

@semio
Copy link
Owner Author

semio commented Jul 7, 2020

After implementing reading time columns as datetime objects, we also need to change the filtering function in ingredient filter and filter procedure, because datetime object is not comparable to int (e.g filter time > 1900 not working). But I found that the performance with eval() and query() (which are used in filter) is very poor

# df is population from WPP
>>> df.shape
(6130902, 5)

>>> df.columns
Index(['age1yearinterval', 'country', 'gender', 'time', 'population'], dtype='object')

>>> %time df.eval("time > '2020'")
CPU times: user 50.4 s, sys: 614 ms, total: 51 s
Wall time: 51 s

>>>  %time df['time'] > '2020'
CPU times: user 7.45 ms, sys: 0 ns, total: 7.45 ms
Wall time: 7.05 ms

# below is performance when using string
>>> %time df['time'] > "2020"
CPU times: user 253 ms, sys: 0 ns, total: 253 ms
Wall time: 252 ms

>>> %time df.eval('time > "2020"')
CPU times: user 255 ms, sys: 23.8 ms, total: 279 ms
Wall time: 276 ms

... which I think is not acceptable. And there will be many changes to make if we don't use df.query(). So I think we for now settle with strings for time concepts because filter should just work.

>>> '2010q4' < '2020q2'
True

>>> '2010w51' < '2020w01'
True

Comments or OK?

semio added a commit that referenced this issue Jul 12, 2020
semio added a commit that referenced this issue Jul 12, 2020
time values are strings, so we don't compare them to ints

related issue #124
@semio
Copy link
Owner Author

semio commented Jul 20, 2020

chef will read time columns as strings now. I also sent a bug report to pandas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants