Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Move Period class and related functions to Cython module #9440

Merged
merged 4 commits into from
Feb 17, 2015

Conversation

blbradley
Copy link
Contributor

I propose that Period should be a more basic type, like Timestamp.

I wanted to start with implementing frequencies with multipliers (see issue #7811), but the code for Period was all over the place. So, I moved all code directly related to Period to a Cython module. Changed a bit of the logic to make calls from Cython instead of Python to get rid of some Python dependencies. However, the 'meat' of Period is still there. I will be refactoring Period to use Cython over the next week or two.

I've only done a bit of performance testing, but the results from moving Period are good so far without any real Cython refactoring.

Note: First commit in this stream is from #9415 because I needed that bug fixed to make this work.

Comments most welcome!

@jreback
Copy link
Contributor

jreback commented Feb 7, 2015

try not to change the meat of any tests (moving around is ok)
as much harder to see what actually changed

ideally we could merge this and then you can build enhancements on top in another pull request

ping when passing
you have a couple of unicode failures (returning bytes not unicode)

@blbradley
Copy link
Contributor Author

I didn't move any tests (and won't without discussion, and probably a separate PR unless writing new tests).

Very good. Can we get #9415 merged first or will you merge it as part of this?

@blbradley blbradley changed the title PROPOSAL: Make Period more low level, like Timestamp Move Period class to Cython module Feb 9, 2015
@blbradley
Copy link
Contributor Author

@jreback Fixed. Some commits are different so please start fresh.

for i in range(n):
val = arr[i]
if val != iNaT:
val = func(val, relation, &finfo)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just combine these into a single loop
In python this matters but not in cython where these are fast comparisons (you can do a mini profile to see if it matters in any event) - it's just more clear in the code

@blbradley
Copy link
Contributor Author

Can we hold off on that? I did not want to change any logic except to remove dependencies.

@jreback
Copy link
Contributor

jreback commented Feb 9, 2015

yep no prob on changing code -

pls do a perf run (vbenches)

keep in mind it's not clear on how much of this code is well tested

so if u think something is not being hit addrl tests to validate are welcome

@blbradley
Copy link
Contributor Author

That is, not until the next PR.

@jorisvandenbossche
Copy link
Member

One comment, not on the actual content of this PR, but on the file structure: you created a new module pandas.period, but I don't think this should be exposed to the user? (as pd.Period or pd.tseries.period is still the entrypoint?) The top-level pandas namespace is already over-crowded I think)

I know for a lot of other cython/python modules this is the same (but I think we do this already to much).

Possible options: make it _period.pyx, move it away from toplevel, or keep it in a separate file but 'include' it in tslib.pyx (but I am not very good acquainted with the structure of pandas folder regarding to cython/c files).

@blbradley
Copy link
Contributor Author

@jorisvandenbossche Right now, tslib is a monstrosity. I tried using include "period.pyx" at first but ran into more problems than just creating another module and using import.

If you put period.pyx in pandas/src, Cython overwrites period.c in the codebase. So, I put it in top level for the moment.

Honestly, I put off making that choice until comments. I was confused about what to do since there are other files to move or rename too.

@blbradley
Copy link
Contributor Author

@jorisvandenbossche I should also mentioned that names for namespaces of pandas Cython modules are defined in setup.py and the actual pyx file used for that name is defined there.

I think pandas/src/_period.pyx is fine for the file location. Maybe name the module pandas.tslib.period if that can be done. If you agree, I'll try it.

@jreback
Copy link
Contributor

jreback commented Feb 9, 2015

pls run versus master to isolate these changes

@blbradley
Copy link
Contributor Author

I've been refactoring Period more today with success. Here's what I've got so far: 4a6c533c. I removed the tests for period_asfreq because it does not follow the behavior of Period.asfreq. Is that ok?

Let me know if I should push it to this PR or wait.

@jreback jreback added the Period Period data type label Feb 10, 2015
@@ -458,6 +455,11 @@ def pxd(name):
'sources': ['pandas/src/datetime/np_datetime.c',
'pandas/src/datetime/np_datetime_strings.c',
'pandas/src/period.c']},
period=dict(pyxfile='period',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use { here (just for consistency)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I need to make this change here or #9415 to wrap it up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can close #9415 and just do it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't use this until this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I liked it better this way, and ext_data['parser'] is also like that. Do you want me to change that one too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted them all to be consistent, using {...} for dict is what I recall

@blbradley blbradley force-pushed the period-refactor-to-cython branch 3 times, most recently from ffdac9b to ba535d7 Compare February 10, 2015 20:53
@blbradley
Copy link
Contributor Author

@jreback Gist comments don't give notifications. What are some good setting for vbench num calls and repeats? I'm running it on a desktop Intel i7, 16GB RAM, and mirrored RAID and not running anything intensive during vbench (like other tests).

Maybe I should do some performance testing between commits. At the very least the first in the stream.

@jreback
Copy link
Contributor

jreback commented Feb 11, 2015

ok merged #9415

I just wanted you to be familiar with the perf testing....

try -n 5 (it should just be somewhat stable), e.g. a rerun of this should be roughy the same.

you can use -r 'period|timeseries' to make the runs shorter (this is a regex to select on the name of the benchmarks'. note there are very few period specific benchmarks. adding more of these would be good.

@blbradley
Copy link
Contributor Author

Just letting you know about lack of gist notifications. I thought they were a feature of GitHub too. I read your gist comment after other discussion. No worries.

I'm going to try booting with no GUI to running vbench. I think it's getting in the way. I'm going to do some vbench tests on particular commits early and late in the stream to determine where my problem is.

vbench documentation and searching about vbench hasn't shown me much. Thanks for the extra input. And baring with me! :bowtie:

@jreback
Copy link
Contributor

jreback commented Feb 11, 2015

vbench docs are here: https://github.com/pydata/pandas/wiki/Performance-Testing

additions always welcome to this

@blbradley
Copy link
Contributor Author

Sorry, I meant that they shown me much about guidelines for number of calls and repeats parameters. I'll gladly add some guidelines about that if we can start an issue and get some consensus (after I examine that myself 😃).

@blbradley
Copy link
Contributor Author

I can't get conclusive results on my main workstation. I always have one or more inconsistent outliers. Going to try some other machines and get back on this soon.

@blbradley
Copy link
Contributor Author

Running the test on server grade hardware seems to reduce the variance. Also, my issue may have partly been due to using pyenv which compiles python by default with CFLAGS='-O3'.

@jreback jreback added this to the 0.16.0 milestone Feb 16, 2015
@jreback
Copy link
Contributor

jreback commented Feb 16, 2015

ok, pls rebase and squash to a smaller number of commits.

pls let me know and we can get this in.

@blbradley
Copy link
Contributor Author

Ok I'll do that in a little bit. I can get the commit count down to three.

I have more work refactoring Period ready but will submit on another PR.

@jreback
Copy link
Contributor

jreback commented Feb 16, 2015

that's all fine

ping when revised and green

and add a release note referencing this PR number (you can put in performance section)

@blbradley blbradley changed the title Move Period class to Cython module PERF: Move Period class and related functions to Cython module Feb 16, 2015
@blbradley
Copy link
Contributor Author

@jreback can you review the name changes in the last commit? I used pandas._period, but I'm not sure if that's consistent with imports like import pandas.algos as _algos.

@jreback
Copy link
Contributor

jreback commented Feb 16, 2015

this all looks good. (e.g. see #5294) for how all things should be named (e.g. like you are doing is good, these imports should be _*).

@jreback
Copy link
Contributor

jreback commented Feb 16, 2015

pls rebase on top of this: #9500

(I am going to merge soon).

As the setup.py is failing on windows after a clean (the path comparison was wrong).

With this (you will change period.c -> period_helper.c in setup.py), then it is good. and we can merge yours.

@blbradley
Copy link
Contributor Author

Two successive vbench runs before next rebase: https://gist.github.com/blbradley/cfb6e46f439dd1b2efd4

Only thing to show there is no likely performance regressions.

@blbradley
Copy link
Contributor Author

Good to go.

ordinal = self.ordinal
else:
ordinal = self.ordinal + (nanos // offset_nanos)
return Period(ordinal=ordinal, freq=self.freq)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to put on your todo list, this code in _add_delta looks a bit suspicous with all of this nested if-then (its the same as the original), but take a look in the next PR

@jreback
Copy link
Contributor

jreback commented Feb 17, 2015

@blbradley looks good; except that Period needs to inherit from PandasObject (like the original).

@blbradley
Copy link
Contributor Author

I was trying to not subclass from PandasObject because it is in Python space. Timestamp does not inherit from PandasObject either.

@jreback
Copy link
Contributor

jreback commented Feb 17, 2015

@blbradley I know, and there are inconsitencies there as well.

Ok, on the list then is a MixIn in Cython that you can sub-class from (like StringMixIn) to pride the string/unicode display things that always are forgotton.

@jreback
Copy link
Contributor

jreback commented Feb 17, 2015

Just make up a list of all of these issues, post it and then you can close via PRs.

merging this.

jreback added a commit that referenced this pull request Feb 17, 2015
PERF: Move Period class and related functions to Cython module
@jreback jreback merged commit 25cb65a into pandas-dev:master Feb 17, 2015
@blbradley
Copy link
Contributor Author

@jreback thanks a bunch for your time! more coming soon. if any other issues arise from this, I'll try to help field them.

@blbradley blbradley deleted the period-refactor-to-cython branch February 17, 2015 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Period Period data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants