Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add expression evaluation functionality via eval #4164

Merged
merged 16 commits into from
Sep 16, 2013
Merged

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Jul 8, 2013

closes #3393
closes #2560

cc @jreback

eval Roadmap

PyTables

allows natural syntax for queries, with in-line variables allowed

some examples
"ts>=Timestamp('2012-02-01') & users=['a','b','c']"

['A > 0']

'(index>df.index[3] & index<=df.index[6]) | string="bar"'

dates=pd.date_range('20130101',periods=5)
'index>dates'

Todos:

  • Release Notes
  • v0.13.txt examples (and link to docs)
  • update docs / need new examples
  • API changes (disallow dict, show deprecations for separated expressions)
  • tests for | and ~ operators
  • tests for invalid filter expressions
  • code clean up, maybe create new Expr base class to opt in/out of allowed operations
    (e.g. Numexpr doesn't want to allow certain operations, but PyTables needs them,
    so maybe define in the base class for simplicity, and just not allow them (checked in visit),
    which right now doesn't allow a generic_visitor)
  • can ops.Value be equiv of ops.Constant?

query method

  • a few more doc examples
  • add a section in query docstring to show use of __getitem__ and then add to indexing.rst
  • query docstring

DataFrame.eval method

  • implement query in terms of this
  • tests

Documentation

  • RELEASE NOTES
  • v0.13.txt examples & link to docs
  • document in and not in
  • enhancing performance section
  • eval docstring
  • Warn users that this is not a magic bullet
    • E.g., x = 3; y = 2; pd.eval('x // y', engine='python') is many times slower than the same operation in actual Python.

Engines

  • python engine for testing (basically a restricted eval, should make sure nothing dangerous a la eval('os.system("sudo rm -rf /")')) is possible here
  • numexpr engine

Parsers

  • python parser is currrently not failing on complex bool ops involving subscripting like it should (potentially could eval the compiled AST instead of dispatching to function calls)
  • preparse should handle subscripting
  • pandas parser should handle subscripting, i.e., df.query(...) == pd.eval('df[...]')
  • pandas parser
  • python parser
  • pytables engine and parser (internal)

Should python be the default engine for small frames?

Functions

  • toplevel isexpr
  • move eval to top-level pandas

Error Handling

  • syntax error handling
  • name error handling
  • ugly stack overflow when an expression is not in frame...issue with variable resolution in expr.py. should fail fast.

Alignment

  • Algorithm
    1. flatten the tree of terms
    2. resolve the variables, special casing only numpy arrays/scalars/single unary operations (probably room for more abstraction here)
    3. align by joining on the indices using the current broadcasting behavior
    4. update the original namespace with the newly aligned version of the underlying numpy array
    5. reconstruct the final object with the aforementioned joined indices and type
  • unary alignment
  • binary alignment

Scope

  • add @a syntax for referencing local variables with the same name as a column
  • deal with global scoping issues (not sure if this is tested well enough...which means probably not)
  • allow Expr objects to be passed to eval (Expr will pull in the local and global variables from the calling stack frame)

Miscellaneous

  • support for MultiIndex queries
  • remove instance testing methods once series ndframe is merged
  • str ops (implemented using isin)
  • basic Timestamp and datetime functionality
  • vbench!
  • single variable or constant evaluation e.g., pd.eval('1') or pd.eval('df')
  • reconstruct the final object to return the expected output type, e.g., df + df2 should return a DataFrame if df and df2 are both DataFrame objects.
  • change Expression to Expr
  • add truediv keyword to Expr, default to True
  • allow and, or and not aliases for the operators &, | and ~, respectively
  • chained comparisons
  • attribute access syntax e.g., df.A

Operators

  • %
  • **
  • unary ops don't play well when they are operands of binary ops

Tests

  • PeriodIndex objects are untested at large (a few scattered tests for basic things)
  • add tests for strange column names e.g., df['@awesome_domain'], df['df["$10"]']
  • Python 3 handles complex scalars differently, so test that
  • revert equality testing in align.py once BUG: fix period index object instantiation when joining with self #4379 is merged
  • boolean ops fail with a DataFrame with a PeriodIndex in the columns
  • test alignment of DatetimeIndex objects
  • rewrite a few tests in the python evaluator to make sure the // operator works as it should (it works, but there may be edge cases that I haven't though of yet)
  • // numexpr only supports scalar evaluation here: variables are not allowed so this will always eval in python space (half-assed it a bit in the tests here need to fix that)
  • truediv keyword tests
  • add tests for nans
  • fix ** associativity issue in testing

Syntax (Easier)

  • handle in, not in ( was thinking implemented as a in b -> b.isin(a) and a not in b -> not a in b -> ~b.isin(a))

Syntax (More difficult)

  • reductions (sum and prod)
  • math ops (sin, cos, tan, sqrt, etc.), i.e., whatever numexpr supports
  • function calls (we're not building a Python interpreter here 😉)

Optimizations

There's a quite a bit of recursion going on here, where maybe something stack-based might be more efficient. For example, an expression like df['f in g and a < b < (c ** 2 + b ** 2 - a ** 3) and d in b not in e'] will make 3 recursive calls to eval (not including the top level call). One for the first in expression and two for the in/not in chain. A stack-based implementation of the parser (instead of recursively visiting nodes) could possible shave off a few milliseconds. Not sure if this is worth it though.

Others

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

ok....was able to pull/push just fine now....

maybe the original rebase caused me to look at something different?

@cpcloud
Copy link
Member Author

cpcloud commented Jul 8, 2013

ok so no rebasing until later...agreed?

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

[sheep-jreback-~/pandas] git remote show origin
* remote origin
  Fetch URL: git://github.com/pydata/pandas.git
  Push  URL: git://github.com/pydata/pandas.git
  HEAD branch: master
  Remote branches:
    PR_json-normalize-proto                    tracked
    add-msgpack                                tracked
    eval-3393                                  tracked
    index-not-ndarray                          tracked
    master                                     tracked
    range-index                                tracked
    refs/remotes/origin/PR-test-3503           stale (use 'git remote prune' to remove)
    refs/remotes/origin/applymap               stale (use 'git remote prune' to remove)
    refs/remotes/origin/c-parser               stale (use 'git remote prune' to remove)
    refs/remotes/origin/cython-refactor        stale (use 'git remote prune' to remove)
    refs/remotes/origin/datetime64-inference   stale (use 'git remote prune' to remove)
    refs/remotes/origin/eval3                  stale (use 'git remote prune' to remove)
    refs/remotes/origin/eval4                  stale (use 'git remote prune' to remove)
    refs/remotes/origin/groupby-apply-refactor stale (use 'git remote prune' to remove)
    refs/remotes/origin/large-excel-files      stale (use 'git remote prune' to remove)
    refs/remotes/origin/maintenance/0.7.x      stale (use 'git remote prune' to remove)
    refs/remotes/origin/maintenance/0.9.x      stale (use 'git remote prune' to remove)
    refs/remotes/origin/meta                   stale (use 'git remote prune' to remove)
    refs/remotes/origin/modes                  stale (use 'git remote prune' to remove)
    refs/remotes/origin/nanoseconds            stale (use 'git remote prune' to remove)
    refs/remotes/origin/period-unit-change     stale (use 'git remote prune' to remove)
    refs/remotes/origin/string-methods         stale (use 'git remote prune' to remove)
    refs/remotes/origin/timezone-fixes         stale (use 'git remote prune' to remove)
    refs/remotes/origin/tseries-plotting       stale (use 'git remote prune' to remove)
    refs/remotes/origin/unicode-fix            stale (use 'git remote prune' to remove)
    refs/remotes/origin/unsafe-assign-dtype    stale (use 'git remote prune' to remove)
  Local branches configured for 'git pull':
    eval-3393 rebases onto remote eval-3393
    master    rebases onto remote master
  Local refs configured for 'git push':
    eval-3393 pushes to eval-3393 (up to date)
    master    pushes to master    (up to date)

this eval-3393 push to eval-3393 is good (was not there previously).

maybe having a different local name was a problem somehow

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

why don't you try rebasing those last 2 commits into one?

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

see what happens

@cpcloud
Copy link
Member Author

cpcloud commented Jul 8, 2013

ok let's see...

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

i picked up the change.....

so I guess as we don't rebase at the same time or while one is working, maybe ok

@cpcloud
Copy link
Member Author

cpcloud commented Jul 8, 2013

going to be hard to coordinate that though

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

in any event....let's just push for now...can fix commits later

you are welcome to create BaseExprVisitor or somesuch if you want

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

not as long as we push/pull.....just don't rebase

@cpcloud
Copy link
Member Author

cpcloud commented Jul 8, 2013

right yeah that's what i meant

@cpcloud
Copy link
Member Author

cpcloud commented Jul 8, 2013

should the numexpr engine allow the = syntax for equality? i would tend toward making the smallest possible superset of syntax allowed preferable there's nothing extra, except later maybe and or not and we change the precedence of bitwise operators...so i would only allow = to replace == in pytables expressions

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

i would make = raise in Numexpr as maybe someday can implement it; for now should be an error I think

@cpcloud
Copy link
Member Author

cpcloud commented Jul 9, 2013

got a query method going and df['b + c < a + b']-style indexing going + refactor the base visitor class

throwing a NameError in some tests in the expression in getitem, need to figure out why rather than just catching and then passing to the usual functionality (which works, but i need to know why)

@cpcloud
Copy link
Member Author

cpcloud commented Jul 9, 2013

hm maybe query for now and indexing later. seems to break a ton of stuff i don't have it down just yet what the rules should be for indexing with something that could be an expression

@cpcloud
Copy link
Member Author

cpcloud commented Jul 9, 2013

actually only a few tests were broken by indexing, my refactor seems to have broken the other things

@jreback
Copy link
Contributor

jreback commented Jul 9, 2013

ok.....lmk when I should take a look....I am pretty much done for now....(maybe will start on docs soon)

@cpcloud
Copy link
Member Author

cpcloud commented Jul 9, 2013

thanks for the pytables stuff!

@jreback
Copy link
Contributor

jreback commented Jul 9, 2013

it was built on a nice base!

@cpcloud
Copy link
Member Author

cpcloud commented Jul 9, 2013

ah shucks, thanks :)

@cpcloud
Copy link
Member Author

cpcloud commented Jul 9, 2013

i will support chained comparisons--they aren't a big deal to add...although things will be evaluated twice by numexpr.

e.g., a < b < c -> a < b & b < c

more generally

x1 < x2 < ... < xN -> x1 < x2 & x2 < x3 & ... & xN-2 < xN-1 & xN-1 < xN

so inner terms will be evaluated 2 * (N - 2) times

something like

reducer = lambda x, y: BinOp(op=BitAnd, operands=[Compare(op=node.op, comparators=[x, y])])
reduce(reducer, node.comparators)

@cpcloud
Copy link
Member Author

cpcloud commented Jul 9, 2013

i wish github had latex

@jreback
Copy link
Contributor

jreback commented Jul 9, 2013

are u doing this transform directly?

I don't think me will mind the extras prob optimize them away anyhow

@cpcloud
Copy link
Member Author

cpcloud commented Jul 9, 2013

yes i would do it in the base class

essentially build a new node and then visit it similar idea to assignment.

compare nodes will be passed over N + 2 * (N - 2) + 2 times i believe (where N is the number of operands in the chain), since there will be a loop over each comparator in the "real" chained comparison parse, that's the first N then when compares are parsed again into a BinOp tree the first and last operand are evaluated once that's 2 and each of the N - 2 "inner" operands is evaluated twice. i think that's right...

that shouldn't be a big deal for parsing unless you have a ridiculously large chained comparison like 1000 compares or something.

@jreback
Copy link
Contributor

jreback commented Jul 9, 2013

if its easy ok

otherwise u can just detect it for now and disallow

@cpcloud
Copy link
Member Author

cpcloud commented Jul 9, 2013

@jreback just added the query method, could probably use some more tests

i think indexing should be off limits for now b/c of date strings, since df['1/1/2001'] could be division or a date string index, unless u can think of a way to get around that...

@jreback
Copy link
Contributor

jreback commented Jul 9, 2013

I agree

and I like query / eval better

@cpcloud
Copy link
Member Author

cpcloud commented Jul 9, 2013

@jreback resolution is now there. i opted for a dict-like object to be passed around and have its get method called at name resolution time. that way any object implementing a dict-like interface can be used, which is consistent with how Python itself does name resolution

@cpcloud
Copy link
Member Author

cpcloud commented Jul 9, 2013

@jreback can any of ur list above be checked off?

@ghost ghost assigned cpcloud Jul 9, 2013
@cpcloud
Copy link
Member Author

cpcloud commented Sep 15, 2013

both good ideas!

@dragoljub
Copy link

@cpcloud

How would you even filter on any other column than c if you just pas a list of values?

I don't understand what you're asking.

I'm saying that df['[1,2] in c'] must filter on col c since there is no other column passed. However, it does not follow the pattern of df['a in b'] which filters on col a, the first object before the 'in'.

@wesm
Copy link
Member

wesm commented Sep 15, 2013

Hey guys,

Sorry it's taken me so long to get to reviewing this. This seems like a really good start on adding this kind of functionality to the library. I feel like we should treat the actual query syntax as experimental in 0.13.x while we solicit some more feedback and see how it plays out in actual use.

I have to say though, I'm kind of 👎 on pushing expression evaluation into DataFrame.__getitem__. It seems like a bit too much magic, but also something that can be added but not removed. I'd maybe relegate this functionality to an eval or query method for the time being and see how things go. You're saving yourself so much work in many cases by writing the expression that doing a method call doesn't seem that onerous.

I'm pretty stretched thin on taking this out for a serious spin but I'm fine with this being merged in and roped off as experimental until we have some time to see what the implications of various design decisions are. Feel free to disagree of course but I think starting a dialogue with the community about this sort of thing in general is a good idea (we've been talking about it for ages...glad you got around to actually doing it)

@cpcloud
Copy link
Member Author

cpcloud commented Sep 16, 2013

@dragoljub

I'm saying that df['[1,2] in c'] must filter on col c since there is no other column passed. However, it does not follow the pattern of df['a in b'] which filters on col a, the first object before the 'in'.

Yes. That's correct. I don't see why this is an issue. Why would df.query('[1, 2] in c') filter on anything but column c?

@dragoljub
Copy link

@cpcloud

Yes. That's correct. I don't see why this is an issue.

Not an issue; possibly confusing when comparing what col is filtered on with different usages of in.

Ex:
df['[1, 2] in c'] ==> Filter on col c (the object after 'in').
vs.
df['a in b'] ==> Filter on col a (the object before 'in').
vs.
df['c in [1, 2]'] ==> Still have to filter on col c? (the object before 'in').

@cpcloud
Copy link
Member Author

cpcloud commented Sep 16, 2013

@dragoljub

Just an FYI, as per @wesm's suggestion I've removed the df[expression] syntax.

Right now df.query('[1, 2] in c') will do the same thing as df.query('c in [1, 2]') (tuples too). These semantics are subject to change and of course I'm open to suggestions.

Otherwise, things work as if you had called left_side_of_in.isin(right_side_of_in).

@dragoljub
Copy link

@cpcloud

Sounds good. I think the query method will make for much easier reading. 👍

cpcloud added a commit that referenced this pull request Sep 16, 2013
ENH: add expression evaluation functionality via eval
@cpcloud cpcloud merged commit bf1b8ca into master Sep 16, 2013
@jreback
Copy link
Contributor

jreback commented Sep 16, 2013

thanks to @cpcloud for implemented much of this. When it was started, this was a tiny little function!

@jreback
Copy link
Contributor

jreback commented Sep 16, 2013

475 comments! new record! who-hoo

@jtratner
Copy link
Contributor

Awesome - great work @cpcloud :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: eval function Adding lambda support inside of __getitem__ for DataFrame, Series, .. etc.
8 participants