Skip to content

WIP: add top-level eval function #4037

Closed
wants to merge 37 commits into from

2 participants

@cpcloud
Python for Data member
cpcloud commented Jun 26, 2013

closes #3393.

This is a ways from being mergeable but I thought I would put up the PR for people to play with and to get some other eyes on it. Tire-kick at will. A consolidated list of things to be completed to come.

eval Roadmap

Documentation

  • enhancing performance section
  • eval docstring
  • Warn users that this is not a magic bullet

    • E.g., x = 3; y = 2; pd.eval('x // y', engine='python') is 1000 times slower than the same operation in actual Python.

Engines

  • python engine for testing (basically a restricted eval, should make sure nothing dangerous a la eval('os.system("sudo rm -rf /")')) is possible here
  • numexpr engine
  • pytables engine (#4155)

Should python be the default engine for small frames?

Functions

  • toplevel isexpr
  • move eval to top-level pandas

Error Handling

  • syntax error handling
  • name error handling

Alignment

  • Algorithm

    1. flatten the tree of terms
    2. resolve the variables, special casing only numpy arrays/scalars/single unary operations (probably room for more abstraction here)
    3. align by joining on the indices using the current broadcasting behavior
    4. update the original namespace with the newly aligned version of the underlying numpy array
    5. reconstruct the final object with the aforementioned joined indices and type
  • punt to 'python' engine when a Series and DataFrame both have DatetimeIndexes (I would rather not do this and just enforce the to-be-deprecated index-aligning-if-both-indices-are-datetimeindex behavior)

  • PeriodIndexes don't work (well, they won't pass my tests) because of a stack overflow bug when joining a DatetimeIndex and a PeriodIndex (that bug is slated for 0.13, but a design decision needs to be made)
  • alignment for NDFrame (is this really necessary?)
  • unary alignment
  • binary alignment

Scope

  • deal with global scoping issues (not sure if this is tested well enough...which means probably not)
  • allow Expr objects to be passed to eval (Expr will pull in the local and global variables from the calling stack frame)

Miscellaneous

  • replace current evaluate function
  • Expr in NDFrame.__getitem__ cc @jreback
  • reconstruct the final object to return the expected output type, e.g., df + df2 should return a DataFrame if df and df2 are both DataFrame objects.
  • change Expression to Expr
  • add truediv keyword to Expr, default to True

Operators

  • %
  • **
  • unary ops don't play well when they are operands of binary ops

Tests

  • DatetimeIndex and PeriodIndex objects are untested because of a few join and alignment issues.
  • rewrite a few tests in the python evaluator to make sure the // operator works as it should (it works, but there may be edge cases that I haven't though of yet)
  • // numexpr only supports scalar evaluation here: variables are not allowed so this will always eval in python space (half-assed it a bit in the tests here need to fix that)
  • truediv keyword tests
  • add tests for nans
  • fix ** associativity issue in testing

Near Future

  • allow and, or and not aliases for the operators &, | and ~, respectively
  • attribute accessing e.g., df.A syntax
  • reductions (sum and prod)
  • math ops (sin, cos, tan, sqrt, etc.)

Far Future

  • non-python indexing (or just punt to python)
@cpcloud cpcloud commented on the diff Jun 26, 2013
pandas/core/common.py
+ return isinstance(obj, (basestring, np.str_, np.unicode_))
+
+
+def is_series(obj):
+ return isinstance(obj, pd.Series)
+
+
+def is_frame(obj):
+ return isinstance(obj, pd.DataFrame)
+
+
+def is_panel(obj):
+ return isinstance(obj, pd.Panel)
+
+
+def is_pd_obj(obj):
@cpcloud
Python for Data member
cpcloud added a note Jun 26, 2013

note to self these should most likely be removed or should replace scattered functions that are similar to these

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@cpcloud cpcloud was assigned Jun 26, 2013
@jreback
jreback commented Jun 26, 2013

quick pass I didn't see a doc-string on eval (its prob there...but I didn't delve too deep)

also a quick writeup for enhancedperf section?

just a couple of examples

@cpcloud
Python for Data member
cpcloud commented Jun 28, 2013

@jreback

question:

  1. i think using eval under the hood for pytables evaluation should replace all of the various incarnations of expressions here...is that ok?

result being that only strings are allowed (which will be converted to Expr and passed to an evaluating function).

going to take much longer to finish this up to support all of those expression forms, but if it needs to be done that's ok too

@jreback
jreback commented Jun 28, 2013

for a string expression, that is exactly right (its conceptually simpler, because only scalars can be on the rhs in any expression), however also need to allow something like this:

E('index>',Timestamp('20130101'))

where E is replace Term and is a top-level that you evaluate

unless you think that you can evaluate:

index>Timestamp('20130101') ?

I think that the current Term should be a sub-class of Expr that just spits out a numexpr evaluable string (and does not actually evaluate it)

@cpcloud
Python for Data member
cpcloud commented Jun 28, 2013

ok i think i'm starting to see the bigger picture here. convert_value and TermValue will be superceeded by eval,

in this code in Selection.select

return self.table.table.readWhere(self.condition, start=self.start, stop=self.stop)

self.condition should be generated by engine.convert()

is that right?

@cpcloud
Python for Data member
cpcloud commented Jun 28, 2013

and queryables is basically an environment internal to the table that has for example the index part of index > 20130101

@jreback
jreback commented Jun 28, 2013

that's about right.

in Selection I am just using a list of Terms that are anded together

queryables is used for validation on the lhs of the expressions (its analagous to you looking up a variable in the enclosing/global scope, e.g. df>1 if df is not defined then you raise, same here

and no alignment necessary, just some conversions on rhs terms

and self.condition is generated by engine.convert() correct

self.condition is equiv to this:

pd.eval('index>Timestamp("20130101")',queryables=queryables,engine='pytables')

@cpcloud
Python for Data member
cpcloud commented Jun 28, 2013

cool maybe this will be easier than i thought...to the pytables-mobile!

@jreback
jreback commented Jun 30, 2013

advertising!

http://stackoverflow.com/questions/17390886/how-to-speed-up-pandas-multilevel-dataframe-sum/17394064#17394064

might be a good benchmark.....and maybe add this 'example' to the EnhancingPerformance section?

@cpcloud
Python for Data member
cpcloud commented Jun 30, 2013

nice! thanks. i gave u +1

@cpcloud
Python for Data member
cpcloud commented Jul 4, 2013

just an update on the and or not deal

essentially:

what you want to write if u like python

pd.eval('not x and y and z or w')

if you want to be terse

pd.eval('~(x & y & z | w)')

what the engine actually sees (you could write this too, if you're practicing typing parens fast w/o looking!)

pd.eval('~((((x) & (y)) & (z)) | (w))')

and of course u can mix and match if you like partial clarity + terseness

what's nice is that the python precedence is preserved, e.g., not x and y rewrites to ~((x) & (y)) (not binds loosely) whereas ~x and y rewrites to '(~(x)) & (y)' (~ binds tightly)

for reference: operator precedence

if u want i can give more details, but essentially the repr of each op is '({self.lhs}) {self.op} ({self.rhs})'.format(self=self) and this happens recursively.

the ast manip is to "rewrite" the ast.BoolOp nodes as their respective Bit* node (really i'm just saying, interpret this as a bitwise op), but the python ast takes care of precedence for u so the precedence semantics remain.

here was the trickiest part

def visit_BoolOp(self, node):
    op = self.visit(node.op)
    def visitor(x, y):
        try:
            lhs = self.visit(x)
        except TypeError:
            lhs = x

        try:
            rhs = self.visit(y)
        except TypeError:
            rhs = y

        return op(lhs, rhs)

    operands = node.values
    return reduce(visitor, operands)

key part is the last two lines which because the python parser treats boolean expressions similar to how lisp does you have to reconstruct the (and x1 x2 ... xN) style op into (and (... (and (and x1 x2) ...) ...) xN)

@jreback
jreback commented Jul 4, 2013

I would have an engine option, maybe parsing=strict as the default
if you want to support python booleans then could pass parsing=extended or something like that
I don't think we want to turn this on by default even though it looks pretty -

@cpcloud
Python for Data member
cpcloud commented Jul 4, 2013

ok. there's not really any automatic way to test that though since and, or and not are not part of the operator module, i suppose python's eval could work though (for testing only)

@jreback
jreback commented Jul 4, 2013

I would put it on back burner anyhow (i know u did it)
but too much of a chnge right now

@cpcloud
Python for Data member
cpcloud commented Jul 4, 2013

sounds good, not worth too much time.

the pytables changes are quite daunting. i know those are most important for this to be merged, still sifting through how it works...reading tests and trying to understand what the classes do...which ones are important etc...

plus the issues with dt and period indexes

i think after those two then this might warrant some serious tire kicking

@cpcloud
Python for Data member
cpcloud commented Jul 4, 2013

I think i'm going to allow callable parsing in the 'pytables' engine for now, as it will make the transition to using eval and friends under the hood much easier. I'll just subclass the current parser something like TableExprVisitor and overload the visit_Call method...

@cpcloud
Python for Data member
cpcloud commented Jul 4, 2013

i think in 0.14 (not a typo) that Term should be removed...deprecate in 0.13 in favor of what I'm calling TableExpr which is a subclass of Expr designed to work with HDFStore et al.

@cpcloud
Python for Data member
cpcloud commented Jul 4, 2013

@jreback should arbitrary expressions be allowed with 'pytables'?

using filter will be easy if only binary ops are allowed, but will be more complicated if arbitrary exprs are allowed

@cpcloud
Python for Data member
cpcloud commented Jul 4, 2013

nvm. should allow them.

@jreback
jreback commented Jul 4, 2013

whatever is parsable by ne should be allowed
in reality though only simple expressions are prob needed

cpcloud added some commits Jun 15, 2013
@cpcloud cpcloud ENH: add new computation module and toplevel eval function 89a03be
@cpcloud cpcloud ENH/TST: add new instance testing functions and their tests bcd17b0
@cpcloud cpcloud BUG: prevent certain index types from joining with DatetimeIndex 81bacd1
@cpcloud cpcloud TST/ENH: add 2d bare numpy array and nan support e380271
@cpcloud cpcloud ENH: add modulus support 99a3d28
@cpcloud cpcloud TST: add failing modulus tests 4db95fe
@cpcloud cpcloud CLN: use format string for unicode 6000c89
@cpcloud cpcloud CLN: remove engine detection and manip for datetimes c25a1d4
@cpcloud cpcloud CLN/ENH: add new interface to encapsulate Terms and Constants 1132bc4
@cpcloud cpcloud ENH: allow an already-parsed expression to be passed to eval 54f1897
@cpcloud cpcloud CLN: add automatic scope creating object e20900a
@cpcloud cpcloud CLN: make the environment an implementation detail 51d80f6
@cpcloud cpcloud DOC: add docstring to eval 038d79c
@cpcloud cpcloud CLN: cleanup pytables.py a bit 599cf32
@cpcloud cpcloud CLN: clean up engines ea769e6
@cpcloud cpcloud CLN: clean up eval and have the Scope instance auto create the scope …
…if none exists
ff78c08
@cpcloud cpcloud CLN: add six.string_types checking instead of basestring f9f7fd7
@cpcloud cpcloud TST: clean up some tests, add minor assertions where none existed 48eff13
@cpcloud cpcloud CLN: clean up frame.py a bit d87f027
@cpcloud cpcloud CLN: clean up pytables arguments a bit 5b58a08
@cpcloud cpcloud CLN: use shiny new string mixin to refactor repring 7482a27
@cpcloud cpcloud CLN: move align to its own file 0d40fe1
@cpcloud cpcloud CLN: clean up and use new stringmixin for Expr 87957d2
@cpcloud cpcloud ENH/CLN: be more careful about unicode e35cb5c
@cpcloud cpcloud CLN: run autopep8 on pandas/io/pytables.py 1ceec39
@cpcloud cpcloud DOC: reference future enhancingperf.eval section c665a85
@cpcloud cpcloud CLN/DOC: clean up docstrings in pytables cb27934
@cpcloud cpcloud CLN: actually pass fletcher32 in get_store 63ba37d
@cpcloud cpcloud CLN: remove unused variables dcde590
@cpcloud cpcloud CLN: more pep8 and get rid of most raise Exception clauses 3c4e2b3
@cpcloud cpcloud CLN: change NameError to match python 226c786
@cpcloud cpcloud API: expose the Expr object to top level pandas 79871d8
@cpcloud cpcloud CLN/TST: fail with a NotImplementedError on and or not 84fdb45
@cpcloud cpcloud CLN: generlize operator/expression printing 4d9f9a7
@cpcloud cpcloud CLN: clean up testing and expr a0d2ce0
@cpcloud
Python for Data member
cpcloud commented Jul 6, 2013

@jreback let me see if i have the right idea about how selections are computed

assume that we:

  • have an AppendableFrameTable instance
  • want the whole table (i.e., not iterating over chunks)

  1. HDFStore.select is called
  2. select calls AppendableFrameTable.read
  3. read calls read_axes which sets the selection attribute
  4. selection is then used in process_axes to make the selection using filters (if any)

i think my plan of attack will be:

  1. make a subclass of my Term class, maybe TableTerm which has the filter attribute and possibly a few other unexposed helper methods if necessary
  2. the Term class in pytables.py will become something like TableExpr (a subclass of Expr) which is a lightweight container for TableTerms (just like Expr)

I think I might be able to do this without touching any of the Storer code.

@jreback
jreback commented Jul 6, 2013

too complicated....going to futz around with your branch....give me a few

@cpcloud
Python for Data member
cpcloud commented Jul 6, 2013

ok. seems like there's a lot going on in pytables. it's not clear to me how to incorporate the filters into my current scheme since there's no state regarding the operator except its name, e.g., self.op = '!='. i suppose i could special case two classes for the filterable operators to hold a filter...

@jreback
jreback commented Jul 6, 2013

a filter is just an expression that we are going to evaluation after the results come back (e.g. its a dimension that pytables doesn't select on), e.g. when you store a frame, the rows are stored, but the columns are stored as a single block), so selecting on the columns is basically just a reindex

@jreback
jreback commented Jul 6, 2013

ok....if I have an expression like columns=A, how can I tell the node visitor (which is now a sub-class of your existing one), to really make this columns==A ? (e.g. it interprets it as an Assign, but I want a different op (aside from actually rewriting it, I basically want to map = and == to the same thing

@cpcloud
Python for Data member
cpcloud commented Jul 6, 2013

i think this should work

def visit_Assign(self, node):
    cmpr = ast.Compare(ops=[ast.Eq()], left=self.visit(node.targets[0]),
                        comparators=self.visit(node.value))
    return self.visit(cmpr)
@cpcloud
Python for Data member
cpcloud commented Jul 6, 2013

you also need to remove the raise i put in to disallow statements

@cpcloud
Python for Data member
cpcloud commented Jul 6, 2013

or add ast.Assign to the instance check in visit_Module

@cpcloud
Python for Data member
cpcloud commented Jul 6, 2013

ok that won't work exactly yet, give me a sec ill make it work an push and put a link

@jreback
jreback commented Jul 6, 2013

dont' worry....i think i got what i need for a min

@jreback
jreback commented Jul 6, 2013

actually....need to transform an Assign node to a Compare node...would be helpful..?

@cpcloud
Python for Data member
cpcloud commented Jul 6, 2013

yep that's what i did at first but i don't think it's necessary...here's something that works

    def visit_Assign(self, node):
        cmpr = ast.Compare(ops=[ast.Eq()], left=node.targets[0],
                           comparators=[node.value])
        return self.visit(cmpr)
@cpcloud
Python for Data member
cpcloud commented Jul 6, 2013

this exposes a buglet. pd.eval('x==y') doesn't return a bool..

@jreback
jreback commented Jul 6, 2013

I am changing this only for a sub-class though (where I explicity treat '=' and '==' as the same); there is not concept of '=', merely a short-hand

@cpcloud
Python for Data member
cpcloud commented Jul 6, 2013

sure...rewrite or not, i don't think it matters unless you want to reconstruct the python source from the ast

@cpcloud
Python for Data member
cpcloud commented Jul 6, 2013

return type is correct for frames...numpy arrays use a different scheme...fixing it now.

@cpcloud
Python for Data member
cpcloud commented Jul 6, 2013

ok must add return type field to ops...then can compute recursive union of return types to yield the final return type

@jreback
jreback commented Jul 6, 2013

challenge for you.....how can I have this:

`columns=A | columns=B

parse as

`(columns==A) | (columns==B)

right now I am pre-parseing and substituting == for =+

but then precedence is screwed up and the comparator has multiple equal ops (which you raise currently)

@cpcloud
Python for Data member
cpcloud commented Jul 6, 2013

yeah...a little annoying that | binds tighter than ==, but if ur using them for bit manip then u might expect that...challenge accepted!

@cpcloud
Python for Data member
cpcloud commented Jul 6, 2013

nice think i got it

@cpcloud
Python for Data member
cpcloud commented Jul 6, 2013

this is one of your eval schemes to make everything a string!

@jreback
jreback commented Jul 6, 2013

fbeb99d

this doesn't work but getting closer.... I made some changes in the base classes, There might be an easier way to do these, or you might want to incorporate for more generality...

@cpcloud
Python for Data member
cpcloud commented Jul 6, 2013

here's a better preparser

def _rewrite_assign(source):
    res = []
    g = tokenize.generate_tokens(StringIO(source).readline)
    for toknum, tokval, _, _, _ in g:
        res.append((toknum, '==' if tokval == '=' else tokval))
    return tokenize.untokenize(res)


def _eat_whitespace(source):
    return re.compile(r'\s*').sub('', source)


def _parenthesize_booleans(source, ops='|&'):
    res = source
    for op in ops:
        terms = res.split(op)

        t = []
        for term in terms:
            t.append('({0})'.format(term))

        res = op.join(t)
    return res


def preparse(source):
    return _parenthesize_booleans(_eat_whitespace(_rewrite_assign(source)))
@cpcloud
Python for Data member
cpcloud commented Jul 6, 2013

passes the following smoke test

In [8]: x, y, z, w = randint(2, size=(10,4)).T

In [9]: all(pd.eval('x=y|w=z') == pd.eval('(x==y)|(w==z)'))
Out[9]: True
@cpcloud
Python for Data member
cpcloud commented Jul 6, 2013

thanks i'll incorporate it.

@jreback
jreback commented Jul 6, 2013

how is something like: columns=['A','B'] handled?

visit_List? or is this an vist_Expr? (I know it will go thru name resolutino that is fine); I ideally want to keep it as a list though

@cpcloud
Python for Data member
cpcloud commented Jul 6, 2013

that's not allowed right now. would be visit_Expr which would make a call into visit_List.

@cpcloud
Python for Data member
cpcloud commented Jul 6, 2013

numexpr can't handle lists

in theory python engine could

and it sounds like pytables engine must handle them

@cpcloud
Python for Data member
cpcloud commented Jul 7, 2013

@jreback i feel a bit silly asking this but how do i pull down ur changes?

@jreback
jreback commented Jul 7, 2013

@cpcloud i think you have to add me as a remote, then cherry-pick

i pseudo have it working....but maybe you can help with this.....

say I haev a call like this:

Term('index=df.index[0:4]')...where Term is basically Expr

I can't seem to get the variable scope for df....(once I am in the object creation call sys._getframe(2).f_locals is just the local stuff

any idea?

@cpcloud
Python for Data member
cpcloud commented Jul 7, 2013

can u point me to a commit?

@cpcloud
Python for Data member
cpcloud commented Jul 7, 2013

oh same one

@jreback
jreback commented Jul 7, 2013

somewhat old .....will push again in a bit

@jreback
jreback commented Jul 7, 2013

6b67214

new one....still very WIP though!

@jreback
jreback commented Jul 7, 2013

figured out the scoping issue.....I AM capturing scope, but then I was throwing it away because I allow a 'loose' sytax..

e.g.

[ 'columns=A',Term('index=df.index[0:4]')]

is translated to:

columns=A & index=df.index[0:4] as a string. (which I can't deal with yet...but getting there)

@cpcloud
Python for Data member
cpcloud commented Jul 7, 2013

wait where does that commit live? it's on master but not a branch but a tree?

how are you parsing the slice?

@cpcloud
Python for Data member
cpcloud commented Jul 7, 2013

@jreback i've added u as a remote but how can i cherry-pick that commit? it looks like its on master, but where is being pushed? is it detached or something?

@jreback
jreback commented Jul 7, 2013

its on a separate branch of my master

try this: https://github.com/jreback/pandas/tree/eval

its build on top of your branch (which at some point i will rebase; i am a few behind you)

eventually you can just grab some of my commits and add to your tree as you have the PR

not 100% sure how to do that, but since we can't push to each other I think only way

though once you have a PR to pandas master I think I can push to that (I think this is a psudoe private branch....
if I do a PR I think my commits will be added after yours.....)

@jreback
jreback commented Jul 7, 2013

on #4151 now

@jreback
jreback commented Jul 7, 2013

of course going to fail all tests...but stil a WIP

@cpcloud
Python for Data member
cpcloud commented Jul 7, 2013

i like how github pulse double counts my commits here....not :|

@jreback
jreback commented Jul 7, 2013

https://github.com/newville/asteval/blob/master/lib/asteval.py

nice ref for implementing the ast handlers

@cpcloud
Python for Data member
cpcloud commented Jul 7, 2013

nice. i think we can use slice, subscript, attribute access, numexpr supported functions, and possibly chained comparisons if we want. x if y else z is covered by where

would be nice to handle certain pandas functions like Timestamp, for pytables stuff.

@jreback
jreback commented Jul 7, 2013

all pretty straightforward

I have a new branch that does most of this

could move some into base class

@cpcloud
Python for Data member
cpcloud commented Jul 7, 2013

great! can u post a link to a branch?

@jreback
jreback commented Jul 7, 2013

already #4155 up, extension of your branch....all tests pass!

take a look. I think some of what I did should actually be in your branch (e.g. see my code clean up point).
Need a better validator on visit in ExprVisitor, in fact maybe create a base class which basically implements everything, then let a sub-class (for NumExpr and your stuff) to turn on which ops it needs (and ocassionally overriding some); I already inherit this class

This can also consolidate the initialization sutff of the binary/math ops as well

other thing to note, I added a ops.Value which is very much like Constant but doesn't do name validation (as I can't know that names on the rhs are invalid, so I must accept them), but this includes stuff like a list of strings, or even an index

I didn't really check whether your tests break! more likely is that some 'invalid' ops would slip by

I also eliminated the call to generic_visitor, instead raising on an invalid op, not sure if that works for you or not

@jreback
jreback commented Jul 8, 2013

hmm..this not showing my commits?

@cpcloud
Python for Data member
cpcloud commented Jul 8, 2013

this won't show your commits because i've changed the remote of my local branch here...

the remote for this pr is my origin which i don't know what you call that on ur box.
the new eval-3393 is on my upstream (your origin)

AFAIK i don't think you can track origin/eval-3393 and upstream/eval-3393 in the same branch so i've chosen to just use upstream, e.g. if i make a change make it won't show up in two places when i push...

one way might be: you could push to your origin/eval-3393 (my upstream), i could fetch from that and you could fetch from this and i'll push to this, problem is i think that creates a cycle in the commit graph which essentially breaks git

@jreback
jreback commented Jul 8, 2013

hmm...I thought that you were creating a master branch that is push/pullable from either of us and sort of independent?

maybe close that PR and just open another one from that branch (e.g. merge from eval-3393 to master) rather than from your branch to master?

@cpcloud
Python for Data member
cpcloud commented Jul 8, 2013

so i think we should close our respective PRs, i'll open a new one with upstream/eval-3393 and we push to the branch on pydata/pandas until we're finished and ready to merge? sound good?

@jreback
jreback commented Jul 8, 2013

yes.....had same thoughts (and I would basically copy your top-levevel lists and then I will do the same), for ToDo's and such

@cpcloud
Python for Data member
cpcloud commented Jul 8, 2013

right. a single PR to rule them all. i'll merge our TODO lists and other stuff in the initial text of the PR.

@jreback
jreback commented Jul 8, 2013

great....

@cpcloud
Python for Data member
cpcloud commented Jul 8, 2013

hopefully, this will scale down the bloated pulse report of my commits

@cpcloud cpcloud closed this Jul 8, 2013
@cpcloud cpcloud deleted the cpcloud:eval-3393 branch Feb 27, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.