Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

WIP: add top-level eval function #4037

Closed
wants to merge 37 commits into from

2 participants

@cpcloud
Collaborator

closes #3393.

This is a ways from being mergeable but I thought I would put up the PR for people to play with and to get some other eyes on it. Tire-kick at will. A consolidated list of things to be completed to come.

eval Roadmap

Documentation

  • enhancing performance section
  • eval docstring
  • Warn users that this is not a magic bullet

    • E.g., x = 3; y = 2; pd.eval('x // y', engine='python') is 1000 times slower than the same operation in actual Python.

Engines

  • python engine for testing (basically a restricted eval, should make sure nothing dangerous a la eval('os.system("sudo rm -rf /")')) is possible here
  • numexpr engine
  • pytables engine (#4155)

Should python be the default engine for small frames?

Functions

  • toplevel isexpr
  • move eval to top-level pandas

Error Handling

  • syntax error handling
  • name error handling

Alignment

  • Algorithm

    1. flatten the tree of terms
    2. resolve the variables, special casing only numpy arrays/scalars/single unary operations (probably room for more abstraction here)
    3. align by joining on the indices using the current broadcasting behavior
    4. update the original namespace with the newly aligned version of the underlying numpy array
    5. reconstruct the final object with the aforementioned joined indices and type
  • punt to 'python' engine when a Series and DataFrame both have DatetimeIndexes (I would rather not do this and just enforce the to-be-deprecated index-aligning-if-both-indices-are-datetimeindex behavior)

  • PeriodIndexes don't work (well, they won't pass my tests) because of a stack overflow bug when joining a DatetimeIndex and a PeriodIndex (that bug is slated for 0.13, but a design decision needs to be made)
  • alignment for NDFrame (is this really necessary?)
  • unary alignment
  • binary alignment

Scope

  • deal with global scoping issues (not sure if this is tested well enough...which means probably not)
  • allow Expr objects to be passed to eval (Expr will pull in the local and global variables from the calling stack frame)

Miscellaneous

  • replace current evaluate function
  • Expr in NDFrame.__getitem__ cc @jreback
  • reconstruct the final object to return the expected output type, e.g., df + df2 should return a DataFrame if df and df2 are both DataFrame objects.
  • change Expression to Expr
  • add truediv keyword to Expr, default to True

Operators

  • %
  • **
  • unary ops don't play well when they are operands of binary ops

Tests

  • DatetimeIndex and PeriodIndex objects are untested because of a few join and alignment issues.
  • rewrite a few tests in the python evaluator to make sure the // operator works as it should (it works, but there may be edge cases that I haven't though of yet)
  • // numexpr only supports scalar evaluation here: variables are not allowed so this will always eval in python space (half-assed it a bit in the tests here need to fix that)
  • truediv keyword tests
  • add tests for nans
  • fix ** associativity issue in testing

Near Future

  • allow and, or and not aliases for the operators &, | and ~, respectively
  • attribute accessing e.g., df.A syntax
  • reductions (sum and prod)
  • math ops (sin, cos, tan, sqrt, etc.)

Far Future

  • non-python indexing (or just punt to python)
@cpcloud cpcloud commented on the diff
pandas/core/common.py
((5 lines not shown))
+ return isinstance(obj, (basestring, np.str_, np.unicode_))
+
+
+def is_series(obj):
+ return isinstance(obj, pd.Series)
+
+
+def is_frame(obj):
+ return isinstance(obj, pd.DataFrame)
+
+
+def is_panel(obj):
+ return isinstance(obj, pd.Panel)
+
+
+def is_pd_obj(obj):
@cpcloud Collaborator
cpcloud added a note

note to self these should most likely be removed or should replace scattered functions that are similar to these

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@cpcloud cpcloud was assigned
@jreback
Owner

quick pass I didn't see a doc-string on eval (its prob there...but I didn't delve too deep)

also a quick writeup for enhancedperf section?

just a couple of examples

@cpcloud
Collaborator

@jreback

question:

  1. i think using eval under the hood for pytables evaluation should replace all of the various incarnations of expressions here...is that ok?

result being that only strings are allowed (which will be converted to Expr and passed to an evaluating function).

going to take much longer to finish this up to support all of those expression forms, but if it needs to be done that's ok too

@jreback
Owner

for a string expression, that is exactly right (its conceptually simpler, because only scalars can be on the rhs in any expression), however also need to allow something like this:

E('index>',Timestamp('20130101'))

where E is replace Term and is a top-level that you evaluate

unless you think that you can evaluate:

index>Timestamp('20130101') ?

I think that the current Term should be a sub-class of Expr that just spits out a numexpr evaluable string (and does not actually evaluate it)

@cpcloud
Collaborator

ok i think i'm starting to see the bigger picture here. convert_value and TermValue will be superceeded by eval,

in this code in Selection.select

return self.table.table.readWhere(self.condition, start=self.start, stop=self.stop)

self.condition should be generated by engine.convert()

is that right?

@cpcloud
Collaborator

and queryables is basically an environment internal to the table that has for example the index part of index > 20130101

@jreback
Owner

that's about right.

in Selection I am just using a list of Terms that are anded together

queryables is used for validation on the lhs of the expressions (its analagous to you looking up a variable in the enclosing/global scope, e.g. df>1 if df is not defined then you raise, same here

and no alignment necessary, just some conversions on rhs terms

and self.condition is generated by engine.convert() correct

self.condition is equiv to this:

pd.eval('index>Timestamp("20130101")',queryables=queryables,engine='pytables')

@cpcloud
Collaborator

cool maybe this will be easier than i thought...to the pytables-mobile!

@jreback
Owner

advertising!

http://stackoverflow.com/questions/17390886/how-to-speed-up-pandas-multilevel-dataframe-sum/17394064#17394064

might be a good benchmark.....and maybe add this 'example' to the EnhancingPerformance section?

@cpcloud
Collaborator

nice! thanks. i gave u +1

@cpcloud
Collaborator

just an update on the and or not deal

essentially:

what you want to write if u like python

pd.eval('not x and y and z or w')

if you want to be terse

pd.eval('~(x & y & z | w)')

what the engine actually sees (you could write this too, if you're practicing typing parens fast w/o looking!)

pd.eval('~((((x) & (y)) & (z)) | (w))')

and of course u can mix and match if you like partial clarity + terseness

what's nice is that the python precedence is preserved, e.g., not x and y rewrites to ~((x) & (y)) (not binds loosely) whereas ~x and y rewrites to '(~(x)) & (y)' (~ binds tightly)

for reference: operator precedence

if u want i can give more details, but essentially the repr of each op is '({self.lhs}) {self.op} ({self.rhs})'.format(self=self) and this happens recursively.

the ast manip is to "rewrite" the ast.BoolOp nodes as their respective Bit* node (really i'm just saying, interpret this as a bitwise op), but the python ast takes care of precedence for u so the precedence semantics remain.

here was the trickiest part

def visit_BoolOp(self, node):
    op = self.visit(node.op)
    def visitor(x, y):
        try:
            lhs = self.visit(x)
        except TypeError:
            lhs = x

        try:
            rhs = self.visit(y)
        except TypeError:
            rhs = y

        return op(lhs, rhs)

    operands = node.values
    return reduce(visitor, operands)

key part is the last two lines which because the python parser treats boolean expressions similar to how lisp does you have to reconstruct the (and x1 x2 ... xN) style op into (and (... (and (and x1 x2) ...) ...) xN)

@jreback
Owner

I would have an engine option, maybe parsing=strict as the default
if you want to support python booleans then could pass parsing=extended or something like that
I don't think we want to turn this on by default even though it looks pretty -

@cpcloud
Collaborator

ok. there's not really any automatic way to test that though since and, or and not are not part of the operator module, i suppose python's eval could work though (for testing only)

@jreback
Owner

I would put it on back burner anyhow (i know u did it)
but too much of a chnge right now

@cpcloud
Collaborator

sounds good, not worth too much time.

the pytables changes are quite daunting. i know those are most important for this to be merged, still sifting through how it works...reading tests and trying to understand what the classes do...which ones are important etc...

plus the issues with dt and period indexes

i think after those two then this might warrant some serious tire kicking

@cpcloud
Collaborator

I think i'm going to allow callable parsing in the 'pytables' engine for now, as it will make the transition to using eval and friends under the hood much easier. I'll just subclass the current parser something like TableExprVisitor and overload the visit_Call method...

@cpcloud
Collaborator

i think in 0.14 (not a typo) that Term should be removed...deprecate in 0.13 in favor of what I'm calling TableExpr which is a subclass of Expr designed to work with HDFStore et al.

@cpcloud
Collaborator

@jreback should arbitrary expressions be allowed with 'pytables'?

using filter will be easy if only binary ops are allowed, but will be more complicated if arbitrary exprs are allowed

@cpcloud
Collaborator

nvm. should allow them.

@jreback
Owner

whatever is parsable by ne should be allowed
in reality though only simple expressions are prob needed

cpcloud added some commits
@cpcloud cpcloud ENH: add new computation module and toplevel eval function
89a03be
@cpcloud cpcloud ENH/TST: add new instance testing functions and their tests
bcd17b0
@cpcloud cpcloud BUG: prevent certain index types from joining with DatetimeIndex
81bacd1
@cpcloud cpcloud TST/ENH: add 2d bare numpy array and nan support
e380271
@cpcloud cpcloud ENH: add modulus support
99a3d28
@cpcloud cpcloud TST: add failing modulus tests
4db95fe
@cpcloud cpcloud CLN: use format string for unicode
6000c89
@cpcloud cpcloud CLN: remove engine detection and manip for datetimes
c25a1d4
@cpcloud cpcloud CLN/ENH: add new interface to encapsulate Terms and Constants
1132bc4
@cpcloud cpcloud ENH: allow an already-parsed expression to be passed to eval
54f1897
@cpcloud cpcloud CLN: add automatic scope creating object
e20900a
@cpcloud cpcloud CLN: make the environment an implementation detail
51d80f6
@cpcloud cpcloud DOC: add docstring to eval
038d79c
@cpcloud cpcloud CLN: cleanup pytables.py a bit
599cf32
@cpcloud cpcloud CLN: clean up engines
ea769e6
@cpcloud cpcloud CLN: clean up eval and have the Scope instance auto create the scope …
…if none exists
ff78c08
@cpcloud cpcloud CLN: add six.string_types checking instead of basestring
f9f7fd7
@cpcloud cpcloud TST: clean up some tests, add minor assertions where none existed
48eff13
@cpcloud cpcloud CLN: clean up frame.py a bit
d87f027
@cpcloud cpcloud CLN: clean up pytables arguments a bit
5b58a08
@cpcloud cpcloud CLN: use shiny new string mixin to refactor repring
7482a27
@cpcloud cpcloud CLN: move align to its own file
0d40fe1
@cpcloud cpcloud CLN: clean up and use new stringmixin for Expr
87957d2
@cpcloud cpcloud ENH/CLN: be more careful about unicode
e35cb5c
@cpcloud cpcloud CLN: run autopep8 on pandas/io/pytables.py
1ceec39
@cpcloud cpcloud DOC: reference future enhancingperf.eval section
c665a85
@cpcloud cpcloud CLN/DOC: clean up docstrings in pytables
cb27934
@cpcloud cpcloud CLN: actually pass fletcher32 in get_store
63ba37d
@cpcloud cpcloud CLN: remove unused variables
dcde590
@cpcloud cpcloud CLN: more pep8 and get rid of most raise Exception clauses
3c4e2b3
@cpcloud cpcloud CLN: change NameError to match python
226c786
@cpcloud cpcloud API: expose the Expr object to top level pandas
79871d8
@cpcloud cpcloud CLN/TST: fail with a NotImplementedError on and or not
84fdb45
@cpcloud cpcloud CLN: generlize operator/expression printing
4d9f9a7
@cpcloud cpcloud CLN: clean up testing and expr
a0d2ce0
@cpcloud
Collaborator

@jreback let me see if i have the right idea about how selections are computed

assume that we:

  • have an AppendableFrameTable instance
  • want the whole table (i.e., not iterating over chunks)

  1. HDFStore.select is called
  2. select calls AppendableFrameTable.read
  3. read calls read_axes which sets the selection attribute
  4. selection is then used in process_axes to make the selection using filters (if any)

i think my plan of attack will be:

  1. make a subclass of my Term class, maybe TableTerm which has the filter attribute and possibly a few other unexposed helper methods if necessary
  2. the Term class in pytables.py will become something like TableExpr (a subclass of Expr) which is a lightweight container for TableTerms (just like Expr)

I think I might be able to do this without touching any of the Storer code.

@jreback
Owner

too complicated....going to futz around with your branch....give me a few

@cpcloud
Collaborator

ok. seems like there's a lot going on in pytables. it's not clear to me how to incorporate the filters into my current scheme since there's no state regarding the operator except its name, e.g., self.op = '!='. i suppose i could special case two classes for the filterable operators to hold a filter...

@jreback
Owner

a filter is just an expression that we are going to evaluation after the results come back (e.g. its a dimension that pytables doesn't select on), e.g. when you store a frame, the rows are stored, but the columns are stored as a single block), so selecting on the columns is basically just a reindex

@jreback
Owner

ok....if I have an expression like columns=A, how can I tell the node visitor (which is now a sub-class of your existing one), to really make this columns==A ? (e.g. it interprets it as an Assign, but I want a different op (aside from actually rewriting it, I basically want to map = and == to the same thing

@cpcloud
Collaborator

i think this should work

def visit_Assign(self, node):
    cmpr = ast.Compare(ops=[ast.Eq()], left=self.visit(node.targets[0]),
                        comparators=self.visit(node.value))
    return self.visit(cmpr)
@cpcloud
Collaborator

you also need to remove the raise i put in to disallow statements

@cpcloud
Collaborator

or add ast.Assign to the instance check in visit_Module

@cpcloud
Collaborator

ok that won't work exactly yet, give me a sec ill make it work an push and put a link

@jreback
Owner

dont' worry....i think i got what i need for a min

@jreback
Owner

actually....need to transform an Assign node to a Compare node...would be helpful..?

@cpcloud
Collaborator

yep that's what i did at first but i don't think it's necessary...here's something that works

    def visit_Assign(self, node):
        cmpr = ast.Compare(ops=[ast.Eq()], left=node.targets[0],
                           comparators=[node.value])
        return self.visit(cmpr)
@cpcloud
Collaborator

this exposes a buglet. pd.eval('x==y') doesn't return a bool..

@jreback
Owner

I am changing this only for a sub-class though (where I explicity treat '=' and '==' as the same); there is not concept of '=', merely a short-hand

@cpcloud
Collaborator

sure...rewrite or not, i don't think it matters unless you want to reconstruct the python source from the ast

@cpcloud
Collaborator

return type is correct for frames...numpy arrays use a different scheme...fixing it now.

@cpcloud
Collaborator

ok must add return type field to ops...then can compute recursive union of return types to yield the final return type

@jreback
Owner

challenge for you.....how can I have this:

`columns=A | columns=B

parse as

`(columns==A) | (columns==B)

right now I am pre-parseing and substituting == for =+

but then precedence is screwed up and the comparator has multiple equal ops (which you raise currently)

@cpcloud
Collaborator

yeah...a little annoying that | binds tighter than ==, but if ur using them for bit manip then u might expect that...challenge accepted!

@cpcloud
Collaborator

nice think i got it

@cpcloud
Collaborator

this is one of your eval schemes to make everything a string!

@jreback
Owner

fbeb99d

this doesn't work but getting closer.... I made some changes in the base classes, There might be an easier way to do these, or you might want to incorporate for more generality...

@cpcloud
Collaborator

here's a better preparser

def _rewrite_assign(source):
    res = []
    g = tokenize.generate_tokens(StringIO(source).readline)
    for toknum, tokval, _, _, _ in g:
        res.append((toknum, '==' if tokval == '=' else tokval))
    return tokenize.untokenize(res)


def _eat_whitespace(source):
    return re.compile(r'\s*').sub('', source)


def _parenthesize_booleans(source, ops='|&'):
    res = source
    for op in ops:
        terms = res.split(op)

        t = []
        for term in terms:
            t.append('({0})'.format(term))

        res = op.join(t)
    return res


def preparse(source):
    return _parenthesize_booleans(_eat_whitespace(_rewrite_assign(source)))
@cpcloud
Collaborator

passes the following smoke test

In [8]: x, y, z, w = randint(2, size=(10,4)).T

In [9]: all(pd.eval('x=y|w=z') == pd.eval('(x==y)|(w==z)'))
Out[9]: True
@cpcloud
Collaborator

thanks i'll incorporate it.

@jreback
Owner

how is something like: columns=['A','B'] handled?

visit_List? or is this an vist_Expr? (I know it will go thru name resolutino that is fine); I ideally want to keep it as a list though

@cpcloud
Collaborator

that's not allowed right now. would be visit_Expr which would make a call into visit_List.

@cpcloud
Collaborator

numexpr can't handle lists

in theory python engine could

and it sounds like pytables engine must handle them

@cpcloud
Collaborator

@jreback i feel a bit silly asking this but how do i pull down ur changes?

@jreback
Owner

@cpcloud i think you have to add me as a remote, then cherry-pick

i pseudo have it working....but maybe you can help with this.....

say I haev a call like this:

Term('index=df.index[0:4]')...where Term is basically Expr

I can't seem to get the variable scope for df....(once I am in the object creation call sys._getframe(2).f_locals is just the local stuff

any idea?

@cpcloud
Collaborator

can u point me to a commit?

@jreback
Owner
@cpcloud
Collaborator

oh same one

@jreback
Owner

somewhat old .....will push again in a bit

@jreback
Owner

6b67214

new one....still very WIP though!

@jreback
Owner

figured out the scoping issue.....I AM capturing scope, but then I was throwing it away because I allow a 'loose' sytax..

e.g.

[ 'columns=A',Term('index=df.index[0:4]')]

is translated to:

columns=A & index=df.index[0:4] as a string. (which I can't deal with yet...but getting there)

@cpcloud
Collaborator

wait where does that commit live? it's on master but not a branch but a tree?

how are you parsing the slice?

@cpcloud
Collaborator

@jreback i've added u as a remote but how can i cherry-pick that commit? it looks like its on master, but where is being pushed? is it detached or something?

@jreback
Owner

its on a separate branch of my master

try this: https://github.com/jreback/pandas/tree/eval

its build on top of your branch (which at some point i will rebase; i am a few behind you)

eventually you can just grab some of my commits and add to your tree as you have the PR

not 100% sure how to do that, but since we can't push to each other I think only way

though once you have a PR to pandas master I think I can push to that (I think this is a psudoe private branch....
if I do a PR I think my commits will be added after yours.....)

@jreback
Owner

on #4151 now

@jreback
Owner

of course going to fail all tests...but stil a WIP

@cpcloud
Collaborator

i like how github pulse double counts my commits here....not :|

@jreback
Owner

https://github.com/newville/asteval/blob/master/lib/asteval.py

nice ref for implementing the ast handlers

@cpcloud
Collaborator

nice. i think we can use slice, subscript, attribute access, numexpr supported functions, and possibly chained comparisons if we want. x if y else z is covered by where

would be nice to handle certain pandas functions like Timestamp, for pytables stuff.

@jreback
Owner

all pretty straightforward

I have a new branch that does most of this

could move some into base class

@cpcloud
Collaborator

great! can u post a link to a branch?

@jreback
Owner

already #4155 up, extension of your branch....all tests pass!

take a look. I think some of what I did should actually be in your branch (e.g. see my code clean up point).
Need a better validator on visit in ExprVisitor, in fact maybe create a base class which basically implements everything, then let a sub-class (for NumExpr and your stuff) to turn on which ops it needs (and ocassionally overriding some); I already inherit this class

This can also consolidate the initialization sutff of the binary/math ops as well

other thing to note, I added a ops.Value which is very much like Constant but doesn't do name validation (as I can't know that names on the rhs are invalid, so I must accept them), but this includes stuff like a list of strings, or even an index

I didn't really check whether your tests break! more likely is that some 'invalid' ops would slip by

I also eliminated the call to generic_visitor, instead raising on an invalid op, not sure if that works for you or not

@jreback
Owner

hmm..this not showing my commits?

@cpcloud
Collaborator

this won't show your commits because i've changed the remote of my local branch here...

the remote for this pr is my origin which i don't know what you call that on ur box.
the new eval-3393 is on my upstream (your origin)

AFAIK i don't think you can track origin/eval-3393 and upstream/eval-3393 in the same branch so i've chosen to just use upstream, e.g. if i make a change make it won't show up in two places when i push...

one way might be: you could push to your origin/eval-3393 (my upstream), i could fetch from that and you could fetch from this and i'll push to this, problem is i think that creates a cycle in the commit graph which essentially breaks git

@jreback
Owner

hmm...I thought that you were creating a master branch that is push/pullable from either of us and sort of independent?

maybe close that PR and just open another one from that branch (e.g. merge from eval-3393 to master) rather than from your branch to master?

@cpcloud
Collaborator

so i think we should close our respective PRs, i'll open a new one with upstream/eval-3393 and we push to the branch on pydata/pandas until we're finished and ready to merge? sound good?

@jreback
Owner

yes.....had same thoughts (and I would basically copy your top-levevel lists and then I will do the same), for ToDo's and such

@cpcloud
Collaborator

right. a single PR to rule them all. i'll merge our TODO lists and other stuff in the initial text of the PR.

@jreback
Owner

great....

@cpcloud
Collaborator

hopefully, this will scale down the bloated pulse report of my commits

@cpcloud cpcloud closed this
@cpcloud cpcloud deleted the branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jul 6, 2013
  1. @cpcloud
  2. @cpcloud
  3. @cpcloud
  4. @cpcloud
  5. @cpcloud

    ENH: add modulus support

    cpcloud authored
  6. @cpcloud

    TST: add failing modulus tests

    cpcloud authored
  7. @cpcloud
  8. @cpcloud
  9. @cpcloud
  10. @cpcloud
  11. @cpcloud
  12. @cpcloud
  13. @cpcloud

    DOC: add docstring to eval

    cpcloud authored
  14. @cpcloud

    CLN: cleanup pytables.py a bit

    cpcloud authored
  15. @cpcloud

    CLN: clean up engines

    cpcloud authored
  16. @cpcloud
  17. @cpcloud
  18. @cpcloud
  19. @cpcloud

    CLN: clean up frame.py a bit

    cpcloud authored
  20. @cpcloud
  21. @cpcloud
  22. @cpcloud
  23. @cpcloud
  24. @cpcloud
  25. @cpcloud
  26. @cpcloud
  27. @cpcloud
  28. @cpcloud
  29. @cpcloud

    CLN: remove unused variables

    cpcloud authored
  30. @cpcloud
  31. @cpcloud
  32. @cpcloud
  33. @cpcloud
  34. @cpcloud
  35. @cpcloud
  36. @cpcloud
  37. @cpcloud
Something went wrong with that request. Please try again.