Args Invariant

Aaron Meurer edited this page Jul 2, 2014 · 11 revisions
Clone this wiki locally

This is based off of this comment in PR 2095. See also this mailing list discussion.

What should the invariant for SymPy functions regarding args be? The basic invariant is expr == expr.func(*expr.args). However, there are some questions about leaf nodes.

Basically, it boils down to, what is a leaf?

  1. A non-Basic
  2. An instance of Atom
  3. A Basic with empty .args

I used to say 2, but I now think that it is wrong. Atom is a convenient core class, but there is really no reason to use it instead of 3. At any rate, Atom is definitely a special case of 3. But requiring every class with empty args to be an Atom would be difficult, lead to unexpected corner cases, and wouldn't even make sense (a good example here is Tuple(), the empty Tuple).

Now, here would be the exact invariants in each case

  1. expr == expr.func(*args)
  2. expr.args == () or expr == expr.func(*expr.args)

Additionally, each instance of Basic would also satisfy

  1. All elements of expr.args are instances of Basic. In code, all(isinstance(arg, Basic) for arg in expr.args).

Some examples of where this comes up (all leaf objects, or near leaf objects if we choose 1)

Object       | example                 | args
-------------|-------------------------|-------------------------------
Symbol       | Symbol('x')             | ()
Integer      | Integer(2)              | ()
Rational     | Rational(2, 3)          | ()
NumberSymbol | pi                      | ()
MatrixSymbol | MatrixSymbol('x', 2, 3) | ('x', 2, 3) where 2 and 3 
             |                         | are Integers and x is a string

TODO: The below lists might be useful to have in a table, because many of the bullet points in on section correspond to bullet points in the other. But to do tables in Markdown, you have to use html.

Pros of 1 over 3/Cons of 3 over 1

  • Easier to conceptualize, in the sense that all parts of an expression that define that expression are in .args
  • Easier to write objects. You don't have to write custom __eq__ or _hashable_content if you want to make non-Basic objects part of the object.
  • Objects like MatrixSymbol might have part non-Basic and part Basic args, but 3 would require all or nothing (or a #2093 type workaround that uses a nesting the intuitive feeling of the object).
  • The invariant looks simpler.
  • non-Basics are treated as part of the expression tree. Any expression tree manipulation will also affect them.

Cons of 1 over 3/Pros of 3 over 1

  • People might take it too far. If there is a Basic version of an object, you really should be using that object (e.g., use Integer(1) instead of 1 or Tuple instead of tuple). Otherwise you miss the whole point of having those objects in the first place.
  • Code that walks expression trees always has to be special-cased to look out for non-Basics.
  • It's a nice assumption to make (all elements of .args are Basic). It means you can write for arg in expr.args and know that you can call any method of Basic on arg.
  • Existing code assumes this.
  • Treating non-Basics as part of the expression tree can be confusing. For example, Symbol('x').xreplace('x', 'y') will return Symbol('y'), but not because it replaced 'x' with 'y', but because 'x' is sympified into Symbol('x') and 'y' is sympified into Symbol('y') and those are replaced (and there is nothing special about the non-Basic str in this case). Doing 1 consistently would require changing the whole way that sympification is handled throughout SymPy.
  • With 1, there can be non-leaf nodes without any leaves, because there will still be objects with empty args (Tuple() is again a classic example). In some sense, these objects will still have to be leaves, so perhaps 1 should really be obj.args == () or not isinstance(obj, Basic). However, to be sure, the invariant does not need to include this because if obj.args == (), then it must be the case that obj == obj.func().

Algorithms

But I think an important thing to consider is what expression traversal functions would look like with each of the invariants.

To be fair, there is a difference between 1 algorithms and 3 algorithms, in that the 1 algorithms treat the non-Basics

preorder_traversal

Note: to make these generators in their current form, you need Python 3.3's yield from. However, as we see below, the usual use case of the pre-order traversal is to use it to actually recurse through an expression, rebuilding at some part, and returning the result. Thus, for simplicity, this version just prints each node.

def preorder_traversal(expr):
    print expr
    if isinstance(expr, Basic):
        for arg in expr.args:
            preorder_traversal(arg)
def preorder_traversal(expr):
    print expr
    for arg in expr.args:
        preorder_traversal(arg)

exact replacement

def xreplace(expr, old, new):
    if expr == old:
        return new
    if isinstance(expr, Basic):
        return expr.func(*[xreplace(arg, old, new) for arg in expr.args])
    return expr
def xreplace(expr, old, new):
    if expr == old:
        return new
    if expr.args:
        return expr.func(*[xreplace(arg, old, new) for arg in expr.args])
    return expr

Atoms

def atoms(expr, classes):
    ret = set()
    if isinstance(expr, classes):
        ret |= set([expr])
    if isinstance(expr, Basic):
        for arg in expr.args:
            ret |= atoms(arg, classes)
    return ret
def atoms(expr, classes):
    ret = set()
    if isinstance(expr, classes):
        ret |= set([expr])
    for arg in expr.args:
        ret |= atoms(arg, classes)
    return ret

Opinions

Aaron

I prefer 3. Here are my issues with 1:

  • It appears simpler, because the invariant is simpler (no check for empty args), but it's deceptive. As the above algorithms show, 3 is actually the simpler version. With 1, every algorithm has to nest its expression recursion under an if isinstance(expr, Basic) block. With 3, empty args being leaves falls out naturally as an empty loop.

    There are cases where you do have to check this condition with 3, such as in xreplace, but you also have to check the Basic condition in the same place with 1, so nothing is gained or lost in that case.

  • I think the bullet point under the cons for 1 about sympification is not to be considered lightly. Going with 1 would require a huge refactoring of the way sympification is handled in SymPy. The usual way is to automatically coerce non-SymPy types into SymPy types, but with 1, it is not clear when this should be done. By default, in the Basic constructor, we would have to not sympify arguments. The burden for a class to specify that all its arguments should be Basic lies on the individual classes (or at least super classes). Since this is by far the common case, it would make the current code much more complicated, and lead to many bugs because people don't expect to need to write args = sympify(args) at the top of their constructors.

    Actually, this is the current behavior (Basic does not call sympify in its constructor). That's why classes exist that have non-Basic args. But we have this exact problem. Code is littered with seemingly unnecessary args = sympify(args) calls, and code that forgets to include it has subtle bugs when someone calls the class with 0 instead of x.

    Alternately, we could keep sympification as the default, and add a flag to Basic that subclasses could use to signify that they allow non-Basic args. But this leads to two issues. First, this becomes an inheritable property of the class, meaning it will be easier to obtain Liskov violations (this may actually be a non-issue, I'm not sure). Second, this gives the feeling that objects with non-Basic args are somehow second-class citizens. In fact, that's my next point

  • No matter how we support 1, objects with non-Basic args will become like second-class citizens, whether it's de jure from the implementation or de facto from people forgetting to add the if not isinstance(expr, Basic) to their algorithms.

  • I should point out that leaf nodes are free to define equality in a natural way. That is, even if something like MatrixSymbol('x', 1, 2).args is (), we can still have MatrixSymbol('x', 1, 2) != MatrixSymbol('y', 1, 2) != MatrixSymbol('y', 2, 2).

Matthew

I mostly prefer 1

  • Small point but while it's fresh in your mind, MatrixSymbol('x', 1, 2).args can't be () easily. This is because the expression could be MatrixSymbol('x', n, k+m). The second two arguments are the shape and in general are Exprs. We'll need to traverse down these so they'll either need to be args or we'll need to do a lot of special casing within MatrixSymbol. A pull request exists to replace 'x' with Symbol('x'). This seems wrong to me because 'x' here is a name, not a mathematical scalar. The right way to do this under 3 would be to create a String(Basic). This also feels wrong to me.

  • In general, I think that having all identifying information within args will simplify the code in the long term. It also opens various performance improvements. See the following timings from my laptop

In [2]: timeit Add(x, y, z)
100000 loops, best of 3: 2.15 us per loop

In [3]: timeit Add(x, y, z, evaluate=False)  # no idea why this is slower
100000 loops, best of 3: 4.14 us per loop

In [4]: timeit Basic.__new__(Add, x, y, z)
1000000 loops, best of 3: 616 ns per loop

In [5]: timeit (Add, x, y, z)
10000000 loops, best of 3: 96.7 ns per loop

Having all information in args turns the SymPy Basic into something like an s-expression (a glorified tuple). Certainly we don't want to go this far (we like lots of things that the Basic object gives to us), but having that simplicity enables more advanced manipulations of SymPy expressions with greater trust in their accuracy. If someone wrote code that required really intense expression manipulation they could switch into tuple mode (or list mode for mutability), do lots of manipulations quickly, then switch back and re-evaluate. If all information is in args then they could perform these transformations with high fidelity. I'm not proposing that this is the reason we should switch but rather that this is an example of something that is easy with an invariant like "All identifying information is in .args"

  • Personally I manipulate SymPy expressions quite a bit. I often do this from outside SymPy (Theano, LogPy, Strategies). Having the "all identifying information is in .args" invariant makes SymPy significantly more interoperable if you don't have access to the codebase (or don't want to fuss with the codebase). Classes like Symbol, Rational, AppliedPredicate all end up requiring special attention in each of these endeavors. I'm mostly willing to put up with it because I'm familiar with them but potential collaborators are often turned off. I believe that interoperability is more important than we realize for long term growth.

I believe that interoperability is the most important direction for SymPy right now (personal opinion). I believe that the "all identifying information is in .args" invariant is essential for that goal.

It seems to me that this is a discussion about either

    1. Simplifying our data structure
    1. Restricting the things that our data structure can hold to simplify our traversal code.

I think that 3 is short-term more convenient but that 1 is long-term more convenient and much more convenient for interoperation and new developers. To me 3 is insular.