Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Clone this wiki locally
What should the invariant for SymPy functions regarding
args be? The basic invariant is
expr == expr.func(*expr.args). However, there are some questions about leaf nodes.
Basically, it boils down to, what is a leaf?
- A non-Basic
- An instance of Atom
- A Basic with empty .args
I used to say 2, but I now think that it is wrong. Atom is a convenient core class, but there is really no reason to use it instead of 3. At any rate, Atom is definitely a special case of 3. But requiring every class with empty args to be an Atom would be difficult, lead to unexpected corner cases, and wouldn't even make sense (a good example here is
Tuple(), the empty Tuple).
Now, here would be the exact invariants in each case
expr == expr.func(*args)
expr.args == () or expr == expr.func(*expr.args)
Additionally, each instance of Basic would also satisfy
- All elements of
expr.argsare instances of Basic. In code,
all(isinstance(arg, Basic) for arg in expr.args).
Some examples of where this comes up (all leaf objects, or near leaf objects if we choose 1)
Object | example | args -------------|-------------------------|------------------------------- Symbol | Symbol('x') | () Integer | Integer(2) | () Rational | Rational(2, 3) | () NumberSymbol | pi | () MatrixSymbol | MatrixSymbol('x', 2, 3) | ('x', 2, 3) where 2 and 3 | | are Integers and x is a string
TODO: The below lists might be useful to have in a table, because many of the bullet points in on section correspond to bullet points in the other. But to do tables in Markdown, you have to use html.
Pros of 1 over 3/Cons of 3 over 1
- Easier to conceptualize, in the sense that all parts of an expression that define that expression are in .args
- Easier to write objects. You don't have to write custom
_hashable_contentif you want to make non-Basic objects part of the object.
- Objects like
MatrixSymbolmight have part non-Basic and part Basic args, but 3 would require all or nothing (or a #2093 type workaround that uses a nesting the intuitive feeling of the object).
- The invariant looks simpler.
- non-Basics are treated as part of the expression tree. Any expression tree manipulation will also affect them.
Cons of 1 over 3/Pros of 3 over 1
- People might take it too far. If there is a Basic version of an object, you really should be using that object (e.g., use
tuple). Otherwise you miss the whole point of having those objects in the first place.
- Code that walks expression trees always has to be special-cased to look out for non-Basics.
- It's a nice assumption to make (all elements of
.argsare Basic). It means you can write
for arg in expr.argsand know that you can call any method of Basic on
- Existing code assumes this.
- Treating non-Basics as part of the expression tree can be confusing. For example,
Symbol('x').xreplace('x', 'y')will return
Symbol('y'), but not because it replaced
'y', but because
'x'is sympified into
'y'is sympified into
Symbol('y')and those are replaced (and there is nothing special about the non-Basic
strin this case). Doing 1 consistently would require changing the whole way that sympification is handled throughout SymPy.
- With 1, there can be non-leaf nodes without any leaves, because there will
still be objects with empty args (
Tuple()is again a classic example). In some sense, these objects will still have to be leaves, so perhaps 1 should really be
obj.args == () or not isinstance(obj, Basic). However, to be sure, the invariant does not need to include this because if
obj.args == (), then it must be the case that
obj == obj.func().
But I think an important thing to consider is what expression traversal functions would look like with each of the invariants.
To be fair, there is a difference between 1 algorithms and 3 algorithms, in that the 1 algorithms treat the non-Basics
Note: to make these generators in their current form, you need Python 3.3's
yield from. However, as we see below, the usual use case of the pre-order traversal is to use it to actually recurse through an expression, rebuilding at some part, and returning the result. Thus, for simplicity, this version just prints each node.
def preorder_traversal(expr): print expr if isinstance(expr, Basic): for arg in expr.args: preorder_traversal(arg)
def preorder_traversal(expr): print expr for arg in expr.args: preorder_traversal(arg)
def xreplace(expr, old, new): if expr == old: return new if isinstance(expr, Basic): return expr.func(*[xreplace(arg, old, new) for arg in expr.args]) return expr
def xreplace(expr, old, new): if expr == old: return new if expr.args: return expr.func(*[xreplace(arg, old, new) for arg in expr.args]) return expr
def atoms(expr, classes): ret = set() if isinstance(expr, classes): ret |= set([expr]) if isinstance(expr, Basic): for arg in expr.args: ret |= atoms(arg, classes) return ret
def atoms(expr, classes): ret = set() if isinstance(expr, classes): ret |= set([expr]) for arg in expr.args: ret |= atoms(arg, classes) return ret
I prefer 3. Here are my issues with 1:
It appears simpler, because the invariant is simpler (no check for empty args), but it's deceptive. As the above algorithms show, 3 is actually the simpler version. With 1, every algorithm has to nest its expression recursion under an
if isinstance(expr, Basic)block. With 3, empty args being leaves falls out naturally as an empty loop.
There are cases where you do have to check this condition with 3, such as in
xreplace, but you also have to check the
Basiccondition in the same place with 1, so nothing is gained or lost in that case.
I think the bullet point under the cons for 1 about sympification is not to be considered lightly. Going with 1 would require a huge refactoring of the way sympification is handled in SymPy. The usual way is to automatically coerce non-SymPy types into SymPy types, but with 1, it is not clear when this should be done. By default, in the Basic constructor, we would have to not sympify arguments. The burden for a class to specify that all its arguments should be Basic lies on the individual classes (or at least super classes). Since this is by far the common case, it would make the current code much more complicated, and lead to many bugs because people don't expect to need to write
args = sympify(args)at the top of their constructors.
Actually, this is the current behavior (
Basicdoes not call sympify in its constructor). That's why classes exist that have non-Basic args. But we have this exact problem. Code is littered with seemingly unnecessary
args = sympify(args)calls, and code that forgets to include it has subtle bugs when someone calls the class with
Alternately, we could keep sympification as the default, and add a flag to Basic that subclasses could use to signify that they allow non-Basic args. But this leads to two issues. First, this becomes an inheritable property of the class, meaning it will be easier to obtain Liskov violations (this may actually be a non-issue, I'm not sure). Second, this gives the feeling that objects with non-Basic args are somehow second-class citizens. In fact, that's my next point
No matter how we support 1, objects with non-Basic args will become like second-class citizens, whether it's de jure from the implementation or de facto from people forgetting to add the
if not isinstance(expr, Basic)to their algorithms.
I should point out that leaf nodes are free to define equality in a natural way. That is, even if something like
MatrixSymbol('x', 1, 2).argsis
(), we can still have
MatrixSymbol('x', 1, 2) != MatrixSymbol('y', 1, 2) != MatrixSymbol('y', 2, 2).
I mostly prefer 1
Small point but while it's fresh in your mind,
MatrixSymbol('x', 1, 2).argscan't be () easily. This is because the expression could be
MatrixSymbol('x', n, k+m). The second two arguments are the shape and in general are
Exprs. We'll need to traverse down these so they'll either need to be args or we'll need to do a lot of special casing within
MatrixSymbol. A pull request exists to replace
Symbol('x'). This seems wrong to me because
'x'here is a name, not a mathematical scalar. The right way to do this under 3 would be to create a
String(Basic). This also feels wrong to me.
In general, I think that having all identifying information within args will simplify the code in the long term. It also opens various performance improvements. See the following timings from my laptop
In : timeit Add(x, y, z) 100000 loops, best of 3: 2.15 us per loop In : timeit Add(x, y, z, evaluate=False) # no idea why this is slower 100000 loops, best of 3: 4.14 us per loop In : timeit Basic.__new__(Add, x, y, z) 1000000 loops, best of 3: 616 ns per loop In : timeit (Add, x, y, z) 10000000 loops, best of 3: 96.7 ns per loop
Having all information in args turns the SymPy Basic into something like an s-expression (a glorified tuple). Certainly we don't want to go this far (we like lots of things that the Basic object gives to us), but having that simplicity enables more advanced manipulations of SymPy expressions with greater trust in their accuracy. If someone wrote code that required really intense expression manipulation they could switch into tuple mode (or list mode for mutability), do lots of manipulations quickly, then switch back and re-evaluate. If all information is in args then they could perform these transformations with high fidelity. I'm not proposing that this is the reason we should switch but rather that this is an example of something that is easy with an invariant like "All identifying information is in
- Personally I manipulate SymPy expressions quite a bit. I often do this from outside SymPy (Theano, LogPy, Strategies). Having the "all identifying information is in
.args" invariant makes SymPy significantly more interoperable if you don't have access to the codebase (or don't want to fuss with the codebase). Classes like Symbol, Rational, AppliedPredicate all end up requiring special attention in each of these endeavors. I'm mostly willing to put up with it because I'm familiar with them but potential collaborators are often turned off. I believe that interoperability is more important than we realize for long term growth.
I believe that interoperability is the most important direction for SymPy right now (personal opinion). I believe that the "all identifying information is in
.args" invariant is essential for that goal.
It seems to me that this is a discussion about either
- Simplifying our data structure
- Restricting the things that our data structure can hold to simplify our traversal code.
I think that 3 is short-term more convenient but that 1 is long-term more convenient and much more convenient for interoperation and new developers. To me 3 is insular.