Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Fortran printing for Indexed, ArraySymbol and MatrixSymbol #23998

Open
ThePauliPrinciple opened this issue Aug 31, 2022 · 28 comments

Comments

@ThePauliPrinciple
Copy link
Contributor

from sympy import MatrixSymbol, IndexedBase
from sympy.tensor.array.expressions import ArraySymbol
print(fcode(IndexedBase('I', (5,5))[0,0]))
print(fcode(ArraySymbol('A', (5,5))[0,0]))
print(fcode(MatrixSymbol('M', 5, 5)[0,0]))
      I(0, 0)
      A[0, 0]
      M(1, 1)

ArraySymbol is just wrong using the bracket notation. I think Indexed should print as I(1,1) like MatrixSymbol does, considering that would be the standard in Fortran. Changing it would potentially break currently working code. (If anyone is using it).

@ThePauliPrinciple
Copy link
Contributor Author

ThePauliPrinciple commented Aug 31, 2022

For loops also seem inconsistent:

from sympy import fcode, Range, IndexedBase, Idx
i = Idx('i')
from sympy.codegen.ast import For
print(fcode(For(i,Range(5), [IndexedBase('A')[i]])))
      do i = 0, 5, 1
         A(i)
      end do

The above fortran code (if I print A(i)) yields six number:

Warning: Array reference at (1) out of bounds (0 < 1) in loop beginning at (2)
   2.33174307E+17
   0.00000000    
   0.00000000    
   0.00000000    
   2.33174307E+17
   0.00000000 

This is the c code btw:

for (i = 0; i < 5; i += 1) {
   A[i];
}

@ThePauliPrinciple
Copy link
Contributor Author

ThePauliPrinciple commented Aug 31, 2022

I wonder if maybe the fortran printer should have a way to set which starting value for arrays you use (the default is that it starts at 1, but you in principle could define an array which starts at any number, including negative numbers I believe)

@ThePauliPrinciple
Copy link
Contributor Author

Some more interesting things:

IndexedBase offers strides, which either tells you if you should skip elements (as in e.g A[::2]) or it can be F or C to indicate column or row major. But this is then not implemented in all printers.
codegen.ast.Element offers strides, but it can only be the first meaning above.
I dont think ArraySymbol offers something like this.

Maybe it is better to have one central array-like printing method, which itself uses printer specific methods to figure out what the correct format is.

@ThePauliPrinciple
Copy link
Contributor Author

Digging into this makes it worse and worse tbh. e.g.

A = IndexedBase('A', (5,5), strides = 'F')
i,j = Idx('i'), Idx('j')
print(ccode(A[i,j]))
print(ccode(A.func(*A.args)[i,j]))
print(ccode(simplify(A)[i,j]))
A[i + 5*j]
A[5*i + j]
A[5*i + j]

Shows that the strides information is lost when recreating the object. That makes this feature completely bugged, since the above can arbitrarily happen during any of SymPy's operations (e.g. with simplify).

I wonder if strides should really be a part IndexedBase? In my opinion, fcode should just write it in reversed order and ccode should keep the order that was already there.

Currently print(fcode(A[i,j])) keeps the same order regardless of the choice of strides (and the python/c like order to top it off).

@bjodah
Copy link
Member

bjodah commented Sep 2, 2022

Oh, that's a lot of bugs. I think it's great that you take the time to hunt them down. I'm probably the one to blame here, I overlooked the need for explicitly adding strides and offset to .args. That could probably be fixed though (by always including e.g. a Strides and Offset obejct, defaulting to "none"), or do we consider the length of .args to be part of our public stable API?

The For loops indeed look very buggy. I believe we have tests exercising the C-printer and Fortran-printer wherr we actually use a compiler to test the generated code, but I don't think there's any test using the For-object:
https://github.com/sympy/sympy/blob/master/sympy/codegen/tests/test_algorithms.py
we should probably expand that test suite to exercise more of the objects from .codegen.ast.

I'm fine with deprecating features which are inherently broken. To be honest, I've seen very few people using some of these features besides me, which I guess says something about the ergonomics of them. But it would be useful to have a "multi-indexed" object (N-dimensional array) which can be printed as e.g. A(i, j, k) and A(i*s0 + j*s1 + k*s2) in Fortran, and as A[i*s0 + j*s1 + k*s2] in C (where sN are strides). Strides C and F are special cases for s0 ... sN. We should probably also try to support e.g. a way for the C++ printer to print with e.g. eigen compatible syntax.

Part of the reason for why the situation looks like this is historic: IndexedBase represents the oldest code, and has been modified along the way. The NDimensionArray-stuff was long private and unstable API (still is?). And the .codegen.ast evolved somewhat independently...

@bjodah
Copy link
Member

bjodah commented Sep 2, 2022

For the translation of indices in the printer, I think we should probably make that configurable by a printer setting? For example: if we always add 1 to indices in the FCodePrinter, then what do we do for symbolic indices?

@ThePauliPrinciple
Copy link
Contributor Author

ThePauliPrinciple commented Sep 2, 2022

I think you should also always add +1 to symbolic indices, consider e.g.:

c = 1
A[c+1]

Now when you do a loop and want to explicitly go from 0..4 to 0..5, you should at that point subsitute the iteration variable i with i-1, so that you go from

do  i = 0, 4
  A[i + 1]
end do

to

do i = 1, 5
  A[(i - 1) + 1] 
end do

Note that sympy's Add would canonicallise (i - 1) + 1 to i, so it would print what you would want it to.

Edit:
For this to work, some of the current looping code would have to be replaced. In particular, I think anywhere the current printers create a loop, it should just put the expression inside a For codegen.ast object and then try to print it again, rather than the current approach of having multiple ways to introduce loops. Then the For printing code can introduce the substitution when appropriate for that language.

Second edit:
This subsitution would also be helpful if you want to unroll the looping in fortran (which you have to do when you have more than 7 indices, since that is the limit for fortran), because then you can simply use the c-style unrolling and when you then apply the substitution, you can still loop from 1..5 and you would still also need the single +1 for the one index you are now describing (with multiple symbols).

@ThePauliPrinciple
Copy link
Contributor Author

Yes, codegen.ast needs more testing. The rcode For loop has a test, but it tests for it to print a c code loop, which is clearly wrong and the test currently passes.

@bjodah
Copy link
Member

bjodah commented Sep 2, 2022

Sometimes people (well, at least me) do code-generation not of a full loop, but just a fragment. Say that I want a fortran code snippet for this expression:

>>> from sympy import IndexedBase, Idx
>>> from sympy.calculus import apply_finite_diff
>>> x, y = map(IndexedBase, 'xy')
>>> i = Idx('i')
>>> x_list, y_list = zip(*[(x[i+j], y[i+j]) for j in range(-1,2)])
>>> expr = apply_finite_diff(1, x_list, y_list, x[i])
>>> from sympy.printing.fortran import fcode
>>> print(fcode(expr, source_format="free"))
((x(i + 1) - x(i))/(-x(i - 1) + x(i)) - 1)*y(i)/(x(i + 1) - x(i)) - (x(i &
      + 1) - x(i))*y(i - 1)/((x(i + 1) - x(i - 1))*(-x(i - 1) + x(i))) &
      + (-x(i - 1) + x(i))*y(i + 1)/((x(i + 1) - x(i - 1))*(x(i + 1) - &
      x(i)))

Here I would rather not have any +1 terms inserted by SymPy. Actually, changing this behavior would silently break code relying on this.

@ThePauliPrinciple
Copy link
Contributor Author

Yes, that's what I'm afraid of, but in my opinion you are currently working around a bug. The same sympy expression evaluates to something different in c and in fortran. You will now need a different value for i for c and fortran if you want it to mean the same thing.

@bjodah
Copy link
Member

bjodah commented Sep 2, 2022

Well, I politely disagree, and I would argue that what we have now is the expected behaviour (but not for our generated loops of course). 🙂

In modern Fortran we can't even make the assumption that the array index starts at 1 (cf. the DIMENSION attribute). And I've actually had to work with existing codes where the base index was explicitly set to something else in a few places (stencils, handling of ghost cells etc.).

@ThePauliPrinciple
Copy link
Contributor Author

So then how to keep the current behaviour, but also be able to consistently evaluate static and looped indices that results in the same behaviour when it is written to Fortran and C?

@bjodah
Copy link
Member

bjodah commented Sep 2, 2022

That's the hard part, and I don't claim to have a silver bullet here. I think it's hard to reconcile with our current printers. Perhaps having more than one type of printer is a viable approach: A "low-level" one which makes close to zero assumptions, and then printers (maybe it should be called something else then) which carries state and are used to generate e.g. whole functions with consistent indexing for the specific language etc.

@bjodah
Copy link
Member

bjodah commented Sep 2, 2022

It's hard to foresee all potential issues with competing designs, perhaps building such new facilities with a few disparate use cases in mind and putting them under:
https://github.com/sympy/sympy/tree/master/sympy/sandbox
is a viable approach?

That way, we're not held back by back-comp. issues, and if all existing use cases can eventually be addressed with the new tools, then one can starting looking at slowly deprecating the old and half-broken tools we have. But as usual, it mostly comes down to manpower, if you have the bandwidth for spearheading this, then that's awesome! I would try to make time to provide some (hopefully) useful input.

@ThePauliPrinciple
Copy link
Contributor Author

ThePauliPrinciple commented Sep 2, 2022

I could work with that if the current printers were truely low-level. A function which would apply such substitutions is perfectly fine with me, e.g. if ccode(expression) and fcode(fortran_indexing(expression)) would yield the same functional code than I would be happy. The problem now is that for loops the Fortran printer assumes that arrays start at 1, for Matrix it assumes that arrays start at 1, but for Indexed and ArraySymbol it makes no assumption on where the arrays start (which is the same as saying it starts at 0, because that is how I would interpret the SymPy expression).

So either we always start at 0, always start at 1 or we have a per class or even per object definition of where they start. But the current situation is bugged and changing it either way is going to bug someone's current code. If someone is currently using Matrix('m')[0,0] and we change that to be consistent with the printing of Indexed and ArrayElement, then they are obtaining numbers from unallocated memory.

@bjodah
Copy link
Member

bjodah commented Sep 2, 2022

I agree that the current printers do too many things. In my point of view a printer should mostly be concerned about parenthesization, syntax and translation of SymPy tree into the natural representation of the targeted langauge (e.g. Pow(..., -1) should print as division etc.).

I'm speculating, but: I think we could break out those "non-controversial" methods into base classes for respective language, which then our current backwards-compatible printers inherit from. That way you can reuse the "good parts" of existing printers, while not being hindered by arbitrary choices? (the contract=True/False flag comes to mind)

@ThePauliPrinciple
Copy link
Contributor Author

Yes, the contract flag is particularly annoying. I would rather have a helper function explicitly defining sums (the current pattern recognition is based on tensor-valid operations only, which is not necessary the only thing you could possible want).

Defining a _get_indices_starts(self, <object with indices>) method which tells us where each index of that object starts depending on first the object, then the class and then a global setting might be a solution. Then the fortran printer just needs to set Indexed and ArraySymbol to 0 and Matrix and 'For' to 1 to get the current behaviour. (probably some more inconsistencies will show up). Then one can set everything to zero if they want to make no assumptions (like your situation above) or to 1 if one wants to write a fortran-style program that is equivalent to a c program based on the same sympy expression. Additionally, you could set a specific object to start at whatever index if you have to work with code that has arbitrarily starting indices.

Also, just to reiterate the insanity of the current situation

from sympy import fcode, MatrixSymbol, IndexedBase, Idx, symbols
i,j = symbols('i,j', cls = Idx, range = 5)
A, B = symbols('A,B', cls = IndexedBase, real = True)
M = MatrixSymbol('M', 5,5)
print(fcode(B[i,j] + M[i,j], assign_to=A[i,j]))
      do j = 1, 5
         do i = 1, 5
            A(i, j) = B(i, j) + M(i + 1, j + 1)
         end do
      end do

@ThePauliPrinciple
Copy link
Contributor Author

I'm also not really sure if backwards compatibility is really an argument here. To me "backwards compatibility" is not an argument in and off itself. The reason you want to maintain backwards compatibility is that you don't want that someone's current script suddenly fails, or worse, the result changes without explicit warnings. And while that is certainly the situation if we were to change something, it is also true that currently someone's script might not be functioning correctly without them knowing either. I would consider that a worse situation.

I'd also be happy with saying that Matrix is bugged and we have Matrix[0,0] print as Matrix(0,0), if you think that has fewer consequences. But I think that keeping the current situation is the worst out of the three options (changing Matrix vs changing Indexed, ArrayElement and loops vs keeping the current situation).

@bjodah
Copy link
Member

bjodah commented Sep 2, 2022

I think your multi-tiered approach of getting start indices sounds promising (first object, then printer, not sure about global though, but I guess it's the immutable default of the printer you're referring to?).

I agree that the current situation is quite insane indeed, but people have adapted, and we try very hard not to break backwards compatibility. In general, people do test their calling code, and adapt to our idiosyncrasies. I know it's frustrating, but in this case, I don't think we can change the output of fcode(someMatrixSymbol[0,0]) without breaking users code, I did a quick (like 30 second) search on github and dug out this:
https://github.com/eph/dsge/blob/30cb16aca3675d991d25b1f0209f3fa99f0adc39/dsge/FCodePrinter.py#L29
I haven't tried it, but superficially it looks like it's depending on our behavior not changing here?

But the fortran-printing of ArraySymbol generating square brackets does indeed look like it classifies as a bug (and a tell tale of it never being used), so I would just go ahead and fix that (theoretically someone could be relying on that, but we have to be a bit pragmatic too).

@ThePauliPrinciple
Copy link
Contributor Author

I think I did not explain the tiered system correctly. When printing an object with indices, we first look in a dictionairy which contains object so e.g. if we print A[i], we look if we find {A[i]: n} and use n as the number to start counting at. Then if this doesn't exist, we look in a dictionary for {Indexed: m} which should tell us that any Indexed (or subclass thereof) not specifically defined should start at m. Then finally if that fails we fall back to the printer's default.

My problem with modifying the existing code is the amount of effort it takes to keep the current inconsistencies that I am not interrested in keeping in the first place. But I also don't think that SymPy needs yet another "new standard" of code generation. Maybe what I am looking for (being able to write algorithms based on SymPy expressions) is better suited being hosted as its own project.

Not that it counters your general argument, but in the given example, it uses _traverse_matrix_indices to find the matrix indices, so if that method is updated consistently it should still print the same thing I believe.

@ThePauliPrinciple
Copy link
Contributor Author

As for the strides as an arg, I don't think it is a problem to add it as an additional arg. But do be warry of this though, since it is difficult to store a str. If you store it as a Symbol, then suddenly IndexedBase.strides == 'F' returns False (if it directly grabs the arg and you used sympify to convert it to a Basic.

@bjodah
Copy link
Member

bjodah commented Sep 2, 2022

OK, I think I see what you mean with the tiered system. That approach also sounds good.

A separate project is of course easier to maintain since one gets to decide on deprecation rules etc. And it's always easier to incorporate a project into the SymPy codebase (once it has stabilized) than vice versa. For visibility you can add a link to it from SymPy's webpage (once it's somewhat stable).

Yes, you're probably right about the _traverse_matrix_indices, so the jury is still out on whether the current behavior is relied upon in the wild.

@ThePauliPrinciple
Copy link
Contributor Author

I still think that For in fortran is bugged for everyone, currently the c code would print N numbers, while fortran would print N+1 numbers because it both starts at 0 and includes the final number. I think it would be most consistent to start at zero and end at N-1.

@ThePauliPrinciple
Copy link
Contributor Author

Note that For is not related to the current automatic looping done by contract=True, that uses different code to print the for loops.

ThePauliPrinciple added a commit to ThePauliPrinciple/sympy that referenced this issue Sep 4, 2022
Several of the less controversial problems brought up in sympy#23998
are addressed.
ThePauliPrinciple added a commit to ThePauliPrinciple/sympy that referenced this issue Sep 4, 2022
Several of the less controversial problems brought up in sympy#23998
are addressed.

typo
@ThePauliPrinciple
Copy link
Contributor Author

I'm speculating, but: I think we could break out those "non-controversial" methods into base classes for respective language, which then our current backwards-compatible printers inherit from.

I'm actually starting to like this idea a bit more. Of course, the question then is what is "controversial". Maybe we should consider any processing as controversial. So that would include e.g. unrolling indexes for C. But also Assignment doing anything other than printing lhs = rhs.

The above operations could easily be coded in simple functions that return a new Expr with something that would print without that code being present in the printers. The printers could then have a preprocessors setting which is a list of such functions that are applied during doprint. Then we can just set the default value such that it reproduces the current behaviour. But it would then be an option for someone to also do unrolling indices for other languages if they wish to do so.

It's not clear to me how to seperate the classes though, since a lot of the extra logic in _print_Assignment is defined in the CodePrinter base class.

@bjodah
Copy link
Member

bjodah commented Sep 5, 2022

Perhaps just new classes overriding _print_Assignment et al. is the appropriate approach then?
I wouldn't mind forcing the user to actually do the two steps themselves:

>>> rewritten = rewrite(exprs, [einstein_sum_contraction, unroll_assignments, expand_integer_powers])
>>> actual_code = fcode(rewritten)

I find that the Python community often suffer from feature creep in ever growing lists of accepted keyword arguments to swiss-army-knife-like functions (I'm looking at you scipy....). And those who want to, can create their own convenience functions.

@ThePauliPrinciple
Copy link
Contributor Author

That's exactly what I have in mind, except that currently the printers are doing some of those things, so it would be difficult to get no clashes if you can't turn the current behaviour.

The problem with creating a new class that overrides the "controversial" methods in the current classes would be that there are good odds that someone in the future is going to create a new "controversial" method in the current classes, which then propagates through to the classes that are supposed to be "clean" and undoing that would mean breaking backwards compatibility when it is finally detected as an issue.

What did you have in mind for expand_integer_powers?

@bjodah
Copy link
Member

bjodah commented Sep 5, 2022

You're right, I can't think of any guard against that short of manual review of future pull requests.

What did you have in mind for expand_integer_powers?

I was just referring to create_expand_pow_optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants