Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerating the construction of matrices of type Matrix_modn_dense #35961

Closed
1 task done
vneiger opened this issue Jul 17, 2023 · 7 comments · Fixed by #36093
Closed
1 task done

Accelerating the construction of matrices of type Matrix_modn_dense #35961

vneiger opened this issue Jul 17, 2023 · 7 comments · Fixed by #36093

Comments

@vneiger
Copy link
Contributor

vneiger commented Jul 17, 2023

Problem Description

This follows on from the issue #28432 , which was created in 2019 and was targetting the construction of the zero matrix from the Matrix_modn_dense_template (classes Matrix_modn_dense_{double,float}). The observation was that a lot of (useless) modular reductions were performed upon creation, even though no such reduction is necessary when all the wanted entries are zero.

Here the issue targets more generally the construction of matrices of this type. There are several cases where creation is obviously too slow (beyond avoidable modular reductions) and there are also some inconsistencies.

For creating the zero matrix, some examples of slowliness/inconsistencies are already given in #28432 (comment) . There one notices that several natural ways of creating the zero matrix are significantly slower than copying the cached zero matrix, and that creating a zero matrix over the rationals is much faster than in the present case (over modular integers).

Consider also the identity matrix and diagonal matrices:

sage: field = GF(9001)
sage: space = MatrixSpace(field, 10000, 10000)
sage: _ = space.zero_matrix(); _ = space.identity_matrix()
sage: %timeit zmat = space.zero()
1.2 s ± 9.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sage: %timeit zmat = space.one()
217 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sage: %timeit zmat = space(1)
1.23 s ± 3.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sage: d = [field.random_element() for k in range(10000)]
sage: %timeit dmat = space.diagonal_matrix(d)
199 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Creating a diagonal matrix or the identity matrix with space.one() is faster than creating the the identity with space(1). This is because some of them rely on copies of cached matrices, whereas space.zero() does not. Indeed, copying is about 200ms:

sage: %timeit cmat = copy(space.identity_matrix())
199 ms ± 1.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Creating a dense random matrix takes as long as some of the above (very special and sparse) cases:

sage: %timeit rmat = space.random_element()
1.16 s ± 5.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Proposed Solution

The constructor seems to parse all the entries of the matrix, whatever the type of matrix (MA_ENTRIES_* from args.pyx). Many other classes (rationals, integers, GF(2), ..) use an iterator with sparse=True, which should bring great improvements at least for the zero matrix and diagonal matrices. It seems this could be adapted here.

A more complete study of the performance of the different methods of matrix construction (types MA_ENTRIES_*) should be conducted, to find out if other cases have to be tackled. Also, one should be careful not to introduce performance regressions, e.g. for some types of constructions or for small matrix sizes.

Alternatives Considered

At least, one should use copies of the cached zero/identity matrices whenever this is faster.

Additional Information

No response

Is there an existing issue for this?

  • I have searched the existing issues for a bug report that matches the one I want to file, without success.
@vneiger
Copy link
Contributor Author

vneiger commented Jul 17, 2023

@marizee @ClementPernet

@marizee
Copy link
Contributor

marizee commented Jul 25, 2023

Using a sparse iterator as suggested is indeed speeding up the creation of the zero matrix :

sage: field = GF(9001)
sage: space = MatrixSpace(field, 10000, 10000)
sage: _ = space.zero_matrix(); _ = space.identity_matrix()
sage: %timeit zmat = space.zero()
16.8 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This also accelerates the constructor for the matrix argument of type MA_ENTRIES_SCALAR (creating a diagonal of ones, in this case):

sage: %timeit zmat = space(1)
16.5 ms ± 7.89 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

However, there are improvements that still need to be made, for example regarding the creation of a diagonal matrix and the identity matrix :

sage: %timeit zmat = space.one()
374 ms ± 1.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sage: d = [field.random_element() for k in range(10000)]
sage: %timeit dmat = space.diagonal_matrix(d)
340 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Indeed we expect the timings to be closer than they are currently, in particular for space(1) versus space.one().

@marizee
Copy link
Contributor

marizee commented Aug 1, 2023

The problem (slow identity) seems to be due to copying the zero matrix. In matrix_space.py, the lazy attribute _copy_zero indicates whether it is faster to copy the zero matrix or to create a new matrix from scratch. In our case, it returns False (at least for matrices that are not very small).

I've added a condition in identity_matrix and diagonal_matrix, to avoid making the copy:

        if self._copy_zero:
            A = self.zero_matrix().__copy__()
        else:
            A = self.matrix()

This does improve the timings and there is no surprising discrepancy anymore (compare this to the previous message):

sage: field = GF(9001)
sage: space = MatrixSpace(field, 10000, 10000)
sage: _ = space.zero_matrix(); _ = space.identity_matrix()
sage: %timeit idmat = space.one()
20.7 ms ± 67.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
sage: d = [field.random_element() for k in range(10000)]
sage: %timeit dmat = space.diagonal_matrix(d)
18.5 ms ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I noticed that _copy_zero is not used anywhere in Sage. Is it relevant? Some further benchmarks are needed to see what dimension threshold is for _copy_zero for Matrix_modn_dense matrices.

On another note: the __cinit__ method is only doing a malloc, not zeroing the allocated memory for the new matrix. Thus, when creating e.g. a diagonal matrix with the iterator with sparse=True, the entries outside are not necessarily all zero in the matrix. One solution is to add a memset at the beginning of __init__, yet this turns out to be quite time-consuming. A faster solution is to use a calloc instead of malloc in __cinit__. For this solution the only downside seems to be that this slows down the __copy__ method (the others are not really impacted). I will investigate this further to see how to avoid this slowdown of __copy__.

@vneiger
Copy link
Contributor Author

vneiger commented Aug 2, 2023

Good catch, with _copy_zero. For these modn_dense matrices it is not clear that this is ever a good idea to copy zero rather than doing a malloc/calloc... This brings back the discussion from sage-devel/c/R4-xnstyS-A, and the related #11589 which also mentions constructing matrices with entries = None versus entries = 0.

Some tests for when to use _copy_zero had been done previously, see #11589 (comment) . I cannot find another issue mentioning why _copy_zero is not used anymore... one should probably search in the git history...

Regarding None versus zero, it seems that the current code is not always behaving best (in these timings below, I used the sparse iterator and calloc as you suggest):

sage: space = MatrixSpace(GF(9001), 10000, 10000)

sage: %timeit zmat = space.matrix()
18.3 µs ± 130 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
sage: %timeit zmat = space(None)
18.1 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

These two calls build the zero matrix using entries = None. But the following ones call the constructor with entries=0, which is much slower for large matrices:

sage: %timeit zmat = space()
12.1 ms ± 65.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
sage: %timeit zmat = zero_matrix(GF(9001), 10000, 10000)
12.3 ms ± 89.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
sage: %timeit zmat = space(0)
12.1 ms ± 64.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

It is expected for the last one, but for the two others one could expect the same efficiency as above when using entries=None. For the first one, space(), it seems tricky at first, I'm not sure where the 0 appears. For the second one, zero_matrix, we may simply change the line

    return matrix_space.MatrixSpace(ring, nrows, ncols, sparse)(0)

(in special.py, 994) into something with None instead of 0, since all matrix spaces are supposedly supporting None and depending on the matrix space this should either not change performances, or improve them.

@vneiger
Copy link
Contributor Author

vneiger commented Aug 3, 2023

For the issue of slowing down copy when using calloc instead of malloc in the initialization, I would suggest adding an optional parameter to __cinit__ so that it does calloc by default (so we keep the great speedup for creating zero), but allows a malloc if wanted (to avoid the calloc overhead when copying). So, something along these lines:

    def __cinit__(self, *args, bint zeroed_alloc=True, **kwds):
        [...]
        if zeroed_alloc :
            self._entries = <celement *>check_calloc(self._nrows * self._ncols, sizeof(celement))
            self._matrix = <celement **>check_calloc(self._nrows, sizeof(celement*))
        else:
            self._entries = <celement *>check_allocarray(self._nrows * self._ncols, sizeof(celement))
            self._matrix = <celement **>check_allocarray(self._nrows, sizeof(celement*))
        [...]

and then, in __copy__,

        A = self.__class__.__new__(self.__class__, self._parent, 0, 0, 0, zeroed_alloc=False)

@marizee
Copy link
Contributor

marizee commented Aug 6, 2023

I also suggest to add zeroed_alloc=False in the other methods calling __new__, which are __neg__, _lmul_, _add_ and _sub_.
Zeroing them beforehand doesn't seem to be necessary in these cases so this might be useless computing.

In the methods mentioned above, some arguments given to __new__ are None whereas in __copy__, the matching ones are 0:

def __neg__(self):
    [...]
    M = self.__class__.__new__(self.__class__, self._parent,None,None,None, zeroed_alloc=False)
    [...]
def __copy__(self):
    [...]
    A = self.__class__.__new__(self.__class__, self._parent, 0, 0, 0, zeroed_alloc=False)
    [...]

Is there any reason for them to be different ? I tried replacing the 0 into None and the timings were not affected.

@vneiger
Copy link
Contributor Author

vneiger commented Aug 7, 2023

also suggest to add zeroed_alloc=False in the other methods calling new, which are neg, lmul, add and sub.
Zeroing them beforehand doesn't seem to be necessary in these cases so this might be useless computing.

Yes, indeed. There also may be other places where this __new__ is used (e.g. _mod_int_c in matrix_integer_dense).

Is there any reason for them to be different ? I tried replacing the 0 into None and the timings were not affected.

I do not think there is a difference since these arguments don't play a role in the initialization/allocation. Using consistently the same one (e.g. None which is used more often) would be better.

vbraun pushed a commit to vbraun/sage that referenced this issue Aug 12, 2023
… type MA_ENTRIES_ZERO

    
There are many ways to specify entries when creating a matrix, which are
handled in `args.pyx` with `MatrixArgs` class. It has been discussed
before why allowing `entries=None` in the matrix creation methods can be
beneficial in terms of performance
(sagemath#11589 ,
sagemath#12020). This input
`entries=None` is required to yield the zero matrix. For example:
`matrix(QQ, 4, 4)` creates the zero matrix. This corresponds to the type
`MA_ENTRIES_ZERO` in `args.pyx`. When one passes a value, e.g.
`matrix(QQ, 4, 4, 2)`, then one gets a diagonal matrix with `2` on the
diagonal; this corresponds to the type `MA_ENTRIES_SCALAR` in
`args.pyx`.

Currently, doing something like `matrix(QQ, 4, 4, 0)` will pick
`MA_ENTRIES_SCALAR`, and therefore will build the matrix and fill the
diagonal with zero. [Behind the scenes, there is still some
acknowledgement that this is not the usual scalar matrix case, since
this will not fail if the matrix is not square (`matrix(QQ, 4, 5, 0)`
will not fail, but `matrix(QQ, 4, 5, 1)` will). But this is still not
seen as `MA_ENTRIES_ZERO`.] This PR ensures the type `MA_ENTRIES_ZERO`
is picked in this situation. This improves performance and solves some
issues, as noted below.  This PR also fixes the related issue
sagemath#36065 .

In fact, `entries=None` is the default in the `__init__` of all matrix
subclasses presently implemented. It also used to be the default when
constructing a matrix by "calling" a matrix space, until
sagemath#31078 and https://github.com/sag
emath/sage/commit/cef613a0a57b85c1ebc5747185213ae4f5ec35f2 which changed
this default value from `None` to `0`, bringing some performance
regression.

Regarding this "calling the matrix space", this PR also solves the
performance issue noted in
sagemath#35961 (comment) ,
where it was observed that in the following piece of code:
```
sage: space = MatrixSpace(GF(9001), 10000, 10000)
sage: %timeit zmat = space.matrix()
18.3 µs ± 130 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops
each)
sage: %timeit zmat = space()
12.1 ms ± 65.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
the called default is not the same. `space.matrix()` directly
initializes the matrix through `entries=None`, but on the other hand
`space()` actually calls the constructor with `entries=0`. This
performance regression comes from
sagemath#31078 and https://github.com/sag
emath/sage/commit/cef613a0a57b85c1ebc5747185213ae4f5ec35f2 , where the
default for construction from the matrix space was changed from `None`
to `0`. This cannot be easily reverted: this is now the default in the
`__call__` of the `Parent` class. So this PR does not directly revert
the call default to `None`  somehow, but the result is very close in
effect (read: the overhead is small). Unchanged: through the parent
call, `0` is converted to the base ring zero, which is passed to the
constructor's `entries` which is then analyzed as `MA_ENTRIES_SCALAR`.
Changed: the code modifications ensure that soon enough it will be
detected that this is in fact `MA_ENTRIES_ZERO`. The small overhead
explains why, despite the improvements, construction with `None` is
still sometimes slightly faster than with `0`.

Below are some timings showing the improvements for some fields. Also,
this PR merged with the improvements discussed in
sagemath#35961 will make the above
timings of `space.matrix()` and `space()` be the same (which means a
speed-up of a factor >500 for this call `space()`...). The measurement
is for construction via calling the matrix space: `None` is
`space(None)`, `Empty` is `space()`, `Zero` is `space(0)`, and `Nonzero`
is `space(some nonzero value)`.

```
NEW TIMES

field   dim     None    Empty   Zero    Nonzero
GF(2)   5       2.3e-06 3.2e-06 3.6e-06 3.2e-06
GF(2)   50      2.4e-06 3.3e-06 3.6e-06 5.8e-06
GF(2)   500     3.6e-06 4.5e-06 4.8e-06 3.1e-05
GF(512) 5       2.6e-05 2.8e-05 2.9e-05 2.9e-05
GF(512) 50      2.6e-05 2.9e-05 2.9e-05 4.0e-05
GF(512) 500     3.7e-05 3.8e-05 3.9e-05 1.6e-04
QQ      5       2.2e-06 3.3e-06 3.4e-06 3.2e-06
QQ      50      8.0e-06 9.2e-06 9.4e-06 1.2e-05
QQ      500     6.1e-04 6.3e-04 6.4e-04 6.7e-04

OLD TIMES

field   dim     None    Empty   Zero    Nonzero
GF(2)   5       2.3e-06 3.5e-06 3.6e-06 3.7e-06
GF(2)   50      2.4e-06 6.0e-06 6.1e-06 6.0e-06
GF(2)   500     3.6e-06 3.0e-05 3.0e-05 3.0e-05
GF(512) 5       2.5e-05 2.8e-05 2.9e-05 2.9e-05
GF(512) 50      2.5e-05 3.9e-05 4.0e-05 4.0e-05
GF(512) 500     3.5e-05 1.5e-04 1.5e-04 1.6e-04
QQ      5       2.2e-06 3.5e-06 3.7e-06 3.7e-06
QQ      50      7.9e-06 1.2e-05 1.2e-05 1.2e-05
QQ      500     6.4e-04 6.9e-04 6.9e-04 6.9e-04
```

Code used for the timings:
```
time_kwds = dict(seconds=True, number=20000, repeat=7)

fields = [GF(2), GF(2**9), QQ]
names = ["GF(2)", "GF(512)", "QQ"]
vals = [GF(2)(1), GF(2**9).gen(), 5/2]

print(f"field\tdim\tNone\tEmpty\tZero\tNonzero")
for field,name,val in zip(fields,names,vals):
    for dim in [5, 50, 500]:
        space = MatrixSpace(field, dim, dim)
        tnone = timeit("mat = space(None)", **time_kwds)
        tempty = timeit("mat = space()", **time_kwds)
        tzero = timeit("mat = space(0)", **time_kwds)
        tnonz = timeit("mat = space(1)", **time_kwds)
        print(f"{name}\t{dim}\t{tnone:.1e}\t{tempty:.1e}\t{tzero:.1e}\t{
tnonz:.1e}")
```

### 📝 Checklist

- [x] The title is concise, informative, and self-explanatory.
- [x] The description explains in detail what this PR is about.
- [x] I have linked a relevant issue or discussion.
- [x] I have created tests covering the changes.
- [x] I have updated the documentation accordingly.

### ⌛ Dependencies
    
URL: sagemath#36068
Reported by: Vincent Neiger
Reviewer(s): Matthias Köppe, Vincent Neiger
@mkoeppe mkoeppe added this to the sage-10.2 milestone Aug 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants