Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up the creation of submatrices of Matrix_modn_dense_template matrices #36059

Merged
merged 24 commits into from Aug 27, 2023

Conversation

marizee
Copy link
Contributor

@marizee marizee commented Aug 10, 2023

Modified the method submatrix() to do a systematic call to memcpy().
Overrode methods matrix_from_rows(), matrix_from_columns() and matrix_from_rows_and_columns().

In the current version of submatrix(), only the case when the submatrix has as many columns as the matrix, relies on the contiguous memory storage of Matrix_modn_dense_template and call memcpy().
In fact, this can be applied to any set of arguments given to the method and enables better performances.

Overriding the other methods mentioned above allows to avoid calling get_unsafe() and set_unsafe() that are doing time-consuming casts/conversions that are unnecessary in this case.

📝 Checklist

  • The title is concise, informative, and self-explanatory.
  • The description explains in detail what this PR is about.
  • I have linked a relevant issue or discussion.
  • I have created tests covering the changes.
  • I have updated the documentation accordingly.

⌛ Dependencies

@marizee
Copy link
Contributor Author

marizee commented Aug 10, 2023

Here, some examples of timings with the current version of Sage:

sage: space = MatrixSpace(GF(9001), 10000, 10000)
sage: mat = space.random_element()
sage: l = [i for i in range(5000)]

sage: %timeit smat = mat.matrix_from_rows_and_columns(l,l)
1.74 s ± 8.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sage: %timeit smat = mat.matrix_from_rows(l)
3.26 s ± 33.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sage: %timeit smat = mat.matrix_from_columns(l)
4.46 s ± 14.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

sage: %timeit smat = mat.submatrix(0,0,5000,5000)
1.76 s ± 11.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sage: %timeit smat = mat.submatrix(0,0,5000)
901 ms ± 1.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sage: %timeit smat = mat.submatrix(0,0,ncols=5000)
3.46 s ± 21.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And the same measurements with the proposed changes:

sage: %timeit smat = mat.matrix_from_rows_and_columns(l,l)
519 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sage: %timeit smat = mat.matrix_from_rows(l)
910 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sage: %timeit smat = mat.matrix_from_columns(l)
1.68 s ± 2.34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

sage: %timeit smat = mat.submatrix(0,0,5000,5000)
455 ms ± 1.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sage: %timeit smat = mat.submatrix(0,0,5000)
911 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sage: %timeit smat = mat.submatrix(0,0,ncols=5000)
914 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Most of the time taken by submatrix() is spent in new_matrix():

sage: space = MatrixSpace(GF(9001), 10000, 10000)
sage: mat = space.random_element()
sage: %timeit smat = mat.submatrix(0,0,5000,5000)
447 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sage: %timeit nmat = mat.new_matrix(nrows=5000,ncols=5000)
427 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The improvement of new_matrix() is currently being discussed in #35961.

@vneiger
Copy link
Contributor

vneiger commented Aug 11, 2023

During review, I will add some timings to highlight some tested performance (the goal being to check that there is no regression). One caveat is that I have had to compile with the changes in #35961 otherwise comparison is too difficult due to long matrix creation. So these changes should probably be listed as a dependency of this PR, and be finalized first.

This is to check that, even when selecting many small pieces of the storage far away from each other in the memcpy calls in submatrix, this is still better than doing the loops in matrix_from_rows_and_columns:

sage: space = MatrixSpace(GF(9001), 10000, 10000)
sage: mat = space.random_element()
sage: lrows = [i for i in range(2000,2100)]
sage: lcols = [i for i in range(2000,2005)]
sage: %timeit nmat = mat.new_matrix(nrows=100,ncols=5)
9.23 µs ± 58.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
sage: %timeit smat = mat.submatrix(2000,2000,100,5)
10.1 µs ± 71.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
sage: %timeit smat = mat.matrix_from_rows_and_columns(lrows,lcols)
12.3 µs ± 99.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
sage: %timeit smat = mat.submatrix(2000,2000,100,100)
16 µs ± 136 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
sage: %timeit smat = mat.matrix_from_rows_and_columns(lrows,lrows)
45 µs ± 821 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Comment on lines -3021 to -3022
cdef Matrix_modn_dense_template M = self.new_matrix(nrows=nrows, ncols=self._ncols)
memcpy(M._entries, self._entries+row*ncols, sizeof(celement)*ncols*nrows)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines could be kept, under a condition if col == 0 and ncols == self._ncols. This would make the copying a bit faster (singly memcpy) in this situation where one wants to retrieve a (contiguous) block of complete rows.

Comment on lines 3154 to 3156
this 3 times.

::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be, on a single line, this 3 times:: . This should also be fixed in matrix1.pyx.

for i, row in enumerate(rows):
if row < 0 or row >= self._nrows:
raise IndexError("row index out of range")
memcpy(A._entries+(i*self._ncols), self._entries+(row*self._ncols), sizeof(celement)*self._ncols)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to use _matrix:

memcpy(A._matrix[i], self._matrix[row], sizeof(celement)*self._ncols)

@vneiger
Copy link
Contributor

vneiger commented Aug 20, 2023

Apart from the above suggestions for minor modifications, this looks good to me. The performance improvements are particularly significant. Below is a table which shows speedups for various sizes.

  • matdim: to start with, we take a random square matrix of size matdim, over GF(9001)
  • row,col,nrows,ncols: using three different methods, we will take the submatrix starting at (row,col) with sizes nrows x ncols
  • getitem: we take the submatrix via Python indexing such as mat[5:10, 10:20]
  • matrix_: we take the submatrix via Sage's method matrix_from_rows, matrix_from_columns, matrix_from_rows_and_columns, depending on the context
  • submat: we take the submatrix using the submatrix method.

I did not compare thoroughly more heterogeneous cases (taking subsets of rows and columns, possibly in nonincreasing orders, etc.), but tests suggest improvements there as well. This is expected since the code is the same as before (for the matrix_ methods) except that the type casts are avoided.

In the table we see ratios old vs. new, showing that the new code is always faster, often by a large factor. The only case where it is slower (by a small margin, factor is between 0.9 and 1), is related to the suggestion in my review, to keep a separate case for retrieving a submatrix consisting of a block of complete rows.

In the table we also see (column copy%) the time taken for copying a matrix of size nrows x ncols compared to the time for the submatrix call, in percentage; and similarly the time taken for initializing a matrix of size nrows x ncols compared to the time for the submatrix call, in percentage. We see that in all tested cases the submatrix call is less than twice as long as either copying or initializing, which is a good indicator that the performance is now quite acceptable. (Note: the 1 at the end of the init% column should be neglected, they are related to the system's virtual memory / copy-on-write strategies when initializing a large zero matrix.)

matdim  row     col     nrows   ncols   getitem matrix_ submat  copy%   init%
2       0       0       1       2       1.00    0.99    0.97    15      98
2       0       0       2       1       1.01    1.00    1.17    15      94
2       0       0       1       1       1.00    1.00    1.17    15      99
2       1       0       1       2       1.01    1.01    0.95    15      97
2       0       1       2       1       1.02    0.99    1.16    15      94
2       1       1       1       1       1.02    1.01    1.18    15      100
5       0       0       2       5       1.06    1.05    0.94    15      95
5       0       0       5       2       1.05    1.06    1.21    15      94
5       0       0       2       2       1.01    1.03    1.18    15      99
5       1       0       2       5       1.05    1.05    0.94    15      96
5       0       1       5       2       1.05    1.06    1.22    15      95
5       1       1       2       2       1.04    1.02    1.17    15      98
5       2       0       2       5       1.06    1.06    0.97    15      97
5       0       2       5       2       1.06    1.07    1.23    15      95
5       2       2       2       2       1.03    1.03    1.19    15      99
10      0       0       5       10      1.27    1.32    0.96    15      98
10      0       0       10      5       1.25    1.30    1.52    15      96
10      0       0       5       5       1.12    1.13    1.34    15      98
10      2       0       5       10      1.29    1.29    0.95    15      96
10      0       2       10      5       1.27    1.29    1.49    15      92
10      2       2       5       5       1.17    1.18    1.36    15      100
10      5       0       5       10      1.29    1.29    0.97    15      95
10      0       5       10      5       1.30    1.30    1.52    15      94
10      5       5       5       5       1.15    1.17    1.38    15      99
20      0       0       10      20      2.07    2.19    0.96    16      95
20      0       0       20      10      1.98    2.14    2.46    15      92
20      0       0       10      10      1.52    1.56    1.87    16      99
20      5       0       10      20      2.12    2.18    0.95    16      94
20      0       5       20      10      1.92    2.14    2.41    15      91
20      5       5       10      10      1.50    1.55    1.86    16      98
20      10      0       10      20      2.08    2.18    0.96    16      95
20      0       10      20      10      1.95    2.15    2.45    15      93
20      10      10      10      10      1.51    1.55    1.84    15      99
50      0       0       25      50      7.48    8.05    0.92    17      91
50      0       0       50      25      5.51    7.44    8.82    16      88
50      0       0       25      25      3.73    3.95    5.59    16      96
50      12      0       25      50      7.40    8.00    0.93    16      92
50      0       12      50      25      5.48    7.40    8.85    16      88
50      12      12      25      25      3.70    3.96    5.56    17      97
50      25      0       25      50      7.43    7.93    0.91    16      90
50      0       25      50      25      5.47    7.49    9.01    17      90
50      25      25      25      25      3.75    3.99    5.41    16      95
100     0       0       50      100     21.98   23.18   0.88    19      81
100     0       0       100     50      9.85    18.76   24.89   18      78
100     0       0       50      50      7.52    8.00    15.88   17      89
100     25      0       50      100     21.84   23.27   0.91    19      81
100     0       25      100     50      10.11   19.24   25.66   18      79
100     25      25      50      50      7.64    8.08    16.03   18      90
100     50      0       50      100     21.97   23.40   0.89    19      81
100     0       50      100     50      9.99    18.76   25.64   18      78
100     50      50      50      50      7.66    7.98    15.69   18      90
200     0       0       100     200     55.61   57.90   0.88    31      63
200     0       0       200     100     13.92   34.06   61.22   30      60
200     0       0       100     100     12.08   12.33   42.74   23      73
200     50      0       100     200     56.16   58.02   0.89    30      63
200     0       50      200     100     13.66   34.61   61.15   30      60
200     50      50      100     100     12.00   12.37   42.78   22      74
200     100     0       100     200     56.64   58.13   0.89    32      64
200     0       100     200     100     13.62   34.41   62.32   30      61
200     100     100     100     100     12.04   12.40   42.91   23      74
500     0       0       250     500     110.10  110.68  1.45    49      43
500     0       0       500     250     16.38   36.43   118.66  48      42
500     0       0       250     250     16.54   15.82   103.39  45      50
500     125     0       250     500     107.14  108.03  1.45    50      45
500     0       125     500     250     15.81   37.89   122.50  46      42
500     125     125     250     250     16.47   16.18   106.56  43      49
500     250     0       250     500     109.66  109.60  1.49    50      44
500     0       250     500     250     16.64   37.44   121.04  47      41
500     250     250     250     250     15.96   16.16   108.01  45      50
1000    0       0       500     1000    67.58   67.26   1.02    52      44
1000    0       0       1000    500     15.07   10.45   66.95   45      39
1000    0       0       500     500     16.88   16.58   119.72  47      34
1000    250     0       500     1000    70.26   65.62   1.03    51      42
1000    0       250     1000    500     15.44   10.78   72.95   28      26
1000    250     250     500     500     16.04   16.44   114.41  45      33
1000    500     0       500     1000    76.04   76.26   1.21    32      29
1000    0       500     1000    500     15.04   10.44   62.05   46      37
1000    500     500     500     500     16.05   16.41   117.44  47      33
2000    0       0       1000    2000    48.13   48.20   0.90    55      36
2000    0       0       2000    1000    15.09   7.34    50.77   55      35
2000    0       0       1000    1000    16.04   16.72   63.85   38      29
2000    500     0       1000    2000    48.04   46.98   0.92    55      35
2000    0       500     2000    1000    15.88   7.31    55.71   43      25
2000    500     500     1000    1000    15.88   15.96   64.37   41      28
2000    1000    0       1000    2000    53.16   52.80   1.03    43      25
2000    0       1000    2000    1000    15.75   7.57    54.86   43      25
2000    1000    1000    1000    1000    15.76   15.83   62.95   37      30
5000    0       0       2500    5000    21.91   21.79   0.98    93      1
5000    0       0       5000    2500    11.17   6.09    23.32   92      1
5000    0       0       2500    2500    11.20   11.05   23.72   89      1
5000    1250    0       2500    5000    22.02   21.80   0.99    93      1
5000    0       1250    5000    2500    11.00   6.02    23.15   92      1
5000    1250    1250    2500    2500    11.07   11.07   23.29   90      1
5000    2500    0       2500    5000    21.88   21.77   1.01    93      1
5000    0       2500    5000    2500    11.33   6.16    23.27   91      1
5000    2500    2500    2500    2500    11.08   11.03   22.52   88      1

@vneiger
Copy link
Contributor

vneiger commented Aug 20, 2023

In case for future reference, the code that was used for doing the timings (this was run and results stored in dicts in two versions, before and after modifications, and then both dicts were loaded to compare the values).

field = GF(9001)
small_spaces = [MatrixSpace(field, dim, dim) for dim in [2,5,10,20,50,100]]
medium_spaces = [MatrixSpace(field, dim, dim) for dim in [100, 200, 500]]
large_spaces = [MatrixSpace(field, dim, dim) for dim in [1000, 2000, 5000]]
small_timeit_args = dict(number=10000, repeat=7, seconds=True)
medium_timeit_args = dict(number=1000, repeat=7, seconds=True)
large_timeit_args = dict(number=20, repeat=3, seconds=True)
spaces = [small_spaces, medium_spaces, large_spaces]
timeit_args = [small_timeit_args, medium_timeit_args, large_timeit_args]

time_dict = {}

for i in range(3):
    ti_args = timeit_args[i]
    for space in spaces[i]:
        print(f"####  matrix dims: {space.nrows()} x {space.ncols()}  ####")
        print(f"row\tcol\tnrows\tncols\tgetitem\tmatrix_\tsubmat\tcopy\tinit")
        mat = space.random_element()

        half = space.nrows()//2  # warning: working with square matrices only
        for start in [0, half//2, half]:
            end = start + half
            # from_rows:
            rmat = matrix.random(field, half, space.ncols())
            tcopy = timeit("copymat = rmat.__copy__()", **ti_args) # copy in this size
            tinit = timeit("initmat = mat.new_matrix(nrows=half)", **ti_args) # initialization in this size
            tgetitem = timeit("smat = mat[start:end]", **ti_args)
            trows = timeit("smat = mat.matrix_from_rows(list(range(start,end)))", **ti_args)
            tsub = timeit("smat = mat.submatrix(start,0,half)", **ti_args)
            print(f"{start}\t0\t{half}\t{space.ncols()}\t{tgetitem:.1e}\t{trows:.1e}\t{tsub:.1e}\t{tcopy:.1e}\t{tinit:.1e}")
            time_dict[(space.nrows(),start,0,half,space.ncols())] = [tgetitem,trows,tsub,tcopy,tinit]
            # from_columns:
            cmat = matrix.random(field, space.nrows(), half)
            tcopy = timeit("copymat = cmat.__copy__()", **ti_args) # copy in this size
            tinit = timeit("initmat = mat.new_matrix(ncols=half)", **ti_args) # initialization in this size
            tgetitem = timeit("smat = mat[:,start:end]", **ti_args)
            tcols = timeit("smat = mat.matrix_from_columns(list(range(start,end)))", **ti_args)
            tsub = timeit("smat = mat.submatrix(0,start,ncols=half)", **ti_args)
            print(f"0\t{start}\t{space.nrows()}\t{half}\t{tgetitem:.1e}\t{tcols:.1e}\t{tsub:.1e}\t{tcopy:.1e}\t{tinit:.1e}")
            time_dict[(space.nrows(),0,start,space.nrows(),half)] = [tgetitem,tcols,tsub,tcopy,tinit]
            # from_rows_and_columns:
            rcmat = matrix.random(field, half, half)
            tcopy = timeit("copymat = rcmat.__copy__()", **ti_args) # copy in this size
            tinit = timeit("initmat = mat.new_matrix(nrows=half,ncols=half)", **ti_args) # initialization in this size
            tgetitem = timeit("smat = mat[start:end,start:end]", **ti_args)
            trowcol = timeit("smat = mat.matrix_from_rows_and_columns(list(range(start,end)),list(range(start,end)))", **ti_args)
            tsub = timeit("smat = mat.submatrix(start,start,half,half)", **ti_args)
            print(f"{start}\t{start}\t{half}\t{half}\t{tgetitem:.1e}\t{trowcol:.1e}\t{tsub:.1e}\t{tcopy:.1e}\t{tinit:.1e}")
            time_dict[(space.nrows(),start,start,half,half)] = [tgetitem,trowcol,tsub,tcopy,tinit]

save(time_dict, f'benchmark_submatrices')


cdef Py_ssize_t i,r
for i,r in enumerate(range(row, row+nrows)) :
memcpy(M._entries + (i*ncols), self._entries+self._ncols*r+col, sizeof(celement)*ncols)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_matrix could be used here as well, probably with

memcpy(M._matrix[i], self._matrix[r]+col, sizeof(celement)*ncols)

cdef Matrix_modn_dense_template M = self.new_matrix(nrows=nrows, ncols=ncols)

if col == 0 and ncols == self._ncols:
memcpy(M._entries, self._entries+row*ncols, sizeof(celement)*ncols*nrows)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using _matrix :

memcpy(M._entries, self._matrix[row], sizeof(celement)*ncols*nrows)

@vneiger
Copy link
Contributor

vneiger commented Aug 21, 2023

Looks good to me, thanks!

@github-actions
Copy link

Documentation preview for this PR (built with commit a52a33f; changes) is ready! 🎉

@vneiger
Copy link
Contributor

vneiger commented Aug 22, 2023

Merge looks good.

vbraun pushed a commit to vbraun/sage that referenced this pull request Aug 23, 2023
…n_dense_template` matrices

    
<!-- ^^^^^
Please provide a concise, informative and self-explanatory title.
Don't put issue numbers in there, do this in the PR body below.
For example, instead of "Fixes sagemath#1234" use "Introduce new method to
calculate 1+1"
-->
<!-- Describe your changes here in detail -->
Modified the method `submatrix()` to do a systematic call to `memcpy()`.
Overrode  methods `matrix_from_rows()`, `matrix_from_columns()` and
`matrix_from_rows_and_columns()`.
<!-- Why is this change required? What problem does it solve? -->
In the current version of `submatrix()`, only the case when the
submatrix has as many columns as the matrix,  relies on the contiguous
memory storage of `Matrix_modn_dense_template` and call `memcpy()`.
In fact, this can be applied to any set of arguments given to the method
and enables better performances.

Overriding the other methods mentioned above allows to avoid calling
`get_unsafe()` and `set_unsafe()` that are doing time-consuming
casts/conversions that are unnecessary in this case.
<!-- If this PR resolves an open issue, please link to it here. For
example "Fixes sagemath#12345". -->
<!-- If your change requires a documentation PR, please link it
appropriately. -->

### 📝 Checklist

<!-- Put an `x` in all the boxes that apply. -->
<!-- If your change requires a documentation PR, please link it
appropriately -->
<!-- If you're unsure about any of these, don't hesitate to ask. We're
here to help! -->
<!-- Feel free to remove irrelevant items. -->

- [x] The title is concise, informative, and self-explanatory.
- [x] The description explains in detail what this PR is about.
- [ ] I have linked a relevant issue or discussion.
- [x] I have created tests covering the changes.
- [ ] I have updated the documentation accordingly.

### ⌛ Dependencies

<!-- List all open PRs that this PR logically depends on
- sagemath#12345: short description why this is a dependency
- sagemath#34567: ...
-->

<!-- If you're unsure about any of these, don't hesitate to ask. We're
here to help! -->
    
URL: sagemath#36059
Reported by: Marie Bonboire
Reviewer(s): Vincent Neiger
@vbraun vbraun merged commit 3dd31ba into sagemath:develop Aug 27, 2023
8 of 9 checks passed
@mkoeppe mkoeppe added this to the sage-10.2 milestone Aug 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants