implement fast linear algebra modulo n < 2^31 #4968

malb · 2009-01-12T16:54:43Z

The switch would allow to cover larger primes with the specialised code (and I currently work with primes ~ 2³²).

The switch requires changing all mentions of int to long and probably implementing arith_long.

Depends on #10281

CC: @simon-king-jena @sagetrac-mraum

Component: linear algebra

Issue created by migration from https://trac.sagemath.org/ticket/4968

The text was updated successfully, but these errors were encountered:

sagetrac-mabshoff · 2009-01-12T17:33:34Z

Replying to @malb:

The switch requires changing all mentions of int to long and probably implementing arith_long.

I am not sure using long here is a good idea since on various platforms we want to support sizeof(long) == sizeof (int). You might want to replace it by some typedef so we can make it long long on those platforms.

Cheers,

Michael

malb · 2009-01-12T21:58:03Z

comment:2

I am not sure using long here is a good idea since on various platforms we want to
support sizeof(long) == sizeof (int). You might want to replace it by some typedef
so we can make it long long on those platforms.

Is your point that the maximum prime for which we use Matrix_modn_dense shouldn't vary across platforms? I didn't think of it before, I can see why this would be desirable. How about "long long" across the board?

williamstein · 2009-01-13T14:49:34Z

comment:3

How about "long long" across the board?

Won't that have negative implications. For example, on a 32-bit machine all matrices will take twice as much memory to store, and all arithmetic will be an order of magnitude slower. That certainly wouldn't make me happy!

%cython
def foo(n):
    cdef long long i = 10, j = 10093, k
    cdef int z
    for z in range(n):
        k += i*j
    return k
///

%cython
def bar(n):
    cdef int i = 10, j = 10093, k
    cdef int z
    for z in range(n):
        k += i*j
    return k
///

time foo(10^8)
///

587410688573299048L
CPU time: 0.29 s,  Wall time: 0.29 s

time bar(10^8)
///

-171048080
CPU time: 0.04 s,  Wall time: 0.04 s

0.29/0.04
///

7.25000000000000

sagetrac-mabshoff · 2009-01-13T15:00:43Z

comment:4

Replying to @williamstein:

How about "long long" across the board?

Won't that have negative implications. For example, on a 32-bit machine all matrices will take twice as much memory to store, and all arithmetic will be an order of magnitude slower. That certainly wouldn't make me happy!

Well, assuming we will transparently switch over to another implementation (using gmpz's for example) we should use the "fast" integer type, i.e. 32 bit ints and a 64 bit int type on 64 bit boxen. But my original point remains: On

64 bit Solaris
64 bit Windows

sizeof(long) == sizeof(int) == 4, so using either a C99 type or some platform dependent typedef is what we need here to get optimum performance on all 64 bit platforms. Since we are already requiring C99 due to FLINT we might as well go for that solution.

Cheers,

Michael

burcin · 2009-01-15T12:59:23Z

comment:5

I don't know what arith_long is, but the long_extras module of FLINT might help here.

malb · 2011-08-30T15:57:17Z

comment:7

I'd say this is now a duplicate? of #4260 and should be closed?

burcin · 2011-09-22T00:48:57Z

comment:8

#4260 doesn't cover your request of modulus > 2³² from the ticket description. Maximum modulus there is 2²³.

We could use the template class from #4260 to provide a matrix class using long longs.

williamstein · 2012-03-23T02:15:04Z

Dependencies: #10281

williamstein · 2012-03-23T02:15:04Z

comment:11

Replying to @burcin:

#4260 doesn't cover your request of modulus > 2³² from the ticket description. Maximum modulus there is 2²³.

We could use the template class from #4260 to provide a matrix class using long longs.

I elected not to do this because:

the template class is very, very heavily tied to linbox, and disentangling that seemed insane. We can't use linbox anyways for primes above 2^23.
much of the work in writing this patch was spent writing a lot of really good tests that truly exercise the new functionality, and in debugging and properly using the old Strassen code for matrix multiplication and echelon forms. None of that would have been made better by using the template.

robertwb · 2012-03-23T07:16:31Z

comment:12

matrix_window_modn_dense_uint64.pyx looks an awfully lot like matrix_window_modn_dense.pyx (and the same could be said of the non-window files), is there any reason to keep both?

malb · 2012-04-18T18:47:53Z

comment:13

William, are you still working on this code? I was about to start reviewing but then saw that you didn't reply to Robert's comment, hence wondering how high priority it is for you at the moment.

williamstein · 2012-04-18T19:30:14Z

comment:14

Replying to @malb:

William, are you still working on this code? I was about to start reviewing but then saw that you didn't reply to Robert's comment, hence wondering how high priority it is for you at the moment.

Yes! Please review! Regarding robert's comment -- No, there is no reason to keep both. But maybe I can put all the cleanup in another patch, just in case there is a reason. Just to keep things simple.

I care very much about this code.

robertwb · 2012-04-19T08:34:10Z

apply only this patch

robertwb · 2012-04-19T08:37:41Z

comment:15

Attachment: 4968-squashed.patch.gz

I've squashed this into just changing matrix_modn_dense directly, which should make it both easier to see the changes and avoid adding a (modified) second copy of everything.

robertwb · 2012-04-19T08:38:08Z

comment:16

It's also rebased on top of sage-5.0.beta15.

malb · 2012-04-19T20:23:58Z

comment:17

32-bit vs 64-bit

I think there is a bit of confusion about 32-bit vs. 64-bit in the patch, or at least I am confused. If I understand correctly, then ctypedef unsigned long uint_fast64_t is used to define the data type of the matrix. However, unsigned long is not necessarily 64-bit, on 32-bit machines it may be 32-bit. Arithmetic uses long long which is at least 64-bit. So either the datatype should not mention 64 bits or should be moved over to e.g. uint64_t or unsigned long long (which doesn't exist in older C++ standards FWIW). Once this is fixed there seem to be a few places like: "Reduce the integer matrix modulo a positive machine size integer." and "stored as a 64-bit unsigned int" which might need adapting too.
I am not sure about this code:

  if use_32bit_type(p): 
        raise RuntimeError, "BUG: p (=%s) is too small to use this matrix type"%p

Why raise an error? Of course, the LinBox matrices take care of that by default, I am just confused by this. Also, this seems to contradict with "If the prime is small enough to fit in a byte" later.

I assume this long long vs unsigned long long is the reason for restricting the modulus to 2³¹?

mod_int vs. mod_int_uint64

I think Robert's merge introduced a bug in pickling. We are at version 10 of pickling for Matrix_modn_dense but William started counting at zero again for his new type. Now that we're back to the old type, we shouldn't check for version == 0.

Other stuff

"TODO: should free up memory allocated so far -- this would leak otherwise, in exactly a situation where a lot leaks" Shouldn't this be fixed?
Of course, far from mandatory but couldn't we -- while we are at it -- not also fix those raise OverflowError, string to read raise OverflowError(string) which is Python3 style (?)
performance: it seems picking a bigger Strassen cutoff would be quite beneficial:

sage: A = random_matrix(GF(previous_prime(2^24)),1000,1000)
sage: %time A*A                               
CPU times: user 2.67 s, sys: 0.00 s, total: 2.68 s
Wall time: 2.69 s
1000 x 1000 dense matrix over Finite Field of size 16777213
sage: %time A._multiply_strassen(A,100)                    
CPU times: user 1.12 s, sys: 0.00 s, total: 1.12 s
Wall time: 1.13 s
1000 x 1000 dense matrix over Finite Field of size 16777213

sagetrac-mraum · 2012-04-26T11:32:20Z

comment:18

The unsigned long thing seems OK (int_fast64_t is a C type, defined in stdint.h). For consistency it would be good if MAX_MODULUS was INTEGER_MOD_INT64_LIMIT to be found in sage/rings/finite_rings/stdint.h this is less by one. In integer_mod.pyx the modulus is tested using <= and in the patch < is used.

Apart from this I more or less agree with all points made previously. If you don't see a good reason to change the Strassen cutoff, I would leave it at is. File a new ticket instead. We should optimize it, but changing it based on one observation is daring.

Honestly, I am against the if use_32bit_type(p) test. The matrix works perfectly for such p. Even thought it shouldn't be called, it may be called.

If anybody has time to make the changes, I will test it within the next two weeks and it will have a positive review.

malb · 2012-04-26T11:43:48Z

comment:19

Replying to @sagetrac-mraum:

The unsigned long thing seems OK (int_fast64_t is a C type, defined in stdint.h). For consistency it would be good if MAX_MODULUS was INTEGER_MOD_INT64_LIMIT to be found in sage/rings/finite_rings/stdint.h this is less by one. In integer_mod.pyx the modulus is tested using <= and in the patch < is used.

You are right, I got confused about Cython ctypdefs. "unsigned long" does not appear in a meaningful way in the generated C files. Sorry for the noise. Still, the documentation is wrong in various places and should be fixed (machine sized ints).

Apart from this I more or less agree with all points made previously. If you don't see a good reason to change the Strassen cutoff, I would leave it at is. File a new ticket instead. We should optimize it, but changing it based on one observation is daring.

Fair enough, so let's make more observations. This code is about performance. I am not saying: let's run a comprehensive optimisation suite but not at least trying to use cutoffs which make the code reasonable fast seems wrong.

sagetrac-mraum · 2012-04-26T11:50:34Z

comment:20

I don't think this code is about performance. The new code will only be used in cases that were not available previously (see the cases treated in matrix_int_dense). This is why I would like to separate this patch and the optimization.

Also, please keep in mind, that not only William is desperate for having Sage use larger moduli (me certainly, John and his student recently posted on sage-devel, and most likely many more).

malb · 2012-04-26T12:07:40Z

comment:21

Replying to @sagetrac-mraum:

I don't think this code is about performance. The new code will only be used in cases that were not available previously (see the cases treated in matrix_int_dense). This is why I would like to separate this patch and the optimization.

Sage can compute with larger primes it's just terribly slow, this patch fixes that.

Also, please keep in mind, that not only William is desperate for having Sage use larger moduli (me certainly, John and his student recently posted on sage-devel, and most likely many more).

Fair enough. I am not asking anyone to spend days and days improving the performance, but I am asking those who care about this patch to spend 10 minutes to test various cutoff parameters so that we have a reasonable default. If there are that many people who care about this, this should be easy.

sagetrac-mraum · 2012-04-27T10:24:39Z

comment:22

I acknowledge that we really see this ticket from different perspectives.

Would you be so kind and run some of the test that you think give a reasonable cutoff? If you post the commands, I can run them myself (on AMD and Intel). Based on this we can choose one.

malb · 2012-04-27T10:38:53Z

comment:23

I'd say just running

sage: A = random_matrix(GF(previous_prime(2^24)),1000,1000)
sage: B = random_matrix(GF(previous_prime(2^24)),1000,1000)
sage: %time A._multiply_strassen(B, 20)
sage: %time A._multiply_strassen(B, 50)
sage: %time A._multiply_strassen(B,100)
sage: %time A._multiply_strassen(B,150)
sage: %time A._multiply_strassen(B,200)

should probably do the trick. I posted timings on my machine above. I'll run some timings on other machines as well.

I'm really not suggesting to spend hours and hours on this, just that we should pick a decent default value, i.e., my impression is that 20 is bad on most machines. I might be wrong though.

williamstein · 2012-04-27T11:15:22Z

comment:24

I'm completely with mraum here. In fact I have run tests like you suggest to set the current parameters. You'll find optimal parameters differ dramatically depending on the computer and the OS... e.g., Mac OS X could be totally different than one linux; Intel versus AMD, etc. I can see almost no point in doing what you suggest above. Either do things right (with a database for many machines), or leave it.

malb · 2012-04-27T11:18:22Z

comment:25

sage.math:

24 bits

sage: %time _ = A._multiply_strassen(B, 20)
CPU times: user 3.31 s, sys: 0.01 s, total: 3.32 s
Wall time: 3.32 s

sage: %time _ = A._multiply_strassen(B, 50)
CPU times: user 2.02 s, sys: 0.03 s, total: 2.05 s
Wall time: 2.04 s

sage: %time _ = A._multiply_strassen(B,100)
CPU times: user 1.90 s, sys: 0.01 s, total: 1.91 s
Wall time: 1.91 s

** 31 bits **

sage: %time _ = A._multiply_strassen(B,20)
CPU times: user 4.06 s, sys: 0.00 s, total: 4.06 s
Wall time: 4.05 s
sage: %time _ = A._multiply_strassen(B,50)
CPU times: user 2.95 s, sys: 0.00 s, total: 2.95 s
Wall time: 2.95 s

sage: %time _ = A._multiply_strassen(B,100)
CPU times: user 3.00 s, sys: 0.00 s, total: 3.00 s
Wall time: 3.00 s

sage: %time _ = A._multiply_strassen(B,200)
CPU times: user 3.20 s, sys: 0.00 s, total: 3.20 s
Wall time: 3.20 s

malb · 2012-04-27T11:19:54Z

comment:26

Replying to @williamstein:

I'm completely with mraum here. In fact I have run tests like you suggest to set the current parameters. You'll find optimal parameters differ dramatically depending on the computer and the OS... e.g., Mac OS X could be totally different than one linux; Intel versus AMD, etc. I can see almost no point in doing what you suggest above. Either do things right (with a database for many machines), or leave it.

I find it odd to claim this before even trying (perhaps you did and I missed that though).

williamstein · 2012-04-27T11:55:29Z

comment:27

I find it odd to claim this before even trying (perhaps you did and I missed that though).

I did a lot of timings like the above on my laptop when writing this patch...

malb · 2012-04-27T11:59:24Z

comment:28

So, was 20 a good on your laptop (I am asking because the 20 cutoff is older than your patch).

FWIW I'll try this on bsd.math and cicero.skynet. If those too suggest to increase the cutoff to > 20, I suggest we do that. If only bsd.math and sage.math agree, I'd still say it's worth increasing the limit, because these are more mainstream machines than cicero.skynet.

malb · 2012-04-28T14:08:30Z

comment:29

Btw. it seems something went wrong with the quashing, there are still mentions of _uint64 in the patch.

malb · 2012-04-28T14:14:44Z

comment:30

bsd.math

sage: A = random_matrix(GF(previous_prime(2^24)),1000,1000)
sage: B = random_matrix(GF(previous_prime(2^24)),1000,1000)
sage: %time _ = A._multiply_strassen(B, 20)
CPU times: user 2.87 s, sys: 0.01 s, total: 2.88 s
Wall time: 2.88 s

sage: %time _ = A._multiply_strassen(B, 50)
CPU times: user 1.47 s, sys: 0.01 s, total: 1.47 s
Wall time: 1.47 s

sage: %time _ = A._multiply_strassen(B,100)
CPU times: user 1.24 s, sys: 0.01 s, total: 1.24 s
Wall time: 1.24 s

sage:  %time _ = A._multiply_strassen(B,150)
CPU times: user 1.16 s, sys: 0.01 s, total: 1.17 s
Wall time: 1.17 s

sage: %time _ = A._multiply_strassen(B,200)
CPU times: user 1.16 s, sys: 0.01 s, total: 1.17 s
Wall time: 1.17 s

sage: A = random_matrix(GF(previous_prime(2^31)),1000,1000)
sage: B = random_matrix(GF(previous_prime(2^31)),1000,1000)

sage: %time A._multiply_strassen(B, 20)
CPU times: user 3.72 s, sys: 0.00 s, total: 3.72 s
Wall time: 3.72 s
1000 x 1000 dense matrix over Finite Field of size 2147483647

sage: %time _ = A._multiply_strassen(B, 50)
CPU times: user 2.55 s, sys: 0.01 s, total: 2.55 s
Wall time: 2.55 s

sage: %time _ = A._multiply_strassen(B,100)
CPU times: user 2.53 s, sys: 0.01 s, total: 2.54 s
Wall time: 2.54 s

sage: %time _ = A._multiply_strassen(B,150)
CPU times: user 2.68 s, sys: 0.00 s, total: 2.69 s
Wall time: 2.69 s

sage: %time _ = A._multiply_strassen(B,200)
CPU times: user 2.68 s, sys: 0.00 s, total: 2.69 s
Wall time: 2.69 s

So 100 is much better than 20, but not necessarily optimal.

malb · 2012-04-29T14:19:11Z

comment:31

Cicero is less conclusive as it's too loaded for benchmarketing. Anyway, here are the numbers:

2²⁴

sage: A = random_matrix(GF(previous_prime(2^24)),1000,1000)
sage: B = random_matrix(GF(previous_prime(2^24)),1000,1000)

sage: %time _ = A._multiply_strassen(B, 20)
CPU times: user 25.93 s, sys: 0.03 s, total: 25.95 s
Wall time: 85.44 s

sage: %time _ = A._multiply_strassen(B, 50)
CPU times: user 21.48 s, sys: 0.04 s, total: 21.52 s
Wall time: 68.51 s

sage: %time _ = A._multiply_strassen(B,100)
CPU times: user 21.99 s, sys: 0.11 s, total: 22.09 s
Wall time: 70.36 s

sage: %time _ = A._multiply_strassen(B,150)
CPU times: user 23.95 s, sys: 0.05 s, total: 24.01 s
Wall time: 76.56 s

2^31

sage: A = random_matrix(GF(previous_prime(2^31)),1000,1000)
sage: B = random_matrix(GF(previous_prime(2^31)),1000,1000)
sage: %time _ = A._multiply_strassen(B, 20)
CPU times: user 25.93 s, sys: 0.08 s, total: 26.00 s
Wall time: 83.17 s

sage: %time _ = A._multiply_strassen(B, 50)
CPU times: user 21.47 s, sys: 0.06 s, total: 21.53 s
Wall time: 68.69 s

sage: %time _ = A._multiply_strassen(B,100)
CPU times: user 22.02 s, sys: 0.06 s, total: 22.08 s
Wall time: 70.31 s

sage: %time _ = A._multiply_strassen(B,150)
CPU times: user 23.92 s, sys: 0.14 s, total: 24.06 s
Wall time: 76.67 s

What I take away from these numbers: increasing the default cutoff from 20 to 100 makes Sage twice as fast on some mainstream architectures (64-bit Linux, Xeon), about 1.7x faster on others (64-bit OSX) and probably slightly faster on less mainstream architectures (32-bit Linux on Pentium4). Am I missing something here?

sagetrac-mraum · 2012-05-01T09:35:09Z

Attachment: 4968-review.patch.gz

sagetrac-mraum · 2012-05-01T09:38:36Z

comment:32

I made this work; Only very few points had been overseen when squashing.

I changed the cutoff back to 100. Since William and I both think one should do it properly or not at all, I thought it would be best to keep the old cutoff. This had been 100. The biggest advantage from my perspective is that the release manager will have to deal with one change only (32 -> 64bit) and not with two, when investigating potential speed regressions. Not that I found any, but there are more exotic architectures than mine.

malb · 2012-05-01T09:52:06Z

comment:33

Replying to @sagetrac-mraum:

I changed the cutoff back to 100. Since William and I both think one should do it properly or not at all, I thought it would be best to keep the old cutoff. This had been 100.

Fair enough, 100 is what I was arguing for regardless. But I have to say being so blatantly ignored leaves a bitter aftertaste. Anyway, it seems to you have this patch covered, I won't intervene any more.

sagetrac-mraum · 2012-05-01T18:45:24Z

comment:34

Please, don't take it like that. I was happy to see that your suggestion and the old cutoff were the same. But consider us changing this and the underlying implementation in parallel. What would the release manager do if regressions showed up. I would unmerge the whole thing (and we would feel like stealing his time). This way you can open a new ticket and do the optimization independently.

kini · 2012-05-20T06:18:54Z

comment:35

By the way, doctesting this patch left several (5-10) python processes running on a patchbot for more than 13 hours, which is probably abnormal...

vbraun · 2012-06-18T16:00:30Z

comment:36

I'm reviewing the linbox update (#12883), where malb removed linbox-stuff from matrix_modn_dense.py. This might conflict with this ticket's patch but, I think, only in a trivial way. If I got this wrong then please let me know.

SnarkBoojum · 2015-03-11T16:22:01Z

comment:37

This ticket is in needs_review since three years... putting it into needs_info status, since obviously sagemath has move forward in the intervening time.

malb added c: linear algebra labels Jan 12, 2009

malb assigned williamstein Jan 12, 2009

williamstein changed the title ~~switch Matrix_modn_dense from ints to CPU longs~~ implement fast linear algebra modulo n < 2^31 Mar 23, 2012

williamstein added the s: needs review label Mar 23, 2012

malb added s: needs work and removed s: needs review labels Apr 19, 2012

sagetrac-mraum mannequin added s: needs review and removed s: needs work labels May 1, 2012

SnarkBoojum mannequin added s: needs info and removed s: needs review labels Mar 11, 2015

implement fast linear algebra modulo n < 2^31 #4968

implement fast linear algebra modulo n < 2^31 #4968

Comments

malb commented Jan 12, 2009

sagetrac-mabshoff mannequin commented Jan 12, 2009

malb commented Jan 12, 2009

williamstein commented Jan 13, 2009

sagetrac-mabshoff mannequin commented Jan 13, 2009

burcin commented Jan 15, 2009

malb commented Aug 30, 2011

burcin commented Sep 22, 2011

williamstein commented Mar 23, 2012

williamstein commented Mar 23, 2012

robertwb commented Mar 23, 2012

malb commented Apr 18, 2012

williamstein commented Apr 18, 2012

robertwb commented Apr 19, 2012

robertwb commented Apr 19, 2012

robertwb commented Apr 19, 2012

malb commented Apr 19, 2012

sagetrac-mraum mannequin commented Apr 26, 2012

malb commented Apr 26, 2012

sagetrac-mraum mannequin commented Apr 26, 2012

malb commented Apr 26, 2012

sagetrac-mraum mannequin commented Apr 27, 2012

malb commented Apr 27, 2012

williamstein commented Apr 27, 2012

malb commented Apr 27, 2012

malb commented Apr 27, 2012

williamstein commented Apr 27, 2012

malb commented Apr 27, 2012

malb commented Apr 28, 2012

malb commented Apr 28, 2012

malb commented Apr 29, 2012

sagetrac-mraum mannequin commented May 1, 2012

sagetrac-mraum mannequin commented May 1, 2012

malb commented May 1, 2012

sagetrac-mraum mannequin commented May 1, 2012

kini commented May 20, 2012

vbraun commented Jun 18, 2012

SnarkBoojum mannequin commented Mar 11, 2015