Simplify MULTIPLY-COMPLEX-MATRIX. Cons less, compute more. #60

jmbr · 2019-09-05T00:33:24Z

As promised in #54, here is a patch that makes multiply-complex-matrix two orders of magnitude faster and less memory-hungry (see results below).

(in-package #:magicl)

(defparameter *n* 2048)
(defparameter *matrix* (magicl:random-unitary *n*))
(defparameter *vector* (magicl:random-matrix *n* 1))

(progn
  (sb-ext:gc :full t)
  (dotimes (i 3)
    (time (multiply-complex-matrices *matrix* *vector*))))

(format t "===================================~%")

(progn
  (sb-ext:gc :full t)
  (dotimes (i 3)
    (time (multiply-complex-matrices-old *matrix* *vector*))))

;;; WITH THIS PATCH.
;; Evaluation took:
;;   0.005 seconds of real time
;;   0.017629 seconds of total run time (0.017518 user, 0.000111 system)
;;   360.00% CPU
;;   12,243,972 processor cycles
;;   1,840 bytes consed
;;   
;; Evaluation took:
;;   0.003 seconds of real time
;;   0.011476 seconds of total run time (0.011374 user, 0.000102 system)
;;   366.67% CPU
;;   8,130,614 processor cycles
;;   39,808 bytes consed
;;   
;; Evaluation took:
;;   0.003 seconds of real time
;;   0.011497 seconds of total run time (0.011432 user, 0.000065 system)
;;   366.67% CPU
;;   7,961,397 processor cycles
;;   38,672 bytes consed
;;   
;;; WITH THE PREVIOUS VERSION OF THE CODE:
;; ===================================
;; Evaluation took:
;;   0.038 seconds of real time
;;   0.077195 seconds of total run time (0.063205 user, 0.013990 system)
;;   [ Run times consist of 0.003 seconds GC time, and 0.075 seconds non-GC time. ]
;;   202.63% CPU
;;   104,146,258 processor cycles
;;   67,143,712 bytes consed
;;   
;; Evaluation took:
;;   0.043 seconds of real time
;;   0.082063 seconds of total run time (0.065534 user, 0.016529 system)
;;   [ Run times consist of 0.003 seconds GC time, and 0.080 seconds non-GC time. ]
;;   190.70% CPU
;;   116,569,989 processor cycles
;;   67,141,680 bytes consed
;;   
;; Evaluation took:
;;   0.020 seconds of real time
;;   0.046819 seconds of total run time (0.046032 user, 0.000787 system)
;;   [ Run times consist of 0.003 seconds GC time, and 0.044 seconds non-GC time. ]
;;   235.00% CPU
;;   53,982,880 processor cycles
;;   67,141,680 bytes consed

notmgsk · 2019-09-05T00:44:51Z

This is on all implementations of BLAS?

jmbr · 2019-09-05T00:54:21Z

@notmgsk I have tested it on OpenBLAS but I expect the trend to be the same in MKL and Apple's Accelerate framework (i.e., new version faster than old version). I'll do more checks and report the results.

ecpeterson

I had it in my head that zgemm could destroy some of the input data, as well as writing to c—but, looking at the documentation again, I don't see any mention of that.

If it's true that a and b stay untouched, then this is great!

In a future PR, I'd like to see the transpose options exposed (especially if one of them acts by complex conjugation—but, again looking at the docs, it's not clear that T and C act any differently??). In a PR after that, it'd be nice to upgrade CL-QUIL's m* to allow individual matrices in the sequence to be tagged with transpose flags.

jmbr · 2019-09-05T14:08:59Z

Both matrices are indeed unchanged and only C is overwritten. In the documentation for the latest reference implementation of BLAS, the matrices A and B are listed as input parameters [1] while C is the only argument explicitly marked as input/output and said to be overwritten (the reference code complies with that). In an older version, it explicitly says that A and B are "unchanged on exit" [2]. Finally, Intel MKL's implementation [3] declares both a and b as pointers to const data (hence unchanged).

The code for the reference implementation of ZGEMM does call DCONJG to get the complex conjugate when asked to take the Hermitian transpose.

[1] http://www.netlib.org/lapack/explore-html/dc/d17/group__complex16__blas__level3_ga4ef748ade85e685b8b2241a7c56dd21c.html#ga4ef748ade85e685b8b2241a7c56dd21c

[2] http://www.netlib.org/lapack/lapack-3.1.1/html/zgemm.f.html

[3] https://software.intel.com/en-us/mkl-developer-reference-c-cblas-gemm

appleby

This change looks perfectly reasonable to my uninformed eyes.

Are there any tests covering this function? If not, any chance of adding a few, just to ease the nerves regarding @ecpeterson's concern about mutating inputs and maybe to verify that the behavior hasn't changed w.r.t. the corner cases (row & col vectors) that were explicitly handled previously?

ecpeterson

I'm sold.

jmbr · 2019-09-05T16:38:02Z

The results are consistent when running the snippet of code below using Accelerate, OpenBLAS, and MKL.

(ql:quickload :magicl)

(in-package #:magicl)

(defparameter *n* 2048)
(defparameter *matrix* (magicl:random-gaussian-matrix *n* *n*))
(defparameter *vector* (magicl:random-matrix *n* 1))

(progn
  (sb-ext:gc :full t)
  (dotimes (i 3)
    (time (magicl:multiply-complex-matrices *matrix* *vector*)))
  (defparameter *results-0* (magicl:multiply-complex-matrices *matrix* *vector*)))

(format t "===================================~%")

(progn
  (sb-ext:gc :full t)
  (dotimes (i 3)
    (time (magicl::multiply-complex-matrices* *matrix* *vector*)))
  (defparameter *results-1* (magicl::multiply-complex-matrices* *matrix* *vector*)))

(format t "Error: ~A~%" (loop :for a :across (matrix-data *results-0*)
                              :for b :across (matrix-data *results-1*)
                              :maximizing (abs (- a b))))

(sb-ext:quit)

accelerate.txt
openblas.txt
mkl.txt

jmbr · 2019-09-06T17:30:13Z

@appleby The current test suite exercises this function extensively (albeit indirectly).

Simplify MULTIPLY-COMPLEX-MATRIX. Cons less, compute more.

7292402

jmbr requested review from stylewarning, notmgsk and ecpeterson September 5, 2019 00:33

jmbr added the enhancement label Sep 5, 2019

ecpeterson reviewed Sep 5, 2019

View reviewed changes

appleby reviewed Sep 5, 2019

View reviewed changes

ecpeterson approved these changes Sep 5, 2019

View reviewed changes

jmbr mentioned this pull request Sep 5, 2019

Speed-up and decrease memory consumption in QVM quil-lang/qvm#173

Open

jmbr merged commit e0102c4 into master Sep 6, 2019

stylewarning deleted the enhancement/multiply-complex-matrices-fast-and-frugally branch September 9, 2019 18:17

notmgsk mentioned this pull request Nov 19, 2019

Bump version to 0.6.5 #66

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify MULTIPLY-COMPLEX-MATRIX. Cons less, compute more. #60

Simplify MULTIPLY-COMPLEX-MATRIX. Cons less, compute more. #60

jmbr commented Sep 5, 2019

notmgsk commented Sep 5, 2019

jmbr commented Sep 5, 2019 •

edited

Loading

ecpeterson left a comment

jmbr commented Sep 5, 2019

appleby left a comment

ecpeterson left a comment

jmbr commented Sep 5, 2019

jmbr commented Sep 6, 2019

Simplify MULTIPLY-COMPLEX-MATRIX. Cons less, compute more. #60

Simplify MULTIPLY-COMPLEX-MATRIX. Cons less, compute more. #60

Conversation

jmbr commented Sep 5, 2019

notmgsk commented Sep 5, 2019

jmbr commented Sep 5, 2019 • edited Loading

ecpeterson left a comment

Choose a reason for hiding this comment

jmbr commented Sep 5, 2019

appleby left a comment

Choose a reason for hiding this comment

ecpeterson left a comment

Choose a reason for hiding this comment

jmbr commented Sep 5, 2019

jmbr commented Sep 6, 2019

jmbr commented Sep 5, 2019 •

edited

Loading