Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify MULTIPLY-COMPLEX-MATRIX. Cons less, compute more. #60

Merged

Conversation

jmbr
Copy link
Contributor

@jmbr jmbr commented Sep 5, 2019

As promised in #54, here is a patch that makes multiply-complex-matrix two orders of magnitude faster and less memory-hungry (see results below).

(in-package #:magicl)

(defparameter *n* 2048)
(defparameter *matrix* (magicl:random-unitary *n*))
(defparameter *vector* (magicl:random-matrix *n* 1))

(progn
  (sb-ext:gc :full t)
  (dotimes (i 3)
    (time (multiply-complex-matrices *matrix* *vector*))))

(format t "===================================~%")

(progn
  (sb-ext:gc :full t)
  (dotimes (i 3)
    (time (multiply-complex-matrices-old *matrix* *vector*))))

;;; WITH THIS PATCH.
;; Evaluation took:
;;   0.005 seconds of real time
;;   0.017629 seconds of total run time (0.017518 user, 0.000111 system)
;;   360.00% CPU
;;   12,243,972 processor cycles
;;   1,840 bytes consed
;;   
;; Evaluation took:
;;   0.003 seconds of real time
;;   0.011476 seconds of total run time (0.011374 user, 0.000102 system)
;;   366.67% CPU
;;   8,130,614 processor cycles
;;   39,808 bytes consed
;;   
;; Evaluation took:
;;   0.003 seconds of real time
;;   0.011497 seconds of total run time (0.011432 user, 0.000065 system)
;;   366.67% CPU
;;   7,961,397 processor cycles
;;   38,672 bytes consed
;;   
;;; WITH THE PREVIOUS VERSION OF THE CODE:
;; ===================================
;; Evaluation took:
;;   0.038 seconds of real time
;;   0.077195 seconds of total run time (0.063205 user, 0.013990 system)
;;   [ Run times consist of 0.003 seconds GC time, and 0.075 seconds non-GC time. ]
;;   202.63% CPU
;;   104,146,258 processor cycles
;;   67,143,712 bytes consed
;;   
;; Evaluation took:
;;   0.043 seconds of real time
;;   0.082063 seconds of total run time (0.065534 user, 0.016529 system)
;;   [ Run times consist of 0.003 seconds GC time, and 0.080 seconds non-GC time. ]
;;   190.70% CPU
;;   116,569,989 processor cycles
;;   67,141,680 bytes consed
;;   
;; Evaluation took:
;;   0.020 seconds of real time
;;   0.046819 seconds of total run time (0.046032 user, 0.000787 system)
;;   [ Run times consist of 0.003 seconds GC time, and 0.044 seconds non-GC time. ]
;;   235.00% CPU
;;   53,982,880 processor cycles
;;   67,141,680 bytes consed

@notmgsk
Copy link
Member

notmgsk commented Sep 5, 2019

This is on all implementations of BLAS?

@jmbr
Copy link
Contributor Author

jmbr commented Sep 5, 2019

@notmgsk I have tested it on OpenBLAS but I expect the trend to be the same in MKL and Apple's Accelerate framework (i.e., new version faster than old version). I'll do more checks and report the results.

Copy link
Contributor

@ecpeterson ecpeterson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had it in my head that zgemm could destroy some of the input data, as well as writing to c—but, looking at the documentation again, I don't see any mention of that.

If it's true that a and b stay untouched, then this is great!

In a future PR, I'd like to see the transpose options exposed (especially if one of them acts by complex conjugation—but, again looking at the docs, it's not clear that T and C act any differently??). In a PR after that, it'd be nice to upgrade CL-QUIL's m* to allow individual matrices in the sequence to be tagged with transpose flags.

@jmbr
Copy link
Contributor Author

jmbr commented Sep 5, 2019

Both matrices are indeed unchanged and only C is overwritten. In the documentation for the latest reference implementation of BLAS, the matrices A and B are listed as input parameters [1] while C is the only argument explicitly marked as input/output and said to be overwritten (the reference code complies with that). In an older version, it explicitly says that A and B are "unchanged on exit" [2]. Finally, Intel MKL's implementation [3] declares both a and b as pointers to const data (hence unchanged).

The code for the reference implementation of ZGEMM does call DCONJG to get the complex conjugate when asked to take the Hermitian transpose.

[1] http://www.netlib.org/lapack/explore-html/dc/d17/group__complex16__blas__level3_ga4ef748ade85e685b8b2241a7c56dd21c.html#ga4ef748ade85e685b8b2241a7c56dd21c

[2] http://www.netlib.org/lapack/lapack-3.1.1/html/zgemm.f.html

[3] https://software.intel.com/en-us/mkl-developer-reference-c-cblas-gemm

Copy link
Contributor

@appleby appleby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks perfectly reasonable to my uninformed eyes.

Are there any tests covering this function? If not, any chance of adding a few, just to ease the nerves regarding @ecpeterson's concern about mutating inputs and maybe to verify that the behavior hasn't changed w.r.t. the corner cases (row & col vectors) that were explicitly handled previously?

Copy link
Contributor

@ecpeterson ecpeterson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sold.

@jmbr
Copy link
Contributor Author

jmbr commented Sep 5, 2019

The results are consistent when running the snippet of code below using Accelerate, OpenBLAS, and MKL.

(ql:quickload :magicl)

(in-package #:magicl)

(defparameter *n* 2048)
(defparameter *matrix* (magicl:random-gaussian-matrix *n* *n*))
(defparameter *vector* (magicl:random-matrix *n* 1))

(progn
  (sb-ext:gc :full t)
  (dotimes (i 3)
    (time (magicl:multiply-complex-matrices *matrix* *vector*)))
  (defparameter *results-0* (magicl:multiply-complex-matrices *matrix* *vector*)))

(format t "===================================~%")

(progn
  (sb-ext:gc :full t)
  (dotimes (i 3)
    (time (magicl::multiply-complex-matrices* *matrix* *vector*)))
  (defparameter *results-1* (magicl::multiply-complex-matrices* *matrix* *vector*)))

(format t "Error: ~A~%" (loop :for a :across (matrix-data *results-0*)
                              :for b :across (matrix-data *results-1*)
                              :maximizing (abs (- a b))))

(sb-ext:quit)

accelerate.txt
openblas.txt
mkl.txt

@jmbr
Copy link
Contributor Author

jmbr commented Sep 6, 2019

@appleby The current test suite exercises this function extensively (albeit indirectly).

@jmbr jmbr merged commit e0102c4 into master Sep 6, 2019
@stylewarning stylewarning deleted the enhancement/multiply-complex-matrices-fast-and-frugally branch September 9, 2019 18:17
@notmgsk notmgsk mentioned this pull request Nov 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants