A question on fluokitten performance #27

randomizedthinking · 2019-01-02T05:11:06Z

I followed the instructions in Fast Map and Reduce for Primitive Vectors and Matrices to test the performance of neaderthal and fluokitten. What I found are:

The performance numbers from the neaderthal are close to the post;
Yet, all performance numbers from fluokitten are much worse.

For instance,

(def nx (dv (range n)))
(def ny (dv (range n)))

(defn p+ ^double [^double x ^double y] (+ x y))
(defn p* ^double [^double x ^double y] (* x y))

(with-progress-reporting (quick-bench (foldmap p+ 0.0 p* nx ny)))
;             Execution time mean : 19.735098 ms

Also, the same tests were performed using the current dev version of fluokitten and it also ended up with much slower execution time than what in the post. I wonder what might be causes of such large discrepancies?

It would be great if someone can also conduct similar tests.

blueberry · 2019-01-02T11:48:04Z

Can you share an example project with the specific dependencies and exactly the code that you tried?

For example, it the code you've posted, ~~you never define p+ and p+~~ (my mistake, early morning :), and I can't see how large n is. Maybe foldmap just throws an exception? Or, n might be quite large, so 20ms is what it should be? What is your hardware, Java version, OS?

blueberry · 2019-01-02T12:03:25Z

Just for reference, I've just tried this code with the current Neanderthal snapshot on the same machine I used for writing that post (i7 4790k) and got the following:

(with-progress-reporting (quick-bench (foldmap p+ 0.0 p* nx ny)))
;;  Execution time mean : 189.349159 µs

It's a little bit faster than reported in the blog post (198 µs) but close. Please try to run that benchmark on your machine a couple of times and see whether there are some changes. Maybe the combination of the version and the settings of your JVM/OS have something to do with that, but it is difficult to say without data.

randomizedthinking · 2019-01-02T18:11:38Z

I should have specified all parameters at the first place. All configurations are:

n=100000
JVM version: OpenJDK 1.8.0
OS: Linux Debian
CPU: i5-3470
Clojure 1.9.0

The computer I use is slow, yet it doesn't explain the huge performance differences. I also tried it on another Xeon E5-2686 machine this morning -- same results.

Further checking shows: fold is reasonable fast, yet fmap is the bottleneck. Another observation. In your post, you showed that

(fold (fmap * nx ny))
;; => ClassCastException clojure.core$_STAR_ cannot be cast to clojure.lang.IFn$DDD  uncomplicate.neanderthal.impl.buffer-block/vector-fmap* (buffer_block.clj:349)

Yet in my tests, this code runs. I suspect you have a different fmap which takes advantage of native MKL or GPU libraries, so you have super performance.

blueberry · 2019-01-02T18:23:18Z

I think that I know what might be the source of the problem: the old Clojure compiler was somewhat inconsistent in applying protocol implementations, so it dispatches to the non-primitive function implementation in your case (why? I don't know). In my tests, Clojure 1.10 fixed this non-determinism. Please upgrade the project to 1.10 and report the timings.

BTW fmap does not use any MKL/GPU acceleration, it just eliminates various JVM bottlenecks. fold does use MKL in a few simple cases where that is possible.

randomizedthinking · 2019-01-02T18:47:06Z

Just tested on Cojure 1.10, but still the performance is sluggish. Also (fold (fmap * nx ny)) runs under 1.10 in my case.

randomizedthinking · 2019-01-02T19:12:34Z

Here I create a project so you can check it out: fluokitten_test. Below are my run results:

Estimating sampling overhead
Warming up for JIT optimisations 10000000000 ...
compilation occurred before 463355 iterations
compilation occurred before 59757179 iterations
compilation occurred before 116734838 iterations
compilation occurred before 118124537 iterations
compilation occurred before 177418361 iterations
compilation occurred before 236712185 iterations
Estimating execution count ...
Sampling ...
Final GC...
Checking GC...
Finding outliers ...
Bootstrapping ...
Checking outlier significance
Warming up for JIT optimisations 5000000000 ...
Estimating execution count ...
Sampling ...
Final GC...
Checking GC...
Finding outliers ...
Bootstrapping ...
Checking outlier significance
Evaluation count : 36 in 6 samples of 6 calls.
Execution time mean : 16.717164 ms
Execution time std-deviation : 188.162809 µs
Execution time lower quantile : 16.481262 ms ( 2.5%)
Execution time upper quantile : 16.925771 ms (97.5%)
Overhead used : 9.915312 ns

blueberry · 2019-01-03T00:51:57Z

When I tried your project as-is on my computer (but starting the benchmark from the repl instead of main), I got 4ms.

Then I added the direct linking option to :jvm-opts in leiningen, and got a significant speedup, 800 microseconds.

:jvm-opts ^:replace ["-Dclojure.compiler.direct-linking=true"
                       "-XX:MaxDirectMemorySize=16g" "-XX:+UseLargePages"]

I restarted your project a few times with different versions of neanderthal (SNAPHOT and 0.20.4) and Clojure 1.8.0, 1.10.0, and I always got the same result.

However, when I started the repl from the benchmarks example project (https://github.com/uncomplicate/neanderthal/blob/master/examples/benchmarks/src/benchmarks/map_reduce.clj) I always get around 200 microseconds, as reported in the blog post.

So, it is definitely related to JVM/Clojure compiler settings, and possibly order in whick Clojure loads namespaces. I don't have time now to compare your project further and see if there is another setting that you've missed. Can you try the code from the benchmarks project and report your numbers (seeing that our CPUs got 20ms vs 4 ms for the initial version, I should expect that you'll get around 1ms with the benchmarks project)?

randomizedthinking · 2019-01-03T01:05:52Z

Thanks for the prompt reply. I will check under the options provided, and report back later.

randomizedthinking · 2019-01-03T08:23:48Z

Now I found the cause of the issue. In addition to the direct-linking options you pointed out, another factor is the *unchecked-math* option: it has to be set to either true or :warn-on-boxed globally. Only set the option in the module won't work.

The fluokitten_test is updated with the change. Now I can get around 240us as the end results.

blueberry closed this as completed Jan 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A question on fluokitten performance #27

A question on fluokitten performance #27

randomizedthinking commented Jan 2, 2019 •

edited

Loading

blueberry commented Jan 2, 2019 •

edited

Loading

blueberry commented Jan 2, 2019

randomizedthinking commented Jan 2, 2019 •

edited

Loading

blueberry commented Jan 2, 2019 •

edited

Loading

randomizedthinking commented Jan 2, 2019

randomizedthinking commented Jan 2, 2019

blueberry commented Jan 3, 2019 •

edited

Loading

randomizedthinking commented Jan 3, 2019

randomizedthinking commented Jan 3, 2019 •

edited

Loading

A question on fluokitten performance #27

A question on fluokitten performance #27

Comments

randomizedthinking commented Jan 2, 2019 • edited Loading

blueberry commented Jan 2, 2019 • edited Loading

blueberry commented Jan 2, 2019

randomizedthinking commented Jan 2, 2019 • edited Loading

blueberry commented Jan 2, 2019 • edited Loading

randomizedthinking commented Jan 2, 2019

randomizedthinking commented Jan 2, 2019

blueberry commented Jan 3, 2019 • edited Loading

randomizedthinking commented Jan 3, 2019

randomizedthinking commented Jan 3, 2019 • edited Loading

randomizedthinking commented Jan 2, 2019 •

edited

Loading

blueberry commented Jan 2, 2019 •

edited

Loading

randomizedthinking commented Jan 2, 2019 •

edited

Loading

blueberry commented Jan 2, 2019 •

edited

Loading

blueberry commented Jan 3, 2019 •

edited

Loading

randomizedthinking commented Jan 3, 2019 •

edited

Loading