Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A question on fluokitten performance #27

Closed
randomizedthinking opened this issue Jan 2, 2019 · 9 comments
Closed

A question on fluokitten performance #27

randomizedthinking opened this issue Jan 2, 2019 · 9 comments

Comments

@randomizedthinking
Copy link

randomizedthinking commented Jan 2, 2019

I followed the instructions in Fast Map and Reduce for Primitive Vectors and Matrices to test the performance of neaderthal and fluokitten. What I found are:

  1. The performance numbers from the neaderthal are close to the post;
  2. Yet, all performance numbers from fluokitten are much worse.

For instance,

(def nx (dv (range n)))
(def ny (dv (range n)))

(defn p+ ^double [^double x ^double y] (+ x y))
(defn p* ^double [^double x ^double y] (* x y))

(with-progress-reporting (quick-bench (foldmap p+ 0.0 p* nx ny)))
;             Execution time mean : 19.735098 ms

Also, the same tests were performed using the current dev version of fluokitten and it also ended up with much slower execution time than what in the post. I wonder what might be causes of such large discrepancies?

It would be great if someone can also conduct similar tests.

@blueberry
Copy link
Member

blueberry commented Jan 2, 2019

Can you share an example project with the specific dependencies and exactly the code that you tried?

For example, it the code you've posted, you never define p+ and p+ (my mistake, early morning :), and I can't see how large n is. Maybe foldmap just throws an exception? Or, n might be quite large, so 20ms is what it should be? What is your hardware, Java version, OS?

@blueberry
Copy link
Member

Just for reference, I've just tried this code with the current Neanderthal snapshot on the same machine I used for writing that post (i7 4790k) and got the following:

(with-progress-reporting (quick-bench (foldmap p+ 0.0 p* nx ny)))
;;  Execution time mean : 189.349159 µs

It's a little bit faster than reported in the blog post (198 µs) but close. Please try to run that benchmark on your machine a couple of times and see whether there are some changes. Maybe the combination of the version and the settings of your JVM/OS have something to do with that, but it is difficult to say without data.

@randomizedthinking
Copy link
Author

randomizedthinking commented Jan 2, 2019

I should have specified all parameters at the first place. All configurations are:

  • n=100000
  • JVM version: OpenJDK 1.8.0
  • OS: Linux Debian
  • CPU: i5-3470
  • Clojure 1.9.0

The computer I use is slow, yet it doesn't explain the huge performance differences. I also tried it on another Xeon E5-2686 machine this morning -- same results.

Further checking shows: fold is reasonable fast, yet fmap is the bottleneck. Another observation. In your post, you showed that

(fold (fmap * nx ny))
;; => ClassCastException clojure.core$_STAR_ cannot be cast to clojure.lang.IFn$DDD  uncomplicate.neanderthal.impl.buffer-block/vector-fmap* (buffer_block.clj:349)

Yet in my tests, this code runs. I suspect you have a different fmap which takes advantage of native MKL or GPU libraries, so you have super performance.

@blueberry
Copy link
Member

blueberry commented Jan 2, 2019

I think that I know what might be the source of the problem: the old Clojure compiler was somewhat inconsistent in applying protocol implementations, so it dispatches to the non-primitive function implementation in your case (why? I don't know). In my tests, Clojure 1.10 fixed this non-determinism. Please upgrade the project to 1.10 and report the timings.

BTW fmap does not use any MKL/GPU acceleration, it just eliminates various JVM bottlenecks. fold does use MKL in a few simple cases where that is possible.

@randomizedthinking
Copy link
Author

Just tested on Cojure 1.10, but still the performance is sluggish. Also (fold (fmap * nx ny)) runs under 1.10 in my case.

@randomizedthinking
Copy link
Author

Here I create a project so you can check it out: fluokitten_test. Below are my run results:

Estimating sampling overhead
Warming up for JIT optimisations 10000000000 ...
compilation occurred before 463355 iterations
compilation occurred before 59757179 iterations
compilation occurred before 116734838 iterations
compilation occurred before 118124537 iterations
compilation occurred before 177418361 iterations
compilation occurred before 236712185 iterations
Estimating execution count ...
Sampling ...
Final GC...
Checking GC...
Finding outliers ...
Bootstrapping ...
Checking outlier significance
Warming up for JIT optimisations 5000000000 ...
Estimating execution count ...
Sampling ...
Final GC...
Checking GC...
Finding outliers ...
Bootstrapping ...
Checking outlier significance
Evaluation count : 36 in 6 samples of 6 calls.
Execution time mean : 16.717164 ms
Execution time std-deviation : 188.162809 µs
Execution time lower quantile : 16.481262 ms ( 2.5%)
Execution time upper quantile : 16.925771 ms (97.5%)
Overhead used : 9.915312 ns

@blueberry
Copy link
Member

blueberry commented Jan 3, 2019

When I tried your project as-is on my computer (but starting the benchmark from the repl instead of main), I got 4ms.

Then I added the direct linking option to :jvm-opts in leiningen, and got a significant speedup, 800 microseconds.

:jvm-opts ^:replace ["-Dclojure.compiler.direct-linking=true"
                       "-XX:MaxDirectMemorySize=16g" "-XX:+UseLargePages"] 

I restarted your project a few times with different versions of neanderthal (SNAPHOT and 0.20.4) and Clojure 1.8.0, 1.10.0, and I always got the same result.

However, when I started the repl from the benchmarks example project (https://github.com/uncomplicate/neanderthal/blob/master/examples/benchmarks/src/benchmarks/map_reduce.clj) I always get around 200 microseconds, as reported in the blog post.

So, it is definitely related to JVM/Clojure compiler settings, and possibly order in whick Clojure loads namespaces. I don't have time now to compare your project further and see if there is another setting that you've missed. Can you try the code from the benchmarks project and report your numbers (seeing that our CPUs got 20ms vs 4 ms for the initial version, I should expect that you'll get around 1ms with the benchmarks project)?

@randomizedthinking
Copy link
Author

Thanks for the prompt reply. I will check under the options provided, and report back later.

@randomizedthinking
Copy link
Author

randomizedthinking commented Jan 3, 2019

Now I found the cause of the issue. In addition to the direct-linking options you pointed out, another factor is the *unchecked-math* option: it has to be set to either true or :warn-on-boxed globally. Only set the option in the module won't work.

The fluokitten_test is updated with the change. Now I can get around 240us as the end results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants