Use `Array[Int]` instead of `Array[Array[Boolean]]` for `[class,trait]_has_trait` #3279

lolgab · 2023-05-08T13:15:53Z

Results

Here the results with the Ember hello world.

 31056 -rwxr-xr-x@ 1 lorenzo  staff   15898802  8 Mag 14:31 after*
675624 -rw-r--r--@ 1 lorenzo  staff  345918875  8 Mag 15:07 after.ll
     0 drwxr-xr-x@ 4 lorenzo  staff        128  1 Mag 16:30 b/
 40632 -rwxr-xr-x@ 1 lorenzo  staff   20803106  8 Mag 14:52 before*
783056 -rw-r--r--@ 1 lorenzo  staff  400921751  8 Mag 14:40 before.ll

the binary size is reduced by ~4.9 MB and the generated out.ll file is reduced by 55 MB
The runtime perfomance seems unchanged:

# Before
> hey -z 60s http://localhost:8081

Summary:
  Total:        60.0026 secs
  Slowest:      0.1061 secs
  Fastest:      0.0003 secs
  Average:      0.0034 secs
  Requests/sec: 14778.1655

# After
> hey -z 60s http://localhost:8080

Summary:
  Total:        60.0031 secs
  Slowest:      0.1002 secs
  Fastest:      0.0002 secs
  Average:      0.0034 secs
  Requests/sec: 14824.3753

Using a Array[Long] as bitset instead of Array[Array[Boolean]]

lolgab · 2023-05-08T13:17:50Z

tools/src/main/scala/scala/scalanative/codegen/Generate.scala

@@ -504,7 +592,7 @@ object Generate {
    }
  }

-  private object Impl {
+  object Impl {


I need to move the constats that I need in to use in Lower outside of the Impl object and make them public.

lolgab · 2023-05-08T13:21:11Z

tools/src/main/scala/scala/scalanative/codegen/BitMatrix.scala

@@ -0,0 +1,26 @@
+package scala.scalanative
+
+private[scalanative] class BitMatrix private (


This is a minimal version of our java BitSet implementation with fixed size and no bound checks.
It represents a matrix so it calculates the array index storing the matrix by rows.

lolgab · 2023-05-08T13:23:15Z

tools/src/main/scala/scala/scalanative/codegen/Generate.scala

+    // def __get_[class,trait]_has_trait(firstid: Int, secondid: Int): Boolean = {
+    //   val columns = meta.traits.length
+    //   val row = firstid * columns
+    //   val bitIndex = row + secondid
+    //   val arrayPos = bitIndex >> AddressBitsPerWord
+    //   val long = bits(arrayPos)
+    //   val toShift = bitIndex & RightBits
+    //   val toShiftLong = toShift.toLong
+    //   val mask = 1L << toShiftLong
+    //   val and = long & mask
+    //   val result = and != 0L
+    //
+    //   result
+    // }


I'm not sure if we need this comment. I can remove it if you think it's redundant.

WojciechMazur · 2023-05-16T22:52:38Z

Even though in typical usage the new implementation might not introduce any regression in performance when we isolate the benchmark to mostly type checks, then the difference can be actually observed:

Anyway, I think this change should be included into the main. Probably in the future we might provide users with option to use the old implementation while acknowledging the higher binary size. In fact we might need to do something similar for virtual methods dispatch in which we use a large array to ensure constant lookup time, which however blocks capabilities of incremental compilation.

Benchmark

Tested with releaseFast + LTO thin
Additional read/write ops were introduce to ensure that code would not be dead-code eliminate/partially evaluated at compile-time.

trait T1
trait T2 extends T1
trait T3 extends T2
trait T4 extends T3

class C1
class C2 extends C1
class C3 extends C2
class C4 extends C3

object Test {
  def main(args: Array[String]): Unit = {
    var c = 0
    var b = false
    var i = 0
    var x = 0
    while (i < 1000000) {
      val obj = (i % 3) match {
        case 0 => new C4
        case 1 => new T3 {}
        case _ => new {}
      }
      assert(obj != null)
      while (c < 1000000) {
        b &&= obj.isInstanceOf[T1]
        b &&= obj.isInstanceOf[T2]
        b &&= obj.isInstanceOf[T3]
        b &&= obj.isInstanceOf[T4]
        b &&= obj.isInstanceOf[C1]
        b &&= obj.isInstanceOf[C2]
        b &&= obj.isInstanceOf[C3]
        b &&= obj.isInstanceOf[C4]
        c += 1
        if (b) x += 1
      }
      i += 1
    }
  }
}

Results

gitpod /workspace/scala-native (main) $ time ./before 

real    0m0.006s
user    0m0.001s
sys     0m0.006s
gitpod /workspace/scala-native (main) $ time ./after 

real    0m0.014s
user    0m0.009s
sys     0m0.005s

lolgab · 2023-05-17T14:15:37Z

@WojciechMazur Great that you wrote a benchmark program, so we can test the performance of a bitset with 32 bit integers and also the performance of a one-dimensional array of Booleans, which are two other options.
I was thinking, do the matrix need to contain all the possible classes and traits in the application?
Couldn't at link time flag the classes and traits that appear in isInstanceOf calls and materialize the matrix only for those?
I'm not sure I understood how these runtime checks work so if I'm saying something completely wrong let me know :)

WojciechMazur · 2023-05-18T21:12:23Z

Couldn't at link time flag the classes and traits that appear in isInstanceOf calls and materialize the matrix only for those?

That might be an interesting experiment, but I'm not sure if that would make a lot of difference. Becouse the NIR is generated after the erasure, the Scala AST might be already full of .asInstanceOf and .isInstanceOf calls, which probably might by present for most of the types. However we could check that. Probably it would be easiest to handler it in Lower phase, there Op.Is and Op.As would be transformed to actual low level calls. I mention .asInstanceOf here as well, because it's lowering included runtime isInstanceOf check as well.

armanbilge · 2023-05-19T20:26:17Z

I mention .asInstanceOf here as well, because it's lowering included runtime isInstanceOf check as well.

Btw, not a priority, but eventually I would be happy to see an option so that in release mode all .asInstanceOf is no-op and ClassCastException becomes undefined behavior. It is already undefined behavior in Scala.js so cross-platform code should not be encountering that, and idiomatic Scala code is also not relying on ClassCastException for correct semantics (certainly not anything in Typelevel stack). Maybe this can be good for performance and even better if we can use it to reduce the size of {class,trait}_has_trait.

lolgab · 2023-05-26T16:50:45Z

Converted the matrixes to Array[Int] and it seems that the performances improved on my Apple ARM64 machine. In my benchmark is always on par with baseline. But I couldn't test with LTO=thin. I'm converting the matrixes to Array[Int] since it should also improve performance on 32 bit machines.

My benchmarks:

`debug` mode

hyperfine --warmup 1 ./32bit ./64bit ./bas
eline
Benchmark 1: ./32bit
  Time (mean ± σ):      1.479 s ±  0.012 s    [User: 1.461 s, System: 0.016 s]
  Range (min … max):    1.466 s …  1.508 s    10 runs
 
Benchmark 2: ./64bit
  Time (mean ± σ):      1.567 s ±  0.002 s    [User: 1.550 s, System: 0.015 s]
  Range (min … max):    1.564 s …  1.570 s    10 runs
 
Benchmark 3: ./baseline
  Time (mean ± σ):      1.441 s ±  0.007 s    [User: 1.425 s, System: 0.014 s]
  Range (min … max):    1.436 s …  1.459 s    10 runs
 
Summary
  './baseline' ran
    1.03 ± 0.01 times faster than './32bit'
    1.09 ± 0.01 times faster than './64bit'

`release-fast` mode

hyperfine --warmup 1 ./32bit-fast ./64bit-fast ./baseline
-fast
Benchmark 1: ./32bit-fast
  Time (mean ± σ):     971.2 ms ±   7.4 ms    [User: 958.6 ms, System: 11.5 ms]
  Range (min … max):   964.9 ms … 986.8 ms    10 runs
 
Benchmark 2: ./64bit-fast
  Time (mean ± σ):     972.8 ms ±   2.4 ms    [User: 961.3 ms, System: 10.4 ms]
  Range (min … max):   969.5 ms … 978.2 ms    10 runs
 
Benchmark 3: ./baseline-fast
  Time (mean ± σ):     972.8 ms ±   2.4 ms    [User: 961.6 ms, System: 10.4 ms]
  Range (min … max):   969.0 ms … 976.7 ms    10 runs
 
Summary
  './32bit-fast' ran
    1.00 ± 0.01 times faster than './64bit-fast'
    1.00 ± 0.01 times faster than './baseline-fast'

WojciechMazur · 2023-05-29T14:33:13Z

Thank you for introducing me to hyperfine. I can confirm that when using thin lto i can get similar results.

…]_has_trait` (#3279) * Efficient trait_has_trait using a Array[Int] as bitset instead of Array[Array[Boolean]]

lolgab added 5 commits May 7, 2023 16:59

WIP: Efficient trait_has_trait

fb93eb3

Using a Array[Long] as bitset instead of Array[Array[Boolean]]

It links now

d55264e

Cleanup code

7850cfe

Basic usage works

d2aff25

Use the right table

34979a7

lolgab commented May 8, 2023

View reviewed changes

lolgab added 5 commits May 8, 2023 15:48

Fix compilation on Scala 3

dc28a54

Run Scalafmt

51853bf

Avoid making Impl object public

93716f8

Rephrase genHasTrait comment

4bedc6d

DRY genHasTrait a bit

5ff3320

lolgab marked this pull request as ready for review May 9, 2023 17:24

lolgab marked this pull request as draft May 9, 2023 17:25

Use Array[Int] instead of Array[Long] for matrices

852bea5

lolgab changed the title ~~[WIP] Use Array[Long] instead of Array[Array[Boolean]] for [class,trait]_has_trait~~ [WIP] Use Array[Int] instead of Array[Array[Boolean]] for [class,trait]_has_trait May 26, 2023

lolgab changed the title ~~[WIP] Use Array[Int] instead of Array[Array[Boolean]] for [class,trait]_has_trait~~ Use Array[Int] instead of Array[Array[Boolean]] for [class,trait]_has_trait May 26, 2023

lolgab marked this pull request as ready for review May 26, 2023 17:31

lolgab mentioned this pull request May 26, 2023

Backports for release 0.4.13 #3289

Merged

4 tasks

WojciechMazur approved these changes May 29, 2023

View reviewed changes

WojciechMazur merged commit e0ee700 into scala-native:main May 29, 2023
69 checks passed

lolgab deleted the efficient-has-trait branch May 29, 2023 14:37

WojciechMazur pushed a commit that referenced this pull request May 30, 2023

Use Array[Int] instead of Array[Array[Boolean]] for `[class,trait…

486b91a

…]_has_trait` (#3279) * Efficient trait_has_trait using a Array[Int] as bitset instead of Array[Array[Boolean]]

WojciechMazur pushed a commit that referenced this pull request Jun 5, 2023

Use Array[Int] instead of Array[Array[Boolean]] for `[class,trait…

b7ab1f6

…]_has_trait` (#3279) * Efficient trait_has_trait using a Array[Int] as bitset instead of Array[Array[Boolean]]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `Array[Int]` instead of `Array[Array[Boolean]]` for `[class,trait]_has_trait` #3279

Use `Array[Int]` instead of `Array[Array[Boolean]]` for `[class,trait]_has_trait` #3279

lolgab commented May 8, 2023 •

edited

lolgab May 8, 2023

lolgab May 8, 2023 •

edited

lolgab May 8, 2023

WojciechMazur commented May 16, 2023

lolgab commented May 17, 2023

WojciechMazur commented May 18, 2023

armanbilge commented May 19, 2023

lolgab commented May 26, 2023 •

edited

WojciechMazur commented May 29, 2023

		@@ -0,0 +1,26 @@
		package scala.scalanative

		private[scalanative] class BitMatrix private (

Use Array[Int] instead of Array[Array[Boolean]] for [class,trait]_has_trait #3279

Use Array[Int] instead of Array[Array[Boolean]] for [class,trait]_has_trait #3279

Conversation

lolgab commented May 8, 2023 • edited

Results

lolgab May 8, 2023

Choose a reason for hiding this comment

lolgab May 8, 2023 • edited

Choose a reason for hiding this comment

lolgab May 8, 2023

Choose a reason for hiding this comment

WojciechMazur commented May 16, 2023

Benchmark

Results

lolgab commented May 17, 2023

WojciechMazur commented May 18, 2023

armanbilge commented May 19, 2023

lolgab commented May 26, 2023 • edited

debug mode

release-fast mode

WojciechMazur commented May 29, 2023

Use `Array[Int]` instead of `Array[Array[Boolean]]` for `[class,trait]_has_trait` #3279

Use `Array[Int]` instead of `Array[Array[Boolean]]` for `[class,trait]_has_trait` #3279

lolgab commented May 8, 2023 •

edited

lolgab May 8, 2023 •

edited

lolgab commented May 26, 2023 •

edited

`debug` mode

`release-fast` mode