Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Array[Int] instead of Array[Array[Boolean]] for [class,trait]_has_trait #3279

Merged
merged 11 commits into from May 29, 2023
Merged

Use Array[Int] instead of Array[Array[Boolean]] for [class,trait]_has_trait #3279

merged 11 commits into from May 29, 2023

Conversation

lolgab
Copy link
Contributor

@lolgab lolgab commented May 8, 2023

Results

Here the results with the Ember hello world.

 31056 -rwxr-xr-x@ 1 lorenzo  staff   15898802  8 Mag 14:31 after*
675624 -rw-r--r--@ 1 lorenzo  staff  345918875  8 Mag 15:07 after.ll
     0 drwxr-xr-x@ 4 lorenzo  staff        128  1 Mag 16:30 b/
 40632 -rwxr-xr-x@ 1 lorenzo  staff   20803106  8 Mag 14:52 before*
783056 -rw-r--r--@ 1 lorenzo  staff  400921751  8 Mag 14:40 before.ll

the binary size is reduced by ~4.9 MB and the generated out.ll file is reduced by 55 MB
The runtime perfomance seems unchanged:

# Before
> hey -z 60s http://localhost:8081

Summary:
  Total:        60.0026 secs
  Slowest:      0.1061 secs
  Fastest:      0.0003 secs
  Average:      0.0034 secs
  Requests/sec: 14778.1655

# After
> hey -z 60s http://localhost:8080

Summary:
  Total:        60.0031 secs
  Slowest:      0.1002 secs
  Fastest:      0.0002 secs
  Average:      0.0034 secs
  Requests/sec: 14824.3753

@@ -504,7 +592,7 @@ object Generate {
}
}

private object Impl {
object Impl {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to move the constats that I need in to use in Lower outside of the Impl object and make them public.

@@ -0,0 +1,26 @@
package scala.scalanative

private[scalanative] class BitMatrix private (
Copy link
Contributor Author

@lolgab lolgab May 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a minimal version of our java BitSet implementation with fixed size and no bound checks.
It represents a matrix so it calculates the array index storing the matrix by rows.

Comment on lines 86 to 99
// def __get_[class,trait]_has_trait(firstid: Int, secondid: Int): Boolean = {
// val columns = meta.traits.length
// val row = firstid * columns
// val bitIndex = row + secondid
// val arrayPos = bitIndex >> AddressBitsPerWord
// val long = bits(arrayPos)
// val toShift = bitIndex & RightBits
// val toShiftLong = toShift.toLong
// val mask = 1L << toShiftLong
// val and = long & mask
// val result = and != 0L
//
// result
// }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we need this comment. I can remove it if you think it's redundant.

@lolgab lolgab marked this pull request as ready for review May 9, 2023 17:24
@lolgab lolgab marked this pull request as draft May 9, 2023 17:25
@WojciechMazur
Copy link
Contributor

Even though in typical usage the new implementation might not introduce any regression in performance when we isolate the benchmark to mostly type checks, then the difference can be actually observed:

Anyway, I think this change should be included into the main. Probably in the future we might provide users with option to use the old implementation while acknowledging the higher binary size. In fact we might need to do something similar for virtual methods dispatch in which we use a large array to ensure constant lookup time, which however blocks capabilities of incremental compilation.

Benchmark

Tested with releaseFast + LTO thin
Additional read/write ops were introduce to ensure that code would not be dead-code eliminate/partially evaluated at compile-time.

trait T1
trait T2 extends T1
trait T3 extends T2
trait T4 extends T3

class C1
class C2 extends C1
class C3 extends C2
class C4 extends C3

object Test {
  def main(args: Array[String]): Unit = {
    var c = 0
    var b = false
    var i = 0
    var x = 0
    while (i < 1000000) {
      val obj = (i % 3) match {
        case 0 => new C4
        case 1 => new T3 {}
        case _ => new {}
      }
      assert(obj != null)
      while (c < 1000000) {
        b &&= obj.isInstanceOf[T1]
        b &&= obj.isInstanceOf[T2]
        b &&= obj.isInstanceOf[T3]
        b &&= obj.isInstanceOf[T4]
        b &&= obj.isInstanceOf[C1]
        b &&= obj.isInstanceOf[C2]
        b &&= obj.isInstanceOf[C3]
        b &&= obj.isInstanceOf[C4]
        c += 1
        if (b) x += 1
      }
      i += 1
    }
  }
}

Results

gitpod /workspace/scala-native (main) $ time ./before 

real    0m0.006s
user    0m0.001s
sys     0m0.006s
gitpod /workspace/scala-native (main) $ time ./after 

real    0m0.014s
user    0m0.009s
sys     0m0.005s

@lolgab
Copy link
Contributor Author

lolgab commented May 17, 2023

@WojciechMazur Great that you wrote a benchmark program, so we can test the performance of a bitset with 32 bit integers and also the performance of a one-dimensional array of Booleans, which are two other options.
I was thinking, do the matrix need to contain all the possible classes and traits in the application?
Couldn't at link time flag the classes and traits that appear in isInstanceOf calls and materialize the matrix only for those?
I'm not sure I understood how these runtime checks work so if I'm saying something completely wrong let me know :)

@WojciechMazur
Copy link
Contributor

Couldn't at link time flag the classes and traits that appear in isInstanceOf calls and materialize the matrix only for those?

That might be an interesting experiment, but I'm not sure if that would make a lot of difference. Becouse the NIR is generated after the erasure, the Scala AST might be already full of .asInstanceOf and .isInstanceOf calls, which probably might by present for most of the types. However we could check that. Probably it would be easiest to handler it in Lower phase, there Op.Is and Op.As would be transformed to actual low level calls. I mention .asInstanceOf here as well, because it's lowering included runtime isInstanceOf check as well.

@armanbilge
Copy link
Member

I mention .asInstanceOf here as well, because it's lowering included runtime isInstanceOf check as well.

Btw, not a priority, but eventually I would be happy to see an option so that in release mode all .asInstanceOf is no-op and ClassCastException becomes undefined behavior. It is already undefined behavior in Scala.js so cross-platform code should not be encountering that, and idiomatic Scala code is also not relying on ClassCastException for correct semantics (certainly not anything in Typelevel stack). Maybe this can be good for performance and even better if we can use it to reduce the size of {class,trait}_has_trait.

@lolgab
Copy link
Contributor Author

lolgab commented May 26, 2023

Converted the matrixes to Array[Int] and it seems that the performances improved on my Apple ARM64 machine. In my benchmark is always on par with baseline. But I couldn't test with LTO=thin. I'm converting the matrixes to Array[Int] since it should also improve performance on 32 bit machines.

My benchmarks:

debug mode

hyperfine --warmup 1 ./32bit ./64bit ./bas
eline
Benchmark 1: ./32bit
  Time (mean ± σ):      1.479 s ±  0.012 s    [User: 1.461 s, System: 0.016 s]
  Range (min … max):    1.466 s …  1.508 s    10 runs
 
Benchmark 2: ./64bit
  Time (mean ± σ):      1.567 s ±  0.002 s    [User: 1.550 s, System: 0.015 s]
  Range (min … max):    1.564 s …  1.570 s    10 runs
 
Benchmark 3: ./baseline
  Time (mean ± σ):      1.441 s ±  0.007 s    [User: 1.425 s, System: 0.014 s]
  Range (min … max):    1.436 s …  1.459 s    10 runs
 
Summary
  './baseline' ran
    1.03 ± 0.01 times faster than './32bit'
    1.09 ± 0.01 times faster than './64bit'

release-fast mode

hyperfine --warmup 1 ./32bit-fast ./64bit-fast ./baseline
-fast
Benchmark 1: ./32bit-fast
  Time (mean ± σ):     971.2 ms ±   7.4 ms    [User: 958.6 ms, System: 11.5 ms]
  Range (min … max):   964.9 ms … 986.8 ms    10 runs
 
Benchmark 2: ./64bit-fast
  Time (mean ± σ):     972.8 ms ±   2.4 ms    [User: 961.3 ms, System: 10.4 ms]
  Range (min … max):   969.5 ms … 978.2 ms    10 runs
 
Benchmark 3: ./baseline-fast
  Time (mean ± σ):     972.8 ms ±   2.4 ms    [User: 961.6 ms, System: 10.4 ms]
  Range (min … max):   969.0 ms … 976.7 ms    10 runs
 
Summary
  './32bit-fast' ran
    1.00 ± 0.01 times faster than './64bit-fast'
    1.00 ± 0.01 times faster than './baseline-fast'

@lolgab lolgab changed the title [WIP] Use Array[Long] instead of Array[Array[Boolean]] for [class,trait]_has_trait [WIP] Use Array[Int] instead of Array[Array[Boolean]] for [class,trait]_has_trait May 26, 2023
@lolgab lolgab changed the title [WIP] Use Array[Int] instead of Array[Array[Boolean]] for [class,trait]_has_trait Use Array[Int] instead of Array[Array[Boolean]] for [class,trait]_has_trait May 26, 2023
@lolgab lolgab marked this pull request as ready for review May 26, 2023 17:31
@lolgab lolgab mentioned this pull request May 26, 2023
4 tasks
@WojciechMazur
Copy link
Contributor

Thank you for introducing me to hyperfine. I can confirm that when using thin lto i can get similar results.

@WojciechMazur WojciechMazur merged commit e0ee700 into scala-native:main May 29, 2023
69 checks passed
@lolgab lolgab deleted the efficient-has-trait branch May 29, 2023 14:37
WojciechMazur pushed a commit that referenced this pull request May 30, 2023
…]_has_trait` (#3279)

* Efficient trait_has_trait using a Array[Int] as bitset instead of Array[Array[Boolean]]
WojciechMazur pushed a commit that referenced this pull request Jun 5, 2023
…]_has_trait` (#3279)

* Efficient trait_has_trait using a Array[Int] as bitset instead of Array[Array[Boolean]]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants