This reverts commit e9d6f2a. This creates very long compile times for shared_state.cpp because of the large number of globals.
This was previously implemented using FFI which has some inherent performance problems. The problem is that FFI allocates the used struct for stat() using malloc and frees with free which causes heavy churn because of this. This could be aleviated by using GC managed memory for this, but combined with how FFI currently works that would mean either having to mature allocate the data or adding new FFI features for allowing a call to be made without going GC independent. Mature allocation isn't something we'd want for stat() calls since stat objects are a perfect example of objects that often only live very short and should be cleaned up quickly. Introducing both GC managed memory for FFI calls using structs and allowing FFI calls to be made without going GC independent so the managed memory doesn't move is a big change, so taking this approach for now is easier. We should revisit this if we improve FFI to make doing this with FFI easier / possible. Another tricky part about stat() is how it binds regarding 64 bit filesystems. It uses macro's for that if you compile your app, so we had to have helpers for this already. And now for the performance numbers of this change. Before: === bin/rbx === File.stat 286796.1 (±2.7%) i/s - 1447040 in 5.049272s After: === bin/rbx === File.stat 722130.3 (±2.0%) i/s - 3633256 in 5.033514s
We can use a single cache object, since the JIT now dereferences the pointer stored in the machine code so it can handle replacing constant cache objects.
This means the current InlineCache will changed to being the PolyInlineCache for cases of more than one receiver type.
Before the primitive hookup logic actually took around 0.2s on my system of the startup time. This was due to how the algorithm works, basically resulting in number of primitives squared symbol lookups. With generating these names like this, we remove the string to symbol lookup needed, resulting in 0.2s faster startup on my system.
The problem here is that for example the module the constant is scoped under can change, for example if code does something like this: self.class::SOME_CONSTANT If the target class here is variable, it would have to refer to different constants. The problem with the old strategy is that there is a race condition, because the contents of the cache can't be changed atomically. So "under" might have already been updated, while the referred value isn't or vice versa. This new strategy actually sets up a reference to a cache entry, much like how the inline cache works for method calls. When retrieving a value, it uses this cache object so things don't change while working with it. When the cache is invalid / needs to be updated, we allocate a new cache entry and replace the old one after making sure it's properly initialized. We also change the JIT to teach it about these new types of objects so we can use regular offsets, instead of working with global addresses like we did before.
For code that heavily uses ensure blocks, creating these thread state objects had quite a big overhead. We were abusing the Rubinius::Internal exception for this behavior and spend a lot of time for that code in symbol lookup for the instance variables. This solution removes the need for any symbol lookup and this also helps in concurrent scenarios since symbol lookup needs a lock around the symbol table.
Right now the JIT can't handle super() calls yet. This let to the allocator for String to never be JIT'ted. For now we use these specific allocators. In the future we should teach the JIT about super() (at least for simple cases like this), so we can remove these specifics then. Also sets up Character like we setup the other internal VM classes.
Initially, the thought was that a CharArray could encapsulate the idea of a vector of bytes and the interpretation of those bytes relative to a particular encoding scheme. However, in practice, the interpretation of those bytes is really encapsulated in String, which composes a ByteArray and an Encoding. Pushing the logic down into CharArray required delegating almost everything from String, which is a good indicator for a poor abstraction. One example in particular illustrates this: a ByteArray (and CharArray) contain a boundary-aligned number of bytes, the boundary being a machine word. The size of a ByteArray (CharArray) is always >= to the number of bytes needed for a String's data. Encoding operations need to operate on the precise number of bytes in the String's data because those extra bytes that pad to a boundary in a ByteArray would be misinterpreted in some Encodings. Essentially, the more Encoding-aware CharArray became, the more it was just a String under String. So we removed it.
The primary operation used for threading is AtomicReference#compare_and_set, which uses the CPUs CAS operation.