Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measure the impact of using BlockPtr #4

Closed
Horusiath opened this issue Apr 21, 2021 · 4 comments
Closed

Measure the impact of using BlockPtr #4

Horusiath opened this issue Apr 21, 2021 · 4 comments
Labels
wontfix This will not be worked on

Comments

@Horusiath
Copy link
Collaborator

Horusiath commented Apr 21, 2021

I still have got some doubts around concept behind BlockPtr. While I got the idea behind the initial concept I'm thinking if all of the operations where taken in mind.

As @dmonad mentioned original test was about populating BlockStore of known size with blocks at the initialization time. Inline approach is faster here: we just make Vec::with_capacity<Block>(known_size) and we have entire chunk ready right away with blocks inside already initialized. Compared to that Vec::with_capacity<Rc<RefMut<Block>>(known_size) seems subpar for two reasons:

  • We need to create every single block manually as a separate heap object.
  • When additional blocks will be added over the lifetime of an application, blocks will be created on the heap in potentially different parts of memory. Memory locality is in the favor of Vec<Block> here.

However this approach also comes with several other issues, I'm not sure we talked about:

When new blocks will be added to a vector, and its capacity will be reached, before appending the next element, vector will have to assign new internal array (also on heap) of sizeof(Block) * vec.capacity() * 2 (vec growth factor is around 2.0), then copy all of the existing blocks over to new space. Block size atm is around ~300B. For block store with 1024 elements, this means copying ~300kB of memory before adding new element. If we're talking about block store as Vec<Rc<RefMut<Block>>>, 1024 elements takes only 8kB, making resizing significantly easier.

Another issue is an access time based on BlockPtr - since it's used as Item's left and right block pointer, every access to left/right block involves going through block store. Compared to access by Rc this is really expensive.

blockptr

I don't have yet enough intuition to say how often this will happen.

Here's the assembly of BlockStore::get_item_mut compiled with release options (-C opt-level=2). Keep in mind that this code will be executed every single time we're about to retrieve item instance from block store.

BlockStore::get_item_mut:
        push    rbx
        mov     r8, qword ptr [rsi]
        mov     rax, r8
        shr     rax, 57
        mov     r10, qword ptr [rdi]
        mov     r9, qword ptr [rdi + 8]
        mov     rdx, r10
        and     rdx, r8
        lea     rdi, [rdx + 16]
        and     rdi, r10
        movdqu  xmm1, xmmword ptr [r9 + rdx]
        movd    xmm0, eax
        punpcklbw       xmm0, xmm0
        pshuflw xmm0, xmm0, 224
        pshufd  xmm0, xmm0, 0
        movdqa  xmm2, xmm0
        pcmpeqb xmm2, xmm1
        pmovmskb        eax, xmm2
        mov     ecx, 16
        pcmpeqd xmm2, xmm2
.LBB2_1:
        test    ax, ax
        jne     .LBB2_4
        pcmpeqb xmm1, xmm2
        pmovmskb        eax, xmm1
        test    ax, ax
        jne     .LBB2_7
        mov     rdx, rdi
        add     rdi, rcx
        add     rdi, 16
        add     rcx, 16
        and     rdi, r10
        movdqu  xmm1, xmmword ptr [r9 + rdx]
        movdqa  xmm3, xmm0
        pcmpeqb xmm3, xmm1
        pmovmskb        eax, xmm3
        jmp     .LBB2_1
.LBB2_4:
        lea     ebx, [rax - 1]
        and     ebx, eax
        bsf     ax, ax
        movzx   eax, ax
        add     rax, rdx
        and     rax, r10
        neg     rax
        lea     r11, [rax + 4*rax]
        mov     eax, ebx
        cmp     r8, qword ptr [r9 + 8*r11 - 40]
        jne     .LBB2_1
        mov     rax, qword ptr [r9 + 8*r11 - 32]
        mov     ecx, dword ptr [rsi + 16]
        lea     rcx, [rcx + 8*rcx]
        shl     rcx, 5
        cmp     qword ptr [rax + rcx], 0
        jne     .LBB2_6
        add     rax, rcx
        add     rax, 8
        pop     rbx
        ret
.LBB2_7:
        lea     rdi, [rip + .L__unnamed_1]
        lea     rdx, [rip + .L__unnamed_2]
.LBB2_8:
        mov     esi, 43
        call    qword ptr [rip + core::panicking::panic@GOTPCREL]
        ud2
.LBB2_6:
        lea     rdi, [rip + .L__unnamed_1]
        lea     rdx, [rip + .L__unnamed_3]
        jmp     .LBB2_8

It would be good to have some performance tests for Rcs and BlockPtr in many different scenarios. Potentially we could also create something that works "like" Rc (eg. using unsafe pointer arithmetic) and let's us to leverage the knowledge about the structures that know we're going to use, which generic RC mechanism won't do.

@Horusiath Horusiath added the wontfix This will not be worked on label Apr 21, 2021
@Horusiath
Copy link
Collaborator Author

Horusiath commented Apr 21, 2021

Proposal

I think that BlockPtr might try to solve wrong problem - the problem is not inefficiency of Rc itself, but rather a problem of generic Rust allocator, which does poor work in optimizing frequent allocation of many small objects (possibly in batches).

Custom allocator

In case of Vec the necessary API was provided to let users define custom allocators, in fact Vec<T> is an alias for Vec<T, Global> using global allocator. We could do similar approach for Rc - create our own allocator, optimized for allocating objects of the same size (eg. Block). Since we know the objects have the same size, we can do a wide range of optimizations eg. bump-pointer allocations and bitmaps for lookups for unused slices of data.

The only problem here is that Rust Rc/Box don't expose custom allocations. For that we'd need to override them. That's why I think the first proposal could be to define BlockPtr (or maybe BlockRef, as Ptr is used in Rust for unmanaged unafe pointers) in terms of Rc:

use std::rc::Rc;

type BlockRef(Rc<RefMut<Block>>); // step 1: naive implementation

This way we could expose only those of Rc's functions, that we're interested about. Eventually we could replace native Rc implementation with our own:

// our Rc semi-compatible type
struct Rc<T, A: Allocator>(/* impl */);

// our super optimized allocator type for data type of same size
struct Pool<T: Sized> { /* impl */ }

// Alocator API is available in experimental phase: https://doc.rust-lang.org/nightly/std/alloc/trait.Allocator.html
impl<T: Sized> Allocator for Pool<T> {
  // impl
}

// step 2: make our Rc look like native std::rc::Rc and swap them where necessary
type Rc<T: Sized> = Rc<T, Pool<T>>;

In that shape, this functionality could be even included as part of some more generic lib like lib0. There is even potential, that we might not need custom Rc, but rather use Rc::from_raw function over memory we allocated by ourselves.

BlockPtr and smart pointers

Another idea for optimization is to use concept of smart pointers to represent data from BlockPtr. In general with step above we could represent our BlockPtr as:

use lib::rc::Rc; // we can now use our "better" Rc implementation

type BlockPtr(Rc<RefMut<Block>>);

pub enum Block {
    Item(Item),
    Skip(Skip),
    GC(GC),
}
pub struct Skip {
    pub id: ID,
    pub len: u32

}
pub struct GC {
    pub id: ID,
    pub len: u32,
}

However we could go further. Let's assume that BlockPtr will have a fixed size of 64-bits. Modern OSes and machines often have limit for address space that leaves highest bits unused. Some runtimes (like JVM/.NET) use them for GC process for marking. We could utilize them as well, example:

  • Highest 2 bits == 01: BlockPtr points to GC. Lower 62bits are used as pointer address.
  • Highest 2 bits == 10: BlockPtr points to Skip. Lower 62bits are used as pointer address.
  • Highest 2 bits == 00: BlockPtr points to Item, Lower 62bits are used as pointer address.

This way we can tell what kind of a block is BlockPtr pointing to without even jumping to a memory address where the Block itself was allocated.

@dmonad
Copy link
Contributor

dmonad commented Apr 21, 2021

I think you are right that we will eventually end up writing our own allocator. For now, I wanted to use Map<client_id, BlockList> because it does basically the same as custom allocator, but using optimized constructs (Map & Vec).

I'm absolutely not concerned about the time it takes to iterate through the BlockStore or the time it takes to allocate more memory for a BlockList. The important part is that the initial loading of the document is fast and that we can optimize in the future.

When new blocks will be added to a vector, and its capacity will be reached, before appending the next element, vector will have to assign new internal array (also on heap) of sizeof(Block) * vec.capacity() * 2 (vec growth factor is around 2.0), then copy all of the existing blocks over to new space. Block size atm is around ~300B. For block store with 1024 elements, this means copying ~300kB of memory before adding new element. If we're talking about block store as Vec<Rc<RefMut>>, 1024 elements takes only 8kB, making resizing significantly easier.

Moving a range of memory to another location is very fast on modern hardware and optimized by the operating system. The important thing is that we optimize loading the document initially because that's where users are actually blocked from editing. A user won't notice if inserting a single character takes 1μ or one ms. Not that a move operation would take that much overhead.

Here's the assembly of BlockStore::get_item_mut compiled with release options (-C opt-level=2). Keep in mind that this code will be executed every single time we're about to retrieve item instance from block store.

The code you mentioned does quite a lot of things. In the ideal case, we retrieve the BlockList and then access the pivot's item in the list. This can be very fast (my intuition speaking), because we will put the whole BlockList in the fast cache so the CPU will find it without cache misses. There are a couple of edge cases. E.g. the requested BlockList doesn't exist and needs to be created or the HashMap lookup misses and we need to find the correct position. This is also handled by the mentioned code.

I think that BlockPtr might try to solve wrong problem - the problem is not inefficiency of Rc itself, but rather a problem of generic Rust allocator, which does poor work in optimizing frequent allocation of many small objects (possibly in batches).

That's exactly the case. But implementing a custom allocator might be out of scope for now. I don't think that the current implementation is optimal. But it is very close to what I did in Yjs and should provide overall good performance even in WASM. I want a custom allocator. But first, we should have something to compare to using benchmarks.

In the future, BlockPtr::pivot can point to a location in memory that is maintained by BlockStore using a custom allocator. I think we can change that rather easily after an initial implementation. WDYT?

@Horusiath
Copy link
Collaborator Author

Horusiath commented Apr 29, 2021

For now, I wanted to use Map<client_id, BlockList> because it does basically the same as custom allocator, but using optimized constructs (Map & Vec).

I couldn't agree here - allocator gives us direct memory address. Block access is isolated and very fast in this scenario. BlockPtr not only requires us to do multiple jumps, but also leaks that complexity into library code. Since every Block access goes via transaction.store.clients, there are also few other problems with this approach:

  1. It's specifically problematic in Rust, as we need to fight with the borrow checker. Example: if you want to take a Block by its BlockPtr and change it's contents, you first need to guarantee (at compile time), that no other variable in the system maintains a reference to other transaction fields. Great example here would be item splitting process where we're iterating over block lists while using them to get mutable references in case of item split/deletion. It took me 2 days to write it, and it's not even well optimized code.
  2. It also means that whenever in the future we'll decide to move into actual reference/pointer based approach, we cannot just change few things (eg. swap Box<T, Global> to Box<T, BlockOptimized> ). Given that block access is frequently used all over the place, it may mean rewriting huge parts of the library in the future.

Even if we're not going back to Rc/Box approach (because we don't want to spend that time writing custom allocator), I think it's still feasible to think about different abstraction than BlockPtr which at least tries to hide the complexity of existing solution.

@Horusiath
Copy link
Collaborator Author

Closing this one - results can be seen in #88.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants