Large minimum allocation size in AppendAlloc for SSIDS factor data #119

mjacobse · 2023-06-24T21:33:40Z

The AppendAlloc allocator that allocates memory for storing the factor data keeps a Pool of Pages of memory. These pages have a minimum size of 8MB:

spral/src/ssids/cpu/AppendAlloc.hxx

Lines 67 to 72 in 70c4219

    
           class Pool { 
        
              const size_t PAGE_SIZE = 8*1024*1024; // 8MB 
        
           public: 
        
              Pool(size_t initial_size) 
        
              : top_page_(new Page(std::max(PAGE_SIZE, initial_size))) 
        
              {}

spral/src/ssids/cpu/AppendAlloc.hxx

Lines 87 to 91 in 70c4219

    
           ptr = top_page_->allocate(sz); 
        
           if(!ptr) { // Insufficient space on current top page, make a new one 
        
              top_page_ = new Page(std::max(PAGE_SIZE, sz), top_page_); 
        
              ptr = top_page_->allocate(sz); 
        
           }

So no matter how little memory is required next, if it does not fit on the current top page, an additional 8 MB is allocated and even zero'd:

spral/src/ssids/cpu/AppendAlloc.hxx

Lines 18 to 20 in 70c4219

    
           /** A single fixed size page of memory with allocate function. 
        
            * We are required to guaruntee it is zero'd, so use calloc rather than anything 
        
            * else for the allocation.

I saw this causing unexpectedly large amounts of page faults for tiny matrices and completely dominating the runtime.

Initially I considered to replace the calloc on construction of a Page with a normal malloc and only memset the actually used parts to 0 in Page::allocate. That did help.

But then I wondered what the purpose of this minimum PAGE_SIZE actually is. From the naming it seems like the idea might be to always request a whole memory page/virtual page from the operating system at once? If so, without knowing much about memory pages, I don't believe that this is achieved with the current implementation. The fixed constant of 8 MB might not be the actual page size and malloc does not seem to be well-equipped to help in this task, something like mmap would probably have to be used (see this stackoverflow question). Maybe the idea is just to reduce the amount of calls to malloc in general. But for that to be worth it, in usual matrices, are there really so many factor nodes to allocate separate blocks of memory for?

I ended up running benchmarks with PAGE_SIZE = 8MB vs. PAGE_SIZE = 0MB. I used google/benchmark with this source as runner and ran it on all real symmetric matrices from the SuiteSparse Matrix Collection that did not run into a 2min timeout. Parameters:

--benchmark_min_time=1s, --benchmark_repetitions=10, OMP_NUM_THREADS=1
Ubuntu 22.04, gcc 11.2, AMD Ryzen 7 PRO 4750U

The plots below show the relative runtime of the PAGE_SIZE = 0MB version in % for each example. The bars show the comparison of the fastest measurement across all repetitions whereas the scattered dots show the comparison of all repetitions pairwise. The matrices are given from left to right in ascending order of fastest measured runtime. More plots of the separate groups can be found here and results for the same benchmark run with 4 threads here.

Simply put, for small matrices, PAGE_SIZE = 0 was unsurprisingly often much faster for me. That improvement grows smaller, the larger the matrices become. But even for larger matrices, there does not seem to be a clear advantage of PAGE_SIZE = 8MB. It looks like noise around the baseline to me.

I understand that SSIDS is intended especially for really large matrices and that performance on small ones is probably less of a concern (if at all). But with these observations it seems to me that removing the overallocation might give performance on small matrices for free, without influencing performance on larger matrices. But perhaps I am missing something. Is the benchmark flawed in some way? Is there a benefit of overallocating once the matrices become really large? Are there platforms with a different implementation of malloc that would benefit? Or certain hardware?

Would be happy to hear your thoughts, sorry for the lengthy description.

The text was updated successfully, but these errors were encountered:

jfowkes · 2023-06-26T15:22:35Z

Many thanks for the detailed investigation @mjacobse, it seems to me that this is trying to second-guess the hardware by assuming a minimum page size of 8MB. Now while that may well have been the case ten years ago I see no reason why that should be the case now and I am generally very much against second-guessing the hardware, that is what the OS is for.

From your extensive benchmarks I can see no downside to allowing 0MB page sizes, indeed the improvement is quite substantial for smaller matrices, so I would suggest switching to that as default. @tyronerees your thoughts on this? Also does MA97 do something similar with Page Sizes? If so this could maybe explain the recent memory issues we were seeing.

jhogg41 · 2023-06-27T14:31:31Z

I don't think PAGE_SIZE was ever meant to correspond to OS page size (which is probably something like 4-16KB depending on the OS). I suspect on systems that support unix-style mmap, you can probably get more performance by replacing this whole mechanism with something that just grows memory space using mmap (which may even zero things for you depending on the exact OS). Jonathan.

…

On Mon, Jun 26, 2023 at 8:22 AM Jari ***@***.***> wrote: Many thanks for the detailed investigation @mjacobse <https://github.com/mjacobse>, it seems to me that this is trying to second-guess the hardware by assuming a minimum page size of 8MB. Now while that may well have been the case ten years ago I see no reason why that should be the case now and I am generally very much against second-guessing the hardware, that is what the OS is for. From your extensive benchmarks I can see no downside to allowing 0MB page sizes, indeed the improvement is quite substantial for smaller matrices, so I would suggest switching to that as default. @tyronerees <https://github.com/tyronerees> your thoughts on this? Also does MA97 do something similar with Page Sizes? If so this could maybe explain the recent memory issues we were seeing. — Reply to this email directly, view it on GitHub <#119 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABAYXX5R7G7Y35YZKKKSZJ3XNGSMNANCNFSM6AAAAAAZSXK42M> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

jfowkes · 2023-06-28T09:01:05Z

Many thanks @jhogg41 for the clarification! @mjacobse what are your thoughts on an mmap replacement implementation?

mjacobse · 2023-06-28T17:41:13Z

That would be interesting to try I think, but not really trivial to implement I'd imagine? Never worked with mmap or any memory API more lower-level than malloc personally. But would be platform-dependent code with a potential parallel implementation for Windows I suppose? I think the fallback of just using standard malloc and letting its implementation deal with it should probably stay in any case. And as mentioned I get the impression that not doing manual overallocations might be a more natural default for that.

jfowkes · 2023-06-29T08:45:20Z

Yes agreed, that feels like an enhancement to the existing codebase, I propose we switch to PAGE_SIZE = 0MB for now.

jfowkes mentioned this issue Jul 3, 2023

Use PAGE_SIZE = 0MB in AppendAlloc #120

Merged

jfowkes linked a pull request Jul 3, 2023 that will close this issue

Use PAGE_SIZE = 0MB in AppendAlloc #120

Merged

jfowkes closed this as completed in #120 Jul 4, 2023

mjacobse mentioned this issue Jul 14, 2023

Avoid default-init overhead for lookup map in assemble_pre #122

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large minimum allocation size in AppendAlloc for SSIDS factor data #119

Large minimum allocation size in AppendAlloc for SSIDS factor data #119

mjacobse commented Jun 24, 2023

jfowkes commented Jun 26, 2023

jhogg41 commented Jun 27, 2023 via email

jfowkes commented Jun 28, 2023

mjacobse commented Jun 28, 2023 •

edited

Loading

jfowkes commented Jun 29, 2023

Large minimum allocation size in AppendAlloc for SSIDS factor data #119

Large minimum allocation size in AppendAlloc for SSIDS factor data #119

Comments

mjacobse commented Jun 24, 2023

jfowkes commented Jun 26, 2023

jhogg41 commented Jun 27, 2023 via email

jfowkes commented Jun 28, 2023

mjacobse commented Jun 28, 2023 • edited Loading

jfowkes commented Jun 29, 2023

mjacobse commented Jun 28, 2023 •

edited

Loading