Skip to content
This repository has been archived by the owner on Jun 21, 2022. It is now read-only.

Block management #135

Open
jpivarski opened this issue Sep 20, 2018 · 9 comments
Open

Block management #135

jpivarski opened this issue Sep 20, 2018 · 9 comments
Assignees

Comments

@jpivarski
Copy link
Member

Currently, reusing a key name or deleting a key reuses the space used by the key but not the value. The behavior is formally correct, but it is inefficient: assigning new histograms to the same name will make the output file length grow indefinitely. To handle this properly, we need to view the linear sequence of bytes in a file as reusable blocks, rather than a stream.

There are many algorithms that do this. ROOT must have one, for instance, and this is essentially what malloc does with RAM memory. This is an open-ended project involving some thinking about design. These systems are prone to fragmentation; perhaps introducing a (not too small, not too large) granular unit of allocation helps. Perhaps that unit can be 4K to match most filesystems, which read in 4K blocks.

Designs should probably be tested outside of the uproot codebase and only integrated when a good solution is found. Maybe there's a library that does it, though I doubt that. (I'd even have a hard time looking for a search term: by what name do computer scientists refer to this problem?) Once you have a good solution, it should be integrated into TFile._expandfile so that both new values and new keys can take advantage of it.

@jpivarski
Copy link
Member Author

In any block management algorithm, the final write of the file could come at a time when some blocks in the file are unused. The ROOT format allows the locations of these blocks to be saved to the file so that a subsequent process can pick up on them and use them.

  • fSeekFree points to a TKey and a list of TFree objects specifying the unused ranges within the file. There's always at least one of these, ranging from fEND to 2000000000.
  • fNbytesFree is the size of the TKey and all the TFree elements.
  • nfree is the length of this list (not derivable from the TKey and its TFree elements alone). It's always at least 1.

So the fSeekFree and fNbytesFree don't refer to the free space itself, they refer to a lookup table of free space. This is the mistake we were making in issue #136.

@anerokhi
Copy link

I'd even have a hard time looking for a search term: by what name do computer scientists refer to this problem?

I think this is described in e.g. Knuth's TAOCP Vol.1 2.5 "Dynamic Storage Allocation"

@jpivarski
Copy link
Member Author

I don't know, but I think dynamic storage allocation usually refers to allocation of RAM memory (like what malloc does). Allocating blocks in a file is a little different because it's more important to use tightly contiguous blocks in a file and moving data is more costly (disk access is slow). Maybe that chapter describes both problems? I wasn't able to find it.

@anerokhi
Copy link

Mkay, I didn't learn "Dynamic Storage Allocation" by heart and only skimmed through it, but what I've seen about managing free blocks using linked lists, to me looked very similar to https://root.cern.ch/doc/master/classTFree.html and https://root.cern.ch/doc/master/classTFree.html#aee8b83cc7f0d5729c6ca8e2e06c32cef

@jpivarski
Copy link
Member Author

You're right: TFree has a lot to do with it; it's how ROOT saves information about where the active blocks are and which parts of the file are dead space, so that the next time ROOT is run, it knows what it's allowed to overwrite.

There's a lot of freedom in how this algorithm is written. It doesn't have to act like ROOT, but it does have to save data with the same meaning or a compatible meaning for the next invocation.

(When I was scanning through the list of open issues, I skipped by this one because it's a harder problem that requires understanding of more interrelated things. The TEfficiency one is somewhat more isolated, and would therefore make an easier stepping stone. But of course, it's partially up to you—partially because Reik is currently the assignee; he may have plans for this one.)

@reikdas
Copy link
Collaborator

reikdas commented Nov 24, 2019

partially because Reik is currently the assignee; he may have plans for this one.

I was going to come to this after most of the other issues assigned to me are closed so feel free to tackle this issue if you want to :)

@jpivarski
Copy link
Member Author

Actually implementing it will require a thorough understanding of the existing codebase, which would be hard to just jump into. This is something that would be easier to sketch out as an independent script—prototyping would be much faster than if it had to be embedded in the real code. I had to do exactly this to solve the getitem problem in Awkward: https://github.com/scikit-hep/awkward-1.0/blob/master/studies/getitem.py

However, the danger of this is that @anerokhi will solve the wrong problem, because there would be no constraint against that. For the getitem problem, I could check solutions against NumPy. For this problem, correctness should be defined by (a) ROOT can read the resulting file and (b) ROOT can add additional objects to that file, but this can only be verified in the full framework that @reikdas has developed.

@anerokhi
Copy link

the danger of this is that @anerokhi will solve the wrong problem

Well, I was not going to solve this issue, at least not in the next 6 months.
So uproot is not in danger. ;)

@jpivarski
Copy link
Member Author

:) "Danger" that you would spend time on something that wouldn't be useful.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants