Block management #135

jpivarski · 2018-09-20T16:43:48Z

Currently, reusing a key name or deleting a key reuses the space used by the key but not the value. The behavior is formally correct, but it is inefficient: assigning new histograms to the same name will make the output file length grow indefinitely. To handle this properly, we need to view the linear sequence of bytes in a file as reusable blocks, rather than a stream.

There are many algorithms that do this. ROOT must have one, for instance, and this is essentially what malloc does with RAM memory. This is an open-ended project involving some thinking about design. These systems are prone to fragmentation; perhaps introducing a (not too small, not too large) granular unit of allocation helps. Perhaps that unit can be 4K to match most filesystems, which read in 4K blocks.

Designs should probably be tested outside of the uproot codebase and only integrated when a good solution is found. Maybe there's a library that does it, though I doubt that. (I'd even have a hard time looking for a search term: by what name do computer scientists refer to this problem?) Once you have a good solution, it should be integrated into TFile._expandfile so that both new values and new keys can take advantage of it.

The text was updated successfully, but these errors were encountered:

jpivarski · 2018-09-24T20:36:31Z

In any block management algorithm, the final write of the file could come at a time when some blocks in the file are unused. The ROOT format allows the locations of these blocks to be saved to the file so that a subsequent process can pick up on them and use them.

fSeekFree points to a TKey and a list of TFree objects specifying the unused ranges within the file. There's always at least one of these, ranging from fEND to 2000000000.
fNbytesFree is the size of the TKey and all the TFree elements.
nfree is the length of this list (not derivable from the TKey and its TFree elements alone). It's always at least 1.

So the fSeekFree and fNbytesFree don't refer to the free space itself, they refer to a lookup table of free space. This is the mistake we were making in issue #136.

anerokhi · 2019-11-23T18:46:46Z

I'd even have a hard time looking for a search term: by what name do computer scientists refer to this problem?

I think this is described in e.g. Knuth's TAOCP Vol.1 2.5 "Dynamic Storage Allocation"

jpivarski · 2019-11-23T20:08:16Z

I don't know, but I think dynamic storage allocation usually refers to allocation of RAM memory (like what malloc does). Allocating blocks in a file is a little different because it's more important to use tightly contiguous blocks in a file and moving data is more costly (disk access is slow). Maybe that chapter describes both problems? I wasn't able to find it.

anerokhi · 2019-11-23T20:37:53Z

Mkay, I didn't learn "Dynamic Storage Allocation" by heart and only skimmed through it, but what I've seen about managing free blocks using linked lists, to me looked very similar to https://root.cern.ch/doc/master/classTFree.html and https://root.cern.ch/doc/master/classTFree.html#aee8b83cc7f0d5729c6ca8e2e06c32cef

jpivarski · 2019-11-23T20:58:12Z

You're right: TFree has a lot to do with it; it's how ROOT saves information about where the active blocks are and which parts of the file are dead space, so that the next time ROOT is run, it knows what it's allowed to overwrite.

There's a lot of freedom in how this algorithm is written. It doesn't have to act like ROOT, but it does have to save data with the same meaning or a compatible meaning for the next invocation.

(When I was scanning through the list of open issues, I skipped by this one because it's a harder problem that requires understanding of more interrelated things. The TEfficiency one is somewhat more isolated, and would therefore make an easier stepping stone. But of course, it's partially up to you—partially because Reik is currently the assignee; he may have plans for this one.)

reikdas · 2019-11-24T01:31:57Z

partially because Reik is currently the assignee; he may have plans for this one.

I was going to come to this after most of the other issues assigned to me are closed so feel free to tackle this issue if you want to :)

jpivarski · 2019-11-24T14:26:38Z

Actually implementing it will require a thorough understanding of the existing codebase, which would be hard to just jump into. This is something that would be easier to sketch out as an independent script—prototyping would be much faster than if it had to be embedded in the real code. I had to do exactly this to solve the getitem problem in Awkward: https://github.com/scikit-hep/awkward-1.0/blob/master/studies/getitem.py

However, the danger of this is that @anerokhi will solve the wrong problem, because there would be no constraint against that. For the getitem problem, I could check solutions against NumPy. For this problem, correctness should be defined by (a) ROOT can read the resulting file and (b) ROOT can add additional objects to that file, but this can only be verified in the full framework that @reikdas has developed.

anerokhi · 2019-11-24T19:24:15Z

the danger of this is that @anerokhi will solve the wrong problem

Well, I was not going to solve this issue, at least not in the next 6 months.
So uproot is not in danger. ;)

jpivarski · 2019-11-24T21:20:15Z

:) "Danger" that you would spend time on something that wouldn't be useful.

jpivarski added the writing-improvements label Sep 20, 2018

jpivarski assigned reikdas Sep 20, 2018

jpivarski mentioned this issue Sep 24, 2018

Make uproot-written files appendable in ROOT #136

Closed

reikdas mentioned this issue Mar 28, 2020

Is uproot.update not yet supported? #381

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block management #135

Block management #135

jpivarski commented Sep 20, 2018

jpivarski commented Sep 24, 2018

anerokhi commented Nov 23, 2019

jpivarski commented Nov 23, 2019

anerokhi commented Nov 23, 2019

jpivarski commented Nov 23, 2019

reikdas commented Nov 24, 2019

jpivarski commented Nov 24, 2019

anerokhi commented Nov 24, 2019

jpivarski commented Nov 24, 2019

Block management #135

Block management #135

Comments

jpivarski commented Sep 20, 2018

jpivarski commented Sep 24, 2018

anerokhi commented Nov 23, 2019

jpivarski commented Nov 23, 2019

anerokhi commented Nov 23, 2019

jpivarski commented Nov 23, 2019

reikdas commented Nov 24, 2019

jpivarski commented Nov 24, 2019

anerokhi commented Nov 24, 2019

jpivarski commented Nov 24, 2019