Skip to content
This repository has been archived by the owner on Jun 21, 2022. It is now read-only.

Attaching new columns #116

Closed
mverzett opened this issue Apr 2, 2019 · 5 comments
Closed

Attaching new columns #116

mverzett opened this issue Apr 2, 2019 · 5 comments

Comments

@mverzett
Copy link

mverzett commented Apr 2, 2019

Hi,

when I try to attach a new column to a table that is the result of a cut I get the following error:

ValueError: new columns can only be attached to the original Table, not a view (try table.base['col'] = array)

The error very clear and so is the cause, but I do not get the reason for such limitation, and if there is a way to circumvent it.

Thanks!

@jpivarski
Copy link
Member

jpivarski commented Apr 2, 2019

A view of a Table is like a lazy filter—the reason that selections on Tables are lazy is because some of the fields might not even be in memory. Generally, all awkward operations touch their contents as little as possible so that you don't end up loading a thousand columns of data just to filter then when you only cared about the filtered versions of two or three of them.

Okay, but not that leads to this problem: a filtered Table consists of the full Table and a selection—say the fill Table has a million rows and the filtered version would have a thousand—and to assign to it, you give it an array that has a thousand elements. What does it put in for those other elements? You never see them, but it has to match the size of the other columns, which haven't been evaluated with the filter.

Option 1: attempts to assign could cause all other columns to get filtered, but that would break the pattern, could be prohibitive to load into memory, and would violate the profile of least surprise.

Option 2: the new column could be turned into an IndexedMaskedArray, which could be surprising because it's a different data type from what you'd get off you assigned to a non-view Table.

Option 3: turning it into a SparseArray would also solve it. A SparseArray is a different class—could be surprising—but it's the same high-level data type. Type-safe functions wouldn't be telling you, "You have to check for the null case." SparseArray unfortunately has a log(n) __getitem__ time, rather than O(1).

All of that is (possibly me talking to myself) ways to extend the code to allow assignments to views. To solve your particular problem right now, you need a way to strictly evaluate the view, turning it into a non-view Table. You could do this:

not_a_view = Table()
for name in original_view.columns:
    not_a_view[name] = original_view[name]

because the Table.__getitem__ strictly evaluates (to give you an unattached column).

If the Table is actually a jagged table, you don't want to put jagged arrays into a Table, you want a new jagged table (jaggedness on the outside). If so, do this:

not_a_view = JaggedArray.zip({n: original_view[n] for n in original_view.columns})

@jpivarski
Copy link
Member

Actually, what I ought to do is Option 4: wrap up the code examples I have you as a new function called compact. Compactness is a concept that has recently been useful for simplifying jagged arrays, and this is kind of like compactness for Tables. Quite possibly, all awkward array classes each have their own concept of compactness, which keeps the meaning of the data the same but simplifies the internal representation to a canonical form.

When that's implemented, the warning about assigning to a view would say, "Call compact() first."

For now, use one of those two recipes.

@nsmith-
Copy link
Contributor

nsmith- commented Apr 2, 2019

I am worried this is a sign of an anti-pattern: making selections, then calculating derived values.
We can more easily afford, and should encourage calculating derived values before masking. And if it is actually performance critical, I think the right choice is a MaskedArray, or in this case a masked jagged table.

@jpivarski
Copy link
Member

If the motivation for selecting before computing derived quantities is to save time by avoiding computations on irrelevant events, then it could be premature optimization, or transcribing the imperative code too closely. Then again, it might be mature optimization (!), or it might be to avoid impossible situations (e.g. use the first particle of an empty event).

In any of those cases, MaskedArrays should help. Instead of changing the number of events in an array, MaskedArrays replace numerical elements with None values. Applying functions to MaskedArrays skips the None values.

The only problem with that is that the MaskedArray part of the ecosystem isn't as well fleshed out. We have MaskedArray constructors, but no convenience methods to build them from other array types. Which, I suppose, is a call to arms, an indication that the MaskedArrays really are useful, after JaggedArrays and Tables...

@mverzett
Copy link
Author

mverzett commented Apr 4, 2019

Thanks for the quick reply. The whole discussion is quite interesting. Indeed the situation is part of both. On one side I am still thinking in imperative form (thanks for pointing that out) and trying to mask at each step, partly is exactly to avoid impossible situations while still keeping the whole data consistent in a Table without having to carry around too many variables. I will try with your suggestions and come back to you in case I have further questions, and I will :).

@mverzett mverzett closed this as completed Apr 4, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants