-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transient properties: a new approach to deep selection and update in maps and arrays #334
Comments
This is an interesting proposal. A few questions/observations:
Note 1: MarkLogic made their JSON types instances of node() so that they had defined parent, etc. accessors. Given that XPath/XQuery made arrays/maps functions then that is not possible, so this is potentially a valid workaround. -- Effectively, this provides a way to make arrays/maps/etc. used in JSON and JSON-like objects into pseudo-nodes. Note 2: This is similar/related to the proposal/idea of making the parts of atomic types (year, day, month, authority, uri-path, etc.) available to users. |
|
I agree, that’s a creative and promising approach. My immediate thoughts revolve around performance: When dealing with small or medium-sized JSON files, there will be no reason to think about performance. However, I’m aware of use cases in which millions of map & array entries are processed, and it would be significant overhead to create additional properties and attach them to each traversed item in main memory. That’s particularly relevant if singletons are used for representing common atomic items. Some years ago, we simplified our data model for similar reasons: We stored (and eventually dropped) transient score values that resulted from XQuery Full Text queries. In some cases, it’s easy to skip the data generation if it will never be needed. In other cases, it can be pretty hard. But I should definitely spend more time on the proposal before giving a final judgement. |
Yes, I can think of various way of making the behaviour optional (so you don't get the properties unless you're going to need them), but I thought I would try to explore whether we can optimize the costs away first. In the prototype Saxon implementation (see https://www.saxonica.com/documentation12/index.html#!functions/saxon/with-pedigree) you start by marking a map or array as being "with pedigree", and the "pedigree" (=transient properties) is only maintained by lookup operations that start from such a map or array. |
This proposal feels liberating, and natural: just yesterday to retain key info, I pushed copies into a new map entry within the map in each one's value. Felt very hacky. And made me wonder about performance, because of possible ballooning effects. It may be worth thinking about a mechanism that permits selective transient properties. There will be cases with very large maps where one wants to retain ¶key but not ¶parent. I like the look and semantics of the pilcrow, but it will be a nuisance to find it. What about @@ or @ followed by some other punctuation character? Finally, an opportunity to construct elegant predicate filter expressions. |
So, I promised that part 2 of the proposal would address deep update. Let's start with an example of what it should look like to users. The following increases the prices of all products, at any depth, by 10% (returning a new value that's the same as the original in all other respects):
How do we make this work? I'll start with a very informal explanation, and then sketch a more formal definition. Firstly, there's a third argument The The If you want to add things to the tree, or delete things, then you do that by making a change to the parent. For example, to add a product, you can do There's a complication if the selection includes a node that is an ancestor of another selected node. There are various ways we could handle this: I would propose that it an ancestor is selected and changed, then neither its old children nor its new children are further processed. Now, how to describe this more formally? I've glossed over a number of issues that need to be addressed. For example, I said that the ¶root property in all selected values must be equal to $root -- but what does "equal" mean here? Maps and arrays, remember, have no identity. I think the best way to tackle this is probably to give values a transient identity for the duration of the operation. So we can sketch a formal description as follows:
Of course, an actual implementation will work differently. Many implementations will use underlying data structures (such as Java immutable maps) where (a) the nodes already have a perfectly usable ID, and (b) virtual copies can be made cheaply, reusing parts of the tree that haven't changed. But we don't need to talk about that. |
An alternative syntax would be the rather COBOL-like
Using custom syntax rather than a function gives us more freedom in defining the semantics, but of course there are downsides as well. |
One construct that has been proven to be particularly successful in BaseX is the (<a/>, <b/>) update {
rename node . as 'x'
} update {
insert node "y" into .
}
(:result :)
<x>y</x>
<x>y</x> A simpler variant (for single nodes, without chaining) had been adopted to the XQuery Update 3 Facility as Unfortunately, due to grammar restrictions, two keywords are required for XQuery Update operations ( (: existing syntax for nodes :)
document {
<root><product><price>2</price></product></root>
} update {
//product/price ! (replace value of node . with . * 1.1)
}
(: syntax for maps and arrays? :)
[
map { 'product': map { 'price': 2 } },
map { 'product': map { 'price': 3 } }
] update {
??product?price ! (replace value of . to . * 1.1)
} |
Closing this as it was essentially implemented in PR #988. |
After exploring many alternatives, I have come to the conclusion that we can't solve the problem of deep navigation and transformation of JSON structures without a data model change.
Most of the problems boil down to this: JSON trees do not have parent pointers, therefore after navigating down to a leaf node of the tree, we cannot get any information from higher up the tree. The solution to this (the "zipper" model) is to retain transient information about how a particular node in the tree was reached, so that we can retrace our steps and revisit nodes that were passed en route.
The change I propose is quite minor, but powerful: Any XDM value can be augmented with a set of transient properties represented as a set of key-value pairs. These properties are ignored (and typically dropped) by all operations on a value, except where otherwise specified. For the purpose of exposition, I'll use the syntax
$value¶name
to refer to the transientname
property of$value
.We'll change the semantics of
map:get()
andarray:get()
, and the associated lookup operators, so that the resulting values have transient properties indicating how they were selected. For example, givenlet $name := $person?firstName
the resulting value (perhaps the string "Michael") will be augmented with transient properties
and derived properties:
We can also define other "downward selection" operations such as
map:find
, andarray:foot
to retain these transient properties. So for examplemap:find($json, 'firstname')[.='Michael']¶parent?surname
now finds the surnames of anyone named 'Michael', at any depth of the tree.If we turn back to the use cases in my 2016 paper on transforming JSON
https://www.saxonica.com/papers/xmlprague-2016mhk.pdf
The first use case (bulk update) relied on matching items expressed in XML as
match="map[array[@key='tags']/string='ice']/number[@key='price']/text()"
which couldn't be done in JSON because of the inability to match based on ancestor context. With the new transient properties we can match this as
match="type(xs:integer)[¶key = 'price'][¶parent?tags?* = 'ice']"
In the second use case (hierarchic inversion), we can again get properties of parent or ancestor maps
$students ! map:put("course", ¶parent?name)
I think we can also use this to define deep update operations. But I'll leave that investigation until later.
Note: transient properties potentially have many other applications, for example we might use them to solve our problems with
document-uri()
. But exploring that would be a distraction here. The nice thing about transient properties is that they give a lot of potential for augmenting existing functionality with full backwards compatibility, because we can define existing operations to return results with additional transient properties that all existing operations will ignore. If we were so minded, for example, we could have different functions/operators return "quiet NaN" and "signalling NaN" by adding a transient property to the NaN value returned.The text was updated successfully, but these errors were encountered: