Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for PagingAndSortingRepositories [DATAGEODE-263] #311

Closed
spring-projects-issues opened this issue Nov 4, 2019 · 10 comments
Closed

Comments

@spring-projects-issues
Copy link

@spring-projects-issues spring-projects-issues commented Nov 4, 2019

John Blum opened DATAGEODE-263 and commented

This JIRA will track the (epic) development of Spring Data for Apache Geode's (SDG) support of the Spring Data Common's PagingAndSortingRepository.

Currently, Apache Geode does not implement nor support the notion of a database cursor in the querying infrastructure (e.g. Apache Geode's QueryService), which allows for such things as [pre-]fetch size, handling concurrent updates and scroll sensitivity.

However, a good interim solution may be to collect a "list" of keys for the values satisfying the query predicate and lazily fetch the values based on the ordered, paged results.

Apache Geode could handle concurrent updates by the user flipping the copy-on-read switch, but scroll sensitivity would not be handled unless the keys were cached. Still, the "cached" keys would be invalidated the moment the user changed the sort order since the "list" is retaining the order the user initially specified, and without the values, there is no way to change the order. Therefore, subsequent queries with the original predicate could result in a significantly different result set thus changing the number and order of the results


Reference URL: https://jira.spring.io/browse/SGF-524

1 votes, 2 watchers

@spring-projects-issues
Copy link
Author

@spring-projects-issues spring-projects-issues commented Oct 7, 2020

John Blum commented

SDG's Repository infrastructure extension Paging functionality design will follow a 2 phase implementation:

  1. Query all keys matching the query predicate ordered by the Sort.
  2. Query all values IN the set of (paged) keys ordered by the Sort.

Between 1 & 2, the SDG Repository infrastructure extension, will calculate which sub-set of keys matching the "requested" Page specified in the Pageable query (method) parameter (e.g. PageRequest). Keep in mind that Page "Numbers" are 0 index based. Page number 0 is page 1, Page number 1 is page 2, and so on.

SDG's Paging implementation could be a single phase approach if Apache Geode provided the notion of a rownum, similar to a Oracle Database Query ResultSet rownum Pseudocolumn (see here). This is easier to implement than a Database Cursor and gives driver implementors the ability to support paged ResultSets without issuing 2 separate queries.

Also note that SDG's Paging implementation was designed with Query Projections in mind. While Projections are not yet implemented, it is an important consideration in the design of Paging.

@spring-projects-issues
Copy link
Author

@spring-projects-issues spring-projects-issues commented Oct 7, 2020

John Blum commented

Keep in mind that a Repository query method cannot have both a Pageable and Sort parameters.

If an query result set order is required, which is often the case when paging the results of a query, then the Sort parameter should be provided as part of the PageRequest argument to the Repository query method Pageable parameter.

Of course, users are always free to order the results of the query after the query method returns in addition to providing a Sort order if you must. The later can be implemented with a Comparator.

Users are encouraged to apply the Sort (order) to the Pageable argument

@spring-projects-issues
Copy link
Author

@spring-projects-issues spring-projects-issues commented Oct 7, 2020

John Blum commented

1 important distinction between the "Derived" queries, "Named" queries and @Query Paging behavior vs. that of the framework provided PagingAndSortingRepository.findAll(:Pageable) query method is that the later (i.e. findAll(:Pageable)) retrieves (selects) the entire result set of the query, then sub-selects (or extracts) the results for the "requested" page.

This is/was an implementation choice made by the SDG framework given the name of the provided query method is "findAll", where the keyword is "All". Therefore, the "paging" function/behavior was made after the query execution, collecting all the results matching the query predicate(s), which is essentially "all" entries in the Region, then applying the paging logic.

This choice resulted in a simpler implementation and a "single" query. However, this could prove to be very costly in practice if 1) the number of entries in the Region is large, 2) the size of the objects stored in the Region is large and 3) the results are coming from a PR.

Still, for certain UCs, this implementation could be quite useful and even more optimal vs. the 2 phase approach.

As such, the SDG framework encourages the use of "Derived", "Named" and @Query based Repository query methods

@spring-projects-issues
Copy link
Author

@spring-projects-issues spring-projects-issues commented Oct 8, 2020

John Blum commented

1 optimization that can be made to the 2 phase query approach/implementation for Paging is use of the OQL LIMIT keyword.

If the user/caller has not already specified the query result set LIMIT, then a LIMIT can be applied by the Repository infrastructure.

The LIMIT would be applied to the query in the first phase of query execution and determined by...

limit = normalize(pageable.getNumber()) * pageable.getSize();

Of course, the "page number" would be adjusted (i.e. normalized) given Page numbers are 0 index based.

The query in phase 1 would be approximately...

SELECT DISTINCT keys 
FROM /Region x, x.keySet keys 
WHERE ... 
ORDER BY ... 
LIMIT <(natural) page number * page size>

For example, if the Page Number were 1 (i.e. page 2) and the Page Size were 5 (i.e. 5 elements per page) then the LIMIT would be 10. We only need to bring back the first 10 (ordered) results (if there are even 10 elements in the result set) given we are only interested in page 2.

If the query in the first phase would potentially return 100s, 1000s or even 100s of thousands of keys (etc), then this would limit the number of results returned based on the page requested, saving bandwidth, memory and processing. Though, generally speaking, a properly tuned and qualified query should rarely return 100s of thousands of results in the first place.

This would be beneficial for the first few pages of the results since it is likely users of paging applications do not search far beyond the 2nd page of results (or even the 1st page in most cases) anyway.

Of course, application developers are encouraged to "prioritize" the query results from most relevant to least relevant first, which would further support this optimization. This involves carefully planned query predicate(s).

This optimization in no way has any effect on paging behavior. However, it does impact the the framework's ability to determine the total query result set size. If 1000 keys would be returned in the query matching the predicate, then we would know the total matching entries is 1000. However, by applying a LIMIIT based on the page number and page size requested, we will not know the total query result set size

@spring-projects-issues
Copy link
Author

@spring-projects-issues spring-projects-issues commented Oct 8, 2020

John Blum commented

Regarding PARTITION Regions (PR) and Paging...

The most optimal approach to query a PR is by way of Function Execution and filtering (by keys).

The keys provide "routing" to the member node in the cluster hosting data for the targeted key(s) (which could be a secondary unless the Execution is "optimized" for writing) where the Function then performs the OQL query on the "local" data set (bucket) of the member node.

Of course, this requires there to be a OQL query "submitting" Function in the first place, which of course SDG could provide, but there are no guarantees that Spring/Spring Boot or SDG in particular was used to configure and bootstrap the data member nodes of the cluster, which implies there is no guarantee that this "canned" Function would exist (and be registered). This is particularly true in managed environments, e.g. using VMware Tanzu GemFire for VMs & K8s.

SDG's current querying behavior on PRs hosted by data nodes in the cluster is not currently implemented this way and neither will page-based queries even though this could be a nice optimization.

This might be open for consideration later in the future

@spring-projects-issues
Copy link
Author

@spring-projects-issues spring-projects-issues commented Oct 8, 2020

John Blum commented

Regarding JOIN OQL queries and Paging...

JOIN OQL queries involve 2 or more Regions, for example:

SELECT x.* 
FROM /RegionOne x, /RegionTwo y 
WHERE x.field = y.field AND y.field = <value> 
ORDER BY x.anotherField ASC
...

The second collection in the FROM clause does not even need to be a Region.

In this case, the "paging" functionality (and 2 phase approach) will applied to /RegionOne, our "subject" in the example query above.

NOTE: generally speaking, JOINS are not applicable to Repositories. The only way to perform a JOIN is by use of "Named" and @Query annotated (declared) OQL queries. It is not even possible for "Derived" queries.

Actually, this might require more careful thought and consideration and as such, may not be handled in the first iteration

@spring-projects-issues
Copy link
Author

@spring-projects-issues spring-projects-issues commented Oct 8, 2020

John Blum commented

Thought of another approach to Paging vs. the 2 phase approach/implementation described in the first comment above.

Given the use of the LIMIT OQL keyword to limit the result set from the keys query matching the query predicate(s), the 2 phase approach could be simplified to a single phase by applying the LIMIT to the the actual query pulling the values matching the query predicate(s).

This has the advantage of running a single query and minimizing error-prone modification(s) of the original query issued by the caller. Modifying the original query requires careful pattern matching, parsing and replacement.

By way of example, given a value query:

SELECT u.username, u.lastLoginTimestamp
FROM /Users u
WHERE u.active = false
AND u.username LIKE '%Doe'"
ORDER BY u.username ASC
[LIMIT <number>]

The query to retrieve the keys based on the query predicate(s) is roughly equivalent to:

SELECT u.key
FROM /Users.entrySet u
WHERE u.value.active = false
AND u.value.username LIKE '%Doe'"
ORDER BY u.value.username ASC
LIMIT 1

This requires significant, non-trivial modification to the original query, which, while possible, is error-prone and further complicated by "non-Derived" queries (i.e. @Query annotated queries and "Named" queries), particularly if they are using JOINS or other collections.

Unfortunately, the downside of the LIMIT based approach is that it increasingly uses more processing cycles, network bandwidth and memory as the page number goes up in relation to the page size.

For instance, if the page size were 100 and the user requested page 10, then this would retrieve the first 1000 values (if present) vs the 2 phase approach which would always be based on the page size, but result in 2 queries.

Of course, again, if the user uses sensible page sizes (e.g. 20) and well-crafted query predicates (targeted results) the effects of the LIMIT based approach could be minimized.

For the time being, I am considering using the 2 phase approach/implementation for "Derived" queries and the LIMIT based approach for @Query annotated queries and "Named" queries.

@spring-projects-issues
Copy link
Author

@spring-projects-issues spring-projects-issues commented Oct 8, 2020

John Blum commented

Another thought regarding PARTITION Regions and Paging...

It is not immediately apparent what type (i.e. DataPolicy; e.g. PARTITION, REPLICATE or other) the server-side Region is employing from the client-side without additional help.

For instance, you cannot determine the DataPolicy of the server-side Region from the client (PROXY) Region matching the server-side Region by name. Additionally, Apache Geode does not provide a (client-side) API to determine the DataPolicy of the matching, server-side Region.

Typically, this determination is only achievable through a "custom" Function. As such, it would be unreliable and extra work to determine the Region type before issuing the OQL query from the client. Again, SDG could provide such a Function, but there is simply no guarantee that Apache Geode would be configured and bootstrapped using Spring (e.g. with SDG or SBDG), especially in a "managed" environment (e.g. VMware Tanzu GemFire for VMs or K8S).

Of course, Gfsh itself provides such behavior, which is even typically implemented with a Function, but Gfsh's Functions are not part of the public API and Apache Geode provides no useful, canned Functions OOTB, as suggested

@spring-projects-issues
Copy link
Author

@spring-projects-issues spring-projects-issues commented Oct 9, 2020

John Blum commented

Another thought on Paging and LIMIT...

Actually, the 2 phase approach/implementation could be further optimized now by inspecting the Pagelable argument (i.e. PageRequest.

Certainly, if the user requested the first page of results, then the query for keys does not need to be executed. The framework can simply "limit" the number of results to the first page.

Subsequently, the framework could also forgo the query for keys if 1) the page number is (say) 0-2 (pages 1, 2, or 3), 2) the page size is sensible (e.g. <= 20) and 3) object size is reasonable. The object size would be reasonable in the context of Projections, most likely, which is why Projections is a important consideration in Paging.

Still, an arbitrary page count and page size maybe in sufficient for some Use Cases. Unfortunately, users are notorious for storing extremely large objects.

This could also be configurable and tunable to a degree

@spring-projects-issues
Copy link
Author

@spring-projects-issues spring-projects-issues commented Oct 14, 2020

John Blum commented

NOTE: I will be filing a new JIRA ticket to complete the 2-phase paged OQL query implementation and approach to paging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants