Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternate approaches to exposing quantile summaries API #57

Merged
merged 3 commits into from
Dec 17, 2019

Conversation

metasim
Copy link

@metasim metasim commented Nov 25, 2019

Implemented three alternate approaches for discussion, from using built-in functions, GeoTrellis facilities, all the way to a custom aggregate expression. In the full aggregate expression approach, still using the QuantileSummaries class, which is great. The GeoTrellis approach doesn't appear to be very good.

Based on locationtech#429

…lt-in functions all the way to a custom aggregate expression.
@metasim metasim requested a review from vpipkt November 25, 2019 15:23
0.00001
)
result.length should be (2)
it("should compute approx percentiles for many tile cols") {
Copy link
Author

@metasim metasim Nov 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I didn't replicate the "many columns" tests because I have learned after several function implementations is that you don't want to operate over multiple columns in a single Expression or UDF, unless you can help it (i.e Generators may need them) because the query plan optimizer can't cull the column if it's not used downstream. We might be able to simulate the approach by providing a convenience method that performs the map over input columns for the user.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not to be worried about in this context. the many columns operation makes sense for df.stats.some_stat(col1, col2, col3) kind of API

import org.locationtech.rasterframes.expressions.accessors.ExtractTile


case class ApproxCellQuantilesAggregate(probabilities: Seq[Double], relativeError: Double) extends UserDefinedAggregateFunction {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go this route, we can add SQL support using the pattern in, say, org.locationtech.rasterframes.expressions.aggregates.LocalCountAggregate.

result.length should be(3)
println(result)
// computing externally with numpy we arrive at 7963, 10068, 12160 for these quantiles
// This will fail as the histogram algorithm approximates things differently (probably not as well)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not expect this to perform well. If you look at the geotrellis.raster.histogram.StreamingHistogram.percentileBreaks they note that it is for use in generating breaks for the histogram. So it is more rough approximation by nature I would expect. I am also not sure how this would be have at the extremes of the range due to its intended use for histogram breaks.

In any case the method is based on linear interpolation of the histogram, which I suspect may be poor in general unless the number of bins is very large.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 The exercise was good to weed out that approach.

describe("Tile quantiles through custom aggregate") {
it("should compute approx percentiles for a single tile col") {
val result = df
.select(rf_agg_approx_quantiles($"tile", Seq(0.1, 0.5, 0.9)))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be my vote.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the next step you think?

@vpipkt vpipkt merged commit df11ac1 into feature/tile-quantile Dec 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants