-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternate approaches to exposing quantile summaries API #57
Conversation
…lt-in functions all the way to a custom aggregate expression.
0.00001 | ||
) | ||
result.length should be (2) | ||
it("should compute approx percentiles for many tile cols") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I didn't replicate the "many columns" tests because I have learned after several function implementations is that you don't want to operate over multiple columns in a single Expression or UDF, unless you can help it (i.e Generator
s may need them) because the query plan optimizer can't cull the column if it's not used downstream. We might be able to simulate the approach by providing a convenience method that performs the map
over input columns for the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not to be worried about in this context. the many columns operation makes sense for df.stats.some_stat(col1, col2, col3)
kind of API
import org.locationtech.rasterframes.expressions.accessors.ExtractTile | ||
|
||
|
||
case class ApproxCellQuantilesAggregate(probabilities: Seq[Double], relativeError: Double) extends UserDefinedAggregateFunction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go this route, we can add SQL support using the pattern in, say, org.locationtech.rasterframes.expressions.aggregates.LocalCountAggregate
.
result.length should be(3) | ||
println(result) | ||
// computing externally with numpy we arrive at 7963, 10068, 12160 for these quantiles | ||
// This will fail as the histogram algorithm approximates things differently (probably not as well) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not expect this to perform well. If you look at the geotrellis.raster.histogram.StreamingHistogram.percentileBreaks they note that it is for use in generating breaks for the histogram. So it is more rough approximation by nature I would expect. I am also not sure how this would be have at the extremes of the range due to its intended use for histogram breaks.
In any case the method is based on linear interpolation of the histogram, which I suspect may be poor in general unless the number of bins is very large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 The exercise was good to weed out that approach.
describe("Tile quantiles through custom aggregate") { | ||
it("should compute approx percentiles for a single tile col") { | ||
val result = df | ||
.select(rf_agg_approx_quantiles($"tile", Seq(0.1, 0.5, 0.9))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would be my vote.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the next step you think?
Implemented three alternate approaches for discussion, from using built-in functions, GeoTrellis facilities, all the way to a custom aggregate expression. In the full aggregate expression approach, still using the
QuantileSummaries
class, which is great. The GeoTrellis approach doesn't appear to be very good.Based on locationtech#429