Create typed specialization for freq_agg and topn_agg #389

WireBaron · 2022-04-13T22:48:52Z

The AnyElement implementation for our various spacesaving implementations are very convenient to allow users to build them using any arbitrary type of data, but turn out to be very ugly to then try to extract data from.

This change creates a 64-bit int specialization and a text specialization of our frequency aggregates. These can be used without providing a type argument to get data out, which ends up being much cleaner.

The existing AnyElement implementations have been moved to raw_freq_agg and raw_topn_agg to allow easier implicit casts to the new specializations.

The need to provide distinct types to postgres for each specialization has resulted in many similar looking data structures and functions in this code. I'm very open to hearing suggestions on how this could be done more cleanly from any reviewers.

fixes #376

JLockerman · 2022-04-14T16:42:32Z

extension/src/frequency.rs

+            counts: [u64; self.num_values], // JOSH TODO look at AoS instead of SoA at some point
+            overcounts: [u64; self.num_values],
+            datums: [i64; self.num_values],


I believe something like this will work now

pg_type! { #[derive(Debug)] struct SpaceSavingBigIntAggregate<'input> { num_values: u32, topn: u32, values_seen: u64, freq_param: f64, elements: [Entry; self.num_values], } } flat_serialize! { #[derive(Debug)] struct Entry { count: u64, overcount: u64, datum: i64, } }

Though I can't say if it would be a good idea

JLockerman

Needs some formatting and I have a couple of nits, but over all it looks good.
Not sure I should be the approver anymore, so I'll leave that for someone else 😉

JLockerman · 2022-04-14T16:46:29Z

extension/src/frequency.rs

+    }
+
+    // Important: this variable will hold the lifetime of the varlena until we copy it in freq_agg_trans()
+    let text = value.map(pgx::rust_str_to_text_p);


if we're doing this I'd recommend dust using an Option<crate::raw::text> and using pg_detoast_datum_copy()

Yeah, this looks a lot nicer (except for the map argument).

JLockerman · 2022-04-14T16:47:34Z

extension/src/frequency.rs

+    fcinfo: pg_sys::FunctionCallInfo,
+) -> Option<Internal> {
+    // Important: this variable will hold the lifetime of the varlena until we copy it in freq_agg_trans()
+    let text = value.map(pgx::rust_str_to_text_p);


if we're doing this I'd recommend dust using an Option<crate::raw::text> and using pg_detoast_datum_copy()

JLockerman · 2022-04-14T16:49:16Z

extension/src/frequency.rs

+
+extension_sql!("\n\
+    CREATE AGGREGATE toolkit_experimental.topn_agg(\n\
+        integer, TEXT\n\


the arguments to the aggregates should have names, one because I the update scripts will need them once stable, but more importantly for documentation in \df. Actually, can this use the aggregate builder instead?

Added names to these arguments. Using the aggregate builder would have added a huge amount of redundant code here, since these 9 aggregates are using the same serialize, deserialize, and combine functions (and one of 3 different final functions).

epgts

I offer a few suggestions for reducing duplication in branch https://github.com/timescale/timescaledb-toolkit/compare/br/remove_freq_anyelem...eg/remove_freq_anyelem?expand=1

The same approach can be applied in a couple other places.

epgts · 2022-04-15T01:18:19Z

extension/src/frequency.rs

+        move |(value, (&count, &overcount))| {
+            let total = agg.values_seen as f64;
+            // This takes converts the varlena to a [u8] to a str to a String, the final step doing a copy which is safe to return
+            // TODO: is there a more straightforward way of doing this?


Not really. String::from(str::from_utf8(bytes)) is exactly what you want. Readability would be greatly improved by assigning the result of varlena_to_byte_slice rather than doing it all in one line. And since you do it a few times, I'd move the conversion out to a separate function.

Also, rustfmt would unjam the expect for you, along with cleaning up many other non-standard formattings :)

(also shown in commit d6e293a)

nice, stole that commit :)

epgts · 2022-04-15T03:23:40Z

extension/src/frequency.rs

+    value: i64,
+) -> f64 {
+    for (idx, datum) in agg.datums.iter().enumerate() {
+        if value == datum as i64 {


I've run out of time, but the same approach can be applied to max and min frequency functions: implement the algorithm in generic function parametrized over data type and conversion function.

This seems like just writing a generalized function to return the index of a matching value given some equality parameterization. On a hunch I checked to see if there was already anything like this and found the position method for iterators. Rewrote these functions to using that which seems cleaner, though honestly the implementations here were already so simple that it doesn't really look much better.

epgts · 2022-04-15T03:25:01Z

extension/src/frequency.rs

+            let mut counts = Vec::new();
+            let mut overcounts = Vec::new();
+
+            for entry in &trans.entries {


I demonstrate how to genericize this in commit 31e12c3

I didn't actually go with this suggestion. I didn't like how the new version created yet another copy of all the values in the integer case.

epgts · 2022-04-15T03:25:42Z

extension/src/frequency.rs

+    value: Option<i64>,
+    fcinfo: pg_sys::FunctionCallInfo,
+) -> Option<Internal> {
+    if n <= 0 {


We can avoid repeating these checks by moving them into topn_agg_from_type_id as show in commit c9eb12f

epgts · 2022-04-15T03:28:41Z

extension/src/frequency.rs

+    max_n: i32,
+    min_freq: f64,
+) -> impl std::iter::Iterator<Item = i64> + '_ {
+    let iter = agg


This is pretty complex, what with zipping two things together and having 3 different exit criteria. It would be great not to duplicate that algorithm. I show how to factor it out in commit d6e293a.

I really liked that change!

Yeah I forgot to tell you that I woke Saturday morning realizing the convert parameter was unnecessary so I removed it in commit 052daea

Also I left a comment about what looks like a bug: where we silently stop processing on an error. If we think it's an impossible error we should use .expect. Also I might just not understand it, it was Saturday morning and I only had a minute for a quick fix, like I only have a minute for a quick comment now :)

Yeah, even cleaner with the new change.

WireBaron

Incorporated feedback, last change in new batch includes a rustfmt, might be easier to see diff looking at individual changes.

WireBaron · 2022-04-18T23:07:49Z

extension/src/frequency.rs

+            let mut counts = Vec::new();
+            let mut overcounts = Vec::new();
+
+            for entry in &trans.entries {


I didn't actually go with this suggestion. I didn't like how the new version created yet another copy of all the values in the integer case.

WireBaron · 2022-04-18T23:08:42Z

extension/src/frequency.rs

+    }
+
+    // Important: this variable will hold the lifetime of the varlena until we copy it in freq_agg_trans()
+    let text = value.map(pgx::rust_str_to_text_p);


Yeah, this looks a lot nicer (except for the map argument).

WireBaron · 2022-04-18T23:08:56Z

extension/src/frequency.rs

+    fcinfo: pg_sys::FunctionCallInfo,
+) -> Option<Internal> {
+    // Important: this variable will hold the lifetime of the varlena until we copy it in freq_agg_trans()
+    let text = value.map(pgx::rust_str_to_text_p);


WireBaron · 2022-04-18T23:19:44Z

extension/src/frequency.rs

+    max_n: i32,
+    min_freq: f64,
+) -> impl std::iter::Iterator<Item = i64> + '_ {
+    let iter = agg


I really liked that change!

WireBaron · 2022-04-18T23:46:51Z

extension/src/frequency.rs

+    value: i64,
+) -> f64 {
+    for (idx, datum) in agg.datums.iter().enumerate() {
+        if value == datum as i64 {


This seems like just writing a generalized function to return the index of a matching value given some equality parameterization. On a hunch I checked to see if there was already anything like this and found the position method for iterators. Rewrote these functions to using that which seems cleaner, though honestly the implementations here were already so simple that it doesn't really look much better.

WireBaron · 2022-04-18T23:47:46Z

extension/src/frequency.rs

+        move |(value, (&count, &overcount))| {
+            let total = agg.values_seen as f64;
+            // This takes converts the varlena to a [u8] to a str to a String, the final step doing a copy which is safe to return
+            // TODO: is there a more straightforward way of doing this?


nice, stole that commit :)

WireBaron · 2022-04-18T23:54:54Z

extension/src/frequency.rs

+
+extension_sql!("\n\
+    CREATE AGGREGATE toolkit_experimental.topn_agg(\n\
+        integer, TEXT\n\


Added names to these arguments. Using the aggregate builder would have added a huge amount of redundant code here, since these 9 aggregates are using the same serialize, deserialize, and combine functions (and one of 3 different final functions).

This is a minimum change needed to specialize the frequency and topn aggregate for integer and string types. This change makes them much more user friendly, but there are a lot of inefficiencies and duplicated code after this change. The original AnyElement version of these aggregates can still be used as raw_freq_agg or raw_topn_agg.

WireBaron · 2022-04-19T22:01:16Z

bors r+

bors · 2022-04-19T22:12:50Z

Build succeeded:

WireBaron requested review from rtwalker, JLockerman and epgts April 13, 2022 22:48

JLockerman reviewed Apr 14, 2022

View reviewed changes

epgts reviewed Apr 15, 2022

View reviewed changes

WireBaron commented Apr 19, 2022

View reviewed changes

WireBaron force-pushed the br/remove_freq_anyelem branch from eff5553 to d20478f Compare April 19, 2022 22:01

bors bot merged commit 3c3e9c0 into main Apr 19, 2022

bors bot deleted the br/remove_freq_anyelem branch April 19, 2022 22:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create typed specialization for freq_agg and topn_agg #389

Create typed specialization for freq_agg and topn_agg #389

WireBaron commented Apr 13, 2022

JLockerman Apr 14, 2022

JLockerman left a comment

JLockerman Apr 14, 2022

WireBaron Apr 18, 2022

JLockerman Apr 14, 2022

WireBaron Apr 18, 2022

JLockerman Apr 14, 2022

WireBaron Apr 18, 2022

epgts left a comment

epgts Apr 15, 2022

WireBaron Apr 18, 2022

epgts Apr 15, 2022

WireBaron Apr 18, 2022

epgts Apr 15, 2022

WireBaron Apr 18, 2022

epgts Apr 15, 2022

epgts Apr 15, 2022

WireBaron Apr 18, 2022

epgts Apr 19, 2022

WireBaron Apr 19, 2022

WireBaron left a comment

WireBaron Apr 18, 2022

WireBaron Apr 18, 2022

WireBaron Apr 18, 2022

WireBaron Apr 18, 2022

WireBaron Apr 18, 2022

WireBaron Apr 18, 2022

WireBaron Apr 18, 2022

WireBaron commented Apr 19, 2022

bors bot commented Apr 19, 2022

Create typed specialization for freq_agg and topn_agg #389

Create typed specialization for freq_agg and topn_agg #389

Conversation

WireBaron commented Apr 13, 2022

Choose a reason for hiding this comment

JLockerman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

epgts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WireBaron left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WireBaron commented Apr 19, 2022

bors bot commented Apr 19, 2022