Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create typed specialization for freq_agg and topn_agg #389

Merged
merged 1 commit into from
Apr 19, 2022

Conversation

WireBaron
Copy link
Contributor

The AnyElement implementation for our various spacesaving implementations are very convenient to allow users to build them using any arbitrary type of data, but turn out to be very ugly to then try to extract data from.

This change creates a 64-bit int specialization and a text specialization of our frequency aggregates. These can be used without providing a type argument to get data out, which ends up being much cleaner.

The existing AnyElement implementations have been moved to raw_freq_agg and raw_topn_agg to allow easier implicit casts to the new specializations.

The need to provide distinct types to postgres for each specialization has resulted in many similar looking data structures and functions in this code. I'm very open to hearing suggestions on how this could be done more cleanly from any reviewers.

fixes #376

Comment on lines +410 to +427
counts: [u64; self.num_values], // JOSH TODO look at AoS instead of SoA at some point
overcounts: [u64; self.num_values],
datums: [i64; self.num_values],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe something like this will work now

pg_type! {
        #[derive(Debug)]
        struct SpaceSavingBigIntAggregate<'input> {
            num_values: u32,
            topn: u32,
            values_seen: u64,
            freq_param: f64,
            elements: [Entry; self.num_values],
    }
}


flat_serialize! {
    #[derive(Debug)]
    struct Entry {
        count: u64,
        overcount: u64,
        datum: i64,
    }
}

Though I can't say if it would be a good idea

Copy link
Contributor

@JLockerman JLockerman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs some formatting and I have a couple of nits, but over all it looks good.
Not sure I should be the approver anymore, so I'll leave that for someone else 😉

}

// Important: this variable will hold the lifetime of the varlena until we copy it in freq_agg_trans()
let text = value.map(pgx::rust_str_to_text_p);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're doing this I'd recommend dust using an Option<crate::raw::text> and using pg_detoast_datum_copy()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this looks a lot nicer (except for the map argument).

fcinfo: pg_sys::FunctionCallInfo,
) -> Option<Internal> {
// Important: this variable will hold the lifetime of the varlena until we copy it in freq_agg_trans()
let text = value.map(pgx::rust_str_to_text_p);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're doing this I'd recommend dust using an Option<crate::raw::text> and using pg_detoast_datum_copy()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


extension_sql!("\n\
CREATE AGGREGATE toolkit_experimental.topn_agg(\n\
integer, TEXT\n\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the arguments to the aggregates should have names, one because I the update scripts will need them once stable, but more importantly for documentation in \df. Actually, can this use the aggregate builder instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added names to these arguments. Using the aggregate builder would have added a huge amount of redundant code here, since these 9 aggregates are using the same serialize, deserialize, and combine functions (and one of 3 different final functions).

Copy link
Contributor

@epgts epgts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I offer a few suggestions for reducing duplication in branch https://github.com/timescale/timescaledb-toolkit/compare/br/remove_freq_anyelem...eg/remove_freq_anyelem?expand=1

The same approach can be applied in a couple other places.

move |(value, (&count, &overcount))| {
let total = agg.values_seen as f64;
// This takes converts the varlena to a [u8] to a str to a String, the final step doing a copy which is safe to return
// TODO: is there a more straightforward way of doing this?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. String::from(str::from_utf8(bytes)) is exactly what you want. Readability would be greatly improved by assigning the result of varlena_to_byte_slice rather than doing it all in one line. And since you do it a few times, I'd move the conversion out to a separate function.

Also, rustfmt would unjam the expect for you, along with cleaning up many other non-standard formattings :)

(also shown in commit d6e293a)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, stole that commit :)

value: i64,
) -> f64 {
for (idx, datum) in agg.datums.iter().enumerate() {
if value == datum as i64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've run out of time, but the same approach can be applied to max and min frequency functions: implement the algorithm in generic function parametrized over data type and conversion function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like just writing a generalized function to return the index of a matching value given some equality parameterization. On a hunch I checked to see if there was already anything like this and found the position method for iterators. Rewrote these functions to using that which seems cleaner, though honestly the implementations here were already so simple that it doesn't really look much better.

let mut counts = Vec::new();
let mut overcounts = Vec::new();

for entry in &trans.entries {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I demonstrate how to genericize this in commit 31e12c3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't actually go with this suggestion. I didn't like how the new version created yet another copy of all the values in the integer case.

value: Option<i64>,
fcinfo: pg_sys::FunctionCallInfo,
) -> Option<Internal> {
if n <= 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can avoid repeating these checks by moving them into topn_agg_from_type_id as show in commit c9eb12f

max_n: i32,
min_freq: f64,
) -> impl std::iter::Iterator<Item = i64> + '_ {
let iter = agg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty complex, what with zipping two things together and having 3 different exit criteria. It would be great not to duplicate that algorithm. I show how to factor it out in commit d6e293a.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really liked that change!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I forgot to tell you that I woke Saturday morning realizing the convert parameter was unnecessary so I removed it in commit 052daea

Also I left a comment about what looks like a bug: where we silently stop processing on an error. If we think it's an impossible error we should use .expect. Also I might just not understand it, it was Saturday morning and I only had a minute for a quick fix, like I only have a minute for a quick comment now :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, even cleaner with the new change.

Copy link
Contributor Author

@WireBaron WireBaron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorporated feedback, last change in new batch includes a rustfmt, might be easier to see diff looking at individual changes.

let mut counts = Vec::new();
let mut overcounts = Vec::new();

for entry in &trans.entries {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't actually go with this suggestion. I didn't like how the new version created yet another copy of all the values in the integer case.

}

// Important: this variable will hold the lifetime of the varlena until we copy it in freq_agg_trans()
let text = value.map(pgx::rust_str_to_text_p);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this looks a lot nicer (except for the map argument).

fcinfo: pg_sys::FunctionCallInfo,
) -> Option<Internal> {
// Important: this variable will hold the lifetime of the varlena until we copy it in freq_agg_trans()
let text = value.map(pgx::rust_str_to_text_p);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

max_n: i32,
min_freq: f64,
) -> impl std::iter::Iterator<Item = i64> + '_ {
let iter = agg
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really liked that change!

value: i64,
) -> f64 {
for (idx, datum) in agg.datums.iter().enumerate() {
if value == datum as i64 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like just writing a generalized function to return the index of a matching value given some equality parameterization. On a hunch I checked to see if there was already anything like this and found the position method for iterators. Rewrote these functions to using that which seems cleaner, though honestly the implementations here were already so simple that it doesn't really look much better.

move |(value, (&count, &overcount))| {
let total = agg.values_seen as f64;
// This takes converts the varlena to a [u8] to a str to a String, the final step doing a copy which is safe to return
// TODO: is there a more straightforward way of doing this?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, stole that commit :)


extension_sql!("\n\
CREATE AGGREGATE toolkit_experimental.topn_agg(\n\
integer, TEXT\n\
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added names to these arguments. Using the aggregate builder would have added a huge amount of redundant code here, since these 9 aggregates are using the same serialize, deserialize, and combine functions (and one of 3 different final functions).

This is a minimum change needed to specialize the frequency and topn aggregate
for integer and string types.  This change makes them much more user friendly,
but there are a lot of inefficiencies and duplicated code after this change.

The original AnyElement version of these aggregates can still be used as
raw_freq_agg or raw_topn_agg.
@WireBaron
Copy link
Contributor Author

bors r+

@bors
Copy link
Contributor

bors bot commented Apr 19, 2022

Build succeeded:

@bors bors bot merged commit 3c3e9c0 into main Apr 19, 2022
@bors bors bot deleted the br/remove_freq_anyelem branch April 19, 2022 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Specialize freq_agg and topn_agg for bigints and text rather than supporting AnyElement
3 participants