Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best (simple) way to get distinct values from a column ? #113

Closed
ericemc3 opened this issue Mar 15, 2021 · 4 comments
Closed

Best (simple) way to get distinct values from a column ? #113

ericemc3 opened this issue Mar 15, 2021 · 4 comments
Labels
question Further information is requested

Comments

@ericemc3
Copy link

I have a table t1 with a risk nominal column
and i want all distinct values in an array this is quite a common need i am having.

I really appreciate the new columnArray() Table method.

I am actually hesitating between different paths, having difficulties to select and remind which would be the best:

[...new Set(t1.columnArray('risk'))]          // short but not ideal for JS debutants...
t1.groupby('risk').count().columnArray('risk')  // tradeoff, but i don't really need the count
t1.rollup({risk: op.array_agg_distinct('risk')}).get('risk', 0)  // complex... could we avoid get here or use a get(0,0) ?

Maybe i am missing here something simplier.

Wish i could use something very simple such as t1.distinct('risk')

@jheer
Copy link
Member

jheer commented Mar 15, 2021

Like so? (v4+):

dt.rollup({ values: op.array_agg_distinct(‘risk’) })
  .get(‘values’, 0)

That is the most efficient route... Also note that your examples are not equivalent. In particular, Set will treat different Date objects with the same date-time as distinct values, Arquero will treat them as the same value and de-duplicate it.

@jheer jheer added the question Further information is requested label Mar 15, 2021
@ericemc3
Copy link
Author

ok, thank you for the caveat regarding Set and Date objects.

I think i will go on with
dt.groupby('risk').count().columnArray('risk')

which seems to me a bit easier to recall and teach

@jheer
Copy link
Member

jheer commented Mar 15, 2021

dt.groupby('risk').count().columnArray('risk') works but is much less efficient: it first builds a map to get the unique groupby values, generates a new internal column containing all groupby index keys for the input table, and then calculates the counts as well.

The array_agg_distinct aggregate function is much more efficient in terms of both time and memory so I would recommend it. If you find the syntax is too verbose, you could consider adding a new table method for convenience:

aq.addTableMethod('distinctArray', (table, columnName) => {
  return table.rollup({ values: op.array_agg_distinct(columnName) }).get('values', 0);
});

// ...
dt.distinctArray('risk') // now returns array of distinct values

@jheer jheer closed this as completed Mar 15, 2021
@ericemc3
Copy link
Author

Thank you for the hint, i should really use addTableMethod more often!

Actually, as an happy Arquero user, it doesn't bother me too much to use a more verbose syntax or new methods like the one you suggest.

But as a teacher, i really want non JS specialists (and R users) to be able to use Arquero in the most simple and intuitive way, this is why i am a bit of a pain on these topics :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants