Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBI sources can't provide sampling implementation #2993

Closed
hannes opened this issue Jul 23, 2017 · 9 comments
Closed

DBI sources can't provide sampling implementation #2993

hannes opened this issue Jul 23, 2017 · 9 comments
Labels
feature a feature request or enhancement verbs 🏃‍♀️

Comments

@hannes
Copy link
Contributor

hannes commented Jul 23, 2017

When using src_sql, it is possible to attach a custom class to tbl objects. But the documentation for src_sql states that it is deprecated and that src_dbi should be used instead. When using src_dbi, no custom class can be set for tbl objects (at least not as far as I can tell).

This is mostly fine, since most generics dispatch on the connection class. Unfortunately, sample_n and sample_frac only dispatch on the tbl class and there is no generic for db-backed tables that could be used to provide a custom implementation.

My suggestion would be to add a DBI generic for sample_* that calls a method defined on the connection if available.

@edgararuiz-zz
Copy link

Hi @hannesmuehleisen , this will probably have to be implemented for each individual database type. Which database are you working with at this time?

@hannes
Copy link
Contributor Author

hannes commented Jul 23, 2017

Well I think this should probably be part of the SQL translation method. Indeed the syntax is different for all the systems that I can see. A generic method would be a first step.

I am the author of the MonetDBLite package, which includes a dplyr backend.

@edgararuiz-zz
Copy link

What's the syntax that MonetDB uses for sampling?

@hadley hadley added database feature a feature request or enhancement labels Aug 23, 2017
@hadley
Copy link
Member

hadley commented Aug 23, 2017

Seems like tbl.src_dbi should mangle the DBI class name in some way and include it in the tbl class.

Alternatively, sample_n.tbl_dbi could re-dispatch on the class of the connection.

@hannes
Copy link
Contributor Author

hannes commented Aug 24, 2017

yes both these solutions seem fine to me! I like how dplyr relies more on a cooperative DBI driver nowadays.

@hadley
Copy link
Member

hadley commented Nov 2, 2017

Actually this is more complicated than I expected. The right way to implement this is to:

  • Add a new sample lazy_ops which both sample_n and sample_frac would use.

  • Implement sql_build.sample which would yield a sample_query() object

  • Implement sql_render.sample which would call a new sql_sample() generic, which dispatches on the connection.

I think this is better than adding a sample parameter to the existing sql_select generator because I suspect that there's currently very little standardisation across databases.

@hannesmuehleisen could you provide a few examples/links to sampling implementations that you've seen?

@hannes
Copy link
Contributor Author

hannes commented Nov 3, 2017

I assume you mean sample functionality in SQL? Here you go:

Oracle: SELECT c FROM t SAMPLE(10)
SQL Server: SELECT c FROM t TABLESAMPLE (10 ROWS)
PostgreSQL: SELECT c FROM t TABLESAMPLE BERNOULLI (0.1)
DB2: seems similar to PostgreSQL
MySQL: not really supported, but can be simulated with something like SELECT c FROM t WHERE RAND() <= .1
SQLite: also no direct support, but can be simulated SELECT c FROM t ORDER BY RANDOM() LIMIT 10
MonetDB: SELECT c FROM t SAMPLE 10 (soon to be extended to weighted samples, too)

So indeed some variety and some system use counts, some percentages and others workarounds. A more crucial difference is whether the sample is taken of the base table or the query result (!). We are in the process in switching from the latter to the former, this gets interesting once filters or aggregations or joins are involved.
Hope this helps

@hadley
Copy link
Member

hadley commented Nov 3, 2017

What about grouped/stratified samples? Any widespread support for that?

@lock
Copy link

lock bot commented Oct 14, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Oct 14, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement verbs 🏃‍♀️
Projects
None yet
Development

No branches or pull requests

3 participants