New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
R implementation of sqlParseVariablesImpl #83
Conversation
@hadley any comments? |
in_quote <- q | ||
break | ||
} | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation looks odd here.
What's the run time of the new code on a 1000, 10000, 100000-char string, compared to the old code? |
did a roxygen run, will check indents next |
whee the checks have passed. quick benchmark next. |
Tested with the following line on my MacBook:
Runtimes old code: Runtimes new code: So, as to expected, its way slower. But I would say since this is just a fallback and queries are expected to be shorter than 10K characters its not going to matter much. |
It might turn out that this function will be called for all parametrized queries, regardless of backend support, to provide a backend-independent syntax for parametrized queries. Anyway, the C++ implementation might as well live in a separate package. The apparent non-linear behavior of the R implementation bothers me. Any chance we can get rid of it, perhaps through some clever strsplit() preprocessing?
Does the presence of quotes/comments in the text affect run time? |
Presence of quotes/comments only changes state bits, so I would not expect that to make any difference. |
Strange, when run in Rstudio it only takes 4.23 seconds on the largest string as one would expect form linearity. |
looks like |
Got the slowest time (100K char string) down to 1.5 sec. Pushing today... |
This is very nice indeed! Looking forward to it. |
Runtimes faster new code: |
#' @export | ||
#' @rdname sqlParseVariables | ||
sqlParseVariablesImpl <- function(sql, quotes, comments) { | ||
sql_arr <- strsplit(as.character(sql), "", fixed=T)[[1]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please use TRUE
instead of T
?
} | ||
} else { | ||
# only check the end of the active quote definition | ||
# TODO: support end quote escaping (e.g. \") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was end quote escaping supported by the Rcpp code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there was support for it, although not tested: https://codecov.io/github/rstats-db/DBI/src/sqldelim.cpp?ref=ef1a72121ed1f34e5bf94a913ea3410d9b289f42#l-61 . I'd like to review the old code and have it fully covered by tests before proceeding here. Also, to me it looks like the construct "SELECT ? FROM A"
will be interpreted in a strange way by the C++ implementation.
Not sure about the end quote escaping. |
Run time is still quadratic in the number of variables, but I guess we can live with that:
|
Not using Rstudio here. |
No idea what happened to the alias. This file is generated after all. |
It's not the alias, just the name (which is good as it is now). Looks good to me. @hadley: Any further comments? |
in_comment <- 0L | ||
i <- 1 | ||
while(i <= length(sql_arr)) { | ||
# only check for variables if neither commented nor quoted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indenting looks slightly off here
added a missing space. |
My main concern the C++ used a state machine which is easy to reason about, and easy to adapt to changing requirements. I can't easily identify the strategy used in the new code, which means that if requirements change, @hannesmuehleisen will have to make the changes. That said, I don't think this code is likely to change much in the future, so it's not a huge concern. |
Can rewrite to state machine if it helps |
If you're willing, I'd really appreciate it. |
Sure, be back in an hour or so. |
Ok, its now a state machine. Runtime is unchanged. |
LGTM |
|
||
# prepare comments & quotes for quicker comparisions | ||
for(c in seq_along(comments)) { | ||
comments[[c]][[1]] <- strsplit(comments[[c]][[1]], "", fixed = TRUE)[[1]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The [[n]] access is not very intuitive, but fastest:
a <- list(start = "a", end = "b", endRequired = FALSE); microbenchmark::microbenchmark(a[[3]], a[["start"]], a[["end"]], a$s, a$start, a$e, a$end, times = 100000)
Unit: nanoseconds
expr min lq mean median uq max neval cld
a[[3]] 125 142 201.7776 153 167 14234 1e+05 a
a[["start"]] 136 156 215.8544 167 184 21100 1e+05 a
a[["end"]] 145 164 225.6624 173 191 62662 1e+05 a
a$s 186 206 299.5662 218 297 43130 1e+05 ab
a$start 151 171 265.3310 183 261 22357 1e+05 ab
a$e 188 207 371.8202 219 298 6883402 1e+05 b
a$end 167 188 281.5594 201 278 21501 1e+05 ab
I wonder how much this affects overall performance (we're talking nanoseconds here). Let's look at this in a separate PR.
- Use pure R implementation of `sqlParseVariablesImpl()` (#83, @hannesmuehleisen).
Thanks for your efforts. |
Whee, thanks for merging! |
* New package maintainer: Kirill Müller. * `dbGetInfo()` gains a default method that extracts the information from `dbGetStatement()`, `dbGetRowsAffected()`, `dbHasCompleted()`, and `dbGetRowCount()`. This means that most drivers should no longer need to implement `dbGetInfo()` (which may be deprecated anyway at some point) (#55). * `dbDataType()` and `dbQuoteString()` are now properly exported. * Default `dbGetQuery()` method now always calls `dbFetch()`, in a `tryCatch()` block. * New generic `dbBind()` for binding values to a parameterised query. * DBI gains a number of SQL generation functions. These make it easier to write backends by implementing common operations that are slightly tricky to do absolutely correctly. * `sqlCreateTable()` and `sqlAppendTable()` create tables from a data frame and insert rows into an existing table. These will power most implementations of `dbWriteTable()`. `sqlAppendTable()` is useful for databases that support parameterised queries. * `sqlRownamesToColumn()` and `sqlColumnToRownames()` provide a standard way of translating row names to and from the database. * `sqlInterpolate()` and `sqlParseVariables()` allows databases without native parameterised queries to use parameterised queries to avoid SQL injection attacks. * `sqlData()` is a new generic that converts a data frame into a data frame suitable for sending to the database. This is used to (e.g.) ensure all character vectors are encoded as UTF-8, or to convert R varible types (like factor) to types supported by the database. * The `sqlParseVariablesImpl()` is now implemented purely in R, with full test coverage (#83, @hannesmuehleisen). * `dbiCheckCompliance()` has been removed, the functionality is now available in the `DBItest` package (#80). * Added default `show()` methods for driver, connection and results. * New concrete `ANSIConnection` class and `ANSI()` function to generate a dummy ANSI compliant connection useful for testing. * Default `dbQuoteString()` and `dbQuoteIdentifer()` methods now use `encodeString()` so that special characters like `\n` are correctly escaped. `dbQuoteString()` converts `NA` to (unquoted) NULL. * The initial DBI proposal and DBI version 1 specification are now included as a vignette. These are there mostly for historical interest. * The new `DBItest` package is described in the vignette. * Removed unused `dbi_dep()` and `print.list.names()`.
* New package maintainer: Kirill Müller. * `dbGetInfo()` gains a default method that extracts the information from `dbGetStatement()`, `dbGetRowsAffected()`, `dbHasCompleted()`, and `dbGetRowCount()`. This means that most drivers should no longer need to implement `dbGetInfo()` (which may be deprecated anyway at some point) (#55). * `dbDataType()` and `dbQuoteString()` are now properly exported. * The default implementation for `dbDataType()` (powered by `dbiDataType()`) now also supports `difftime` and `AsIs` objects and lists of `raw` (#70). * Default `dbGetQuery()` method now always calls `dbFetch()`, in a `tryCatch()` block. * New generic `dbBind()` for binding values to a parameterised query. * DBI gains a number of SQL generation functions. These make it easier to write backends by implementing common operations that are slightly tricky to do absolutely correctly. * `sqlCreateTable()` and `sqlAppendTable()` create tables from a data frame and insert rows into an existing table. These will power most implementations of `dbWriteTable()`. `sqlAppendTable()` is useful for databases that support parameterised queries. * `sqlRownamesToColumn()` and `sqlColumnToRownames()` provide a standard way of translating row names to and from the database. * `sqlInterpolate()` and `sqlParseVariables()` allows databases without native parameterised queries to use parameterised queries to avoid SQL injection attacks. * `sqlData()` is a new generic that converts a data frame into a data frame suitable for sending to the database. This is used to (e.g.) ensure all character vectors are encoded as UTF-8, or to convert R varible types (like factor) to types supported by the database. * The `sqlParseVariablesImpl()` is now implemented purely in R, with full test coverage (#83, @hannesmuehleisen). * `dbiCheckCompliance()` has been removed, the functionality is now available in the `DBItest` package (#80). * Added default `show()` methods for driver, connection and results. * New concrete `ANSIConnection` class and `ANSI()` function to generate a dummy ANSI compliant connection useful for testing. * Default `dbQuoteString()` and `dbQuoteIdentifer()` methods now use `encodeString()` so that special characters like `\n` are correctly escaped. `dbQuoteString()` converts `NA` to (unquoted) NULL. * The initial DBI proposal and DBI version 1 specification are now included as a vignette. These are there mostly for historical interest. * The new `DBItest` package is described in the vignette. * Deprecated `print.list.pairs()`. * Removed unused `dbi_dep()`.
@hannesmuehleisen thanks for this PR! just spotted that I can upgrade DBI and was worried as I was sure DBI will depend on Rcpp based on #40. Great surprise to see it is still lightweight. |
Fixes #82.