Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R implementation of sqlParseVariablesImpl #83

Merged
merged 22 commits into from Mar 2, 2016
Merged

R implementation of sqlParseVariablesImpl #83

merged 22 commits into from Mar 2, 2016

Conversation

hannes
Copy link
Contributor

@hannes hannes commented Feb 15, 2016

Fixes #82.

@hannes
Copy link
Contributor Author

hannes commented Feb 23, 2016

@hadley any comments?

in_quote <- q
break
}
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation looks odd here.

@krlmlr
Copy link
Member

krlmlr commented Feb 23, 2016

What's the run time of the new code on a 1000, 10000, 100000-char string, compared to the old code?

@hannes
Copy link
Contributor Author

hannes commented Feb 23, 2016

did a roxygen run, will check indents next

@hannes
Copy link
Contributor Author

hannes commented Feb 23, 2016

whee the checks have passed. quick benchmark next.

@hannes
Copy link
Contributor Author

hannes commented Feb 23, 2016

Tested with the following line on my MacBook:

system.time(sqlParseVariables(ANSI(), paste0(rep("S", 100000), collapse="")))

Runtimes old code:
1000-char: 0.00s
10000-char: 0.00s
100000-char: 0.02

Runtimes new code:
1000-char: 0.03s
10000-char: 0.48s
100000-char: 34.06s (!)

So, as to expected, its way slower. But I would say since this is just a fallback and queries are expected to be shorter than 10K characters its not going to matter much.

@krlmlr
Copy link
Member

krlmlr commented Feb 23, 2016

It might turn out that this function will be called for all parametrized queries, regardless of backend support, to provide a backend-independent syntax for parametrized queries. Anyway, the C++ implementation might as well live in a separate package.

The apparent non-linear behavior of the R implementation bothers me. Any chance we can get rid of it, perhaps through some clever strsplit() preprocessing?

Could you please post/upload the testing code?

Does the presence of quotes/comments in the text affect run time?

@hannes
Copy link
Contributor Author

hannes commented Feb 23, 2016

Presence of quotes/comments only changes state bits, so I would not expect that to make any difference.

@hannes
Copy link
Contributor Author

hannes commented Feb 23, 2016

Strange, when run in Rstudio it only takes 4.23 seconds on the largest string as one would expect form linearity.

@hannes
Copy link
Contributor Author

hannes commented Feb 23, 2016

looks like substr is the main culprit here. Will try to optimize.

@hannes
Copy link
Contributor Author

hannes commented Feb 24, 2016

Got the slowest time (100K char string) down to 1.5 sec. Pushing today...

@krlmlr
Copy link
Member

krlmlr commented Feb 24, 2016

This is very nice indeed! Looking forward to it.

@hannes
Copy link
Contributor Author

hannes commented Feb 24, 2016

Runtimes faster new code:
1000-char: 0.01s
10000-char: 0.14s
100000-char: 1.52s
1000000-char: 14.39s

#' @export
#' @rdname sqlParseVariables
sqlParseVariablesImpl <- function(sql, quotes, comments) {
sql_arr <- strsplit(as.character(sql), "", fixed=T)[[1]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please use TRUE instead of T?

}
} else {
# only check the end of the active quote definition
# TODO: support end quote escaping (e.g. \")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was end quote escaping supported by the Rcpp code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there was support for it, although not tested: https://codecov.io/github/rstats-db/DBI/src/sqldelim.cpp?ref=ef1a72121ed1f34e5bf94a913ea3410d9b289f42#l-61 . I'd like to review the old code and have it fully covered by tests before proceeding here. Also, to me it looks like the construct "SELECT ? FROM A" will be interpreted in a strange way by the C++ implementation.

@hannes
Copy link
Contributor Author

hannes commented Feb 24, 2016

Not sure about the end quote escaping.

@krlmlr
Copy link
Member

krlmlr commented Mar 2, 2016

Run time is still quadratic in the number of variables, but I guess we can live with that:

> s <- paste0(rep("?a", 10000), collapse=""); system.time(sqlParseVariables(ANSI(), s))
   user  system elapsed 
  0.520   0.000   0.521 
> s <- paste0(rep("?a", 20000), collapse=""); system.time(sqlParseVariables(ANSI(), s))
   user  system elapsed 
  1.840   0.000   1.846 

@hannes
Copy link
Contributor Author

hannes commented Mar 2, 2016

Not using Rstudio here.

@hannes
Copy link
Contributor Author

hannes commented Mar 2, 2016

No idea what happened to the alias. This file is generated after all.

@krlmlr
Copy link
Member

krlmlr commented Mar 2, 2016

It's not the alias, just the name (which is good as it is now).

Looks good to me. @hadley: Any further comments?

in_comment <- 0L
i <- 1
while(i <= length(sql_arr)) {
# only check for variables if neither commented nor quoted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indenting looks slightly off here

@hannes
Copy link
Contributor Author

hannes commented Mar 2, 2016

added a missing space.

@hadley
Copy link
Member

hadley commented Mar 2, 2016

My main concern the C++ used a state machine which is easy to reason about, and easy to adapt to changing requirements. I can't easily identify the strategy used in the new code, which means that if requirements change, @hannesmuehleisen will have to make the changes. That said, I don't think this code is likely to change much in the future, so it's not a huge concern.

@hannes
Copy link
Contributor Author

hannes commented Mar 2, 2016

Can rewrite to state machine if it helps

@hadley
Copy link
Member

hadley commented Mar 2, 2016

If you're willing, I'd really appreciate it.

@hannes
Copy link
Contributor Author

hannes commented Mar 2, 2016

Sure, be back in an hour or so.

@hannes
Copy link
Contributor Author

hannes commented Mar 2, 2016

Ok, its now a state machine. Runtime is unchanged.

@hadley
Copy link
Member

hadley commented Mar 2, 2016

LGTM


# prepare comments & quotes for quicker comparisions
for(c in seq_along(comments)) {
comments[[c]][[1]] <- strsplit(comments[[c]][[1]], "", fixed = TRUE)[[1]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The [[n]] access is not very intuitive, but fastest:

a <- list(start = "a", end = "b", endRequired = FALSE); microbenchmark::microbenchmark(a[[3]], a[["start"]], a[["end"]], a$s, a$start, a$e, a$end, times = 100000)
Unit: nanoseconds
         expr min  lq     mean median  uq     max neval cld
       a[[3]] 125 142 201.7776    153 167   14234 1e+05  a 
 a[["start"]] 136 156 215.8544    167 184   21100 1e+05  a 
   a[["end"]] 145 164 225.6624    173 191   62662 1e+05  a 
          a$s 186 206 299.5662    218 297   43130 1e+05  ab
      a$start 151 171 265.3310    183 261   22357 1e+05  ab
          a$e 188 207 371.8202    219 298 6883402 1e+05   b
        a$end 167 188 281.5594    201 278   21501 1e+05  ab

I wonder how much this affects overall performance (we're talking nanoseconds here). Let's look at this in a separate PR.

krlmlr added a commit that referenced this pull request Mar 2, 2016
- Use pure R implementation of `sqlParseVariablesImpl()` (#83, @hannesmuehleisen).
@krlmlr krlmlr merged commit ca4e977 into r-dbi:master Mar 2, 2016
@krlmlr
Copy link
Member

krlmlr commented Mar 2, 2016

Thanks for your efforts.

@krlmlr krlmlr mentioned this pull request Mar 2, 2016
2 tasks
@hannes
Copy link
Contributor Author

hannes commented Mar 2, 2016

Whee, thanks for merging!

krlmlr pushed a commit that referenced this pull request Apr 19, 2016
* New package maintainer: Kirill Müller.

* `dbGetInfo()` gains a default method that extracts the information from
  `dbGetStatement()`, `dbGetRowsAffected()`, `dbHasCompleted()`, and
  `dbGetRowCount()`. This means that most drivers should no longer need to
  implement `dbGetInfo()` (which may be deprecated anyway at some point) (#55).

* `dbDataType()` and `dbQuoteString()` are now properly exported.

* Default `dbGetQuery()` method now always calls `dbFetch()`, in a `tryCatch()`
  block.

* New generic `dbBind()` for binding values to a parameterised query.

* DBI gains a number of SQL generation functions. These make it easier to
  write backends by implementing common operations that are slightly
  tricky to do absolutely correctly.

    * `sqlCreateTable()` and `sqlAppendTable()` create tables from a data
      frame and insert rows into an existing table. These will power most
      implementations of `dbWriteTable()`. `sqlAppendTable()` is useful
      for databases that support parameterised queries.

    * `sqlRownamesToColumn()` and `sqlColumnToRownames()` provide a standard
      way of translating row names to and from the database.

    * `sqlInterpolate()` and `sqlParseVariables()` allows databases without
      native parameterised queries to use parameterised queries to avoid
      SQL injection attacks.

    * `sqlData()` is a new generic that converts a data frame into a data
      frame suitable for sending to the database. This is used to (e.g.)
      ensure all character vectors are encoded as UTF-8, or to convert
      R varible types (like factor) to types supported by the database.

    * The `sqlParseVariablesImpl()` is now implemented purely in R, with full
      test coverage (#83, @hannesmuehleisen).

* `dbiCheckCompliance()` has been removed, the functionality is now available
  in the `DBItest` package (#80).

* Added default `show()` methods for driver, connection and results.

* New concrete `ANSIConnection` class and `ANSI()` function to generate a dummy
  ANSI compliant connection useful for testing.

* Default `dbQuoteString()` and `dbQuoteIdentifer()` methods now use
  `encodeString()` so that special characters like `\n` are correctly escaped.
  `dbQuoteString()` converts `NA` to (unquoted) NULL.

* The initial DBI proposal and DBI version 1 specification are now included as
  a vignette. These are there mostly for historical interest.

* The new `DBItest` package is described in the vignette.

* Removed unused `dbi_dep()` and `print.list.names()`.
krlmlr pushed a commit that referenced this pull request May 2, 2016
* New package maintainer: Kirill Müller.

* `dbGetInfo()` gains a default method that extracts the information from
  `dbGetStatement()`, `dbGetRowsAffected()`, `dbHasCompleted()`, and
  `dbGetRowCount()`. This means that most drivers should no longer need to
  implement `dbGetInfo()` (which may be deprecated anyway at some point) (#55).

* `dbDataType()` and `dbQuoteString()` are now properly exported.

* The default implementation for `dbDataType()` (powered by `dbiDataType()`) now
  also supports `difftime` and `AsIs` objects and lists of `raw` (#70).

* Default `dbGetQuery()` method now always calls `dbFetch()`, in a `tryCatch()`
  block.

* New generic `dbBind()` for binding values to a parameterised query.

* DBI gains a number of SQL generation functions. These make it easier to
  write backends by implementing common operations that are slightly
  tricky to do absolutely correctly.

    * `sqlCreateTable()` and `sqlAppendTable()` create tables from a data
      frame and insert rows into an existing table. These will power most
      implementations of `dbWriteTable()`. `sqlAppendTable()` is useful
      for databases that support parameterised queries.

    * `sqlRownamesToColumn()` and `sqlColumnToRownames()` provide a standard
      way of translating row names to and from the database.

    * `sqlInterpolate()` and `sqlParseVariables()` allows databases without
      native parameterised queries to use parameterised queries to avoid
      SQL injection attacks.

    * `sqlData()` is a new generic that converts a data frame into a data
      frame suitable for sending to the database. This is used to (e.g.)
      ensure all character vectors are encoded as UTF-8, or to convert
      R varible types (like factor) to types supported by the database.

    * The `sqlParseVariablesImpl()` is now implemented purely in R, with full
      test coverage (#83, @hannesmuehleisen).

* `dbiCheckCompliance()` has been removed, the functionality is now available
  in the `DBItest` package (#80).

* Added default `show()` methods for driver, connection and results.

* New concrete `ANSIConnection` class and `ANSI()` function to generate a dummy
  ANSI compliant connection useful for testing.

* Default `dbQuoteString()` and `dbQuoteIdentifer()` methods now use
  `encodeString()` so that special characters like `\n` are correctly escaped.
  `dbQuoteString()` converts `NA` to (unquoted) NULL.

* The initial DBI proposal and DBI version 1 specification are now included as
  a vignette. These are there mostly for historical interest.

* The new `DBItest` package is described in the vignette.

* Deprecated `print.list.pairs()`.

* Removed unused `dbi_dep()`.
@jangorecki
Copy link

@hannesmuehleisen thanks for this PR! just spotted that I can upgrade DBI and was worried as I was sure DBI will depend on Rcpp based on #40. Great surprise to see it is still lightweight.

@github-actions github-actions bot locked and limited conversation to collaborators Oct 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DBI depending on Rcpp?
4 participants