Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mangling of duplicated colnames in data frames resulting from a query breaks Bioconductor packages #259

Closed
hpages opened this issue May 7, 2018 · 12 comments
Labels

Comments

@hpages
Copy link

hpages commented May 7, 2018

Hi,

We just released BioC 3.7 a few days ago (on May 1st) and several packages broke already because of this change in RSQLite 2.1.1:

Data frames resulting from a query always have unique non-empty column names (r-dbi/DBItest#137).

I see that the issue was discussed here r-dbi/DBItest#137

It's a little bit disappointing that this kind of breaking changes are made without more consideration for the existing packages. If only it was an improvement but name mangling has always proven to be an annoyance more than anything else. Furthermore, the fact that RSQLite uses its own mangling (via internal helper tidy_names) instead of base::make.names() adds another layer of inconsistency and confusion.

Could this at least be made optional e.g. via a new argument to dbGetQuery and dbFetch (and possibly to a few other functions)? Or is there an easy and robust way to bring the original names back if one wishes to (removing the ..n suffix is not robust). Thanks!

H.

@krlmlr
Copy link
Member

krlmlr commented May 7, 2018

Thanks. I did check BioConductor packages that import RSQLite directly (though not all of them, some couldn't install or check on my system) and none of them had failing tests. Which packages are failing now?

Happy to discuss a compatibility option or perhaps an attribute that describes the renaming.

@hpages
Copy link
Author

hpages commented May 7, 2018

More than 30 packages are failing. For example: https://master.bioconductor.org/checkResults/3.7/bioc-LATEST/annotate/malbec2-buildsrc.html You won't necessarily catch the problems by checking only packages that import RSQLite directly. You don't have to do anything for this, we've already worked on a fix.

I opened this issue to discuss the possibility to make name mangling optional or at least easy to reverse. Thanks!

@krlmlr
Copy link
Member

krlmlr commented May 7, 2018

Ideally, queries would be written to return unique column names (via AS ... in the case of ambiguities). I do realize this isn't always practical with joins and the * selector. Could you point me to code where dealing with duplicate names in R is easier than changing the SQL?

I'm happy to check more BioConductor packages. Would you like to share a list of packages that are now failing?

An attribute that describes the renaming seems like a good solution.

@hpages
Copy link
Author

hpages commented May 7, 2018

Bioconductor packages that were failing are:

    a4
    adSplit
    annotate
    attract
    categoryCompare
    clippda
    CompGO
    EGSEA
    eisa
    EnrichmentBrowser
    fastLiquidAssociation
    FGNet
    gage
    GeneAnswers
    GlobalAncova
    globaltest
    goseq
    GOstats
    GSEABase
    gwascat
    hpar
    InterMineR
    MAGeCKFlute
    MIGSA
    MLP
    MmPalateMiRNA
    pcaGoPromoter
    PGSEA
    RDAVIDWebService
    ReportingTools
    rgsepd
    rols
    ScISI
    SLGI

As I said, we already came up with a fix. The fix is actually in AnnotationDbi: Bioconductor/AnnotationDbi@2a4aa6a
It was an easy fix because we already had an internal helper (L2Rchain.colnames) that we can use to predict the column names of the result. So now we use it to fix the column names of the returned data frame.

We have a sophisticated SQL generator in AnnotationDbi that was written a long time ago (> 10 years) that automatically generates complicated SELECT queries that contain JOINs. These queries typically return a "wide" data frame that contains columns from various tables and so duplicated column names are to be expected. The wide data frame is later split in several "narrow" data frames, one for each table in the JOINs, Because an SQL table cannot have duplicated column names, these narrow data frames are guaranteed to not have them either. And it's important for the downstream code that these narrow data frames use the original column names. Using AS in the SQL query to mangle the column names only to unmangle them later would add an unnecessary level of complication.

So yes, there are use cases where one will use AS at the SQL level to avoid duplicated column names, and use cases where one doesn't use AS on purpose to get a result with the original column names even if that means that some of them are duplicated. The fact that you can control the returned column names upfront with AS is an argument against doing name mangling at the RSQLite level IMO.

Name mangling is one of those features that is only useful when one works in interactive mode. From a programming perspective it has consistently and repeatedly been an annoyance so it was really a good thing that RSQlite didn't play the name mangling game so far. Adding an attribute that describes the renaming would kind of work but note that the 1st thing the conscious programmer will do is define a little wrapper to dbGetQuery that uses this attribute to restore the true column names. At least that's what we will do in AnnotationDbi. This means that 2 extra copies of the data frame will be generated for nothing: one by set_tidy_names, and another one to revert what set_tidy_names did. Not really satisfactory to do this on big results.

Finally I would argue that the current mangling scheme is broken if some column names have a ..n suffix (which you cannot assume will never happen):

library(RSQLite) 
conn <- dbConnect(SQLite(), tempfile())
dbWriteTable(conn, "foo", data.frame(`aa..5`=11:13, `bb`=21:23, check.names=FALSE))
dbWriteTable(conn, "bar", data.frame(`bb`=22:24, `aa..6`=42:44, check.names=FALSE))
dbGetQuery(conn, "SELECT * FROM foo JOIN bar ON (foo.bb=bar.bb)")
#   aa..1 bb bb..3 aa..4
# 1    12 22    22    42
# 2    13 23    23    43

Thanks,
H.

@krlmlr
Copy link
Member

krlmlr commented Jun 6, 2018

Sorry for the slow reply. I think it's easiest to just revert the change, I'll do it soon.

@hadley
Copy link
Member

hadley commented Jun 6, 2018

Let's back out this change, and reconsider it. I think we accidentally applied tidyverse renaming ideas (which are more aggressive) to DBI.

@krlmlr
Copy link
Member

krlmlr commented Jan 4, 2020

I think we no longer mangle column names here.

@krlmlr krlmlr closed this as completed Jan 4, 2020
@hpages
Copy link
Author

hpages commented Jan 6, 2020

AFAIK you still do:

library(RSQLite)
conn <- dbConnect(SQLite())
df <- data.frame(aa=1:5, bb=11:15)
dbWriteTable(conn, "toto", df)
dbGetQuery(conn, "select *,aa from toto")
#   aa bb aa..3
# 1  1 11     1
# 2  2 12     2
# 3  3 13     3
# 4  4 14     4
# 5  5 15     5

sessionInfo()

R Under development (unstable) (2019-10-30 r77336)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS:   /home/hpages/R/R-4.0.r77336/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.0.r77336/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RSQLite_2.1.5.9000 devtools_2.2.1     usethis_1.5.1     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3        compiler_4.0.0    pillar_1.4.3      prettyunits_1.0.2
 [5] remotes_2.1.0     tools_4.0.0       testthat_2.3.1    zeallot_0.1.0    
 [9] digest_0.6.23     pkgbuild_1.0.6    pkgload_1.0.2     bit_1.1-14       
[13] memoise_1.1.0     tibble_2.1.3      pkgconfig_2.0.3   rlang_0.4.2      
[17] cli_2.0.0         DBI_1.1.0.9000    curl_4.3          withr_2.1.2      
[21] desc_1.2.0        fs_1.3.1          vctrs_0.2.1       rprojroot_1.3-2  
[25] bit64_0.9-7       glue_1.3.1        R6_2.4.1          processx_3.4.1   
[29] fansi_0.4.0       sessioninfo_1.1.1 callr_3.4.0       blob_1.2.0       
[33] magrittr_1.5      backports_1.1.5   ps_1.3.0          ellipsis_0.3.0   
[37] assertthat_0.2.1  crayon_1.3.4     

@krlmlr
Copy link
Member

krlmlr commented Jan 6, 2020

Thanks for the nudge. The relevant parts of the spec will need to be removed/renamed, r-dbi/DBItest#181.

@krlmlr
Copy link
Member

krlmlr commented Jan 7, 2020

RSQLite 2.2.0 is on CRAN now.

@hpages
Copy link
Author

hpages commented Jan 7, 2020

Excellent. Thanks a lot!

@github-actions
Copy link

github-actions bot commented Jan 7, 2021

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

@github-actions github-actions bot locked and limited conversation to collaborators Jan 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants