Mangling of duplicated colnames in data frames resulting from a query breaks Bioconductor packages #259

hpages · 2018-05-07T20:18:39Z

Hi,

We just released BioC 3.7 a few days ago (on May 1st) and several packages broke already because of this change in RSQLite 2.1.1:

Data frames resulting from a query always have unique non-empty column names (r-dbi/DBItest#137).

I see that the issue was discussed here r-dbi/DBItest#137

It's a little bit disappointing that this kind of breaking changes are made without more consideration for the existing packages. If only it was an improvement but name mangling has always proven to be an annoyance more than anything else. Furthermore, the fact that RSQLite uses its own mangling (via internal helper tidy_names) instead of base::make.names() adds another layer of inconsistency and confusion.

Could this at least be made optional e.g. via a new argument to dbGetQuery and dbFetch (and possibly to a few other functions)? Or is there an easy and robust way to bring the original names back if one wishes to (removing the ..n suffix is not robust). Thanks!

H.

The text was updated successfully, but these errors were encountered:

krlmlr · 2018-05-07T20:31:53Z

Thanks. I did check BioConductor packages that import RSQLite directly (though not all of them, some couldn't install or check on my system) and none of them had failing tests. Which packages are failing now?

Happy to discuss a compatibility option or perhaps an attribute that describes the renaming.

hpages · 2018-05-07T20:59:23Z

More than 30 packages are failing. For example: https://master.bioconductor.org/checkResults/3.7/bioc-LATEST/annotate/malbec2-buildsrc.html You won't necessarily catch the problems by checking only packages that import RSQLite directly. You don't have to do anything for this, we've already worked on a fix.

I opened this issue to discuss the possibility to make name mangling optional or at least easy to reverse. Thanks!

krlmlr · 2018-05-07T21:17:39Z

Ideally, queries would be written to return unique column names (via AS ... in the case of ambiguities). I do realize this isn't always practical with joins and the * selector. Could you point me to code where dealing with duplicate names in R is easier than changing the SQL?

I'm happy to check more BioConductor packages. Would you like to share a list of packages that are now failing?

An attribute that describes the renaming seems like a good solution.

hpages · 2018-05-07T23:11:06Z

Bioconductor packages that were failing are:

    a4
    adSplit
    annotate
    attract
    categoryCompare
    clippda
    CompGO
    EGSEA
    eisa
    EnrichmentBrowser
    fastLiquidAssociation
    FGNet
    gage
    GeneAnswers
    GlobalAncova
    globaltest
    goseq
    GOstats
    GSEABase
    gwascat
    hpar
    InterMineR
    MAGeCKFlute
    MIGSA
    MLP
    MmPalateMiRNA
    pcaGoPromoter
    PGSEA
    RDAVIDWebService
    ReportingTools
    rgsepd
    rols
    ScISI
    SLGI

As I said, we already came up with a fix. The fix is actually in AnnotationDbi: Bioconductor/AnnotationDbi@2a4aa6a
It was an easy fix because we already had an internal helper (L2Rchain.colnames) that we can use to predict the column names of the result. So now we use it to fix the column names of the returned data frame.

We have a sophisticated SQL generator in AnnotationDbi that was written a long time ago (> 10 years) that automatically generates complicated SELECT queries that contain JOINs. These queries typically return a "wide" data frame that contains columns from various tables and so duplicated column names are to be expected. The wide data frame is later split in several "narrow" data frames, one for each table in the JOINs, Because an SQL table cannot have duplicated column names, these narrow data frames are guaranteed to not have them either. And it's important for the downstream code that these narrow data frames use the original column names. Using AS in the SQL query to mangle the column names only to unmangle them later would add an unnecessary level of complication.

So yes, there are use cases where one will use AS at the SQL level to avoid duplicated column names, and use cases where one doesn't use AS on purpose to get a result with the original column names even if that means that some of them are duplicated. The fact that you can control the returned column names upfront with AS is an argument against doing name mangling at the RSQLite level IMO.

Name mangling is one of those features that is only useful when one works in interactive mode. From a programming perspective it has consistently and repeatedly been an annoyance so it was really a good thing that RSQlite didn't play the name mangling game so far. Adding an attribute that describes the renaming would kind of work but note that the 1st thing the conscious programmer will do is define a little wrapper to dbGetQuery that uses this attribute to restore the true column names. At least that's what we will do in AnnotationDbi. This means that 2 extra copies of the data frame will be generated for nothing: one by set_tidy_names, and another one to revert what set_tidy_names did. Not really satisfactory to do this on big results.

Finally I would argue that the current mangling scheme is broken if some column names have a ..n suffix (which you cannot assume will never happen):

library(RSQLite) 
conn <- dbConnect(SQLite(), tempfile())
dbWriteTable(conn, "foo", data.frame(`aa..5`=11:13, `bb`=21:23, check.names=FALSE))
dbWriteTable(conn, "bar", data.frame(`bb`=22:24, `aa..6`=42:44, check.names=FALSE))
dbGetQuery(conn, "SELECT * FROM foo JOIN bar ON (foo.bb=bar.bb)")
#   aa..1 bb bb..3 aa..4
# 1    12 22    22    42
# 2    13 23    23    43

Thanks,
H.

krlmlr · 2018-06-06T17:40:41Z

Sorry for the slow reply. I think it's easiest to just revert the change, I'll do it soon.

hadley · 2018-06-06T17:46:16Z

Let's back out this change, and reconsider it. I think we accidentally applied tidyverse renaming ideas (which are more aggressive) to DBI.

krlmlr · 2020-01-04T20:04:22Z

I think we no longer mangle column names here.

hpages · 2020-01-06T16:48:10Z

AFAIK you still do:

library(RSQLite)
conn <- dbConnect(SQLite())
df <- data.frame(aa=1:5, bb=11:15)
dbWriteTable(conn, "toto", df)
dbGetQuery(conn, "select *,aa from toto")
#   aa bb aa..3
# 1  1 11     1
# 2  2 12     2
# 3  3 13     3
# 4  4 14     4
# 5  5 15     5

sessionInfo()

R Under development (unstable) (2019-10-30 r77336)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS:   /home/hpages/R/R-4.0.r77336/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.0.r77336/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RSQLite_2.1.5.9000 devtools_2.2.1     usethis_1.5.1     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3        compiler_4.0.0    pillar_1.4.3      prettyunits_1.0.2
 [5] remotes_2.1.0     tools_4.0.0       testthat_2.3.1    zeallot_0.1.0    
 [9] digest_0.6.23     pkgbuild_1.0.6    pkgload_1.0.2     bit_1.1-14       
[13] memoise_1.1.0     tibble_2.1.3      pkgconfig_2.0.3   rlang_0.4.2      
[17] cli_2.0.0         DBI_1.1.0.9000    curl_4.3          withr_2.1.2      
[21] desc_1.2.0        fs_1.3.1          vctrs_0.2.1       rprojroot_1.3-2  
[25] bit64_0.9-7       glue_1.3.1        R6_2.4.1          processx_3.4.1   
[29] fansi_0.4.0       sessioninfo_1.1.1 callr_3.4.0       blob_1.2.0       
[33] magrittr_1.5      backports_1.1.5   ps_1.3.0          ellipsis_0.3.0   
[37] assertthat_0.2.1  crayon_1.3.4

krlmlr · 2020-01-06T23:20:37Z

Thanks for the nudge. The relevant parts of the spec will need to be removed/renamed, r-dbi/DBItest#181.

krlmlr · 2020-01-07T16:09:26Z

RSQLite 2.2.0 is on CRAN now.

hpages · 2020-01-07T16:36:42Z

Excellent. Thanks a lot!

github-actions · 2021-01-07T00:54:21Z

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

hpages mentioned this issue Jun 5, 2018

rowData does not contain the rownames Bioconductor/SummarizedExperiment#2

Closed

krlmlr mentioned this issue Jun 6, 2018

Consider an abbreviated account of "New names" tidyverse/tibble#406

Closed

krlmlr closed this as completed Jan 4, 2020

krlmlr reopened this Jan 6, 2020

krlmlr added the bug label Jan 6, 2020

hpages mentioned this issue Jan 6, 2020

Column name mangling in RSQLite >= 2.1.1 #308

Closed

krlmlr closed this as completed in 6d0d1be Jan 6, 2020

github-actions bot locked and limited conversation to collaborators Jan 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mangling of duplicated colnames in data frames resulting from a query breaks Bioconductor packages #259

Mangling of duplicated colnames in data frames resulting from a query breaks Bioconductor packages #259

hpages commented May 7, 2018 •

edited

Loading

krlmlr commented May 7, 2018

hpages commented May 7, 2018

krlmlr commented May 7, 2018

hpages commented May 7, 2018 •

edited

Loading

krlmlr commented Jun 6, 2018

hadley commented Jun 6, 2018

krlmlr commented Jan 4, 2020

hpages commented Jan 6, 2020

krlmlr commented Jan 6, 2020

krlmlr commented Jan 7, 2020

hpages commented Jan 7, 2020

github-actions bot commented Jan 7, 2021

Mangling of duplicated colnames in data frames resulting from a query breaks Bioconductor packages #259

Mangling of duplicated colnames in data frames resulting from a query breaks Bioconductor packages #259

Comments

hpages commented May 7, 2018 • edited Loading

krlmlr commented May 7, 2018

hpages commented May 7, 2018

krlmlr commented May 7, 2018

hpages commented May 7, 2018 • edited Loading

krlmlr commented Jun 6, 2018

hadley commented Jun 6, 2018

krlmlr commented Jan 4, 2020

hpages commented Jan 6, 2020

krlmlr commented Jan 6, 2020

krlmlr commented Jan 7, 2020

hpages commented Jan 7, 2020

github-actions bot commented Jan 7, 2021

hpages commented May 7, 2018 •

edited

Loading

hpages commented May 7, 2018 •

edited

Loading