Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle an edge case when requested region is not in fragment file #1348

Merged
merged 2 commits into from
Mar 2, 2023

Conversation

nrockweiler
Copy link
Contributor

  • Motivation

    • When GetReadsInRegion is called, if there are no fragments that overlap the region, for some reason Rsamtools::scanTabix is reporting all fragments if there is no overlap. The catch is that it returns a named list with a 'region' called NA and lists all fragments in the file. Therefore, when TabixOutputToDataFrame slurps up the overlapping fragments, it incorrectly thinks this named list element is a real overlapping region.
  • Solution

    • Within GetReadsInRegion, don't bother scanning the the tabix and converting the output with TabixOutputToDataFrame if there are no contigs in common. Return a dataframe with no entries.
    • Within TabixOutputToDataFrame, correctly check if there really are no overlaps. If there are no overlaps, return an empty reads dataframe.

- Motivation
  - When GetReadsInRegion is called, if there are no fragments that overlap the region, for some reason Rsamtools::scanTabix is reporting all fragments if there is no overlap.  The catch is that it returns a named list with a 'region' called NA and lists all fragments in the file.  Therefore, when TabixOutputToDataFrame slurps up the overlapping fragments, it incorrectly thinks this named list element is a real overlapping region.

 - Solution
   - Within GetReadsInRegion, don't bother scanning the the tabix and converting the output with TabixOutputToDataFrame if there are no contigs in common.  Return a dataframe with no entries.
   - Within TabixOutputToDataFrame, correctly check if there really are no overlaps.  If there are no overlaps, return an empty reads dataframe.
@nrockweiler
Copy link
Contributor Author

nrockweiler commented Mar 1, 2023

Here's a code snippet to reproduce the error:

I have a fragment_object that only has chr22 fragments. I use GetFragmentData to find all fragments that overlap a region on chr1 (spoiler alert, there shouldn't be any). The current version incorrectly outputs all [chr22] fragments. The patch correctly outputs no fragments.

library("Signac")

load("~/test.rdata") # loads fragment_object

region = "chr1-1-2000000"
tbx.path <- GetFragmentData(object = fragment_object, slot = "path")
cellmap <- GetFragmentData(object = fragment_object, slot = "cells")
tabix.file <- Rsamtools:::TabixFile(file = tbx.path)

Signac:::GetReadsInRegion(cellmap, region, tabix.file, cells = NULL, verbose = TRUE) 

Here's the output w/ Signac_1.9.0 (I just grabbed the first few lines):

Extracting reads in requested region
      chr    start      end               cell count ident length
1   chr22 10514843 10514890 TTGCATTTCCGGTTGA-1     2     1     47
2   chr22 10515876 10516176 TTCAGGTAGGGTCTAT-1     1     1    300
...

Here's the output w/ the commit:

Extracting reads in requested region
[1] chr   start end   cell  count
<0 rows> (or 0-length row.names)

Here's my sessionInfo:

R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.6

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Signac_1.9.0

loaded via a namespace (and not attached):
 [1] tidyr_1.3.0            pkgload_1.3.2          shiny_1.7.4            sp_1.6-0               stats4_4.2.2          
 [6] GenomeInfoDbData_1.2.9 Rsamtools_2.14.0       remotes_2.4.2          sessioninfo_1.2.2      globals_0.16.2        
[11] pillar_1.8.1           lattice_0.20-45        glue_1.6.2             digest_0.6.31          GenomicRanges_1.50.2  
[16] promises_1.2.0.1       XVector_0.38.0         colorspace_2.1-0       htmltools_0.5.4        httpuv_1.6.9          
[21] Matrix_1.5-3           pkgconfig_2.0.3        devtools_2.4.5         listenv_0.9.0          zlibbioc_1.44.0       
[26] purrr_1.0.1            xtable_1.8-4           patchwork_1.1.2        scales_1.2.1           processx_3.8.0        
[31] later_1.3.0            BiocParallel_1.32.5    tibble_3.1.8           generics_0.1.3         IRanges_2.32.0        
[36] ggplot2_3.4.1          usethis_2.1.6          ellipsis_0.3.2         cachem_1.0.6           withr_2.5.0           
[41] pbapply_1.7-0          SeuratObject_4.1.3     BiocGenerics_0.44.0    cli_3.6.0              magrittr_2.0.3        
[46] crayon_1.5.2           mime_0.12              memoise_2.0.1          ps_1.7.2               fs_1.6.1              
[51] fansi_1.0.4            future_1.31.0          parallelly_1.34.0      pkgbuild_1.4.0         progressr_0.13.0      
[56] profvis_0.3.7          tools_4.2.2            data.table_1.14.8      prettyunits_1.1.1      lifecycle_1.0.3       
[61] stringr_1.5.0          S4Vectors_0.36.1       munsell_0.5.0          irlba_2.3.5.1          callr_3.7.3           
[66] Biostrings_2.66.0      compiler_4.2.2         GenomeInfoDb_1.34.9    RcppRoll_0.3.0         rlang_1.0.6           
[71] grid_4.2.2             RCurl_1.98-1.10        rstudioapi_0.14        htmlwidgets_1.6.1      miniUI_0.1.1.1        
[76] bitops_1.0-7           gtable_0.3.1           codetools_0.2-19       R6_2.5.1               dplyr_1.1.0           
[81] fastmap_1.1.0          future.apply_1.10.0    utf8_1.2.3             fastmatch_1.1-3        rprojroot_2.0.3       
[86] desc_1.4.2             stringi_1.7.12         parallel_4.2.2         Rcpp_1.0.10            vctrs_0.5.2           
[91] tidyselect_1.2.0       urlchecker_1.0.1

@timoast timoast changed the base branch from master to develop March 2, 2023 04:56
@timoast
Copy link
Collaborator

timoast commented Mar 2, 2023

Hi @nrockweiler, thanks for the PR! This is an interesting edge case, not something that I've come across before. Thanks for the fix and clear explanation!

@timoast timoast merged commit 6ebe377 into stuart-lab:develop Mar 2, 2023
@nrockweiler
Copy link
Contributor Author

np @timoast.

When I was debugging this, I came across another potential issue. I think there will be an error when the following happens:

  • You call MultiGetReadsInRegion to find the overlapping fragments for 2+ fragment objects
  • At least 1 of the fragment objects has no overlap with the region

This is because MultiGetReadsInRegion will try to rbind the the results from each fragment object. The fragment files w/ overlap will have a length column (see line 1340 of Utilities.R reads$length <- reads$end - reads$start) but the fragment files w/o overlap will not have that extra column. I think rbind is going to complain that the number of columns do not match.

I think the fix would to just initialize empty dataframes with a length column. (This is just my hunch -- I haven't debugged or tested the proposed solution.)

timoast added a commit that referenced this pull request Mar 10, 2023
@timoast
Copy link
Collaborator

timoast commented Mar 10, 2023

Hi @nrockweiler, I agree, I have updated the function to make sure the structure of the dataframe is the same whether reads are present or not

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants