Skip to content

Contributing Guidelines

LiNk-NY edited this page Mar 23, 2021 · 13 revisions

March 23, 2021

Overview

Thank you for your interest!

The SingleCellMultiModal package aims to provide single cell datasets from several different technologies / modalities for benchmarking and analysis. We currently provide from scNMT, scM&T, seqFISH, CITEseq, and other technologies. Contributions are very much welcome.

List of Multi-modal Datasets

For a full list of available datasets, see here: Google Drive Sheet

Contributing

In order to contribute, we generally require data in Rda or Rds format though we also support HDF5 and MTX formats. Aside from the usual required metadata.csv documentation in the package, contributors are required to add a name to the DataType column in the metadata table that indicates the name of the contributed dataset. To illustrate, here are some DataType names already in the package:

  • mouse_gastrulation
  • mouse_visual_cortex
  • cord_blood
  • pbmc_10x
  • macrophage_differentiation
  • mouse_embryo_8_cell
library(SingleCellMultiModal)
meta <- system.file("extdata", "metadata.csv",
    package = "SingleCellMultiModal", mustWork = TRUE)
head(read.csv(meta))
#>             DataType           ResourceName
#> 1 mouse_gastrulation      scnmt_acc_cgi.rda
#> 2 mouse_gastrulation     scnmt_acc_CTCF.rda
#> 3 mouse_gastrulation      scnmt_acc_DHS.rda
#> 4 mouse_gastrulation scnmt_acc_genebody.rda
#> 5 mouse_gastrulation     scnmt_acc_p300.rda
#> 6 mouse_gastrulation scnmt_acc_promoter.rda
#>                                                        RDataPath DispatchClass
#> 1      SingleCellMultiModal/mouse_gastrulation/scnmt_acc_cgi.rda           Rda
#> 2     SingleCellMultiModal/mouse_gastrulation/scnmt_acc_CTCF.rda           Rda
#> 3      SingleCellMultiModal/mouse_gastrulation/scnmt_acc_DHS.rda           Rda
#> 4 SingleCellMultiModal/mouse_gastrulation/scnmt_acc_genebody.rda           Rda
#> 5     SingleCellMultiModal/mouse_gastrulation/scnmt_acc_p300.rda           Rda
#> 6 SingleCellMultiModal/mouse_gastrulation/scnmt_acc_promoter.rda           Rda
#>   RDataClass                                  Maintainer
#> 1     matrix Marcel Ramos <marcel.ramos@roswellpark.org>
#> 2     matrix Marcel Ramos <marcel.ramos@roswellpark.org>
#> 3     matrix Marcel Ramos <marcel.ramos@roswellpark.org>
#> 4     matrix Marcel Ramos <marcel.ramos@roswellpark.org>
#> 5     matrix Marcel Ramos <marcel.ramos@roswellpark.org>
#> 6     matrix Marcel Ramos <marcel.ramos@roswellpark.org>
#>                                                      DataProvider
#> 1 Dept. of Bioinformatics, The Babraham Institute, United Kingdom
#> 2 Dept. of Bioinformatics, The Babraham Institute, United Kingdom
#> 3 Dept. of Bioinformatics, The Babraham Institute, United Kingdom
#> 4 Dept. of Bioinformatics, The Babraham Institute, United Kingdom
#> 5 Dept. of Bioinformatics, The Babraham Institute, United Kingdom
#> 6 Dept. of Bioinformatics, The Babraham Institute, United Kingdom
#>   Coordinate_1_based TaxonomyId      Species SourceVersion
#> 1                 NA      10090 Mus musculus         1.0.0
#> 2                 NA      10090 Mus musculus         1.0.0
#> 3                 NA      10090 Mus musculus         1.0.0
#> 4                 NA      10090 Mus musculus         1.0.0
#> 5                 NA      10090 Mus musculus         1.0.0
#> 6                 NA      10090 Mus musculus         1.0.0
#>                                                SourceUrl SourceType Genome
#> 1 https://cloudstor.aarnet.edu.au/plus/s/Xzf5vCgAEUVgbfQ        RDS     NA
#> 2 https://cloudstor.aarnet.edu.au/plus/s/Xzf5vCgAEUVgbfQ        RDS     NA
#> 3 https://cloudstor.aarnet.edu.au/plus/s/Xzf5vCgAEUVgbfQ        RDS     NA
#> 4 https://cloudstor.aarnet.edu.au/plus/s/Xzf5vCgAEUVgbfQ        RDS     NA
#> 5 https://cloudstor.aarnet.edu.au/plus/s/Xzf5vCgAEUVgbfQ        RDS     NA
#> 6 https://cloudstor.aarnet.edu.au/plus/s/Xzf5vCgAEUVgbfQ        RDS     NA
#>   BiocVersion
#> 1        3.12
#> 2        3.12
#> 3        3.12
#> 4        3.12
#> 5        3.12
#> 6        3.12
#>                                                          Description
#> 1      scnmt_acc_cgi data specific to the MOUSE_GASTRULATION project
#> 2     scnmt_acc_CTCF data specific to the MOUSE_GASTRULATION project
#> 3      scnmt_acc_DHS data specific to the MOUSE_GASTRULATION project
#> 4 scnmt_acc_genebody data specific to the MOUSE_GASTRULATION project
#> 5     scnmt_acc_p300 data specific to the MOUSE_GASTRULATION project
#> 6 scnmt_acc_promoter data specific to the MOUSE_GASTRULATION project
#>                Title
#> 1      scnmt_acc_cgi
#> 2     scnmt_acc_CTCF
#> 3      scnmt_acc_DHS
#> 4 scnmt_acc_genebody
#> 5     scnmt_acc_p300
#> 6 scnmt_acc_promoter

Versioning and folder structure

We associate a version with all datasets. We start with version 1.0.0 using semantic versioning and include data in a corresponding version folder (v1.0.0). Thus, the recommended folder structure is as follows:

~/data
  └ scmm/
    └ mouse_gastrulation/
      └ v1.0.0/
        └ scnmt_acc_cgi.rda
        └ scnmt_met_genebody.rda
        └ scnmt_met_cgi.rda
        └ scnmt_rna.rda
        └ scnmt_colData.rda
        └ scnmt_sampleMap.rda

In the inst section, we will discuss how to annotate these data products.

Files

It is customary to include one Rda / Rds file per assay or per assay and region combination of interest (as above). We also highly recommend including sampleMap and colData datasets for the MultiAssayExperiment that will be built on the fly. In this example, there are three modalities in the scNMT dataset, rna (transcriptome), acc (chromatin accessibility), and met (methylation).

vignettes

Contributors are required to demonstrate user-level functionality via examples in a vignette for each contributed dataset.

R

Ideally, the interface for the contributed dataset should be similar to that of scNMT so that users have a sense of consistency in the usage of the package. This means having one main function that returns a MultiAssayExperiment object and having options that show the user what datasets are available for a particular technology. Contributors should use roxygen2 for documenting datasets and using @inheritParams scNMT tag to avoid copying @param documentation.

See the current example for implementation details:

scNMT(
    DataType = "mouse_gastrulation",
    mode = "*",
    version = "1.0.0",
    dry.run = TRUE
)
#> snapshotDate(): 2021-01-20
#>     ah_id         mode file_size rdataclass rdatadateadded rdatadateremoved
#> 1  EH3738      acc_cgi      7 Mb     matrix     2020-09-03             <NA>
#> 2  EH3739     acc_CTCF    1.2 Mb     matrix     2020-09-03             <NA>
#> 3  EH3740      acc_DHS    0.3 Mb     matrix     2020-09-03             <NA>
#> 4  EH3741 acc_genebody   49.6 Mb     matrix     2020-09-03             <NA>
#> 5  EH3742     acc_p300    0.2 Mb     matrix     2020-09-03             <NA>
#> 6  EH3743 acc_promoter   27.2 Mb     matrix     2020-09-03             <NA>
#> 7  EH3745      met_cgi    4.6 Mb     matrix     2020-09-03             <NA>
#> 8  EH3746     met_CTCF    0.1 Mb     matrix     2020-09-03             <NA>
#> 9  EH3747      met_DHS    0.1 Mb     matrix     2020-09-03             <NA>
#> 10 EH3748 met_genebody   26.8 Mb     matrix     2020-09-03             <NA>
#> 11 EH3749     met_p300    0.1 Mb     matrix     2020-09-03             <NA>
#> 12 EH3750 met_promoter   11.5 Mb     matrix     2020-09-03             <NA>
#> 13 EH3751          rna   18.6 Mb     matrix     2020-09-03             <NA>

Note. Contributors should ensure that the documentation is complete and the proper data sources have been attributed.

inst/*

extdata/

In the following section we will describe how to annotate and append to the metadata.csv file. First, we have to ensure that we are accounting for all of the fields required by ExperimentHub. They are listed here:

  • ResourceName
  • Title
  • Description
  • BiocVersion
  • Genome
  • SourceType
  • SourceUrl
  • SourceVersion
  • Species
  • TaxonomyId
  • Coordinate_1_based
  • DataProvider
  • Maintainer
  • RDataPath
  • RDataClass
  • DispatchClass
  • DataType+

Note. DataType is a field we’ve added to help distinguish multimodal technologies and is required for SingleCellMultiModal. Some of the DataTypes already available are mouse_gastrulation, mouse_visual_cortex, cord_blood, peripheral_blood, etc.

To make it easy for contributions, we’ve provided a mechanism for easy documentation using a file from a data.frame we call a doc_file.

Interested contributors should create a doc_file in inst/extdata/docuData folder. Although we do not have a strict naming convention for the doc_file, we usually name the file singlecellmultimodalvX.csv where X is the nth dataset added to the package.

Here is an example of the file from version v1.0.0 of the scNMT dataset:

doc_file <- system.file("extdata", "docuData", "singlecellmultimodalv1.csv",
    package = "SingleCellMultiModal", mustWork = TRUE)
read.csv(doc_file, header = TRUE)
#>                                                      DataProvider TaxonomyId
#> 1 Dept. of Bioinformatics, The Babraham Institute, United Kingdom      10090
#>        Species                                              SourceUrl
#> 1 Mus musculus https://cloudstor.aarnet.edu.au/plus/s/Xzf5vCgAEUVgbfQ
#>   SourceType SourceVersion           DataType
#> 1        RDS         1.0.0 mouse_gastrulation
#>                                    Maintainer
#> 1 Marcel Ramos <marcel.ramos@roswellpark.org>

Contributors will then use their doc_file to append to the existing metadata.csv.

To create a doc_file data.frame with the file name singlecellmultimodalvX.csv, first we create a data.frame object. Each general annotation or row in this data.frame will be applied to all files uploaded to ExperimentHub. We take advantage of the data.frame function to repeat data and create a uniform data.frame with equal values across the columns.

scmeta <- data.frame(
    DataProvider =
        "Dept. of Bioinformatics, The Babraham Institute, United Kingdom",
    TaxonomyId = "10090",
    Species = "Mus musculus",
    SourceUrl = "https://cloudstor.aarnet.edu.au/plus/s/Xzf5vCgAEUVgbfQ",
    SourceType = "RDS",
    SourceVersion = "1.0.0",
    DataType = "mouse_gastrulation",
    Maintainer = "Ricard Argelaguet <ricard@ebi.ac.uk>",
    stringsAsFactors = FALSE
)
scmeta
#>                                                      DataProvider TaxonomyId
#> 1 Dept. of Bioinformatics, The Babraham Institute, United Kingdom      10090
#>        Species                                              SourceUrl
#> 1 Mus musculus https://cloudstor.aarnet.edu.au/plus/s/Xzf5vCgAEUVgbfQ
#>   SourceType SourceVersion           DataType
#> 1        RDS         1.0.0 mouse_gastrulation
#>                             Maintainer
#> 1 Ricard Argelaguet <ricard@ebi.ac.uk>

Saving the data

After creating the documentation data.frame (doc_file), the contributor can save that dataset as a .csv file using write.csv.

write.csv(
    scmeta,
    file = "inst/extdata/docuData/singlecellmultimodal.csv",
    row.names = FALSE
)

Documenting diverse data

In the case that the contributed data is not uniform, meaning that there are multiple file types from potentially different speciments, the data.frame will have to account for all contributed data files.

For example, if the contributed data has a number of different source types, the contributor is required to create a data.frame with the number of rows equal to the number of files to be uploaded.

In this example, we have two data files from different source types and formats:

data.frame(
    DataProvider =
        c("Institute of Population Genetics", "Mouse Science Center"),
    TaxonomyId = c("9606", "10090"),
    Species = c("Homo sapiens", "Mus musculus"),
    SourceUrl = c("https://human.science/org", "https://mouse.science/gov"),
    SourceType = c("RDS", "XML"),
    DataType = c("human_genetics", "mouse_genetics"),
    stringsAsFactors = FALSE
)
#>                       DataProvider TaxonomyId      Species
#> 1 Institute of Population Genetics       9606 Homo sapiens
#> 2             Mouse Science Center      10090 Mus musculus
#>                   SourceUrl SourceType       DataType
#> 1 https://human.science/org        RDS human_genetics
#> 2 https://mouse.science/gov        XML mouse_genetics

scripts/

make-data/

The individual data products that will eventually come together into a MultiAssayExperiment can be uploaded as serialized RDA / RDS files, HDF5, and even MTX files. For examples on how to save data into their respective file formats, see the make-data folder.

Generating the metadata.csv

make-metadata.R

Based on the folder structure described previously, the directory argument in make_metadata will correspond to the ~/data/scmm folder. The dataDir folder will correspond to the DataType / technology subfolder (e.g., “mouse_gastrulation”). These will be used as inputs to the make_metadata function.

Once the data is ready, the user can use the function in make-metadata.R in the scripts folder. A typical call to make_metadata will either add to the metadata or replace it entirely. The easiest for current contributors is to append rows to the metadata file.

make_metadata(
    directory = "~/data/scmm",
    dataDirs = "mouse_gastrulation", # also the name of the DataType
    ext_pattern = "\\.[Rr][Dd][Aa]$",
    doc_file = "inst/extdata/docuData/singlecellmultimodalv1.csv",
    pkg_name = "SingleCellMultiModal",
    append = TRUE,
    dry.run = TRUE
)

Note that the extraction pattern (ext_pattern) will allow contributors to match a specific file extension in that folder and ignore any intermediate files.

The contributor may also wish to run dry.run=TRUE to see the output data.frame to be added to the metadata.csv file.

Note. The make_metadata function should be run from the base package directory from a GitHub / git checkout (git clone ...).

Validation

It is recommended to run the metadata validation function from AnnotationHubData:

AnnotationHubData::makeAnnotationHubMetadata("SingleCellMultiModal")

to ensure that some of the metadata fields are properly annotated.

NEWS.md

Contributors should update the NEWS.md file with a mention of the function and data that are being provided. See the NEWS.md for examples.

Next steps

The contributor should then create a Pull Request on GitHub.

If you are interested in contributing, I can help you go over the contribution and submission. Please contact me either on the Bioc-community Slack (mramos148) or at marcel {dot} ramos [at] sph (dot) cuny (dot) edu. If you need to sign up to the community Slack channel, follow this link: https://bioc-community.herokuapp.com/

sessionInfo

sessionInfo
#> R Under development (unstable) (2020-12-12 r79621)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.10
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] SingleCellMultiModal_1.3.9  MultiAssayExperiment_1.17.6
#>  [3] SummarizedExperiment_1.21.1 DelayedArray_0.17.7        
#>  [5] MatrixGenerics_1.3.0        matrixStats_0.57.0         
#>  [7] Matrix_1.3-2                Biobase_2.51.0             
#>  [9] GenomicRanges_1.43.3        GenomeInfoDb_1.27.3        
#> [11] IRanges_2.25.6              S4Vectors_0.29.6           
#> [13] BiocGenerics_0.37.0        
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.6                    lattice_0.20-41              
#>  [3] SingleCellExperiment_1.13.3   assertthat_0.2.1             
#>  [5] digest_0.6.27                 mime_0.9                     
#>  [7] BiocFileCache_1.15.1          R6_2.5.0                     
#>  [9] RSQLite_2.2.2                 evaluate_0.14                
#> [11] httr_1.4.2                    pillar_1.4.7                 
#> [13] zlibbioc_1.37.0               rlang_0.4.10                 
#> [15] curl_4.3                      blob_1.2.1                   
#> [17] rmarkdown_2.6                 AnnotationHub_2.23.0         
#> [19] stringr_1.4.0                 RCurl_1.98-1.2               
#> [21] bit_4.0.4                     shiny_1.5.0                  
#> [23] compiler_4.1.0                httpuv_1.5.5                 
#> [25] xfun_0.20                     pkgconfig_2.0.3              
#> [27] htmltools_0.5.1               tidyselect_1.1.0             
#> [29] tibble_3.0.5                  GenomeInfoDbData_1.2.4       
#> [31] interactiveDisplayBase_1.29.0 codetools_0.2-18             
#> [33] withr_2.4.0                   crayon_1.3.4                 
#> [35] dplyr_1.0.3                   dbplyr_2.0.0                 
#> [37] later_1.1.0.1                 bitops_1.0-6                 
#> [39] rappdirs_0.3.1                grid_4.1.0                   
#> [41] xtable_1.8-4                  lifecycle_0.2.0              
#> [43] DBI_1.1.1                     magrittr_2.0.1               
#> [45] stringi_1.5.3                 XVector_0.31.1               
#> [47] promises_1.1.1                SpatialExperiment_1.1.0      
#> [49] ellipsis_0.3.1                filelock_1.0.2               
#> [51] generics_0.1.0                vctrs_0.3.6                  
#> [53] tools_4.1.0                   bit64_4.0.5                  
#> [55] glue_1.4.2                    purrr_0.3.4                  
#> [57] BiocVersion_3.13.1            fastmap_1.0.1                
#> [59] yaml_2.2.1                    AnnotationDbi_1.53.0         
#> [61] ExperimentHub_1.17.0          BiocManager_1.30.10          
#> [63] memoise_1.1.0                 knitr_1.30