Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign up| --- | |
| title: "Sequence Retrieval" | |
| date: "`r Sys.Date()`" | |
| output: rmarkdown::html_vignette | |
| vignette: > | |
| %\VignetteIndexEntry{Sequence Retrieval} | |
| %\VignetteEngine{knitr::rmarkdown} | |
| %\usepackage[utf8]{inputenc} | |
| --- | |
| ```{r, echo = FALSE, message = FALSE} | |
| options(width = 750) | |
| knitr::opts_chunk$set( | |
| comment = "#>", | |
| error = FALSE, | |
| tidy = FALSE) | |
| ``` | |
| ## Biological Sequence Retrieval | |
| The `biomartr` package allows users to retrieve biological sequences in a very simple and intuitive way. | |
| Using `biomartr`, users can retrieve either genomes, proteomes, CDS, RNA, GFF, and genome assembly statistics data using the specialized functions: | |
| * [1. Genome retrieval with `getGenome()`](#genome-retrieval) | |
| - [A. list total number of available genomes in each database](#listing-the-total-number-of-available-genomes) | |
| - [B. dealing with kingdoms, groups, and subgroups ](#retrieving-kingdom-group-and-subgroup-information) | |
| - [1.1 from NCBI RefSeq](#example-ncbi-refseq) | |
| - [1.2 from NCBI Genbank](#example-ncbi-genbank) | |
| - [1.3 from ENSEMBL](#example-ensembl) | |
| * [1.B. Multiple genome retrieval with `getGenomeSet()`](#genomeset-retrieval) | |
| * [2. Proteome retrieval with `getProteome()`](#proteome-retrieval) | |
| - [2.1 from NCBI RefSeq](#example-ncbi-refseq-1) | |
| - [2.2 from NCBI Genbank](#example-ncbi-genbank-1) | |
| - [2.3 from ENSEMBL](#example-ensembl-1) | |
| - [2.5 from UniProt](#example-retrieval-uniprot) | |
| * [2.B. Multiple proteome retrieval with `getProteomeSet()`](#proteomeset-retrieval) | |
| * [3. Coding sequence retrieval with `getCDS()`](#cds-retrieval) | |
| - [3.1 from NCBI RefSeq](#example-ncbi-refseq-2) | |
| - [3.2 from NCBI Genbank](#example-ncbi-genbank-2) | |
| - [3.3 from ENSEMBL](#example-ensembl-2) | |
| * [3.B. Multiple coding sequence retrieval with `getCDSSet()`](#cdsset-retrieval) | |
| * [4. RNA retrieval with `getRNA()`](#rna-retrieval) | |
| - [4.1 from NCBI RefSeq](#example-ncbi-refseq-3) | |
| - [4.2 from NCBI Genbank](#example-ncbi-genbank-3) | |
| - [4.3 from ENSEMBL](#example-ensembl-3) | |
| * [4.B. Multiple RNA retrieval with `getRNASet()`](#rnaset-retrieval) | |
| * [5. GFF retrieval with `getGFF()`](#retrieve-the-annotation-file-of-a-particular-genome) | |
| - [5.1 from NCBI RefSeq](#example-ncbi-refseq-4) | |
| - [5.2 from NCBI Genbank](#example-ncbi-genbank-4) | |
| - [5.3 from ENSEMBL](#example-ensembl-4) | |
| * [5.B. Multiple GFF retrieval with `getGFFSet()`](#gffset-retrieval) | |
| * [6. GTF retrieval with `getGTF()`](#retrieve-the-annotation-file-of-a-particular-genome) | |
| - [6.1 from NCBI RefSeq](#example-ncbi-refseq-5) | |
| - [6.2 from NCBI Genbank](#example-ncbi-genbank-5) | |
| - [6.3 from ENSEMBL](#example-ensembl-5) | |
| * [7. Repeat Masker Annotation file retrieval with `getRepeatMasker()`](#repeat-masker-retrieval) | |
| - [7.1 from NCBI RefSeq](#example-ncbi-refseq-6) | |
| * [8. Genome Assembly Stats Retrieval with `getAssemblyStats()`](#genome-assembly-stats-retrieval) | |
| - [8.1 from NCBI RefSeq](#example-ncbi-refseq-7) | |
| - [8.2 from NCBI Genbank](#example-ncbi-genbank-6) | |
| * [9. Collection (Genome, Proteome, CDS, RNA, GFF, Repeat Masker, AssemblyStats) retrieval with `getCollection()`](#collection-retrieval) | |
| - [9.1 from NCBI RefSeq](#example-ncbi-refseq-8) | |
| - [9.2 from NCBI Genbank](#example-ncbi-genbank-7) | |
| - [9.3 from NCBI ENSEMBL](#example-ensembl-6) | |
| ## Getting Started with Sequence Retrieval | |
| First users can check whether or not the genome, proteome, CDS, RNA, GFF, GTF, or genome assembly statistics of their interest is available for download. | |
| Using the scientific name of the organism of interest, users can check whether the | |
| corresponding genome is available via the `is.genome.available()` function. | |
| Please note that the first time you run this command it might take a while, because | |
| during the initial execution of this function all necessary information is retrieved from NCBI | |
| and then stored locally. All further runs are then much faster. | |
| ### Example `NCBI RefSeq` (?is.genome.available): | |
| ```r | |
| # checking whether or not the Homo sapiens | |
| # genome is avaialable for download | |
| is.genome.available(db = "refseq", organism = "Homo sapiens") | |
| ``` | |
| ``` | |
| A reference or representative genome assembly is available for 'Homo sapiens'. | |
| [1] TRUE | |
| ``` | |
| In the case of the human genome, there are more than one entry in the NCBI RefSeq | |
| database (see message). By specifying the argument 'details = TRUE' users can | |
| retrieve all information for 'Homo sapiens' that is stored in NCBI RefSeq. | |
| ```r | |
| # checking whether or not the Homo sapiens | |
| # genome is avaialable for download | |
| is.genome.available(db = "refseq", organism = "Homo sapiens", details = TRUE) | |
| ``` | |
| ``` | |
| assembly_accessi… bioproject biosample wgs_master refseq_category taxid | |
| <chr> <chr> <chr> <chr> <chr> <int> | |
| 1 GCF_000001405.38 PRJNA168 NA NA reference geno… 9606 | |
| # ... with 15 more variables: species_taxid <int>, organism_name <chr>, | |
| # infraspecific_name <chr>, isolate <chr>, version_status <chr>, | |
| # assembly_level <chr>, release_type <chr>, genome_rep <chr>, | |
| # seq_rel_date <date>, asm_name <chr>, submitter <chr>, | |
| # gbrs_paired_asm <chr>, paired_asm_comp <chr>, ftp_path <chr>, | |
| # excluded_from_refseq <chr> | |
| ``` | |
| Here, we find one possible versions of the human genome having the `assembly_accession` | |
| ID `GCF_000001405.38`. | |
| When retrieving a genome with e.g. `getGenome()` the `organism` argument can | |
| also be specified using the `assembly_accession` ID instead of the scientific name of the | |
| organism. This is true for all `get*()` functions. Hence, instead of writing | |
| `getGenome(db = "refseq", organism = "Homo sapiens")`, users can specify | |
| `getGenome(db = "refseq", organism = "GCF_000001405.38")`. | |
| This is particularly useful when more than one entry is available for one organism. | |
| __Please note that the `assembly_accession` id might change internally in external databases such as NCBI RefSeq. Thus when | |
| writing automated scripts using the `assembly_accession` id they might stop working due to the | |
| internal change of ids at RefSeq. Recently, I had the case where the `assembly_accession` id was changed | |
| at RefSeq from `GCF_000001405.37` to `GCF_000001405.38`. Thus, scripts based on screening for entries with `is.genome.available()` | |
| stopped working, because the id `GCF_000001405.37` couldn't be found anymore.__ | |
| In some cases users will find several entries for the same scientific name. | |
| This might be due to the fact that ecotypes, strains, or other types of sub-species | |
| are available in the respective databases. | |
| For example. when we search for the bacterium `Mycobacterium tuberculosis` in NCBI RefSeq we get 5377 hits. | |
| ```r | |
| is.genome.available(organism = "Mycobacterium tuberculosis", db = "refseq", details = TRUE) | |
| ``` | |
| ``` | |
| A tibble: 5,377 x 21 | |
| assembly_accessi… bioproject biosample wgs_master refseq_category taxid | |
| <chr> <chr> <chr> <chr> <chr> <int> | |
| 1 GCF_000008585.1 PRJNA2241… SAMN0260… NA na 83331 | |
| 2 GCF_000016145.1 PRJNA2241… SAMN0260… NA na 419947 | |
| 3 GCF_000016925.1 PRJNA2241… SAMN0010… NA na 336982 | |
| 4 GCF_000023625.1 PRJNA2241… SAMN0076… NA na 478434 | |
| 5 GCF_000152105.1 PRJNA2241… SAMN0259… ADHQ00000… na 675512 | |
| 6 GCF_000153685.2 PRJNA2241… SAMN0308… NA na 395095 | |
| 7 GCF_000154585.2 PRJNA2241… SAMN0308… NA na 478433 | |
| 8 GCF_000154605.2 PRJNA2241… SAMN0010… NA na 478435 | |
| 9 GCF_000155795.1 PRJNA2241… SAMN0259… ABVM00000… na 555461 | |
| 10 GCF_000159755.1 PRJNA2241… SAMN0259… ACHP00000… na 611303 | |
| # ... with 5,367 more rows, and 15 more variables: species_taxid <int>, | |
| # organism_name <chr>, infraspecific_name <chr>, isolate <chr>, | |
| # version_status <chr>, assembly_level <chr>, release_type <chr>, | |
| # genome_rep <chr>, seq_rel_date <date>, asm_name <chr>, submitter <chr>, | |
| # gbrs_paired_asm <chr>, paired_asm_comp <chr>, ftp_path <chr>, | |
| # excluded_from_refseq <chr> | |
| ``` | |
| When looking at the names of these organisms we see that they consist of different strains of `Mycobacterium tuberculosis`. | |
| ```r | |
| head(is.genome.available(organism = "Mycobacterium tuberculosis", | |
| db = "refseq", | |
| details = TRUE)$organism_name) | |
| ``` | |
| ``` | |
| [1] "Mycobacterium tuberculosis CDC1551" | |
| [2] "Mycobacterium tuberculosis H37Ra" | |
| [3] "Mycobacterium tuberculosis F11" | |
| [4] "Mycobacterium tuberculosis KZN 1435" | |
| [5] "Mycobacterium tuberculosis SUMu001" | |
| [6] "Mycobacterium tuberculosis str. Haarlem" | |
| ``` | |
| Now we can use the `assembly_accession` id to retrieve the `Mycobacterium tuberculosis` | |
| strain we are interested in, e.g. `Mycobacterium tuberculosis CDC1551`. | |
| ```r | |
| MtbCDC1551 <- getGenome(db = "refseq", | |
| organism = "GCF_000008585.1", | |
| path = file.path("_ncbi_downloads","genomes"), | |
| reference = FALSE) | |
| MtbCDC1551_genome <- read_genome(MtbCDC1551) | |
| MtbCDC1551_genome | |
| ``` | |
| ``` | |
| A DNAStringSet instance of length 1 | |
| width seq names | |
| [1] 4403837 TTGACCGATGACCC...AGGGAGATACGTCG NC_002755.2 Mycob... | |
| ``` | |
| #### Using the NCBI Taxonomy ID instead of the scientific name to screen for organism availability | |
| Instead of specifying the `scientific name` of the organism of interest users can specify the [NCBI Taxonomy](https://www.ncbi.nlm.nih.gov/taxonomy) identifier (= `taxid`) | |
| of the corresponding organism. For example, the `taxid` of `Homo sapiens` is `9606`. | |
| Now users can specify `organism = "9606"` to retrieve entries for `Homo sapiens`: | |
| ```r | |
| # checking availability for Homo sapiens using its taxid | |
| is.genome.available(db = "refseq", organism = "9606", details = TRUE) | |
| ``` | |
| ``` | |
| assembly_accession bioproject biosample wgs_master refseq_category taxid | |
| <chr> <chr> <chr> <chr> <chr> <int> | |
| 1 GCF_000001405.38 PRJNA168 NA NA reference genome 9606 | |
| # ... with 15 more variables: species_taxid <int>, organism_name <chr>, | |
| # infraspecific_name <chr>, isolate <chr>, version_status <chr>, | |
| # assembly_level <chr>, release_type <chr>, genome_rep <chr>, | |
| # seq_rel_date <date>, asm_name <chr>, submitter <chr>, | |
| # gbrs_paired_asm <chr>, paired_asm_comp <chr>, ftp_path <chr>, | |
| # excluded_from_refseq <chr> | |
| ``` | |
| #### Using the accession ID instead of the scientific name or taxid to screen for organism availability | |
| Finally, instead of specifying either the `scientific name` of the organism of interest nor the `taxid`, | |
| users can specify the accession ID of the organism of interest. | |
| In the following example we use the accession ID of `Homo sapiens` (= `GCF_000001405.38`): | |
| ```r | |
| # checking availability for Homo sapiens using its taxid | |
| is.genome.available(db = "refseq", organism = "GCF_000001405.38", details = TRUE) | |
| ``` | |
| ``` | |
| assembly_accession bioproject biosample wgs_master refseq_category taxid | |
| <chr> <chr> <chr> <chr> <chr> <int> | |
| 1 GCF_000001405.38 PRJNA168 NA NA reference genome 9606 | |
| # ... with 15 more variables: species_taxid <int>, organism_name <chr>, | |
| # infraspecific_name <chr>, isolate <chr>, version_status <chr>, | |
| # assembly_level <chr>, release_type <chr>, genome_rep <chr>, | |
| # seq_rel_date <date>, asm_name <chr>, submitter <chr>, | |
| # gbrs_paired_asm <chr>, paired_asm_comp <chr>, ftp_path <chr>, | |
| # excluded_from_refseq <chr> | |
| ``` | |
| #### A small negative example | |
| In some cases your organism of interest will not be available in `NCBI RefSeq`. | |
| Here an example: | |
| ```r | |
| # check genome availability for Candida glabrata | |
| is.genome.available(db = "refseq", organism = "Candida glabrata") | |
| ``` | |
| ```r | |
| Unfortunatey, no entry for 'Candida glabrata' was found in the 'refseq' database. | |
| Please consider specifying 'db = genbank' or 'db = ensembl' | |
| to check whether 'Candida glabrata' is availabe in these databases. | |
| [1] FALSE | |
| ``` | |
| When now checking the availability in `NCBI Genbank` we find that indeed | |
| 'Candida glabrata' is availabe: | |
| ```r | |
| # check genome availability for Candida glabrata | |
| is.genome.available(db = "genbank", organism = "Candida glabrata") | |
| ``` | |
| ``` | |
| Only a non-reference genome assembly is available for 'Candida glabrata'. | |
| Please make sure to specify the argument 'reference = FALSE' when running any get*() function. | |
| [1] TRUE | |
| ``` | |
| Although, an entry is available the `is.genome.available()` warns us that | |
| only a non-reference genome is available for 'Candida glabrata'. | |
| To then retrieve the e.g. genome etc. files users need to speficy the | |
| `reference = FALSE` argument in the `get*()` functions. | |
| For example: | |
| ```r | |
| # retrieve non-reference genome | |
| getGenome(db = "genbank", organism = "Candida glabrata", reference = FALSE) | |
| ``` | |
| ``` | |
| Only a non-reference genome assembly is available for 'Candida glabrata'. | |
| Please make sure to specify the argument 'reference = FALSE' when running any get*() function. | |
| Genome download is completed! | |
| Checking md5 hash of file: _ncbi_downloads/genomes/Candida_glabrata_md5checksums.txt ... | |
| The md5 hash of file '_ncbi_downloads/genomes/Candida_glabrata_md5checksums.txt' matches! | |
| The genome of 'Candida_glabrata' has been downloaded to '_ncbi_downloads/genomes' | |
| and has been named 'Candida_glabrata_genomic_genbank.fna.gz'. | |
| [1] "_ncbi_downloads/genomes/Candida_glabrata_genomic_genbank.fna.gz" | |
| ``` | |
| ### Example `NCBI Genbank` (?is.genome.available): | |
| Test whether or not the genome of `Homo sapiens` is present at NCBI Genbank. | |
| ```r | |
| # checking whether or not the Homo sapiens | |
| # genome is avaialable for download | |
| is.genome.available(db = "genbank", organism = "Homo sapiens") | |
| ``` | |
| ``` | |
| A reference or representative genome assembly is available for 'Homo sapiens'. | |
| More than one entry was found for 'Homo sapiens'. | |
| Please consider to re-run this funtion and specify 'details = TRUE'. | |
| This will allow you to select the 'assembly_accession' identifier | |
| that can then be specified in all get*() functions. | |
| [1] TRUE | |
| ``` | |
| Now with `details = TRUE`. | |
| ```r | |
| # checking whether or not the Homo sapiens | |
| # genome is avaialable for download | |
| is.genome.available(db = "genbank", organism = "Homo sapiens", details = TRUE) | |
| ``` | |
| ``` | |
| assembly_accessi… bioproject biosample wgs_master refseq_category taxid | |
| <chr> <chr> <chr> <chr> <chr> <int> | |
| 1 GCA_000001405.27 PRJNA31257 NA NA reference geno… 9606 | |
| 2 GCA_000002115.2 PRJNA1431 SAMN029812… AADD000000… na 9606 | |
| 3 GCA_000002125.2 PRJNA19621 SAMN029812… ABBA000000… na 9606 | |
| 4 GCA_000002135.3 PRJNA10793 NA NA na 9606 | |
| 5 GCA_000004845.2 PRJNA42199 SAMN000033… ADDF000000… na 9606 | |
| 6 GCA_000005465.1 PRJNA42201 NA DAAB000000… na 9606 | |
| 7 GCA_000181135.1 PRJNA28335 SAMN000016… ABKV000000… na 9606 | |
| 8 GCA_000185165.1 PRJNA59877 SAMN029812… AEKP000000… na 9606 | |
| 9 GCA_000212995.1 PRJNA19621 SAMN029812… ABSL000000… na 9606 | |
| 10 GCA_000252825.1 PRJNA19621 SAMN029812… ABBA000000… na 9606 | |
| # ... with 70 more rows, and 15 more variables: species_taxid <int>, | |
| # organism_name <chr>, infraspecific_name <chr>, isolate <chr>, | |
| # version_status <chr>, assembly_level <chr>, release_type <chr>, | |
| # genome_rep <chr>, seq_rel_date <date>, asm_name <chr>, submitter <chr>, | |
| # gbrs_paired_asm <chr>, paired_asm_comp <chr>, ftp_path <chr>, | |
| # excluded_from_refseq <chr> | |
| ``` | |
| As you can see there are several versions of the `Homo sapiens` genome | |
| available for download from NCBI Genbank. Using the `assembly_accession` | |
| id will now allow to specify which version shall be retrieved. | |
| ### Using `is.genome.available()` with ENSEMBL | |
| Users can also specify `db = "ensembl"` to retrieve available organisms provided by ENSEMBL. | |
| Again, users might experience a delay in the execution of this function when running it for the first time. | |
| This is due to the download of ENSEMBL information which is then stored internally to enable a much faster | |
| execution of this function in following runs. The corresponding information files are stored at `file.path(tempdir(), "ensembl_summary.txt")`. | |
| ### Example `ENSEMBL` (?is.genome.available): | |
| ```r | |
| # cheking whether Homo sapiens is available in the ENSEMBL database | |
| is.genome.available(db = "ensembl", organism = "Homo sapiens") | |
| ``` | |
| ``` | |
| A reference or representative genome assembly is available for 'Homo sapiens'. | |
| [1] TRUE | |
| ``` | |
| ```r | |
| # retrieve details for Homo sapiens | |
| is.genome.available(db = "ensembl", organism = "Homo sapiens", details = TRUE) | |
| ``` | |
| ``` | |
| division taxon_id name release display_name accession strain_collecti… | |
| <chr> <int> <chr> <int> <chr> <chr> <chr> | |
| 1 Ensembl 9606 homo_s… 92 Human GCA_000001… NA | |
| # ... with 3 more variables: common_name <chr>, strain <chr>, assembly <chr> | |
| ``` | |
| Again, users can either specify the `taxid` or `accession id` when searching for organism entries. | |
| ```r | |
| # retrieve details for Homo sapiens using taxid | |
| is.genome.available(db = "ensembl", organism = "9606", details = TRUE) | |
| ``` | |
| ``` | |
| division taxon_id name release display_name accession strain_collecti… | |
| <chr> <int> <chr> <int> <chr> <chr> <chr> | |
| 1 Ensembl 9606 homo_s… 92 Human GCA_000001… NA | |
| # ... with 3 more variables: common_name <chr>, strain <chr>, assembly <chr> | |
| > | |
| ``` | |
| ```r | |
| # retrieve details for Homo sapiens using accession id | |
| is.genome.available(organism = "GCA_000001405.27", db = "ensembl", details = TRUE) | |
| ``` | |
| ``` | |
| division taxon_id name release display_name accession strain_collecti… | |
| <chr> <int> <chr> <int> <chr> <chr> <chr> | |
| 1 Ensembl 9606 homo_s… 92 Human GCA_00000… NA | |
| # ... with 3 more variables: common_name <chr>, strain <chr>, assembly <chr> | |
| ``` | |
| __Please note that the `accession` id can change internally at ENSEMBL. E.g. in a recent case | |
| the `accession` id changed from `GCA_000001405.25` to `GCA_000001405.27`. Hence, please | |
| be careful and take this issue into account when you build automated retrieval scripts that are based | |
| on `accession` ids.__ | |
| ### Example `UniProt` (?is.genome.available): | |
| Users can also check the availability of proteomes in the [UniProt database](http://www.uniprot.org) | |
| by specifying: | |
| ```r | |
| # retrieve information from UniProt | |
| is.genome.available(db = "uniprot", "Homo sapiens", details = FALSE) | |
| ``` | |
| ``` | |
| A reference or representative genome assembly is available for 'Homo sapiens'. | |
| [1] TRUE | |
| ``` | |
| or with details: | |
| ```r | |
| # retrieve information from UniProt | |
| is.genome.available(db = "uniprot", "Homo sapiens", details = TRUE) | |
| ``` | |
| ``` | |
| name description isReferenceProt… isRepresentativ… dbReference component | |
| * <chr> <chr> <lgl> <lgl> <list> <list> | |
| 1 Homo … Homo sapie… TRUE TRUE <data.fram… <data.fr… | |
| # ... with 7 more variables: reference <list>, upid <chr>, modified <dbl>, | |
| # taxonomy <int>, source <chr>, sourceTaxonomy <int>, superregnum <chr> | |
| ``` | |
| Users can also search available species at UniProt via `taxid` or `upid` id. | |
| Here `9606` defines the taxonomy id for `Homo sapiens`. | |
| ```r | |
| # retrieve information from UniProt | |
| is.genome.available(db = "uniprot", "9606", details = TRUE) | |
| ``` | |
| ``` | |
| name description isReferenceProt… isRepresentativ… dbReference component | |
| * <chr> <chr> <lgl> <lgl> <list> <list> | |
| 1 Homo … Homo sapie… TRUE TRUE <data.fram… <data.fr… | |
| # ... with 7 more variables: reference <list>, upid <chr>, modified <dbl>, | |
| # taxonomy <int>, source <chr>, sourceTaxonomy <int>, superregnum <chr> | |
| ``` | |
| Here `UP000005640` defines the `upid` for `Homo sapiens`. | |
| ```r | |
| # retrieve information from UniProt | |
| is.genome.available(db = "uniprot", "UP000005640", details = TRUE) | |
| ``` | |
| ``` | |
| name description isReferenceProt… isRepresentativ… dbReference component | |
| * <chr> <chr> <lgl> <lgl> <list> <list> | |
| 1 Homo … Homo sapie… TRUE TRUE <data.fram… <data.fr… | |
| # ... with 7 more variables: reference <list>, upid <chr>, modified <dbl>, | |
| # taxonomy <int>, source <chr>, sourceTaxonomy <int>, superregnum <chr> | |
| ``` | |
| In general, the argument `db` specifies from which database (`refseq`, `genbank`, `ensembl` or `uniprot`) organism | |
| information shall be retrieved. Options are: | |
| - `db = 'refseq'` | |
| - `db = 'genbank'` | |
| - `db = 'ensembl'` | |
| - `db = 'uniprot'` | |
| ### Listing the total number of available genomes | |
| In some cases it might be useful to check how many genomes (in total) are available in the different databases. | |
| Users can determine the total number of available genomes using the `listGenomes()` function. | |
| Example `refseq`: | |
| ```{r,eval=FALSE} | |
| length(listGenomes(db = "refseq")) | |
| ``` | |
| ``` | |
| [1] 14264 | |
| ``` | |
| Example `genbank`: | |
| ```{r,eval=FALSE} | |
| length(listGenomes(db = "genbank")) | |
| ``` | |
| ``` | |
| [1] 21770 | |
| ``` | |
| Example `ensembl`: | |
| ```{r,eval=FALSE} | |
| length(listGenomes(db = "ensembl")) | |
| ``` | |
| ``` | |
| [1] 113 | |
| ``` | |
| Hence, currently 14264 genomes (including all kingdoms of life) are stored on NCBI RefSeq (as of 31/05/2018). | |
| ### Retrieving kingdom, group and subgroup information | |
| Using this example users can retrieve the number of available species for each kingdom of life: | |
| Example `refseq`: | |
| ```{r,eval=FALSE} | |
| # the number of genomes available for each kingdom | |
| listKingdoms(db = "refseq") | |
| ``` | |
| ``` | |
| Archaea Bacteria Eukaryota Viroids Viruses | |
| 169 6321 561 47 7166 | |
| ``` | |
| Example `genbank`: | |
| ```{r,eval=FALSE} | |
| # the number of genomes available for each kingdom | |
| listKingdoms(db = "genbank") | |
| ``` | |
| ``` | |
| Archaea Bacteria Eukaryota | |
| 1249 18083 2438 | |
| ``` | |
| Example `ENSEMBL`: | |
| ```{r,eval=FALSE} | |
| # the number of genomes available for each kingdom | |
| listKingdoms(db = "ensembl") | |
| ``` | |
| ``` | |
| Ensembl | |
| 113 | |
| ``` | |
| ### Analogous computations can be performed for `groups` and `subgroups` | |
| Unfortunately, `ENSEMBL` does not provide group or subgroup information. | |
| Therefore, group and subgroup listings are limited to `refseq` and `genbank`. | |
| Example `refseq`: | |
| ```{r,eval=FALSE} | |
| # the number of genomes available for each group | |
| listGroups(db = "refseq") | |
| ``` | |
| ``` | |
| Abditibacteriota Acidithiobacillia | |
| 1 5 | |
| Acidobacteriia Ackermannviridae | |
| 16 21 | |
| Actinobacteria Adenoviridae | |
| 1226 62 | |
| Alloherpesviridae Alphaflexiviridae | |
| 8 56 | |
| Alphaproteobacteria Alphasatellitidae | |
| 857 67 | |
| Alphatetraviridae Alvernaviridae | |
| 3 1 | |
| Amalgaviridae Amphibians | |
| 10 3 | |
| Ampullaviridae Anelloviridae | |
| 3 61 | |
| Apicomplexans Apple fruit crinkle viroid | |
| 18 1 | |
| Apple hammerhead viroid-like circular RNA Apscaviroid | |
| 1 12 | |
| Aquificae Archaeoglobi | |
| 6 3 | |
| Arenaviridae Armatimonadetes | |
| 37 4 | |
| Arteriviridae Ascomycetes | |
| 13 45 | |
| Ascoviridae Asfarviridae | |
| 4 1 | |
| Aspiviridae Astroviridae | |
| 3 40 | |
| Avsunviroid Bacilladnaviridae | |
| 1 7 | |
| Bacteria candidate phyla Bacteroidetes/Chlorobi group | |
| 3 654 | |
| Baculoviridae Balneolia | |
| 66 4 | |
| Barnaviridae Basidiomycetes | |
| 1 5 | |
| Benyviridae Betaflexiviridae | |
| 3 92 | |
| Betaproteobacteria Bicaudaviridae | |
| 407 4 | |
| Birds Birnaviridae | |
| 54 8 | |
| Blastocatellia Bornaviridae | |
| 1 5 | |
| Bromoviridae Caliciviridae | |
| 35 26 | |
| Candidatus Kryptonia Candidatus Micrarchaeota | |
| 3 1 | |
| Candidatus Tectomicrobia Carmotetraviridae | |
| 1 1 | |
| Caulimoviridae Cherry leaf scorch small circular viroid-like RNA 1 | |
| 67 1 | |
| Cherry small circular viroid-like RNA Chlamydiae | |
| 1 19 | |
| Chloroflexi Chrysoviridae | |
| 23 9 | |
| Circoviridae Closteroviridae | |
| 148 40 | |
| Cocadviroid Coleviroid | |
| 4 6 | |
| Coprothermobacterota Coronaviridae | |
| 1 47 | |
| Corticoviridae Crenarchaeota | |
| 1 9 | |
| Cyanobacteria/Melainabacteria group Cystoviridae | |
| 22 5 | |
| Deinococcus-Thermus delta/epsilon subdivisions | |
| 23 142 | |
| Deltaflexiviridae Dicistroviridae | |
| 1 27 | |
| Elaviroid Endomicrobia | |
| 1 2 | |
| Endornaviridae Fibrobacteres | |
| 29 2 | |
| Filoviridae Fimoviridae | |
| 7 6 | |
| Firmicutes Fishes | |
| 1378 52 | |
| Flatworms Flaviviridae | |
| 4 107 | |
| Fusariviridae Fuselloviridae | |
| 7 10 | |
| Fusobacteriia Gammaflexiviridae | |
| 20 1 | |
| Gammaproteobacteria Geminiviridae | |
| 1285 406 | |
| Gemmatimonadetes Genomoviridae | |
| 2 62 | |
| Globuloviridae Grapevine latent viroid | |
| 2 1 | |
| Green Algae Guttaviridae | |
| 8 1 | |
| Halobacteria Hantaviridae | |
| 80 24 | |
| Hepadnaviridae Hepeviridae | |
| 16 3 | |
| Herpesviridae Hostuviroid | |
| 72 2 | |
| Hydrogenophilalia Hypoviridae | |
| 1 15 | |
| Hytrosaviridae Iflaviridae | |
| 2 34 | |
| Inoviridae Insects | |
| 42 110 | |
| Iridoviridae Kinetoplasts | |
| 20 5 | |
| Kiritimatiellaeota Land Plants | |
| 1 80 | |
| Lavidaviridae Lentisphaerae | |
| 2 1 | |
| Leviviridae Lipothrixviridae | |
| 11 8 | |
| Luteoviridae Malacoherpesviridae | |
| 48 2 | |
| Mammals Marnaviridae | |
| 100 1 | |
| Marseilleviridae Megabirnaviridae | |
| 6 3 | |
| Mesoniviridae Methanobacteria | |
| 7 18 | |
| Methanococci Methanomicrobia | |
| 2 21 | |
| Methanonatronarchaeia Microviridae | |
| 1 42 | |
| Mimiviridae Mulberry small circular viroid-like RNA 1 | |
| 5 1 | |
| Mymonaviridae Myoviridae | |
| 1 516 | |
| Nairoviridae Nanoviridae | |
| 13 8 | |
| Narnaviridae Nimaviridae | |
| 39 1 | |
| Nitrospira Nodaviridae | |
| 10 15 | |
| Nudiviridae Nyamiviridae | |
| 5 4 | |
| Oligoflexia Orthomyxoviridae | |
| 3 6 | |
| Other Other Animals | |
| 988 35 | |
| Other Fungi Other Protists | |
| 3 16 | |
| Papillomaviridae Paramyxoviridae | |
| 134 54 | |
| Partitiviridae Parvoviridae | |
| 55 90 | |
| Pelamoviroid Peribunyaviridae | |
| 2 32 | |
| Permutotetraviridae Persimmon viroid | |
| 1 1 | |
| Phasmaviridae Phenuiviridae | |
| 4 28 | |
| Phycodnaviridae Picobirnaviridae | |
| 26 5 | |
| Picornaviridae Pithoviridae | |
| 104 2 | |
| Planctomycetes Plasmaviridae | |
| 18 1 | |
| Pleolipoviridae Pneumoviridae | |
| 8 7 | |
| Podoviridae Polycipiviridae | |
| 343 6 | |
| Polydnaviridae Polyomaviridae | |
| 5 83 | |
| Pospiviroid Potyviridae | |
| 11 146 | |
| Poxviridae Quadriviridae | |
| 47 1 | |
| Reoviridae Reptiles | |
| 67 11 | |
| Retroviridae Rhabdoviridae | |
| 52 150 | |
| Rhodothermia Roniviridae | |
| 4 1 | |
| Roundworms Rubber viroid India/2009 | |
| 8 1 | |
| Rudiviridae Secoviridae | |
| 15 53 | |
| Siphoviridae Solemoviridae | |
| 1061 20 | |
| Solibacteres Solinviviridae | |
| 3 2 | |
| Sphaerolipoviridae Spirochaetia | |
| 7 39 | |
| Sunviridae Synergistia | |
| 1 4 | |
| Tectiviridae Tenericutes | |
| 5 69 | |
| Thaumarchaeota Thermococci | |
| 8 19 | |
| Thermodesulfobacteria Thermoplasmata | |
| 4 7 | |
| Thermotogae Togaviridae | |
| 10 24 | |
| Tolecusatellitidae Tombusviridae | |
| 101 68 | |
| Tospoviridae Totiviridae | |
| 17 66 | |
| Tristromaviridae Turriviridae | |
| 1 2 | |
| Tymoviridae unclassified | |
| 37 606 | |
| unclassified Acidobacteria unclassified Bacteria (miscellaneous) | |
| 1 13 | |
| unclassified Nitrospirae unclassified Proteobacteria | |
| 1 2 | |
| Verrucomicrobia Virgaviridae | |
| 25 52 | |
| Zetaproteobacteria | |
| 5 | |
| ``` | |
| Example `genbank`: | |
| ```{r,eval=FALSE} | |
| # the number of genomes available for each group | |
| listGroups(db = "genbank") | |
| ``` | |
| ``` | |
| Abditibacteriota Acidithiobacillia | |
| 1 8 | |
| Acidobacteriia Actinobacteria | |
| 24 1596 | |
| Alphaproteobacteria Amphibians | |
| 1604 6 | |
| Apicomplexans Aquificae | |
| 47 7 | |
| Archaeoglobi Armatimonadetes | |
| 5 38 | |
| Ascomycetes Bacteria candidate phyla | |
| 689 3532 | |
| Bacteroidetes/Chlorobi group Balneolia | |
| 2247 44 | |
| Basidiomycetes Betaproteobacteria | |
| 204 751 | |
| Birds Blastocatellia | |
| 80 2 | |
| Caldisericia candidate division WS1 | |
| 1 1 | |
| candidate division Zixibacteria Candidatus Aegiribacteria | |
| 17 1 | |
| Candidatus Aenigmarchaeota Candidatus Bathyarchaeota | |
| 14 42 | |
| Candidatus Cloacimonetes Candidatus Diapherotrites | |
| 88 11 | |
| Candidatus Fermentibacteria Candidatus Geothermarchaeota | |
| 8 3 | |
| Candidatus Heimdallarchaeota Candidatus Hydrogenedentes | |
| 4 10 | |
| Candidatus Korarchaeota Candidatus Kryptonia | |
| 1 4 | |
| Candidatus Lambdaproteobacteria Candidatus Latescibacteria | |
| 6 13 | |
| Candidatus Lokiarchaeota Candidatus Marinimicrobia | |
| 2 92 | |
| Candidatus Marsarchaeota Candidatus Micrarchaeota | |
| 15 35 | |
| Candidatus Moduliflexus Candidatus Muproteobacteria | |
| 1 14 | |
| Candidatus Nanohaloarchaeota Candidatus Odinarchaeota | |
| 16 1 | |
| Candidatus Omnitrophica Candidatus Pacearchaeota | |
| 126 41 | |
| Candidatus Parvarchaeota Candidatus Tectomicrobia | |
| 11 6 | |
| Candidatus Thorarchaeota Candidatus Vecturithrix | |
| 7 1 | |
| Candidatus Verstraetearchaeota Candidatus Woesearchaeota | |
| 5 36 | |
| Chlamydiae Chloroflexi | |
| 43 403 | |
| Chrysiogenetes Coprothermobacterota | |
| 1 1 | |
| Crenarchaeota Cyanobacteria/Melainabacteria group | |
| 54 172 | |
| Deferribacteres Deinococcus-Thermus | |
| 8 27 | |
| delta/epsilon subdivisions Dictyoglomia | |
| 615 1 | |
| Elusimicrobia Endomicrobia | |
| 1 2 | |
| environmental samples Fibrobacteres | |
| 3 15 | |
| Firmicutes Fishes | |
| 2663 168 | |
| Flatworms Fusobacteriia | |
| 35 29 | |
| Gammaproteobacteria Gemmatimonadetes | |
| 2041 82 | |
| Green Algae Hadesarchaea | |
| 30 5 | |
| Halobacteria Holophagae | |
| 162 9 | |
| Hydrogenophilalia Insects | |
| 32 280 | |
| Kinetoplasts Kiritimatiellaeota | |
| 33 3 | |
| Land Plants Lentisphaerae | |
| 282 32 | |
| Mammals Methanobacteria | |
| 145 31 | |
| Methanococci Methanomicrobia | |
| 2 77 | |
| Methanonatronarchaeia Methanopyri | |
| 2 1 | |
| Nanoarchaeota nitrifying bacterium enrichment culture | |
| 11 1 | |
| Nitrospinae Nitrospira | |
| 19 31 | |
| Oligoflexia Other | |
| 87 15 | |
| Other Animals Other Fungi | |
| 125 58 | |
| Other Plants Other Protists | |
| 1 128 | |
| Planctomycetes Reptiles | |
| 171 21 | |
| Rhodothermia Roundworms | |
| 4 91 | |
| Solibacteres Spirochaetia | |
| 7 115 | |
| Synergistia Tenericutes | |
| 43 141 | |
| Thaumarchaeota Theionarchaea | |
| 92 2 | |
| Thermococci Thermodesulfobacteria | |
| 21 7 | |
| Thermoplasmata Thermotogae | |
| 135 23 | |
| unclassified Acidobacteria unclassified Archaea (miscellaneous) | |
| 94 52 | |
| unclassified Bacteria (miscellaneous) unclassified Caldiserica | |
| 197 11 | |
| unclassified Calditrichaeota unclassified Elusimicrobia | |
| 2 101 | |
| unclassified Euryarchaeota unclassified Nitrospirae | |
| 348 90 | |
| unclassified Proteobacteria unclassified Rhodothermaeota | |
| 130 4 | |
| unclassified Spirochaetes unclassified Synergistetes | |
| 37 3 | |
| unclassified Thermotogae uncultured archaeon | |
| 1 1 | |
| uncultured archaeon A07HB70 uncultured archaeon A07HN63 | |
| 1 1 | |
| uncultured archaeon A07HR60 uncultured archaeon A07HR67 | |
| 1 1 | |
| uncultured bacterium Verrucomicrobia | |
| 1 296 | |
| Zetaproteobacteria | |
| 41 | |
| ``` | |
| __Note__ that when running the `listGenomes()` function for the first time, it might take a while until | |
| the function returns any results, because necessary information need to be downloaded from NCBI and ENSEMBL databases. All subsequent executions of `listGenomes()` | |
| will then respond very fast, because they will access the corresponding files stored on your hard drive. | |
| ## Downloading Biological Sequences and Annotations | |
| After checking for the availability of sequence information for an organism of interest, | |
| the next step is to download the corresponding genome, proteome, CDS, or GFF file. | |
| The following functions allow users to download proteomes, genomes, CDS and GFF files from several | |
| database resources such as: [NCBI RefSeq](http://www.ncbi.nlm.nih.gov/refseq/about/), [NCBI Genbank](http://www.ncbi.nlm.nih.gov/genbank/about/), [ENSEMBL](http://www.ensembl.org/index.html). When a corresponding proteome, genome, CDS or GFF file was | |
| loaded to your hard-drive, a documentation `*.txt` file is generated storing `File Name`, `Organism`, | |
| `Database`, `URL`, `DATE`, `assembly_accession`, `bioproject`, `biosample`, `taxid`, `version_status`, `release_type`, | |
| `seq_rel_date` etc. information of the retrieved file. This way a better reproducibility of proteome, genome, CDS and GFF versions | |
| used for subsequent data analyses can be achieved. | |
| ### Genome Retrieval | |
| The easiest way to download a genome is to use the `getGenome()` function. | |
| In this example we will download the genome of `Homo sapiens`. | |
| The `getGenome()` function is an interface function to the [NCBI RefSeq](http://www.ncbi.nlm.nih.gov/refseq/about/), [NCBI Genbank](http://www.ncbi.nlm.nih.gov/genbank/about/), | |
| [ENSEMBL](http://www.ensembl.org/index.html) databases from | |
| which corresponding genomes can be retrieved. | |
| The `db` argument specifies from which database genome assemblies in `*.fasta` file format shall be retrieved. | |
| Options are: | |
| - `db = "refseq"` for retrieval from [NCBI RefSeq](http://www.ncbi.nlm.nih.gov/refseq/about/) | |
| - `db = "genbank"` for retrieval from [NCBI Genbank](http://www.ncbi.nlm.nih.gov/genbank/about/) | |
| - `db = "ensembl"` for retrieval from [ENSEMBL](http://www.ensembl.org/index.html) | |
| Furthermore, users need to specify the `scientific name`, the `taxid` (= [NCBI Taxnonomy](https://www.ncbi.nlm.nih.gov/taxonomy) identifier), or the `accession identifier` of the organism of interest for which a genome assembly shall be downloaded, e.g. `organism = "Homo sapiens"` or `organism = "9606"` or `organism = "GCF_000001405.37"`. | |
| Finally, the `path` argument specifies the folder path in which the corresponding assembly shall be locally stored. In case users would like to store the genome file at a different location, | |
| they can specify the `path = file.path("put","your","path","here")` argument (e.g. `file.path("_ncbi_downloads","genomes")`). | |
| ### Example NCBI RefSeq: | |
| ```{r,eval=FALSE} | |
| # download the genome of Homo sapiens from refseq | |
| # and store the corresponding genome file in '_ncbi_downloads/genomes' | |
| HS.genome.refseq <- getGenome( db = "refseq", | |
| organism = "Homo sapiens", | |
| path = file.path("_ncbi_downloads","genomes") ) | |
| ``` | |
| In this example, `getGenome()` creates a directory named `'_ncbi_downloads/genomes'` into which the corresponding genome named `GCF_000001405.34_GRCh38.p8_genomic.fna.gz` is downloaded. The return value of `getGenome()` is the folder path to the downloaded genome file | |
| that can then be used as input to the `read_genome()` function. The variable `HS.genome.refseq` stores the path to the downloaded genome. | |
| Users can also omit the `path` argument if they wish to store the genome in their current working directory. E.g.: | |
| ```{r,eval=FALSE} | |
| # download the genome of Homo sapiens from refseq | |
| # and store the corresponding genome file in '_ncbi_downloads/genomes' | |
| HS.genome.refseq <- getGenome( db = "refseq", | |
| organism = "Homo sapiens") | |
| ``` | |
| Subsequently, users can use the `read_genome()` function to import the genome into the R session. Users can choose to work with the genome sequence in R either as [Biostrings](http://bioconductor.org/packages/release/bioc/html/Biostrings.html) object (`obj.type = "Biostrings"`) or [data.table](https://github.com/Rdatatable/data.table/wiki) object | |
| (`obj.type = "data.table"`) by specifying the `obj.type` argument of the `read_genome()` function. | |
| ```{r,eval=FALSE} | |
| # import downloaded genome as Biostrings object | |
| Human_Genome <- read_genome(file = HS.genome.refseq) | |
| ``` | |
| ```{r,eval=FALSE} | |
| # look at the Biostrings object | |
| Human_Genome | |
| ``` | |
| ``` | |
| A DNAStringSet instance of length 551 | |
| width seq names | |
| [1] 248956422 NNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNN NC_000001.11 Homo... | |
| [2] 175055 GAATTCAGCTGAGAAGAACAGGCA...TGTTTGTCAGTCACATAGAATTC NT_187361.1 Homo ... | |
| [3] 32032 AGGGGTCTGCTTAGAGAGGGTCTC...TGACTTACGTTGACGTGGAATTC NT_187362.1 Homo ... | |
| [4] 127682 GATCGAGACTATCCTGGCTAACAC...ATTGTCAATTGGGACCTTTGATC NT_187363.1 Homo ... | |
| [5] 66860 GAATTCATTCGATGACGATTCCAT...AAAAAACTCTCAGCCACGAATTC NT_187364.1 Homo ... | |
| ... ... ... | |
| [547] 170148 TTTCTTTCTTTTTTTTTTTTTTGT...GTCACAGGACTCATGGGGAATTC NT_187685.1 Homo ... | |
| [548] 215732 TGTGGTGAGGACCCTTAAGATCTA...GTCACAGGACTCATGGGGAATTC NT_187686.1 Homo ... | |
| [549] 170537 TCTACTCTCCCATGCTTGCCTCGG...GTCACAGGACTCATGGGGAATTC NT_187687.1 Homo ... | |
| [550] 177381 GATCTATCTGTATCTCCACAGGTG...GTCACAGGACTCATGGGGAATTC NT_113949.2 Homo ... | |
| [551] 16569 GATCACAGGTCTATCACCCTATTA...CCCTTAAATAAGACATCACGATG NC_012920.1 Homo ... | |
| ``` | |
| Internally, a text file named `doc_Homo_sapiens_db_refseq.txt` is generated. The information stored in this log file is structured as follows: | |
| ``` | |
| File Name: Homo_sapiens_genomic_refseq.fna.gz | |
| Organism Name: Homo_sapiens | |
| Database: NCBI refseq | |
| URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/ | |
| GCF_000001405.35_GRCh38.p9/GCF_000001405.35_GRCh38.p9_genomic.fna.gz | |
| Download_Date: Sat Oct 22 12:41:07 2016 | |
| refseq_category: reference genome | |
| assembly_accession: GCF_000001405.35 | |
| bioproject: PRJNA168 | |
| biosample: NA | |
| taxid: 9606 | |
| infraspecific_name: NA | |
| version_status: latest | |
| release_type: Patch | |
| genome_rep: Full | |
| seq_rel_date: 2016-09-26 | |
| submitter: Genome Reference Consortium | |
| ``` | |
| In addition, the genome summary statistics for the retrieved species is stored locally (`doc_Homo_sapiens_db_refseq_summary_statistics.tsv`) to provide users with insights regarding the genome assembly quality (see `?summary_genome()` for details). | |
| This file can be used as `Supplementary Information` file in publications to facilitate reproducible research. | |
| Most comparative genomics studies do not consider differences in genome assembly qualities when comparing the genomes of diverse species. | |
| This way, they expose themselves to technical artifacts that might generate patterns mistaken to be of biological relevance whereas | |
| in reality they just reflect the difference in genome assembly quality. Considering the quality of genome assemblies when downloading | |
| the genomic sequences will help researchers to avoid these pitfalls. | |
| The summary statistics include: | |
| - `genome_size_mbp`: Genome size in mega base pairs | |
| - `n50_mbp`: The N50 contig size of the genome assembly in mega base pairs | |
| - `n_seqs`: The number of chromosomes/scaffolds/contigs of the genome assembly file | |
| - `n_nnn`: The absolute number of NNNs (over all chromosomes or scaffolds or contigs) in the genome assembly file | |
| - `rel_nnn`: The percentage (relative frequency) of NNNs (over all chromosomes or scaffolds or contigs) compared to the total number of nucleotides in the genome assembly file | |
| - `genome_entropy`: The Shannon Entropy of the genome assembly file (median entropy over all individual chromosome entropies) | |
| - `n_gc`: The total number of GCs (over all chromosomes or scaffolds or contigs) in the genome assembly file | |
| - `rel_gc`: The (relative frequency) of GCs (over all chromosomes or scaffolds or contigs) compared to the total number of nucleotides in the genome assembly file | |
| In summary, the `getGenome()` and `read_genome()` functions allow users to retrieve genome assemblies by specifying | |
| the scientific name of the organism of interest and allow them to import the retrieved genome assembly e.g. as `Biostrings` object. | |
| Thus, users can then perform the `Biostrings notation` to work with downloaded genomes and can rely on the log | |
| file generated by `getGenome()` to better document the source and version of genome assemblies used for subsequent studies. | |
| Alternatively, users can perform the pipeline logic of the [magrittr](https://github.com/smbache/magrittr) package: | |
| ```{r,eval=FALSE} | |
| # install.packages("magrittr") | |
| library(magrittr) | |
| # import genome as Biostrings object | |
| Human_Genome <- getGenome( db = "refseq", | |
| organism = "Homo sapiens", | |
| path = file.path("_ncbi_downloads","genomes")) %>% | |
| read_genome() | |
| ``` | |
| ```{r,eval=FALSE} | |
| Human_Genome | |
| ``` | |
| ``` | |
| A DNAStringSet instance of length 551 | |
| width seq names | |
| [1] 248956422 NNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNN NC_000001.11 Homo... | |
| [2] 175055 GAATTCAGCTGAGAAGAACAGGCA...TGTTTGTCAGTCACATAGAATTC NT_187361.1 Homo ... | |
| [3] 32032 AGGGGTCTGCTTAGAGAGGGTCTC...TGACTTACGTTGACGTGGAATTC NT_187362.1 Homo ... | |
| [4] 127682 GATCGAGACTATCCTGGCTAACAC...ATTGTCAATTGGGACCTTTGATC NT_187363.1 Homo ... | |
| [5] 66860 GAATTCATTCGATGACGATTCCAT...AAAAAACTCTCAGCCACGAATTC NT_187364.1 Homo ... | |
| ... ... ... | |
| [547] 170148 TTTCTTTCTTTTTTTTTTTTTTGT...GTCACAGGACTCATGGGGAATTC NT_187685.1 Homo ... | |
| [548] 215732 TGTGGTGAGGACCCTTAAGATCTA...GTCACAGGACTCATGGGGAATTC NT_187686.1 Homo ... | |
| [549] 170537 TCTACTCTCCCATGCTTGCCTCGG...GTCACAGGACTCATGGGGAATTC NT_187687.1 Homo ... | |
| [550] 177381 GATCTATCTGTATCTCCACAGGTG...GTCACAGGACTCATGGGGAATTC NT_113949.2 Homo ... | |
| [551] 16569 GATCACAGGTCTATCACCCTATTA...CCCTTAAATAAGACATCACGATG NC_012920.1 Homo ... | |
| ``` | |
| #### Use `taxid` id for genome retrieval | |
| Alternatively, instead of specifying the scientific name in the argument `organism` | |
| users can specify the `taxonomy` id of the corresponding organism. Here, we specify | |
| the taxonomy id `559292` which encodes the species `Saccharomyces cerevisiae`. | |
| ```{r,eval=FALSE} | |
| # install.packages("magrittr") | |
| library(magrittr) | |
| # import genome as Biostrings object | |
| Scerevisiae_Genome <- getGenome( | |
| db = "refseq", | |
| organism = "559292") %>% | |
| read_genome() | |
| ``` | |
| ```{r,eval=FALSE} | |
| Scerevisiae_Genome | |
| ``` | |
| ``` | |
| A DNAStringSet instance of length 17 | |
| width seq names | |
| [1] 230218 CCACACCACACCCACACAC...GGGTGTGGTGTGTGTGGG NC_001133.9 Sacch... | |
| [2] 813184 AAATAGCCCTCATGTACGT...TGTGGTGTGTGGGTGTGT NC_001134.8 Sacch... | |
| [3] 316620 CCCACACACCACACCCACA...GTGGGTGTGGTGTGTGTG NC_001135.5 Sacch... | |
| [4] 1531933 ACACCACACCCACACCACA...GTAGTAAGTAGCTTTTGG NC_001136.10 Sacc... | |
| [5] 576874 CGTCTCCTCCAAGCCCTGT...ATTTTCATTTTTTTTTTT NC_001137.3 Sacch... | |
| ... ... ... | |
| [13] 924431 CCACACACACACCACACCC...TGTGGTGTGTGTGTGGGG NC_001145.3 Sacch... | |
| [14] 784333 CCGGCTTTCTGACCGAAAT...TGTGGGTGTGGTGTGGGT NC_001146.8 Sacch... | |
| [15] 1091291 ACACCACACCCACACCACA...TGTGTGGGTGTGGTGTGT NC_001147.6 Sacch... | |
| [16] 948066 AAATAGCCCTCATGTACGT...TTTAATTTCGGTCAGAAA NC_001148.4 Sacch... | |
| [17] 85779 TTCATAATTAATTTTTTAT...TATAATATAATATCCATA NC_001224.1 Sacch... | |
| ``` | |
| #### Use `assembly_accession` id for genome retrieval | |
| Alternatively, instead of specifying the scientific name or taxonomy in the argument `organism` | |
| users can specify the `assembly_accession` id of the corresponding organism. Here, we specify | |
| the `assembly_accession` id `GCF_000146045.2` which encodes the species `Saccharomyces cerevisiae`. | |
| ```{r,eval=FALSE} | |
| # install.packages("magrittr") | |
| library(magrittr) | |
| # import genome as Biostrings object | |
| Scerevisiae_Genome <- getGenome( | |
| db = "refseq", | |
| organism = "GCF_000146045.2") %>% | |
| read_genome() | |
| ``` | |
| ```{r,eval=FALSE} | |
| Scerevisiae_Genome | |
| ``` | |
| ``` | |
| A DNAStringSet instance of length 17 | |
| width seq names | |
| [1] 230218 CCACACCACACCCACACACCCACACA...GTGGTGTGGGTGTGGTGTGTGTGGG NC_001133.9 Sacch... | |
| [2] 813184 AAATAGCCCTCATGTACGTCTCCTCC...GTGTGGGTGTGGTGTGTGGGTGTGT NC_001134.8 Sacch... | |
| [3] 316620 CCCACACACCACACCCACACCACACC...GGGTGTGGTGGGTGTGGTGTGTGTG NC_001135.5 Sacch... | |
| [4] 1531933 ACACCACACCCACACCACACCCACAC...AATAAAGGTAGTAAGTAGCTTTTGG NC_001136.10 Sacc... | |
| [5] 576874 CGTCTCCTCCAAGCCCTGTTGTCTCT...GGGTTTCATTTTCATTTTTTTTTTT NC_001137.3 Sacch... | |
| ... ... ... | |
| [13] 924431 CCACACACACACCACACCCACACCAC...GTGTGGGTGTGGTGTGTGTGTGGGG NC_001145.3 Sacch... | |
| [14] 784333 CCGGCTTTCTGACCGAAATTAAAAAA...GGGTGTGTGTGGGTGTGGTGTGGGT NC_001146.8 Sacch... | |
| [15] 1091291 ACACCACACCCACACCACACCCACAC...GAGAGAGTGTGTGGGTGTGGTGTGT NC_001147.6 Sacch... | |
| [16] 948066 AAATAGCCCTCATGTACGTCTCCTCC...TTTTTTTTTTAATTTCGGTCAGAAA NC_001148.4 Sacch... | |
| [17] 85779 TTCATAATTAATTTTTTATATATATA...GCTTAATTATAATATAATATCCATA NC_001224.1 Sacch... | |
| ``` | |
| ### Example NCBI Genbank: | |
| Genome retrieval from `NCBI Genbank`. | |
| ```{r,eval=FALSE} | |
| # download the genome of Homo sapiens from Genbank | |
| # and store the corresponding genome file in '_ncbi_downloads/genomes' | |
| HS.genome.genbank <- getGenome( db = "genbank", | |
| organism = "Homo sapiens", | |
| path = file.path("_ncbi_downloads","genomes") ) | |
| ``` | |
| ```{r,eval=FALSE} | |
| # import downloaded genome as Biostrings object | |
| Human_Genome <- read_genome(file = HS.genome.genbank) | |
| ``` | |
| ```{r,eval=FALSE} | |
| # look at the Biostrings object | |
| Human_Genome | |
| ``` | |
| ``` | |
| A DNAStringSet instance of length 551 | |
| width seq names | |
| [1] 248956422 NNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNN CM000663.2 Homo s... | |
| [2] 175055 GAATTCAGCTGAGAAGAACAGGCA...TGTTTGTCAGTCACATAGAATTC KI270706.1 Homo s... | |
| [3] 32032 AGGGGTCTGCTTAGAGAGGGTCTC...TGACTTACGTTGACGTGGAATTC KI270707.1 Homo s... | |
| [4] 127682 GATCGAGACTATCCTGGCTAACAC...ATTGTCAATTGGGACCTTTGATC KI270708.1 Homo s... | |
| [5] 66860 GAATTCATTCGATGACGATTCCAT...AAAAAACTCTCAGCCACGAATTC KI270709.1 Homo s... | |
| ... ... ... | |
| [547] 170148 TTTCTTTCTTTTTTTTTTTTTTGT...GTCACAGGACTCATGGGGAATTC KI270931.1 Homo s... | |
| [548] 215732 TGTGGTGAGGACCCTTAAGATCTA...GTCACAGGACTCATGGGGAATTC KI270932.1 Homo s... | |
| [549] 170537 TCTACTCTCCCATGCTTGCCTCGG...GTCACAGGACTCATGGGGAATTC KI270933.1 Homo s... | |
| [550] 177381 GATCTATCTGTATCTCCACAGGTG...GTCACAGGACTCATGGGGAATTC GL000209.2 Homo s... | |
| [551] 16569 GATCACAGGTCTATCACCCTATTA...CCCTTAAATAAGACATCACGATG J01415.2 Homo sap... | |
| ``` | |
| #### Use `taxonomy` id for genome retrieval | |
| Alternatively, instead of specifying the scientific name in the argument `organism` | |
| users can specify the `taxonomy` id of the corresponding organism. Here, we specify | |
| the taxonomy id `559292` which encodes the species `Saccharomyces cerevisiae`. | |
| ```{r,eval=FALSE} | |
| # install.packages("magrittr") | |
| library(magrittr) | |
| # import genome as Biostrings object | |
| Scerevisiae_Genome <- getGenome( | |
| db = "genbank", | |
| organism = "559292") %>% | |
| read_genome() | |
| ``` | |
| ```{r,eval=FALSE} | |
| Scerevisiae_Genome | |
| ``` | |
| ``` | |
| A DNAStringSet instance of length 16 | |
| width seq names | |
| [1] 230218 CCACACCACACCCACA...TGTGGTGTGTGTGGG BK006935.2 TPA_in... | |
| [2] 813184 AAATAGCCCTCATGTA...GGTGTGTGGGTGTGT BK006936.2 TPA_in... | |
| [3] 316620 CCCACACACCACACCC...GGTGTGGTGTGTGTG BK006937.2 TPA_in... | |
| [4] 1531933 ACACCACACCCACACC...GTAAGTAGCTTTTGG BK006938.2 TPA_in... | |
| [5] 576874 CGTCTCCTCCAAGCCC...TTCATTTTTTTTTTT BK006939.2 TPA_in... | |
| ... ... ... | |
| [12] 1078177 CACACACACACACCAC...ACATGAGGGCTATTT BK006945.2 TPA_in... | |
| [13] 924431 CCACACACACACCACA...GGTGTGTGTGTGGGG BK006946.2 TPA_in... | |
| [14] 784333 CCGGCTTTCTGACCGA...GGGTGTGGTGTGGGT BK006947.3 TPA_in... | |
| [15] 1091291 ACACCACACCCACACC...GTGGGTGTGGTGTGT BK006948.2 TPA_in... | |
| [16] 948066 AAATAGCCCTCATGTA...AATTTCGGTCAGAAA BK006949.2 TPA_in... | |
| ``` | |
| #### Use `assembly_accession` id for genome retrieval | |
| Alternatively, instead of specifying the scientific name or taxonomy in the argument `organism` | |
| users can specify the `assembly_accession` id of the corresponding organism. Here, we specify | |
| the `assembly_accession` id `GCA_000146045.2` which encodes the species `Saccharomyces cerevisiae`. | |
| ```{r,eval=FALSE} | |
| # install.packages("magrittr") | |
| library(magrittr) | |
| # import genome as Biostrings object | |
| Scerevisiae_Genome <- getGenome( | |
| db = "genbank", | |
| organism = "GCA_000146045.2") %>% | |
| read_genome() | |
| ``` | |
| ```{r,eval=FALSE} | |
| Scerevisiae_Genome | |
| ``` | |
| ``` | |
| A DNAStringSet instance of length 16 | |
| width seq names | |
| [1] 230218 CCACACCACACCCACAC...GTGTGGTGTGTGTGGG BK006935.2 TPA_in... | |
| [2] 813184 AAATAGCCCTCATGTAC...TGGTGTGTGGGTGTGT BK006936.2 TPA_in... | |
| [3] 316620 CCCACACACCACACCCA...GGGTGTGGTGTGTGTG BK006937.2 TPA_in... | |
| [4] 1531933 ACACCACACCCACACCA...AGTAAGTAGCTTTTGG BK006938.2 TPA_in... | |
| [5] 576874 CGTCTCCTCCAAGCCCT...TTTCATTTTTTTTTTT BK006939.2 TPA_in... | |
| ... ... ... | |
| [12] 1078177 CACACACACACACCACC...TACATGAGGGCTATTT BK006945.2 TPA_in... | |
| [13] 924431 CCACACACACACCACAC...TGGTGTGTGTGTGGGG BK006946.2 TPA_in... | |
| [14] 784333 CCGGCTTTCTGACCGAA...TGGGTGTGGTGTGGGT BK006947.3 TPA_in... | |
| [15] 1091291 ACACCACACCCACACCA...TGTGGGTGTGGTGTGT BK006948.2 TPA_in... | |
| [16] 948066 AAATAGCCCTCATGTAC...TAATTTCGGTCAGAAA BK006949.2 TPA_in... | |
| ``` | |
| In addition, the genome summary statistics for the retrieved species is stored locally (`doc_saccharomyces_cerevisiae_db_genbank_summary_statistics.tsv`) to provide users with insights regarding the genome assembly quality (see `?summary_genome()` for details). | |
| This file can be used as `Supplementary Information` file in publications to facilitate reproducible research. | |
| Most comparative genomics studies do not consider differences in genome assembly qualities when comparing the genomes of diverse species. | |
| This way, they expose themselves to technical artifacts that might generate patterns mistaken to be of biological relevance whereas | |
| in reality they just reflect the difference in genome assembly quality. Considering the quality of genome assemblies when downloading | |
| the genomic sequences will help researchers to avoid these pitfalls. | |
| The summary statistics include: | |
| - `genome_size_mbp`: Genome size in mega base pairs | |
| - `n50_mbp`: The N50 contig size of the genome assembly in mega base pairs | |
| - `n_seqs`: The number of chromosomes/scaffolds/contigs of the genome assembly file | |
| - `n_nnn`: The absolute number of NNNs (over all chromosomes or scaffolds or contigs) in the genome assembly file | |
| - `rel_nnn`: The percentage (relative frequency) of NNNs (over all chromosomes or scaffolds or contigs) compared to the total number of nucleotides in the genome assembly file | |
| - `genome_entropy`: The Shannon Entropy of the genome assembly file (median entropy over all individual chromosome entropies) | |
| - `n_gc`: The total number of GCs (over all chromosomes or scaffolds or contigs) in the genome assembly file | |
| - `rel_gc`: The (relative frequency) of GCs (over all chromosomes or scaffolds or contigs) compared to the total number of nucleotides in the genome assembly file | |
| ### Example ENSEMBL: | |
| ```{r,eval=FALSE} | |
| # download the genome of Homo sapiens from ENSEMBL | |
| # and store the corresponding genome file in '_ncbi_downloads/genomes' | |
| HS.genome.ensembl <- getGenome( db = "ensembl", | |
| organism = "Homo sapiens", | |
| path = file.path("_ncbi_downloads","genomes") , | |
| assembly_type = "primary_assembly") | |
| ``` | |
| ```{r,eval=FALSE} | |
| # import downloaded genome as Biostrings object | |
| Human_Genome <- read_genome(file = HS.genome.ensembl) | |
| ``` | |
| ```{r,eval=FALSE} | |
| # look at the Biostrings object | |
| Human_Genome | |
| ``` | |
| ``` | |
| A DNAStringSet instance of length 524 | |
| width seq names | |
| [1] 248956422 NNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNN 1 dna:chromosome ... | |
| [2] 242193529 NNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNN 2 dna:chromosome ... | |
| [3] 198295559 NNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNN 3 dna:chromosome ... | |
| [4] 190214555 NNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNN 4 dna:chromosome ... | |
| [5] 181538259 NNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNN 5 dna:chromosome ... | |
| ... ... ... | |
| [520] 993 GCCCCACGTCCGGGAGGGAGGTGG...GAGGGAGGTGGGGGGTCAGCCCT KI270539.1 dna:sc... | |
| [521] 990 TTTCATAGAGCATGTTTGAAACAC...TCAGAAACTTGTTGTGATGTGTG KI270385.1 dna:sc... | |
| [522] 981 AGATTTCGTTGGAACGGGATAAAC...TCAGCTTTCAAACACTCTTTTTG KI270423.1 dna:sc... | |
| [523] 971 ATTTGCGATGTGTGTTCTCAACTA...TTGGATAGCTTTGAAGTTTTCGT KI270392.1 dna:sc... | |
| [524] 970 AAGTGGATATTTGGATAGCTTTGA...TCCTCAATAACAGAGTTGAACCT KI270394.1 dna:sc... | |
| ``` | |
| If you are using `db = "ensembl"` you can set `assembly_type = "primary_assembly"`. For the human genome the toplevel genome assembly size is > 70 GB uncompressed, while primary assembly is just a few GB. | |
| Users can also specify the `release` argument which denotes the database release version of ENSEMBL (`db = "ensembl"`) in case they wish to download archived vgenome versions. Default is release = NULL meaning that the most recent database version is used. | |
| In addition, the genome summary statistics for the retrieved species is stored locally (`doc_homo_sapiens_db_ensembl_summary_statistics.tsv`) to provide users with insights regarding the genome assembly quality (see `?summary_genome()` for details). | |
| This file can be used as `Supplementary Information` file in publications to facilitate reproducible research. | |
| Most comparative genomics studies do not consider differences in genome assembly qualities when comparing the genomes of diverse species. | |
| This way, they expose themselves to technical artifacts that might generate patterns mistaken to be of biological relevance whereas | |
| in reality they just reflect the difference in genome assembly quality. Considering the quality of genome assemblies when downloading | |
| the genomic sequences will help researchers to avoid these pitfalls. | |
| The summary statistics include: | |
| - `genome_size_mbp`: Genome size in mega base pairs | |
| - `n50_mbp`: The N50 contig size of the genome assembly in mega base pairs | |
| - `n_seqs`: The number of chromosomes/scaffolds/contigs of the genome assembly file | |
| - `n_nnn`: The absolute number of NNNs (over all chromosomes or scaffolds or contigs) in the genome assembly file | |
| - `rel_nnn`: The percentage (relative frequency) of NNNs (over all chromosomes or scaffolds or contigs) compared to the total number of nucleotides in the genome assembly file | |
| - `genome_entropy`: The Shannon Entropy of the genome assembly file (median entropy over all individual chromosome entropies) | |
| - `n_gc`: The total number of GCs (over all chromosomes or scaffolds or contigs) in the genome assembly file | |
| - `rel_gc`: The (relative frequency) of GCs (over all chromosomes or scaffolds or contigs) compared to the total number of nucleotides in the genome assembly file | |
| #### Use `taxonomy` id for genome retrieval | |
| Alternatively, instead of specifying the scientific name in the argument `organism` | |
| users can specify the `taxonomy` id of the corresponding organism. Here, we specify | |
| the taxonomy id `4932` which encodes the species `Saccharomyces cerevisiae`. | |
| ```{r,eval=FALSE} | |
| # install.packages("magrittr") | |
| library(magrittr) | |
| # import genome as Biostrings object | |
| Scerevisiae_Genome <- getGenome( | |
| db = "ensembl", | |
| organism = "4932") %>% | |
| read_genome() | |
| ``` | |
| ```{r,eval=FALSE} | |
| Scerevisiae_Genome | |
| ``` | |
| ``` | |
| A DNAStringSet instance of length 17 | |
| width seq names | |
| [1] 230218 CCACACCACACCCA...TGGTGTGTGTGGG I dna:chromosome ... | |
| [2] 813184 AAATAGCCCTCATG...TGTGTGGGTGTGT II dna:chromosome... | |
| [3] 316620 CCCACACACCACAC...TGTGGTGTGTGTG III dna:chromosom... | |
| [4] 1531933 ACACCACACCCACA...AAGTAGCTTTTGG IV dna:chromosome... | |
| [5] 576874 CGTCTCCTCCAAGC...CATTTTTTTTTTT V dna:chromosome ... | |
| ... ... ... | |
| [13] 924431 CCACACACACACCA...TGTGTGTGTGGGG XIII dna:chromoso... | |
| [14] 784333 CCGGCTTTCTGACC...GTGTGGTGTGGGT XIV dna:chromosom... | |
| [15] 1091291 ACACCACACCCACA...GGGTGTGGTGTGT XV dna:chromosome... | |
| [16] 948066 AAATAGCCCTCATG...TTTCGGTCAGAAA XVI dna:chromosom... | |
| [17] 85779 TTCATAATTAATTT...TATAATATCCATA Mito dna:chromoso... | |
| ``` | |
| #### Use `assembly_accession` id for genome retrieval | |
| Alternatively, instead of specifying the scientific name or taxonomy in the argument `organism` | |
| users can specify the `assembly_accession` id of the corresponding organism. Here, we specify | |
| the `assembly_accession` id `GCA_000146045.2` which encodes the species `Saccharomyces cerevisiae`. | |
| ```{r,eval=FALSE} | |
| # install.packages("magrittr") | |
| library(magrittr) | |
| # import genome as Biostrings object | |
| Scerevisiae_Genome <- getGenome( | |
| db = "ensembl", | |
| organism = "GCA_000146045.2") %>% | |
| read_genome() | |
| ``` | |
| ```{r,eval=FALSE} | |
| Scerevisiae_Genome | |
| ``` | |
| ``` | |
| A DNAStringSet instance of length 17 | |
| width seq names | |
| [1] 230218 CCACACCACACCCACA...TGTGGTGTGTGTGGG I dna:chromosome ... | |
| [2] 813184 AAATAGCCCTCATGTA...GGTGTGTGGGTGTGT II dna:chromosome... | |
| [3] 316620 CCCACACACCACACCC...GGTGTGGTGTGTGTG III dna:chromosom... | |
| [4] 1531933 ACACCACACCCACACC...GTAAGTAGCTTTTGG IV dna:chromosome... | |
| [5] 439888 CACACACACCACACCC...GTGTGGTGTGTGTGT IX dna:chromosome... | |
| ... ... ... | |
| [13] 1078177 CACACACACACACCAC...ACATGAGGGCTATTT XII dna:chromosom... | |
| [14] 924431 CCACACACACACCACA...GGTGTGTGTGTGGGG XIII dna:chromoso... | |
| [15] 784333 CCGGCTTTCTGACCGA...GGGTGTGGTGTGGGT XIV dna:chromosom... | |
| [16] 1091291 ACACCACACCCACACC...GTGGGTGTGGTGTGT XV dna:chromosome... | |
| [17] 948066 AAATAGCCCTCATGTA...AATTTCGGTCAGAAA XVI dna:chromosom... | |
| ``` | |
| ### GenomeSet Retrieval | |
| The `getGenomeSet()` function enables users to retrieve genome files for | |
| multiple species. This is convenient when users wish to bulk download a particular | |
| set of species. Internally, a folder named `set_genomes` is generated in which | |
| genomes and genome info files are stored. In addition, users can specify the arguments: `clean_retrieval` and `gunzip` (both are `TRUE` by default) to clean file names and automatically unzip downloaded files. | |
| Example: | |
| ```r | |
| # specify the species names | |
| download_species <- c("Arabidopsis thaliana", | |
| "Arabidopsis lyrata", | |
| "Capsella rubella") | |
| # retrieve these three species from NCBI RefSeq | |
| biomartr::getGenomeSet("refseq", organisms = download_species, path = "set_genomes") | |
| ``` | |
| If the download process was interrupted, users can re-run the function | |
| and it will only download missing genomes. In cases users wish | |
| to download everything again and updating existing genomes, they may | |
| specify the argument `update = TRUE`. | |
| ### Proteome Retrieval | |
| The `getProteome()` function is an interface function to the [NCBI RefSeq](http://www.ncbi.nlm.nih.gov/refseq/about/), [NCBI Genbank](http://www.ncbi.nlm.nih.gov/genbank/about/), | |
| [ENSEMBL](http://www.ensembl.org/index.html), and [UniProt](http://uniprot.org/) databases from | |
| which corresponding proteomes can be retrieved. It works analogous to `getGenome()`. | |
| The `db` argument specifies from which database proteomes in `*.fasta` file format shall be retrieved. | |
| Options are: | |
| - `db = "refseq"` for retrieval from [NCBI RefSeq](http://www.ncbi.nlm.nih.gov/refseq/about/) | |
| - `db = "genbank"` for retrieval from [NCBI Genbank](http://www.ncbi.nlm.nih.gov/genbank/about/) | |
| - `db = "ensembl"` for retrieval from [ENSEMBL](http://www.ensembl.org/index.html) | |
| - `db = "uniprot" ` for retrieval from [UniProt](http://uniprot.org/) | |
| Furthermore, again users need to specify the scientific name of the organism of interest for which a proteomes shall be downloaded, e.g. `organism = "Homo sapiens"`. | |
| Finally, the `path` argument specifies the folder path in which the corresponding proteome shall be locally stored. In case users would like to store the proteome file at a different location, | |
| they can specify the `path = file.path("put","your","path","here")` argument. | |
| ### Example `NCBI RefSeq`: | |
| ```{r,eval=FALSE} | |
| # download the proteome of Homo sapiens from refseq | |
| # and store the corresponding proteome file in '_ncbi_downloads/proteomes' | |
| HS.proteome.refseq <- getProteome( db = "refseq", | |
| organism = "Homo sapiens", | |
| path = file.path("_ncbi_downloads","proteomes")) | |
| ``` | |
| In this example, `getProteome()` creates a directory named `'_ncbi_downloads/proteomes'` into which the corresponding genome named `GCF_000001405.34_GRCh38.p8_protein.faa.gz` is downloaded. The return value of `getProteome()` is the folder path to the downloaded proteome file | |
| that can then be used as input to the `read_proteome()` function. The variable `HS.proteome.refseq` stores the path to the downloaded proteome. | |
| Subsequently, users can use the `read_proteome()` function to import the proteome into the R session. Users can choose to work with the proteome sequence in R either as [Biostrings](http://bioconductor.org/packages/release/bioc/html/Biostrings.html) object (`obj.type = "Biostrings"`) or [data.table](https://github.com/Rdatatable/data.table/wiki) object | |
| (`obj.type = "data.table"`) by specifying the `obj.type` argument of the `read_proteome()` function. | |
| ```{r,eval=FALSE} | |
| # import proteome as Biostrings object | |
| Human_Proteome <- read_proteome(file = HS.proteome.refseq) | |
| ``` | |
| ```{r,eval=FALSE} | |
| Human_Proteome | |
| ``` | |
| ``` | |
| A AAStringSet instance of length 113620 | |
| width seq names | |
| [1] 1474 MGKNKLLHPSLVL...YNAPCSKDLGNA NP_000005.2 alpha... | |
| [2] 290 MDIEAYFERIGYK...LVPKPGDGSLTI NP_000006.2 aryla... | |
| [3] 421 MAAGFGRCCRVLR...IVAREHIDKYKN NP_000007.1 mediu... | |
| [4] 412 MAAALLARASGPA...VIAGHLLRSYRS NP_000008.1 short... | |
| [5] 655 MQAARMAASLGRQ...RGGVVTSNPLGF NP_000009.1 very ... | |
| ... ... ... | |
| [113616] 98 MPLIYMNIMLAFT...LDYVHNLNLLQC YP_003024034.1 NA... | |
| [113617] 459 MLKLIVPTIMLLP...SLNPDIITGFSS YP_003024035.1 NA... | |
| [113618] 603 MTMHTTMTTLTLT...FFPLILTLLLIT YP_003024036.1 NA... | |
| [113619] 174 MMYALFLLSVGLV...GVYIVIEIARGN YP_003024037.1 NA... | |
| [113620] 380 MTPMRKTNPLMKL...ISLIENKMLKWA YP_003024038.1 cy... | |
| ``` | |
| Alternatively, users can perform the pipeline logic of the [magrittr](https://github.com/smbache/magrittr) package: | |
| ```{r,eval=FALSE} | |
| # install.packages("magrittr") | |
| library(magrittr) | |
| # import proteome as Biostrings object | |
| Human_Proteome <- getProteome( db = "refseq", | |
| organism = "Homo sapiens", | |
| path = file.path("_ncbi_downloads","proteomes")) %>% | |
| read_proteome() | |
| ``` | |
| ```{r,eval=FALSE} | |
| Human_Proteome | |
| ``` | |
| ``` | |
| A AAStringSet instance of length 113620 | |
| width seq names | |
| [1] 1474 MGKNKLLHPSLVL...YNAPCSKDLGNA NP_000005.2 alpha... | |
| [2] 290 MDIEAYFERIGYK...LVPKPGDGSLTI NP_000006.2 aryla... | |
| [3] 421 MAAGFGRCCRVLR...IVAREHIDKYKN NP_000007.1 mediu... | |
| [4] 412 MAAALLARASGPA...VIAGHLLRSYRS NP_000008.1 short... | |
| [5] 655 MQAARMAASLGRQ...RGGVVTSNPLGF NP_000009.1 very ... | |
| ... ... ... | |
| [113616] 98 MPLIYMNIMLAFT...LDYVHNLNLLQC YP_003024034.1 NA... | |
| [113617] 459 MLKLIVPTIMLLP...SLNPDIITGFSS YP_003024035.1 NA... | |
| [113618] 603 MTMHTTMTTLTLT...FFPLILTLLLIT YP_003024036.1 NA... | |
| [113619] 174 MMYALFLLSVGLV...GVYIVIEIARGN YP_003024037.1 NA... | |
| [113620] 380 MTPMRKTNPLMKL...ISLIENKMLKWA YP_003024038.1 cy... | |
| ``` | |
| ### Example `NCBI Genbank`: | |
| ```{r,eval=FALSE} | |
| # download the proteome of Homo sapiens from genbank | |
| # and store the corresponding proteome file in '_ncbi_downloads/proteomes' | |
| HS.proteome.genbank <- getProteome( db = "genbank", | |
| organism = "Homo sapiens", | |
| path = file.path("_ncbi_downloads","proteomes")) | |
| ``` | |
| ```{r,eval=FALSE} | |
| # import proteome as Biostrings object | |
| Human_Proteome <- read_proteome(file = HS.proteome.genbank) | |
| ``` | |
| ```{r,eval=FALSE} | |
| Human_Proteome | |
| ``` | |
| ``` | |
| A AAStringSet instance of length 13 | |
| width seq names | |
| [1] 318 MPMANLLLLIVPILI...VSMPITISSIPPQT AAB58943.1 NADH d... | |
| [2] 347 MNPLAQPVIYSTIFA...TLLLPISPFMLMIL AAB58944.1 NADH d... | |
| [3] 513 MFADRWLFSTNHKDI...PPYHTFEEPVYMKS AAB58945.1 cytoch... | |
| [4] 227 MAHAAQVGLQDATSP...IPLKIFEMGPVFTL AAB58946.1 cytoch... | |
| [5] 68 MPQLNTTVWPTMITP...WTKICSLHSLPPQS AAB58947.1 ATPase... | |
| ... ... ... | |
| [9] 98 MPLIYMNIMLAFTIS...YGLDYVHNLNLLQC AAB58951.1 NADH d... | |
| [10] 459 MLKLIVPTIMLLPLT...LLSLNPDIITGFSS AAB58952.1 NADH d... | |
| [11] 603 MTMHTTMTTLTLTSL...SFFFPLILTLLLIT AAB58953.1 NADH d... | |
| [12] 174 MMYALFLLSVGLVMG...FVGVYIVIEIARGN AAB58954.1 NADH d... | |
| [13] 380 MTPMRKTNPLMKLIN...PTISLIENKMLKWA AAB58955.3 cytoch... | |
| ``` | |
| ### Example `ENSEMBL`: | |
| ```{r,eval=FALSE} | |
| # download the proteome of Homo sapiens from ENSEMBL | |
| # and store the corresponding proteome file in '_ncbi_downloads/proteomes' | |
| HS.proteome.ensembl <- getProteome( db = "ensembl", | |
| organism = "Homo sapiens", | |
| path = file.path("_ncbi_downloads","proteomes")) | |
| ``` | |
| ```{r,eval=FALSE} | |
| # import proteome as Biostrings object | |
| Human_Proteome <- read_proteome(file = HS.proteome.ensembl) | |
| ``` | |
| ```{r,eval=FALSE} | |
| Human_Proteome | |
| ``` | |
| ``` | |
| A AAStringSet instance of length 107844 | |
| width seq names | |
| [1] 3 PSY ENSP00000451515.1... | |
| [2] 4 TGGY ENSP00000452494.1... | |
| [3] 2 EI ENSP00000451042.1... | |
| [4] 4 GTGG ENSP00000487941.1... | |
| [5] 4 GTGG ENSP00000488240.1... | |
| ... ... ... | |
| [107840] 135 MLQKKIEEKDLKV...LNHICKVPLAIK ENSP00000495237.1... | |
| [107841] 166 MEHAFTPLEPLLS...IIQEESLIYLLQ ENSP00000496198.1... | |
| [107842] 42 MEVPTAYMISPKE...GLKSENTMLLRC ENSP00000495723.1... | |
| [107843] 508 MPSMLERISKNLV...GTLSLLQQLAEA ENSP00000496548.1... | |
| [107844] 508 MPSMLERISKNLV...GTLSLLQQLAEA ENSP00000494855.1... | |
| ``` | |
| Users can also specify the `release` argument which denotes the database release version of ENSEMBL (`db = "ensembl"`) in case they wish to download archived vgenome versions. Default is release = NULL meaning that the most recent database version is used. | |
| ### Example Retrieval `Uniprot`: | |
| Another way of retrieving proteome sequences is from the [UniProt](http://www.uniprot.org/) database. | |
| ```{r,eval=FALSE} | |
| # download the proteome of Mus musculus from UniProt | |
| # and store the corresponding proteome file in '_uniprot_downloads/proteomes' | |
| Mm.proteome.uniprot<- getProteome( db = "uniprot", | |
| organism = "Mus musculus", | |
| path = file.path("_uniprot_downloads","proteomes")) | |
| ``` | |
| ```{r,eval=FALSE} | |
| # import proteome as Biostrings object | |
| Mouse_Proteome <- read_proteome(file = Mm.proteome.uniprot) | |
| ``` | |
| ```{r,eval=FALSE} | |
| Mouse_Proteome | |
| ``` | |
| ``` | |
| A AAStringSet instance of length 84522 | |
| width seq names | |
| [1] 781 MATQADLMELDMAMEPD...LPPGDSNQLAWFDTDL sp|Q02248|CTNB1_M... | |
| [2] 2531 MPRLLTPLLCLTLLPAL...PTTMPSQITHIPEAFK sp|Q01705|NOTC1_M... | |
| [3] 437 MLLLLARCFLVILASSL...LDSETMHPLGMAVKSS sp|Q62226|SHH_MOU... | |
| [4] 380 MKKPIGILSPGVALGTA...VKCKKCTEIVDQFVCK sp|P22725|WNT5A_M... | |
| [5] 387 MEESQSDISLELPLSQE...SRHKKTMVKKVGPDSD sp|P02340|P53_MOU... | |
| ... ... ... | |
| [84518] 459 MLKIILPSLMLLPLTWL...LILLTTSPKLITGLTM tr|A0A0F6PZ84|A0A... | |
| [84519] 380 MTNMRKTHPLFKIINHS...LMPISGIIEDKMLKLY tr|U3TEV9|U3TEV9_... | |
| [84520] 381 MTNMRKTHPLFKIINHS...MPISGIIEDKMLKLYP tr|A0A0F6PXN8|A0A... | |
| [84521] 172 MNNYIFVLSSLFLVGCL...WSLFAGIFIIIEITRD tr|A0A0F6PXK9|A0A... | |
| [84522] 381 MTNMRKTHPLFKIINHS...MPISGIIEDKMLKLYP tr|A0A0F6PYR5|A0A... | |
| ``` | |
| ### ProteomeSet Retrieval | |
| The `getProteomeSet()` function enables users to retrieve proteome files for | |
| multiple species. This is convenient when users wish to bulk download a particular | |
| set of species. Internally, a folder named `set_proteomes` is generated in which | |
| proteomes and proteome info files are stored. In addition, users can specify the arguments: `clean_retrieval` and `gunzip` (both are `TRUE` by default) to clean file names and automatically unzip downloaded files. | |
| Example: | |
| ```r | |
| # specify the species names | |
| download_species <- c("Arabidopsis thaliana", | |
| "Arabidopsis lyrata", | |
| "Capsella rubella") | |
| # retrieve these three species from NCBI RefSeq | |
| biomartr::getProteomeSet("refseq", organisms = download_species, path = "set_proteomes") | |
| ``` | |
| If the download process was interrupted, users can re-run the function | |
| and it will only download missing genomes. In cases users wish | |
| to download everything again and updating existing genomes, they may | |
| specify the argument `update = TRUE`. | |
| ### CDS Retrieval | |
| The `getCDS()` function is an interface function to the [NCBI RefSeq](http://www.ncbi.nlm.nih.gov/refseq/about/), [NCBI Genbank](http://www.ncbi.nlm.nih.gov/genbank/about/), | |
| [ENSEMBL](http://www.ensembl.org/index.html) databases from | |
| which corresponding CDS files can be retrieved. It works analogous to `getGenome()` and `getProteome()`. | |
| The `db` argument specifies from which database proteomes in `*.fasta` file format shall be retrieved. | |
| Options are: | |
| - `db = "refseq"` for retrieval from [NCBI RefSeq](http://www.ncbi.nlm.nih.gov/refseq/about/) | |
| - `db = "genbank"` for retrieval from [NCBI Genbank](http://www.ncbi.nlm.nih.gov/genbank/about/) | |
| - `db = "ensembl"` for retrieval from [ENSEMBL](http://www.ensembl.org/index.html) | |
| Furthermore, again users need to specify the scientific name of the organism of interest for which a proteomes shall be downloaded, e.g. `organism = "Homo sapiens"`. | |
| Finally, the `path` argument specifies the folder path in which the corresponding CDS file shall be locally stored. In case users would like to store the CDS file at a different location, | |
| they can specify the `path = file.path("put","your","path","here")` argument. | |
| ### Example `NCBI RefSeq`: | |
| ```{r,eval=FALSE} | |
| # download the genome of Homo sapiens from refseq | |
| # and store the corresponding genome CDS file in '_ncbi_downloads/CDS' | |
| HS.cds.refseq <- getCDS( db = "refseq", | |
| organism = "Homo sapiens", | |
| path = file.path("_ncbi_downloads","CDS")) | |
| ``` | |
| In this example, `getCDS()` creates a directory named `'_ncbi_downloads/CDS'` into which the corresponding genome named `Homo_sapiens_cds_from_genomic_refseq.fna.gz` is downloaded. The return value of `getCDS()` is the folder path to the downloaded genome file | |
| that can then be used as input to the `read_cds()` function. The variable `HS.cds.refseq` stores the path to the downloaded CDS file. | |
| Subsequently, users can use the `read_cds()` function to import the genome into the R session. Users can choose to work with the genome sequence in R either as [Biostrings](http://bioconductor.org/packages/release/bioc/html/Biostrings.html) object (`obj.type = "Biostrings"`) or [data.table](https://github.com/Rdatatable/data.table/wiki) object | |
| (`obj.type = "data.table"`) by specifying the `obj.type` argument of the `read_cds()` function. | |
| ```{r,eval=FALSE} | |
| # import downloaded CDS as Biostrings object | |
| Human_CDS <- read_cds(file = HS.cds.refseq, | |
| obj.type = "Biostrings") | |
| ``` | |
| ```{r,eval=FALSE} | |
| # look at the Biostrings object | |
| Human_CDS | |
| ``` | |
| ``` | |
| A BStringSet instance of length 114967 | |
| width seq names | |
| [1] 918 ATGGTGACTGAATTCATTTTTCTG...CACATTCTAGTGTAAAGTTTTAG lcl|NC_000001.11_... | |
| [2] 402 ATGAGTGACAGCATCAACTTCTCT...CAGGACCCAGGCACAGGCATTAG lcl|NC_000001.11_... | |
| [3] 402 ATGAGTGACAGCATCAACTTCTCT...CAGGACCCAGGCACAGGCATTAG lcl|NC_000001.11_... | |
| [4] 402 ATGAGTGACAGCATCAACTTCTCT...CAGGACCCAGGCACAGGCATTAG lcl|NC_000001.11_... | |
| [5] 402 ATGAGTGACAGCATCAACTTCTCT...CAGGACCCAGGCACAGGCATTAG lcl|NC_000001.11_... | |
| ... ... ... | |
| [114963] 297 ATGCCCCTCATTTACATAAATATT...ACCTAAACCTACTCCAATGCTAA lcl|NC_012920.1_c... | |
| [114964] 1378 ATGCTAAAACTAATCGTCCCAACA...CATCATTACCGGGTTTTCCTCTT lcl|NC_012920.1_c... | |
| [114965] 1812 ATAACCATGCACACTACTATAACC...TAACCCTACTCCTAATCACATAA lcl|NC_012920.1_c... | |
| [114966] 525 ATGATGTATGCTTTGTTTCTGTTG...TTGAGATTGCTCGGGGGAATAGG lcl|NC_012920.1_c... | |
| [114967] 1141 ATGACCCCAATACGCAAAACTAAC...AAACAAAATACTCAAATGGGCCT lcl|NC_012920.1_c... | |
| ``` | |
| Internally, a text file named `doc_Homo_sapiens_db_refseq.txt` is generated. The information stored in this log file is structured as follows: | |
| ``` | |
| File Name: Homo_sapiens_cds_from_genomic_refseq.fna.gz | |
| Organism Name: Homo_sapiens | |
| Database: NCBI refseq | |
| URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/ | |
| GCF_000001405.35_GRCh38.p9/GCF_000001405.35_GRCh38.p9_cds_from_genomic.fna.gz | |
| Download_Date: Sun Oct 23 17:19:05 2016 | |
| refseq_category: reference genome | |
| assembly_accession: GCF_000001405.35 | |
| bioproject: PRJNA168 | |
| biosample: NA | |
| taxid: 9606 | |
| infraspecific_name: NA | |
| version_status: latest | |
| release_type: Patch | |
| genome_rep: Full | |
| seq_rel_date: 2016-09-26 | |
| submitter: Genome Reference Consortium | |
| ``` | |
| In summary, the `getCDS()` and `read_cds()` functions allow users to retrieve CDS files by specifying | |
| the scientific name of the organism of interest and allow them to import the retrieved CDS file e.g. as `Biostrings` object. | |
| Thus, users can then perform the `Biostrings notation` to work with downloaded CDS and can rely on the log | |
| file generated by `getCDS()` to better document the source and version of CDS used for subsequent studies. | |
| Alternatively, users can perform the pipeline logic of the [magrittr](https://github.com/smbache/magrittr) package: | |
| ```{r,eval=FALSE} | |
| # install.packages("magrittr") | |
| library(magrittr) | |
| # import CDS as Biostrings object | |
| Human_CDS <- getCDS( db = "refseq", | |
| organism = "Homo sapiens", | |
| path = file.path("_ncbi_downloads","CDS")) %>% | |
| read_cds(obj.type = "Biostrings") | |
| ``` | |
| ```{r,eval=FALSE} | |
| Human_CDS | |
| ``` | |
| ``` | |
| A BStringSet instance of length 114967 | |
| width seq names | |
| [1] 918 ATGGTGACTGAATTCATTTTTCTG...CACATTCTAGTGTAAAGTTTTAG lcl|NC_000001.11_... | |
| [2] 402 ATGAGTGACAGCATCAACTTCTCT...CAGGACCCAGGCACAGGCATTAG lcl|NC_000001.11_... | |
| [3] 402 ATGAGTGACAGCATCAACTTCTCT...CAGGACCCAGGCACAGGCATTAG lcl|NC_000001.11_... | |
| [4] 402 ATGAGTGACAGCATCAACTTCTCT...CAGGACCCAGGCACAGGCATTAG lcl|NC_000001.11_... | |
| [5] 402 ATGAGTGACAGCATCAACTTCTCT...CAGGACCCAGGCACAGGCATTAG lcl|NC_000001.11_... | |
| ... ... ... | |
| [114963] 297 ATGCCCCTCATTTACATAAATATT...ACCTAAACCTACTCCAATGCTAA lcl|NC_012920.1_c... | |
| [114964] 1378 ATGCTAAAACTAATCGTCCCAACA...CATCATTACCGGGTTTTCCTCTT lcl|NC_012920.1_c... | |
| [114965] 1812 ATAACCATGCACACTACTATAACC...TAACCCTACTCCTAATCACATAA lcl|NC_012920.1_c... | |
| [114966] 525 ATGATGTATGCTTTGTTTCTGTTG...TTGAGATTGCTCGGGGGAATAGG lcl|NC_012920.1_c... | |
| [114967] 1141 ATGACCCCAATACGCAAAACTAAC...AAACAAAATACTCAAATGGGCCT lcl|NC_012920.1_c... | |
| ``` | |
| ### Example `NCBI Genbank`: | |
| ```{r,eval=FALSE} | |
| # download the genome of Homo sapiens from genbank | |
| # and store the corresponding genome CDS file in '_ncbi_downloads/CDS' | |
| HS.cds.genbank <- getCDS( db = "genbank", | |
| organism = "Homo sapiens", | |
| path = file.path("_ncbi_downloads","CDS")) | |
| ``` | |
| ```{r,eval=FALSE} | |
| # import downloaded CDS as Biostrings object | |
| Human_CDS <- read_cds(file = HS.cds.genbank, | |
| obj.type = "Biostrings") | |
| ``` | |
| ```{r,eval=FALSE} | |
| # look at the Biostrings object | |
| Human_CDS | |
| ``` | |
| ``` | |
| A BStringSet instance of length 13 | |
| width seq names | |
| [1] 956 ATACCCATGGCCAACCTCCTACTCCT...ATCTCCAGCATTCCCCCTCAAACCTA lcl|J01415.2_cds_... | |
| [2] 1042 ATTAATCCCCTGGCCCAACCCGTCAT...CTCCCCTTTTATACTAATAATCTTAT lcl|J01415.2_cds_... | |
| [3] 1542 ATGTTCGCCGACCGTTGACTATTCTC...AAGAACCCGTATACATAAAATCTAGA lcl|J01415.2_cds_... | |
| [4] 684 ATGGCACATGCAGCGCAAGTAGGTCT...AAATAGGGCCCGTATTTACCCTATAG lcl|J01415.2_cds_... | |
| [5] 207 ATGCCCCAACTAAATACTACCGTATG...TTCATTCATTGCCCCCACAATCCTAG lcl|J01415.2_cds_... | |
| ... ... ... | |
| [9] 297 ATGCCCCTCATTTACATAAATATTAT...ATAACCTAAACCTACTCCAATGCTAA lcl|J01415.2_cds_... | |
| [10] 1378 ATGCTAAAACTAATCGTCCCAACAAT...CGACATCATTACCGGGTTTTCCTCTT lcl|J01415.2_cds_... | |
| [11] 1812 ATAACCATGCACACTACTATAACCAC...TCCTAACCCTACTCCTAATCACATAA lcl|J01415.2_cds_... | |
| [12] 525 ATGATGTATGCTTTGTTTCTGTTGAG...TAATTGAGATTGCTCGGGGGAATAGG lcl|J01415.2_cds_... | |
| [13] 1141 ATGACCCCAATACGCAAAACTAACCC...TGAAAACAAAATACTCAAATGGGCCT lcl|J01415.2_cds_... | |
| ``` | |
| ### Example `ENSEMBL`: | |
| ```{r,eval=FALSE} | |
| # download the genome of Homo sapiens from ensembl | |
| # and store the corresponding genome CDS file in '_ncbi_downloads/CDS' | |
| HS.cds.ensembl <- getCDS( db = "ensembl", | |
| organism = "Homo sapiens", | |
| path = file.path("_ncbi_downloads","CDS")) | |
| ``` | |
| ```{r,eval=FALSE} | |
| # import downloaded CDS as Biostrings object | |
| Human_CDS <- read_cds(file = HS.cds.ensembl, | |
| obj.type = "Biostrings") | |
| ``` | |
| ```{r,eval=FALSE} | |
| # look at the Biostrings object | |
| Human_CDS | |
| ``` | |
| ``` | |
| A BStringSet instance of length 102915 | |
| width seq names | |
| [1] 13 ACTGGGGGATACG ENST00000448914.1... | |
| [2] 12 GGGACAGGGGGC ENST00000631435.1... | |
| [3] 12 GGGACAGGGGGC ENST00000632684.1... | |
| [4] 9 CCTTCCTAC ENST00000434970.2... | |
| [5] 8 GAAATAGT ENST00000415118.1... | |
| ... ... ... | |
| [102911] 1665 ATGCTACTGCCACTGCTGCTGTCC...ATGCAGAAGTCAAGTTCCAATGA ENST00000436984.6... | |
| [102912] 1920 ATGCTACTGCCACTGCTGCTGTCC...ATGCAGAAGTCAAGTTCCAATGA ENST00000439889.6... | |
| [102913] 2094 ATGCTACTGCCACTGCTGCTGTCC...ATGCAGAAGTCAAGTTCCAATGA ENST00000339313.9... | |
| [102914] 466 ATGCTACTGCCACTGCTGCTGTCC...AGCCCTGGACCTCTCTGTGCAGT ENST00000529627.1... | |
| [102915] 559 ATGCGGAGATGCTACTGCCACTGC...CCTCACCTGCCATGTGGACTTCT ENST00000530476.1... | |
| ``` | |
| Users can also specify the `release` argument which denotes the database release version of ENSEMBL (`db = "ensembl"`) in case they wish to download archived vgenome versions. Default is release = NULL meaning that the most recent database version is used. | |
| ### CDSSet Retrieval | |
| The `getCDSSet()` function enables users to retrieve CDS files for | |
| multiple species. This is convenient when users wish to bulk download a particular | |
| set of species. Internally, a folder named `set_cds` is generated in which | |
| CDS and CDS info files are stored. In addition, users can specify the arguments: `clean_retrieval` and `gunzip` (both are `TRUE` by default) to clean file names and automatically unzip downloaded files. | |
| Example: | |
| ```r | |
| # specify the species names | |
| download_species <- c("Arabidopsis thaliana", | |
| "Arabidopsis lyrata", | |
| "Capsella rubella") | |
| # retrieve these three species from NCBI RefSeq | |
| biomartr::getCDSSet("refseq", organisms = download_species, path = "set_cds") | |
| ``` | |
| If the download process was interrupted, users can re-run the function | |
| and it will only download missing genomes. In cases users wish | |
| to download everything again and updating existing genomes, they may | |
| specify the argument `update = TRUE`. | |
| ### RNA Retrieval | |
| The `getRNA()` function is an interface function to the [NCBI RefSeq](http://www.ncbi.nlm.nih.gov/refseq/about/), [NCBI Genbank](http://www.ncbi.nlm.nih.gov/genbank/about/), | |
| [ENSEMBL](http://www.ensembl.org/index.html) databases from | |
| which corresponding RNA files can be retrieved. It works analogous to `getGenome()`, `getProteome()`, and `getCDS()`. | |
| The `db` argument specifies from which database proteomes in `*.fasta` file format shall be retrieved. | |
| Options are: | |
| - `db = "refseq"` for retrieval from [NCBI RefSeq](http://www.ncbi.nlm.nih.gov/refseq/about/) | |
| - `db = "genbank"` for retrieval from [NCBI Genbank](http://www.ncbi.nlm.nih.gov/genbank/about/) | |
| - `db = "ensembl"` for retrieval from [ENSEMBL](http://www.ensembl.org/index.html) | |
| Furthermore, again users need to specify the scientific name of the organism of interest for which a proteomes shall be downloaded, e.g. `organism = "Homo sapiens"`. | |
| Finally, the `path` argument specifies the folder path in which the corresponding RNA file shall be locally stored. In case users would like to store the RNA file at a different location, | |
| they can specify the `path = file.path("put","your","path","here")` argument. | |
| ### Example `NCBI RefSeq`: | |
| ```{r,eval=FALSE} | |
| # download the RNA of Homo sapiens from refseq | |
| # and store the corresponding RNA file in '_ncbi_downloads/RNA' | |
| HS.rna.refseq <- getRNA( db = "refseq", | |
| organism = "Homo sapiens", | |
| path = file.path("_ncbi_downloads","RNA")) | |
| ``` | |
| In this example, `getRNA()` creates a directory named `'_ncbi_downloads/RNA'` into which the corresponding RNA file named `Homo_sapiens_rna_from_genomic_refseq.fna.gz` is downloaded. The return value of `getRNA()` is the folder path to the downloaded genome file | |
| that can then be used as input to the `read_rna()` function. The variable `HS.rna.refseq` stores the path to the downloaded RNA file. | |
| Subsequently, users can use the `read_cds()` function to import the genome into the R session. Users can choose to work with the genome sequence in R either as [Biostrings](http://bioconductor.org/packages/release/bioc/html/Biostrings.html) object (`obj.type = "Biostrings"`) or [data.table](https://github.com/Rdatatable/data.table/wiki) object | |
| (`obj.type = "data.table"`) by specifying the `obj.type` argument of the `read_rna()` function. | |
| ```{r,eval=FALSE} | |
| # import downloaded RNA as Biostrings object | |
| Human_rna <- read_rna(file = HS.rna.refseq, | |
| obj.type = "Biostrings") | |
| ``` | |
| ```{r,eval=FALSE} | |
| # look at the Biostrings object | |
| Human_rna | |
| ``` | |
| ``` | |
| A BStringSet instance of length 164136 | |
| width seq names | |
| [1] 1652 CTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGT...CACAGCTAGAGATCCTTTATTAAAAGCACACTGTTGGTTTCTG lcl|NC_000001.11_... | |
| [2] 1769 TCCGGCAGAGCGGAAGCGGCGGCGGGAGCTTCCGGGAGGGCGG...ACCAACAGTGTGCTTTTAATAAAGGATCTCTAGCTGTGCAGGA lcl|NC_000001.11_... | |
| [3] 68 TGTGGGAGAGGAACATGGGCTCAGGACAGCGGGTGTCAGCTTGCCTGACCCCCATGTCGCCTCTGTAG lcl|NC_000001.11_... | |
| [4] 23 TGACCCCCATGTCGCCTCTGTAG lcl|NC_000001.11_... | |
| [5] 23 GAGAGGAACATGGGCTCAGGACA lcl|NC_000001.11_... | |
| ... ... ... | |
| [164132] 59 GAGAAAGCTCACAAGAACTGCTAACTCATGCCCCCATGTCTAACAACATGGCTTTCTCA lcl|NC_012920.1_t... | |
| [164133] 71 ACTTTTAAAGGATAACAGCTATCCATTGGTCTTAGGCCCCAAAAATTTTGGTGCAACTCCAAATAAAAGTA lcl|NC_012920.1_t... | |
| [164134] 69 GTTCTTGTAGTTGAAATACAACGATGGTTTTTCATATCATTGGTCGTGGTTGTAGTCCGTGCGAGAATA lcl|NC_012920.1_t... | |
| [164135] 66 GTCCTTGTAGTATAAACTAATACACCAGTCTTGTAAACCGGAGATGAAAACCTTTTTCCAAGGACA lcl|NC_012920.1_t... | |
| [164136] 68 CAGAGAATAGTTTAAATTAGAATCTTAGCTTTGGGTGCTAATGGTGGAGTTAAAGACTTTTTCTCTGA lcl|NC_012920.1_t... | |
| ``` | |
| Internally, a text file named `doc_Homo_sapiens_db_refseq.txt` is generated. The information stored in this log file is structured as follows: | |
| ``` | |
| File Name: Homo_sapiens_rna_from_genomic_refseq.fna.gz | |
| Organism Name: Homo_sapiens | |
| Database: NCBI refseq | |
| URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.36_GRCh38.p10/GCF_000001405.36_GRCh38.p10_rna_from_genomic.fna.gz | |
| Download_Date: Wed Mar 15 16:46:45 2017 | |
| refseq_category: reference genome | |
| assembly_accession: GCF_000001405.36 | |
| bioproject: PRJNA168 | |
| biosample: NA | |
| taxid: 9606 | |
| infraspecific_name: NA | |
| version_status: latest | |
| release_type: Patch | |
| genome_rep: Full | |
| seq_rel_date: 2017-01-06 | |
| submitter: Genome Reference Consortium | |
| ``` | |
| In summary, the `getRNA()` and `read_rna()` functions allow users to retrieve RNA files by specifying | |
| the scientific name of the organism of interest and allow them to import the retrieved RNA file e.g. as `Biostrings` object. | |
| Thus, users can then perform the `Biostrings notation` to work with downloaded RNA and can rely on the log | |
| file generated by `getRNA()` to better document the source and version of RNA used for subsequent studies. | |
| Alternatively, users can perform the pipeline logic of the [magrittr](https://github.com/smbache/magrittr) package: | |
| ```{r,eval=FALSE} | |
| # install.packages("magrittr") | |
| library(magrittr) | |
| # import RNA as Biostrings object | |
| Human_rna <- getRNA( db = "refseq", | |
| organism = "Homo sapiens", | |
| path = file.path("_ncbi_downloads","RNA")) %>% | |
| read_cds(obj.type = "Biostrings") | |
| ``` | |
| ```{r,eval=FALSE} | |
| Human_rna | |
| ``` | |
| ``` | |
| A BStringSet instance of length 164136 | |
| width seq names | |
| [1] 1652 CTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGT...CACAGCTAGAGATCCTTTATTAAAAGCACACTGTTGGTTTCTG lcl|NC_000001.11_... | |
| [2] 1769 TCCGGCAGAGCGGAAGCGGCGGCGGGAGCTTCCGGGAGGGCGG...ACCAACAGTGTGCTTTTAATAAAGGATCTCTAGCTGTGCAGGA lcl|NC_000001.11_... | |
| [3] 68 TGTGGGAGAGGAACATGGGCTCAGGACAGCGGGTGTCAGCTTGCCTGACCCCCATGTCGCCTCTGTAG lcl|NC_000001.11_... | |
| [4] 23 TGACCCCCATGTCGCCTCTGTAG lcl|NC_000001.11_... | |
| [5] 23 GAGAGGAACATGGGCTCAGGACA lcl|NC_000001.11_... | |
| ... ... ... | |
| [164132] 59 GAGAAAGCTCACAAGAACTGCTAACTCATGCCCCCATGTCTAACAACATGGCTTTCTCA lcl|NC_012920.1_t... | |
| [164133] 71 ACTTTTAAAGGATAACAGCTATCCATTGGTCTTAGGCCCCAAAAATTTTGGTGCAACTCCAAATAAAAGTA lcl|NC_012920.1_t... | |
| [164134] 69 GTTCTTGTAGTTGAAATACAACGATGGTTTTTCATATCATTGGTCGTGGTTGTAGTCCGTGCGAGAATA lcl|NC_012920.1_t... | |
| [164135] 66 GTCCTTGTAGTATAAACTAATACACCAGTCTTGTAAACCGGAGATGAAAACCTTTTTCCAAGGACA lcl|NC_012920.1_t... | |
| [164136] 68 CAGAGAATAGTTTAAATTAGAATCTTAGCTTTGGGTGCTAATGGTGGAGTTAAAGACTTTTTCTCTGA lcl|NC_012920.1_t... | |
| ``` | |
| ### Example `NCBI Genbank`: | |
| ```{r,eval=FALSE} | |
| # download the RNA of Homo sapiens from genbank | |
| # and store the corresponding genome RNA file in '_ncbi_downloads/RNA' | |
| HS.rna.genbank <- getRNA( db = "genbank", | |
| organism = "Homo sapiens", | |
| path = file.path("_ncbi_downloads","RNA")) | |
| ``` | |
| ```{r,eval=FALSE} | |
| # import downloaded RNA as Biostrings object | |
| Human_rna <- read_cds(file = HS.rna.genbank, | |
| obj.type = "Biostrings") | |
| ``` | |
| ```{r,eval=FALSE} | |
| # look at the Biostrings object | |
| Human_rna | |
| ``` | |
| ``` | |
| A BStringSet instance of length 24 | |
| width seq names | |
| [1] 71 GTTTATGTAGCTTACCTCCTCAAAGCAATACACTGAAAATGTTTAGACGGGCTCACATCACCCCATAAACA lcl|J01415.2_trna... | |
| [2] 954 AATAGGTTTGGTCCTAGCCTTTCTATTAGCTCTTAGTAAGATTACA...AAGTCGTAACATGGTAAGTGTACTGGAAAGTGCACTTGGACGAAC lcl|J01415.2_rrna... | |
| [3] 69 CAGAGTGTAGCTTAACACAAAGCACCCAACTTACACTTAGGAGATTTCAACTTAACTTGACCGCTCTGA lcl|J01415.2_trna... | |
| [4] 1559 GCTAAACCTAGCCCCAAACCCACTCCACCTTACTACCAGACAACCT...ATCTCAACTTAGTATTATACCCACACCCACCCAAGAACAGGGTTT lcl|J01415.2_rrna... | |
| [5] 75 GTTAAGATGGCAGAGCCCGGTAATCGCATAAAACTTAAAACTTTACAGTCAGAGGTTCAATTCCTCTTCTTAACA lcl|J01415.2_trna... | |
| ... ... ... | |
| [20] 59 GAGAAAGCTCACAAGAACTGCTAACTCATGCCCCCATGTCTAACAACATGGCTTTCTCA lcl|J01415.2_trna... | |
| [21] 71 ACTTTTAAAGGATAACAGCTATCCATTGGTCTTAGGCCCCAAAAATTTTGGTGCAACTCCAAATAAAAGTA lcl|J01415.2_trna... | |
| [22] 69 GTTCTTGTAGTTGAAATACAACGATGGTTTTTCATATCATTGGTCGTGGTTGTAGTCCGTGCGAGAATA lcl|J01415.2_trna... | |
| [23] 66 GTCCTTGTAGTATAAACTAATACACCAGTCTTGTAAACCGGAGATGAAAACCTTTTTCCAAGGACA lcl|J01415.2_trna... | |
| [24] 68 CAGAGAATAGTTTAAATTAGAATCTTAGCTTTGGGTGCTAATGGTGGAGTTAAAGACTTTTTCTCTGA lcl|J01415.2_trna... | |
| ``` | |
| ### Example `ENSEMBL`: | |
| ```{r,eval=FALSE} | |
| # download the RNA of Homo sapiens from ensembl | |
| # and store the corresponding genome RNA file in '_ncbi_downloads/RNA' | |
| HS.rna.ensembl <- getRNA( db = "ensembl", | |
| organism = "Homo sapiens", | |
| path = file.path("_ncbi_downloads","RNA")) | |
| ``` | |
| ```{r,eval=FALSE} | |
| # import downloaded RNA as Biostrings object | |
| Human_rna <- read_cds(file = HS.rna.ensembl, | |
| obj.type = "Biostrings") | |
| ``` | |
| ```{r,eval=FALSE} | |
| # look at the Biostrings object | |
| Human_rna | |
| ``` | |
| ``` | |
| A BStringSet instance of length 36701 | |
| width seq names | |
| [1] 104 GTGCTCACTTTGGCAACATACATACTAAAATTGGACGGATACAG...GCACAAGGATGACATGCAAATTCATGAAGCATTCCATATTTTT ENST00000516494.2... | |
| [2] 164 TTAACTACCTGACAGAGGAGATACTGTGATCATGAAAGTGGTTT...CATAATTTCTGGTGGTAGGGGACTGCGTTCATGTTCTCCCCTA ENST00000627793.1... | |
| [3] 114 ACACTGGTTTCTCTTCAGATCGAATAAATCTTTCGCCTTTTACT...AGTTATAAGCTAATTTTTTGTAAGCCTTGCCCTGGGGAGGCAG ENST00000629478.1... | |
| [4] 64 CAGTGTTACACCTCTTTTAGAATTTATCTATCAGGTTTTCCAGTGTTCACTGAAATTTGTCTCT ENST00000629187.1... | |
| [5] 107 GTGTTGGCCTGGGCAGCACGTATACTAAAGTTGGAATGACACAG...GCGCAAGGATGATGTGCAAATTCGTGACAAGTTCCATATTTTT ENST00000631612.1... | |
| ... ... ... | |
| [36697] 90 GGCTAGTCCAAATGTAGTGGTGTTCCAAACTAATTAATCACAACCAGTTACAGATTTCTTGTTTTCTTTTCCACTCACACTTAGCCTTAA ENST00000410951.1... | |
| [36698] 109 GGCTGATCTGAAGATGATGAGTTATCTCAATTGATTGTTCAGCC...TCTATTCTTTCCTCTCTTCTCACTACTGCACTTGGCTAGGAAA ENST00000410462.2... | |
| [36699] 109 GGTGGGTCCAAAGGTAGCAAGTTATCTCAATTGATCACAGTCAG...CATTCTATCACCCCTTCTCATTACTGTACTTGACTAGTCTTTT ENST00000364501.1... | |
| [36700] 293 GGATATGAGGGCGATCTGGCAGGGACATCTGTCACCCCACTGAT...AAAATTAGCTGGGCATAGTGGCGTGCACCTGTCGTCCTAGCTA ENST00000365097.1... | |
| [36701] 172 TTGAAGGCGTGGAGACTGAAGTCCTCTCTATATCCACAGAACAG...AAGAGGGCTGTTCAGTCTCCATGCCCTTCAATCCTTGGCTACT ENST00000615087.1... | |
| ``` | |
| Users can also specify the `release` argument which denotes the database release version of ENSEMBL (`db = "ensembl"`) in case they wish to download archived vgenome versions. Default is release = NULL meaning that the most recent database version is used. | |
| ### RNASet Retrieval | |
| The `getRNASet()` function enables users to retrieve RNA files for | |
| multiple species. This is convenient when users wish to bulk download a particular | |
| set of species. Internally, a folder named `set_rna` is generated in which | |
| RNA and RNA info files are stored. In addition, users can specify the arguments: `clean_retrieval` and `gunzip` (both are `TRUE` by default) to clean file names and automatically unzip downloaded files. | |
| Example: | |
| ```r | |
| # specify the species names | |
| download_species <- c("Arabidopsis thaliana", | |
| "Arabidopsis lyrata", | |
| "Capsella rubella") | |
| # retrieve these three species from NCBI RefSeq | |
| biomartr::getRNASet("refseq", organisms = download_species, path = "set_rna") | |
| ``` | |
| If the download process was interrupted, users can re-run the function | |
| and it will only download missing genomes. In cases users wish | |
| to download everything again and updating existing genomes, they may | |
| specify the argument `update = TRUE`. | |
| ### Retrieve the annotation file of a particular genome | |
| Finally, users can download the corresponding annotation `.gff` files for particular genomes of interest using the `getGFF()` or alternatively for `ensembl` the `getGTF()` function. | |
| ### Example `NCBI RefSeq`: | |
| ```{r,eval=FALSE} | |
| # download the GFF file of Homo sapiens from refseq | |
| # and store the corresponding file in '_ncbi_downloads/annotation' | |
| HS.gff.refseq <- getGFF( db = "refseq", | |
| organism = "Homo sapiens", | |
| path = file.path("_ncbi_downloads","annotation")) | |
| ``` | |
| After downloading the `.gff` file, users can import the `.gff` file with `read_gff()`. | |
| ```{r,eval=FALSE} | |
| # import downloaded GFF file | |
| Human_GFF <- read_gff(file = HS.gff.refseq) | |
| ``` | |
| ```{r,eval=FALSE} | |
| Human_GFF | |
| ``` | |
| ``` | |
| seqid source type start end score strand phase | |
| <chr> <chr> <chr> <int> <int> <dbl> <chr> <dbl> | |
| 1 NC_000001.11 RefSeq region 1 248956422 0 + 0 | |
| 2 NC_000001.11 BestRefSeq gene 11874 14409 0 + 0 | |
| 3 NC_000001.11 BestRefSeq transcript 11874 14409 0 + 0 | |
| 4 NC_000001.11 BestRefSeq exon 11874 12227 0 + 0 | |
| 5 NC_000001.11 BestRefSeq exon 12613 12721 0 + 0 | |
| 6 NC_000001.11 BestRefSeq exon 13221 14409 0 + 0 | |
| 7 NC_000001.11 BestRefSeq gene 14362 29370 0 - 0 | |
| 8 NC_000001.11 BestRefSeq transcript 14362 29370 0 - 0 | |
| 9 NC_000001.11 BestRefSeq exon 29321 29370 0 - 0 | |
| 10 NC_000001.11 BestRefSeq exon 24738 24891 0 - 0 | |
| ``` | |
| #### Removing corrupt lines from downloaded GFF files | |
| In some cases, `GFF` files stored at NCBI databases include corrupt lines that have more than 65000 characters. This [leads to problems](https://github.com/lawremi/rtracklayer/issues/15) | |
| when trying to import such annotation files into R. To overcome this issue users can specify the `remove_annotation_outliers = TRUE` | |
| argument to remove such outlier lines and overwrite the downloaded annotation file. This will make any downstream analysis with this annotation file much more reliable. | |
| Example: | |
| ```r | |
| Ath_path <- biomartr::getGFF(organism = "Arabidopsis thaliana", remove_annotation_outliers = TRUE) | |
| ``` | |
| ``` | |
| Starting GFF retrieval of 'Arabidopsis thaliana' from refseq ... | |
| |============================================| 100% 52 MB | |
| File _ncbi_downloads/annotation/Arabidopsis_thaliana_genomic_refseq.gff.gz exists already. Thus, download has been skipped. | |
| Importing '_ncbi_downloads/annotation/Arabidopsis_thaliana_genomic_refseq.gff.gz' ... | |
| |============================================| 100% 434 MB | |
| Reading annotation file '_ncbi_downloads/annotation/Arabidopsis_thaliana_genomic_refseq.gff.gz' and removing all outlier lines with number of characters greater 65000 ... | |
| Overwriting '_ncbi_downloads/annotation/Arabidopsis_thaliana_genomic_refseq.gff.gz' with removed outlier lines ... | |
| Unzipping file _ncbi_downloads/annotation/Arabidopsis_thaliana_genomic_refseq.gff.gz' ... | |
| The new annotation file was created and has been stored at '_ncbi_downloads/annotation/Arabidopsis_thaliana_genomic_refseq.gff'. | |
| The outlier lines were stored in a separate file to explore at '/var/folders/3x/6bbw6ds1039gpwny1m0hn8r80000gp/T//RtmpuVjnsC/Arabidopsis_thaliana_genomic_refseq.gff.gz_outlier_lines.txt'. | |
| ``` | |
| ### Example `NCBI Genbank`: | |
| ```{r,eval=FALSE} | |
| # download the GFF file of Homo sapiens from genbank | |
| # and store the corresponding file in '_ncbi_downloads/annotation' | |
| HS.gff.genbank <- getGFF( db = "genbank", | |
| organism = "Homo sapiens", | |
| path = file.path("_ncbi_downloads","annotation")) | |
| ``` | |
| After downloading the `.gff` file, users can import the `.gff` file with `read_gff()`. | |
| ```{r,eval=FALSE} | |
| # import downloaded GFF file | |
| Human_GFF <- read_gff(file = HS.gff.genbank) | |
| ``` | |
| ```{r,eval=FALSE} | |
| # show all elements of the data.frame | |
| # options(tibble.print_max = Inf) | |
| Human_GFF | |
| ``` | |
| ``` | |
| seqid source type start end score strand phase | |
| <chr> <chr> <chr> <int> <int> <dbl> <chr> <dbl> | |
| 1 CM000663.2 Genbank region 1 248956422 0 + 0 | |
| 2 CM000663.2 Genbank centromere 122026460 125184587 0 + 0 | |
| 3 KI270706.1 Genbank region 1 175055 0 + 0 | |
| 4 KI270707.1 Genbank region 1 32032 0 + 0 | |
| 5 KI270708.1 Genbank region 1 127682 0 + 0 | |
| 6 KI270709.1 Genbank region 1 66860 0 + 0 | |
| 7 KI270710.1 Genbank region 1 40176 0 + 0 | |
| 8 KI270711.1 Genbank region 1 42210 0 + 0 | |
| 9 KI270712.1 Genbank region 1 176043 0 + 0 | |
| 10 KI270713.1 Genbank region 1 40745 0 + 0 | |
| ``` | |
| #### Removing corrupt lines from downloaded GFF files | |
| In some cases, `GFF` files stored at NCBI databases include corrupt lines that have more than 65000 characters. This [leads to problems](https://github.com/lawremi/rtracklayer/issues/15) | |
| when trying to import such annotation files into R. To overcome this issue users can specify the `remove_annotation_outliers = TRUE` | |
| argument to remove such outlier lines and overwrite the downloaded annotation file. This will make any downstream analysis with this annotation file much more reliable. | |
| Example: | |
| ```r | |
| Ath_path <- biomartr::getGFF(db = "genbank", | |
| organism = "Arabidopsis thaliana", | |
| remove_annotation_outliers = TRUE) | |
| ``` | |
| ``` | |
| Starting GFF retrieval of 'Arabidopsis thaliana' from genbank ... | |
| Completed! | |
| Now continue with species download ... | |
| GFF download of Arabidopsis_thaliana is completed! | |
| Checking md5 hash of file: _ncbi_downloads/annotation/Arabidopsis_thaliana_md5checksums.txt ... | |
| The md5 hash of file '_ncbi_downloads/annotation/Arabidopsis_thaliana_md5checksums.txt' matches! | |
| Importing '_ncbi_downloads/annotation/Arabidopsis_thaliana_genomic_genbank.gff.gz' ... | |
| Reading annotation file '_ncbi_downloads/annotation/Arabidopsis_thaliana_genomic_genbank.gff.gz' and removing all outlier lines with number of characters greater 65000 ... | |
| Overwriting '_ncbi_downloads/annotation/Arabidopsis_thaliana_genomic_genbank.gff.gz' with removed outlier lines ... | |
| Unzipping file _ncbi_downloads/annotation/Arabidopsis_thaliana_genomic_genbank.gff.gz' ... | |
| The new annotation file was created and has been stored at '_ncbi_downloads/annotation/Arabidopsis_thaliana_genomic_genbank.gff'. | |
| The outlier lines were stored in a separate file to explore at '/var/folders/3x/6bbw6ds1039gpwny1m0hn8r80000gp/T//RtmpuVjnsC/Arabidopsis_thaliana_genomic_genbank.gff.gz_outlier_lines.txt'. | |
| ``` | |
| ### Example `ENSEMBL`: | |
| ```{r,eval=FALSE} | |
| # download the GFF file of Homo sapiens from ENSEMBL | |
| # and store the corresponding file in 'ensembl/annotation' | |
| HS.gff.ensembl <- getGFF( db = "ensembl", | |
| organism = "Homo sapiens", | |
| path = file.path("ensembl","annotation")) | |
| ``` | |
| After downloading the `.gff` file, users can import the `.gff` file with `read_gff()`. | |
| ```{r,eval=FALSE} | |
| # import downloaded GFF file | |
| Human_GFF <- read_gff(file = HS.gff.ensembl) | |
| ``` | |
| ```{r,eval=FALSE} | |
| # show all elements of the data.frame | |
| # options(tibble.print_max = Inf) | |
| Human_GFF | |
| ``` | |
| ``` | |
| seqid source type start end score strand phase | |
| <int> <chr> <chr> <int> <int> <chr> <chr> <dbl> | |
| 1 1 GRCh38 chromosome 1 248956422 . . 0 | |
| 2 1 . biological_region 10469 11240 1.3e+03 . 0 | |
| 3 1 . biological_region 10650 10657 0.999 + 0 | |
| 4 1 . biological_region 10655 10657 0.999 - 0 | |
| 5 1 . biological_region 10678 10687 0.999 + 0 | |
| 6 1 . biological_region 10681 10688 0.999 - 0 | |
| 7 1 . biological_region 10707 10716 0.999 + 0 | |
| 8 1 . biological_region 10708 10718 0.999 - 0 | |
| 9 1 . biological_region 10735 10747 0.999 - 0 | |
| 10 1 . biological_region 10737 10744 0.999 + 0 | |
| ``` | |
| Users can also specify the `release` argument which denotes the database release version of ENSEMBL (`db = "ensembl"`) in case they wish to download archived vgenome versions. Default is release = NULL meaning that the most recent database version is used. | |
| #### Removing corrupt lines from downloaded GFF files | |
| In some cases, `GFF` files stored at NCBI databases include corrupt lines that have more than 65000 characters. This [leads to problems](https://github.com/lawremi/rtracklayer/issues/15) | |
| when trying to import such annotation files into R. To overcome this issue users can specify the `remove_annotation_outliers = TRUE` | |
| argument to remove such outlier lines and overwrite the downloaded annotation file. This will make any downstream analysis with this annotation file much more reliable. | |
| Alternatively for `getGTF()`: | |
| ```{r,eval=FALSE} | |
| # download the GTF file of Homo sapiens from ENSEMBL | |
| # and store the corresponding file in 'ensembl/annotation' | |
| HS.gtf.ensembl <- getGTF( db = "ensembl", | |
| organism = "Homo sapiens", | |
| path = file.path("ensembl","annotation"), | |
| assembly_type = "primary_assembly") | |
| ``` | |
| Taken together, `getGFF()` or `getGTF()` in combination with `getGenome()`, `getProteome()`, `getRNA()` and `getCDS()` allows users to retrieve the genome information together with the | |
| corresponding `.gff` or `gtf` annotation file to make sure that they both have the same version and origin. | |
| ### GFFSet Retrieval | |
| The `getGFFSet()` function enables users to retrieve GFF files for | |
| multiple species. This is convenient when users wish to bulk download a particular | |
| set of species. Internally, a folder named `set_gff` is generated in which | |
| GFF and GFF info files are stored. In addition, users can specify the arguments: `clean_retrieval` and `gunzip` (both are `TRUE` by default) to clean file names and automatically unzip downloaded files. | |
| Example: | |
| ```r | |
| # specify the species names | |
| download_species <- c("Arabidopsis thaliana", | |
| "Arabidopsis lyrata", | |
| "Capsella rubella") | |
| # retrieve these three species from NCBI RefSeq | |
| biomartr::getGFFSet("refseq", organisms = download_species, path = "set_gff") | |
| ``` | |
| If the download process was interrupted, users can re-run the function | |
| and it will only download missing genomes. In cases users wish | |
| to download everything again and updating existing genomes, they may | |
| specify the argument `update = TRUE`. | |
| In some cases, `GFF` files stored at NCBI databases include corrupt lines that have more than 65000 characters. This [leads to problems](https://github.com/lawremi/rtracklayer/issues/15) | |
| when trying to import such annotation files into R. To overcome this issue users can specify the `remove_annotation_outliers = TRUE` | |
| argument to remove such outlier lines and overwrite the downloaded annotation file. This will make any downstream analysis with this annotation file much more reliable. | |
| ## Repeat Masker Retrieval | |
| [Repeat Masker](http://www.repeatmasker.org) is a tool for screening DNA sequences for interspersed repeats and low complexity DNA sequences. | |
| NCBI stores the `Repeat Masker` for sevel species in their databases and can be retrieved using `getRepeatMasker()` and imported | |
| via `read_rm()`. | |
| ### Example `NCBI RefSeq`: | |
| ```{r,eval=FALSE} | |
| # download repeat masker annotation file for Homo sapiens | |
| Hsapiens_rm <- getRepeatMasker( db = "refseq", | |
| organism = "Homo sapiens", | |
| path = file.path("refseq","TEannotation")) | |
| ``` | |
| Now users can import the TE annotation file using `read_rm()`. | |
| ```{r,eval=FALSE} | |
| # import TE annotation file | |
| Hsapiens_rm_import <- read_rm("refseq/TEannotation/Homo_sapiens_rm_refseq.out.gz") | |
| # look at data | |
| Hsapiens_rm_import | |
| ``` | |
| ## Genome Assembly Stats Retrieval | |
| By specifying the scientific name of an organism of interest the corresponding genome assembly stats file storing the assembly statistics of the organism of interest can be downloaded and stored locally. | |
| ### Example `NCBI RefSeq`: | |
| ```{r,eval=FALSE} | |
| # download genome assembly stats file for Homo sapiens | |
| Hsapiens_stats <- getAssemblyStats( db = "refseq", | |
| organism = "Homo sapiens", | |
| path = file.path("refseq","AssemblyStats")) | |
| ``` | |
| Now users can import the TE annotation file using `read_rm()`. | |
| ```{r,eval=FALSE} | |
| # import TE annotation file | |
| Hsapiens_stats_import <- read_assemblystats(Hsapiens_stats) | |
| # look at data | |
| Hsapiens_stats_import | |
| ``` | |
| ``` | |
| A tibble: 1,196 x 6 | |
| unit_name molecule_name molecule_type sequence_type statistic value | |
| <chr> <chr> <chr> <chr> <chr> <int> | |
| 1 all all all all total-length NA | |
| 2 all all all all spanned-gaps 661 | |
| 3 all all all all unspanned-gaps 349 | |
| 4 all all all all region-count 317 | |
| 5 all all all all scaffold-count 875 | |
| 6 all all all all scaffold-N50 59364414 | |
| 7 all all all all scaffold-L50 17 | |
| 8 all all all all scaffold-N75 31699399 | |
| 9 all all all all scaffold-N90 7557127 | |
| 10 all all all all contig-count 1536 | |
| ``` | |
| ### Example `NCBI Genbank`: | |
| ```{r,eval=FALSE} | |
| # download genome assembly stats file for Homo sapiens | |
| Hsapiens_stats <- getAssemblyStats( db = "genbank", | |
| organism = "Homo sapiens", | |
| path = file.path("genbank","AssemblyStats")) | |
| ``` | |
| Now users can import the TE annotation file using `read_rm()`. | |
| ```{r,eval=FALSE} | |
| # import TE annotation file | |
| Hsapiens_stats_import <- read_assemblystats(Hsapiens_stats) | |
| # look at data | |
| Hsapiens_stats_import | |
| ``` | |
| ``` | |
| A tibble: 1,196 x 6 | |
| unit_name molecule_name molecule_type sequence_type statistic value | |
| <chr> <chr> <chr> <chr> <chr> <int> | |
| 1 all all all all total-length NA | |
| 2 all all all all spanned-gaps 661 | |
| 3 all all all all unspanned-gaps 349 | |
| 4 all all all all region-count 317 | |
| 5 all all all all scaffold-count 875 | |
| 6 all all all all scaffold-N50 59364414 | |
| 7 all all all all scaffold-L50 17 | |
| 8 all all all all scaffold-N75 31699399 | |
| 9 all all all all scaffold-N90 7557127 | |
| 10 all all all all contig-count 1536 | |
| ``` | |
| ## Collection Retrieval | |
| The automated retrieval of collections (= Genome, Proteome, CDS, RNA, GFF, Repeat Masker, AssemblyStats) | |
| will make sure that the genome file of an organism will match the CDS, proteome, RNA, GFF, etc file | |
| and was generated using the same genome assembly version. One aspect of why genomics studies | |
| fail in computational and biological reproducibility is that it is not clear whether CDS, proteome, RNA, GFF, etc files | |
| used in a proposed analysis were generated using the same genome assembly file denoting the same genome assembly version. | |
| To avoid this seemingly trivial mistake we encourage users to retrieve | |
| genome file collections using the `biomartr` function `getCollection()` | |
| and attach the corresponding output as Supplementary Data | |
| to the respective genomics study to ensure computational and biological reproducibility. | |
| By specifying the scientific name of an organism of interest a collection consisting of the genome file, proteome file, CDS file, RNA file, GFF file, Repeat Masker file, AssemblyStats file of the organism of interest can be downloaded and stored locally. | |
| ### Example `NCBI RefSeq`: | |
| ```{r,eval=FALSE} | |
| # download collection for Saccharomyces cerevisiae | |
| getCollection( db = "refseq", | |
| organism = "Saccharomyces cerevisiae", | |
| path = file.path("refseq","Collections")) | |
| ``` | |
| Internally, the `getCollection()` function will now generate a folder named `refseq/Collection/Saccharomyces_erevisiae` | |
| and will store all genome and annotation files for `Saccharomyces cerevisiae` in the same folder. | |
| In addition, the exact genoem and annotation version will be logged in the `doc` folder. | |
| ``` | |
| Genome download is completed! | |
| Checking md5 hash of file: refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt ... | |
| The md5 hash of file 'refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt' matches! | |
| The genome of 'Saccharomyces_cerevisiae' has been downloaded to 'refseq/Collections' and has been named 'Saccharomyces_cerevisiae_genomic_refseq.fna.gz'. | |
| Starting proteome retrieval of 'Saccharomyces cerevisiae' from refseq ... | |
| Proteome download is completed! | |
| Checking md5 hash of file: refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt ... | |
| The md5 hash of file 'refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt' matches! | |
| The proteome of 'Saccharomyces_cerevisiae' has been downloaded to 'refseq/Collections' and has been named 'Saccharomyces_cerevisiae_protein_refseq.faa.gz' . | |
| Starting CDS retrieval of 'Saccharomyces cerevisiae' from refseq ... | |
| CDS download is completed! | |
| Checking md5 hash of file: refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt ... | |
| The md5 hash of file 'refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt' matches! | |
| The genomic CDS of 'Saccharomyces_cerevisiae' has been downloaded to 'refseq/Collections' and has been named 'Saccharomyces_cerevisiae_cds_from_genomic_refseq.fna.gz' . | |
| Starting GFF retrieval of 'Saccharomyces cerevisiae' from refseq ... | |
| GFF download is completed! | |
| Checking md5 hash of file: refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt ... | |
| The md5 hash of file 'refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt' matches! | |
| The *.gff annotation file of 'Saccharomyces_cerevisiae' has been downloaded to 'refseq/Collections' and has been named 'Saccharomyces_cerevisiae_genomic_refseq.gff.gz'. | |
| Starting RNA retrieval of 'Saccharomyces cerevisiae' from refseq ... | |
| RNA download is completed! | |
| Checking md5 hash of file: refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt ... | |
| The md5 hash of file 'refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt' matches! | |
| The genomic RNA of 'Saccharomyces_cerevisiae' has been downloaded to 'refseq/Collections' and has been named 'Saccharomyces_cerevisiae_rna_from_genomic_refseq.fna.gz' . | |
| Starting Repeat Masker retrieval of 'Saccharomyces cerevisiae' from refseq ... | |
| RepeatMasker download is completed! | |
| Checking md5 hash of file: refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt ... | |
| The md5 hash of file 'refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt' matches! | |
| The Repeat Masker output file of 'Saccharomyces_cerevisiae' has been downloaded to 'refseq/Collections' and has been named 'Saccharomyces_cerevisiae_rm_refseq.out.gz'. | |
| Starting assembly quality stats retrieval of 'Saccharomyces cerevisiae' from refseq ... | |
| Genome assembly quality stats file download completed! | |
| The assembly statistics file of 'Saccharomyces_cerevisiae' has been downloaded to 'refseq/Collections' and has been named 'Saccharomyces_cerevisiae_assembly_stats_refseq.txt'. | |
| Collection retrieval finished successfully! | |
| We retrieved the genome assembly and checked the annotation for 'Saccharomyces cerevisiae' (taxid: 559292, strain=S288C) from 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_assembly_stats.txt' (accession: GCF_000146045.2, bioproject: PRJNA128, biosample: NA) using the biomartr R package (Drost and Paszkowski, 2017). | |
| ``` | |
| Users can now simply attach the output folder `` as supplementary data in their study and state in the materials sections. This way, computational and biological reproducibility can be standardized and indeed will become trivial | |
| in the context of genome version and annotation version. | |