Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign up| --- | |
| title: "Functional Annotation with biomartr" | |
| date: "`r Sys.Date()`" | |
| output: rmarkdown::html_vignette | |
| vignette: > | |
| %\VignetteIndexEntry{Functional Annotation} | |
| %\VignetteEngine{knitr::rmarkdown} | |
| %\usepackage[utf8]{inputenc} | |
| --- | |
| ```{r, echo = FALSE, message = FALSE} | |
| options(width = 750) | |
| knitr::opts_chunk$set( | |
| comment = "#>", | |
| error = FALSE, | |
| tidy = FALSE) | |
| ``` | |
| ## Functional Annotation Retrieval from `Ensembl Biomart` | |
| The [Ensembl Biomart](http://ensemblgenomes.org/info/access/biomart) database enables users to retrieve a vast diversity of annotation data | |
| for specific organisms. Initially, Steffen Durinck and Wolfgang Huber provided a powerful interface between | |
| the R language and [Ensembl Biomart](http://ensemblgenomes.org/info/access/biomart) by implementing the R package [biomaRt](http://www.bioconductor.org/packages/release/bioc/html/biomaRt.html). | |
| The purpose of the `biomaRt` package was to mimic the ENSEMBL BioMart database structure to construct queries that can be sent to the Application Programming Interface (API) of BioMart. Although, this procedure was very useful in the past, it seems not intuitive from an organism centric point of view. Usually, users wish to download functional annotation for a particular organism of interest. However, the BioMart and thus the `biomaRt` package require that users already know in which `mart` and `dataset` the organism of interest will be found which requires significant efforts of searching and screening. In addition, once the `mart` and `dataset` of a particular organism of interest were found and specified the user must again learn which `attribute` has to be specified to retrieve the functional annotation information of interest. | |
| The new functionality implemented in the `biomartr` package aims to overcome this | |
| search bottleneck by extending the functionality of the [biomaRt](http://www.bioconductor.org/packages/release/bioc/html/biomaRt.html) package. The new `biomartr` package introduces a more organism cantered annotation retrieval concept which does not require to screen for `marts`, `datasets`, and `attributes` beforehand. With `biomartr` users only need to specify the `scientific name` of the organism of interest to then retrieve available `marts`, `datasets`, and `attributes` for the corresponding organism of interest. | |
| This paradigm shift enables users to quickly construct queries to the BioMart database without having to learn the particular database structure and organization of BioMart. | |
| The following sections will introduce users to the functionality and data retrieval precedures of `biomartr` and will show how `biomartr` | |
| extends the functionality of the initial [biomaRt](http://www.bioconductor.org/packages/release/bioc/html/biomaRt.html) package. | |
| ### The old `biomaRt` query methodology | |
| The best way to get started with the _old_ methodology presented by the established [biomaRt](http://www.bioconductor.org/packages/release/bioc/html/biomaRt.html) package is to understand the workflow of its data retrieval process. The query logic of the `biomaRt` package derives from the database organization of [Ensembl Biomart](http://ensemblgenomes.org/info/access/biomart) which stores a vast diversity of annotation data | |
| for specific organisms. In detail, the [Ensembl Biomart](http://ensemblgenomes.org/info/access/biomart) database is organized into so called: | |
| `marts`, `datasets`, and `attributes`. `Marts` denote a higher level category of functional annotation such as `SNP` (e.g. for functional annotation of particular single nucleotide polymorphisms (SNPs)) or `FUNCGEN` (e.g. for functional annotation of regulatory regions or relationsships of genes). `Datasets` denote the | |
| particular species of interest for which functional annotation is available __within__ this specific `mart`. It can happen that | |
| `datasets` (= particular species of interest) are available in one `mart` (= higher category of functional annotation) but not in an other `mart`. | |
| For the actual retrieval of functional annotation information users must then specify the `type` of functional annotation information they | |
| wish to retrieve. These `types` are called `attributes` in the `biomaRt` notation. | |
| Hence, when users wish to retrieve information for a specific organism of interest, they first need to specify a particular `mart` and `dataset` in which the information of the corresponding organism of interest can be found. Subsequently they can specify the `attributes` argument to retrieve a particular type of functional annotation (e.g. Gene Ontology terms). | |
| The following section shall illustrate how `marts`, `datasets`, and `attributes` could be explored | |
| using `biomaRt` before the `biomartr` package existed. | |
| The availability of `marts`, `datasets`, and `attributes` can be checked by the following functions: | |
| ```{r,eval=FALSE} | |
| # install the biomaRt package | |
| # source("http://bioconductor.org/biocLite.R") | |
| # biocLite("biomaRt") | |
| # load biomaRt | |
| library(biomaRt) | |
| # look at top 10 databases | |
| head(biomaRt::listMarts(host = "www.ensembl.org"), 10) | |
| ``` | |
| Users will observe that several `marts` providing annotation for specific classes of organisms or groups of organisms are available. | |
| For our example, we will choose the `hsapiens_gene_ensembl` `mart` and list all available datasets that are element of this `mart`. | |
| ```{r,eval=FALSE} | |
| head(biomaRt::listDatasets(biomaRt::useMart("ENSEMBL_MART_ENSEMBL", host = "www.ensembl.org")), 10) | |
| ``` | |
| The `useMart()` function is a wrapper function provided by `biomaRt` to connect a selected BioMart database (`mart`) with a corresponding dataset stored within this `mart`. | |
| We select dataset `hsapiens_gene_ensembl` and now check for available attributes (annotation data) that can be accessed for `Homo sapiens` genes. | |
| ```{r,eval=FALSE} | |
| head(biomaRt::listAttributes(biomaRt::useDataset( | |
| dataset = "hsapiens_gene_ensembl", | |
| mart = useMart("ENSEMBL_MART_ENSEMBL", | |
| host = "www.ensembl.org"))), 10) | |
| ``` | |
| Please note the nested structure of this attribute query. For an attribute query procedure an additional wrapper function named `useDataset()` is needed in which `useMart()` and a corresponding dataset needs to be specified. The result is a table storing the name of available attributes for | |
| _Homo sapiens_ as well as a short description. | |
| Furthermore, users can retrieve all filters for _Homo sapiens_ that can be specified by the actual BioMart query process. | |
| ```{r,eval=FALSE} | |
| head(biomaRt::listFilters(biomaRt::useDataset(dataset = "hsapiens_gene_ensembl", | |
| mart = useMart("ENSEMBL_MART_ENSEMBL", | |
| host = "www.ensembl.org"))), 10) | |
| ``` | |
| After accumulating all this information, it is now possible to perform an actual BioMart query by using the `getBM()` function. | |
| In this example we will retrieve attributes: `start_position`,`end_position` and `description` | |
| for the _Homo sapiens_ gene `"GUCA2A"`. | |
| Since the input genes are `ensembl gene ids`, we need to specify the `filters` argument `filters = "hgnc_symbol"`. | |
| ```{r,eval=FALSE} | |
| # 1) select a mart and data set | |
| mart <- biomaRt::useDataset(dataset = "hsapiens_gene_ensembl", | |
| mart = useMart("ENSEMBL_MART_ENSEMBL", | |
| host = "www.ensembl.org")) | |
| # 2) run a biomart query using the getBM() function | |
| # and specify the attributes and filter arguments | |
| geneSet <- "GUCA2A" | |
| resultTable <- biomaRt::getBM(attributes = c("start_position","end_position","description"), | |
| filters = "hgnc_symbol", | |
| values = geneSet, | |
| mart = mart) | |
| resultTable | |
| ``` | |
| When using `getBM()` users can pass all attributes retrieved by `listAttributes()` to the `attributes` argument of the `getBM()` function. | |
| ## Extending `biomaRt` using the new query system of the `biomartr` package | |
| ### Getting Started with `biomartr` | |
| This query methodology provided by `Ensembl Biomart` and the `biomaRt` package is a very well defined approach | |
| for accurate annotation retrieval. Nevertheless, when learning this query methodology it (subjectively) | |
| seems non-intuitive from the user perspective. Therefore, the `biomartr` package provides another | |
| query methodology that aims to be more organism centric. | |
| Taken together, the following workflow allows users to perform fast BioMart queries for | |
| attributes using the `biomart()` function implemented in this `biomartr` package: | |
| 1) get attributes, datasets, and marts via : `organismAttributes()` | |
| 2) choose available biological features (filters) via: `organismFilters()` | |
| 3) specify a set of query genes: e.g. retrieved with `getGenome()`, `getProteome()` or `getCDS()` | |
| 4) specify all arguments of the `biomart()` function using steps 1) - 3) and | |
| perform a BioMart query | |
| __Note that dataset names change very frequently due to the update of dataset versions. | |
| So in case some query functions do not work properly, users should check with | |
| `organismAttributes(update = TRUE)` whether or not their dataset name has been changed. | |
| For example, `organismAttributes("Homo sapiens", topic = "id", update = TRUE)` | |
| might reveal that the dataset `ENSEMBL_MART_ENSEMBL` has changed.__ | |
| ## Retrieve marts, datasets, attributes, and filters with biomartr | |
| ### Retrieve Available Marts | |
| The `getMarts()` function allows users to list all available databases that can be accessed through BioMart interfaces. | |
| ```{r,eval=FALSE} | |
| # load the biomartr package | |
| library(biomartr) | |
| # list all available databases | |
| biomartr::getMarts() | |
| ``` | |
| ``` | |
| mart version | |
| mart version | |
| <chr> <chr> | |
| 1 ENSEMBL_MART_ENSEMBL Ensembl Genes 99 | |
| 2 ENSEMBL_MART_MOUSE Mouse strains 99 | |
| 3 ENSEMBL_MART_SEQUENCE Sequence | |
| 4 ENSEMBL_MART_ONTOLOGY Ontology | |
| 5 ENSEMBL_MART_GENOMIC Genomic features 99 | |
| 6 ENSEMBL_MART_SNP Ensembl Variation 99 | |
| 7 ENSEMBL_MART_FUNCGEN Ensembl Regulation 99 | |
| 8 plants_mart Ensembl Plants Genes 46 | |
| 9 plants_variations Ensembl Plants Variations 46 | |
| 10 fungi_mart Ensembl Fungi Genes 46 | |
| 11 fungi_variations Ensembl Fungi Variations 46 | |
| 12 protists_mart Ensembl Protists Genes 46 | |
| 13 protists_variations Ensembl Protists Variations 46 | |
| 14 metazoa_mart Ensembl Metazoa Genes 46 | |
| 15 metazoa_variations Ensembl Metazoa Variations 46 | |
| ``` | |
| ### Retrieve Available Datasets from a Specific Mart | |
| Now users can select a specific database to list all available datasets | |
| that can be accessed through this database. In this example we choose | |
| the `ENSEMBL_MART_ENSEMBL` database. | |
| ```{r,eval=FALSE} | |
| head(biomartr::getDatasets(mart = "ENSEMBL_MART_ENSEMBL") , 5) | |
| ``` | |
| ``` | |
| dataset description version | |
| <chr> <chr> <chr> | |
| 1 mauratus_gene_ense… Golden Hamster genes (MesAur1. MesAur1.0 | |
| 2 tbelangeri_gene_en… Tree Shrew genes (tupBel1) tupBel1 | |
| 3 csemilaevis_gene_e… Tongue sole genes (Cse_v1.0) Cse_v1.0 | |
| 4 acalliptera_gene_e… Eastern happy genes (fAstCal1. fAstCal1.2 | |
| 5 hcomes_gene_ensembl Tiger tail seahorse genes (H_c H_comes_QL… | |
| ``` | |
| Now you can select the dataset `hsapiens_gene_ensembl` and list all available attributes that can be retrieved from this dataset. | |
| ```{r,eval=FALSE} | |
| tail(biomartr::getDatasets(mart = "ENSEMBL_MART_ENSEMBL") , 38) | |
| ``` | |
| ``` | |
| 1 lcrocea_gene_ense Large yellow croaker genes ( L_crocea_2.0 | |
| 2 xmaculatus_gene_e Platyfish genes (X_maculatus X_maculatus-… | |
| 3 etelfairi_gene_en Lesser hedgehog tenrec genes TENREC | |
| 4 lcalcarifer_gene_ Barramundi perch genes (ASB_ ASB_HGAPasse… | |
| 5 ssjinhua_gene_ens Pig - Jinhua genes (Jinhua_p Jinhua_pig_v1 | |
| 6 acchrysaetos_gene Golden eagle genes (bAquChr1 bAquChr1.2 | |
| 7 elucius_gene_ense Northern pike genes (Eluc_V3) Eluc_V3 | |
| 8 oanatinus_gene_en Platypus genes (OANA5) OANA5 | |
| 9 oaries_gene_ensem Sheep genes (Oar_v3.1) Oar_v3.1 | |
| 10 rferrumequinum_ge Greater horseshoe bat genes mRhiFer1_v1.p | |
| # ... with 28 more rows | |
| ``` | |
| ### Retrieve Available Attributes from a Specific Dataset | |
| Now that you have selected a database (`hsapiens_gene_ensembl`) and a dataset (`hsapiens_gene_ensembl`), | |
| users can list all available attributes for this dataset using the `getAttributes()` function. | |
| ```{r,eval=FALSE} | |
| # show all elements of the data.frame | |
| options(tibble.print_max = Inf) | |
| # list all available attributes for dataset: hsapiens_gene_ensembl | |
| head( biomartr::getAttributes(mart = "ENSEMBL_MART_ENSEMBL", | |
| dataset = "hsapiens_gene_ensembl"), 10 ) | |
| ``` | |
| ``` | |
| Starting retrieval of attribute information from mart ENSEMBL_MART_ENSEMBL and dataset hsapiens_gene_ensembl ... | |
| name description | |
| 1 ensembl_gene_id Gene stable ID | |
| 2 ensembl_gene_id_version Gene stable ID version | |
| 3 ensembl_transcript_id Transcript stable ID | |
| 4 ensembl_transcript_id_version Transcript stable ID version | |
| 5 ensembl_peptide_id Protein stable ID | |
| 6 ensembl_peptide_id_version Protein stable ID version | |
| 7 ensembl_exon_id Exon stable ID | |
| 8 description Gene description | |
| 9 chromosome_name Chromosome/scaffold name | |
| 10 start_position Gene start (bp) | |
| ``` | |
| ### Retrieve Available Filters from a Specific Dataset | |
| Finally, the `getFilters()` function allows users to list available filters | |
| for a specific dataset that can be used for a `biomart()` query. | |
| ```{r,eval=FALSE} | |
| # show all elements of the data.frame | |
| options(tibble.print_max = Inf) | |
| # list all available filters for dataset: hsapiens_gene_ensembl | |
| head( biomartr::getFilters(mart = "ENSEMBL_MART_ENSEMBL", | |
| dataset = "hsapiens_gene_ensembl"), 10 ) | |
| ``` | |
| ``` | |
| Starting retrieval of filters information from mart ENSEMBL_MART_ENSEMBL and dataset hsapiens_gene_ensembl ... | |
| name description | |
| 1 chromosome_name Chromosome/scaffold name | |
| 2 start Start | |
| 3 end End | |
| 4 band_start Band Start | |
| 5 band_end Band End | |
| 6 marker_start Marker Start | |
| 7 marker_end Marker End | |
| 8 encode_region Encode region | |
| 9 strand Strand | |
| 10 chromosomal_region e.g. 1:100:10000:-1, 1:100000:200000:1 | |
| ``` | |
| ## Organism Specific Retrieval of Information | |
| In most use cases, users will work with a single or a set of model organisms. In this process they will mostly be | |
| interested in specific annotations for this particular model organism. The `organismBM()` | |
| function addresses this issue and provides users with an organism centric query to `marts` and `datasets` | |
| which are available for a particular organism of interest. | |
| __Note__ that when running the following functions for the first time, the data retrieval procedure will take some time, due to the remote access to BioMart. The corresponding result is then saved in a `*.txt` file named `_biomart/listDatasets.txt` within the `tempdir()` folder, allowing subsequent queries to be performed much faster. | |
| The `tempdir()` folder, however, will be deleted after a new R session was established. In this case | |
| the inital call of the subsequent functions again will take time to retrieve all organism specific data from the BioMart database. | |
| This concept of locally storing all organism specific database linking information available in BioMart into | |
| an internal file allows users to significantly speed up subsequent retrieval queries for that particular organism. | |
| ```{r,eval=FALSE} | |
| # show all elements of the data.frame | |
| options(tibble.print_max = Inf) | |
| # retrieving all available datasets and biomart connections for | |
| # a specific query organism (scientific name) | |
| biomartr::organismBM(organism = "Homo sapiens") | |
| ``` | |
| ``` | |
| organism_name description mart dataset version | |
| <chr> <chr> <chr> <chr> <chr> | |
| 1 hsapiens Human genes (GRCh38.p12) ENSEMBL_… hsapiens_gen… GRCh38… | |
| 2 hsapiens Human sequences (GRCh38.p12) ENSEMBL_… hsapiens_gen… GRCh38… | |
| 3 hsapiens marker_feature ENSEMBL_… hsapiens_mar… GRCh38… | |
| 4 hsapiens karyotype_start ENSEMBL_… hsapiens_kar… GRCh38… | |
| 5 hsapiens encode ENSEMBL_… hsapiens_enc… GRCh38… | |
| 6 hsapiens marker_feature_end ENSEMBL_… hsapiens_mar… GRCh38… | |
| 7 hsapiens karyotype_end ENSEMBL_… hsapiens_kar… GRCh38… | |
| 8 hsapiens Human Short Variants (SNPs an… ENSEMBL_… hsapiens_snp GRCh38… | |
| 9 hsapiens Human Somatic Short Variants … ENSEMBL_… hsapiens_snp… GRCh38… | |
| 10 hsapiens Human Somatic Structural Vari… ENSEMBL_… hsapiens_str… GRCh38… | |
| 11 hsapiens Human Structural Variants (GR… ENSEMBL_… hsapiens_str… GRCh38… | |
| 12 hsapiens Human Regulatory Features (GR… ENSEMBL_… hsapiens_reg… GRCh38… | |
| 13 hsapiens Human miRNA Target Regions (G… ENSEMBL_… hsapiens_mir… GRCh38… | |
| 14 hsapiens Human Binding Motifs (GRCh38.… ENSEMBL_… hsapiens_mot… GRCh38… | |
| 15 hsapiens Human Other Regulatory Region… ENSEMBL_… hsapiens_ext… GRCh38… | |
| 16 hsapiens Human Regulatory Evidence (GR… ENSEMBL_… hsapiens_peak GRCh38… | |
| ``` | |
| The result is a table storing all `marts` and `datasets` from which annotations can be retrieved | |
| for _Homo sapiens_. Furthermore, a short description as well as the version of the dataset | |
| being accessed (very useful for publications) is returned. | |
| Users will observe that 3 different `marts` provide 6 different `datasets` storing annotation information for | |
| _Homo sapiens_. | |
| __Please note, however, that scientific names of organisms must be written correctly! For ex. "Homo Sapiens" will be treated differently (not recognized) than "Homo sapiens" (recognized).__ | |
| Similar to the `biomaRt` package query methodology, users need to specify `attributes` and `filters` to be able to perform | |
| accurate BioMart queries. Here the functions `organismAttributes()` and `organismFilters()` provide useful and intuitive | |
| concepts to obtain this information. | |
| ```{r,eval=FALSE} | |
| # show all elements of the data.frame | |
| options(tibble.print_max = Inf) | |
| # return available attributes for "Homo sapiens" | |
| head(biomartr::organismAttributes("Homo sapiens"), 20) | |
| ``` | |
| ``` | |
| 1 ensembl_gene_id Gene stable ID hsapiens_ge… ENSEMBL_M… | |
| 2 ensembl_gene_id_version Gene stable ID version hsapiens_ge… ENSEMBL_M… | |
| 3 ensembl_transcript_id Transcript stable ID hsapiens_ge… ENSEMBL_M… | |
| 4 ensembl_transcript_id_version Transcript stable ID … hsapiens_ge… ENSEMBL_M… | |
| 5 ensembl_peptide_id Protein stable ID hsapiens_ge… ENSEMBL_M… | |
| 6 ensembl_peptide_id_version Protein stable ID ver… hsapiens_ge… ENSEMBL_M… | |
| 7 ensembl_exon_id Exon stable ID hsapiens_ge… ENSEMBL_M… | |
| 8 description Gene description hsapiens_ge… ENSEMBL_M… | |
| 9 chromosome_name Chromosome/scaffold n… hsapiens_ge… ENSEMBL_M… | |
| 10 start_position Gene start (bp) hsapiens_ge… ENSEMBL_M… | |
| 11 end_position Gene end (bp) hsapiens_ge… ENSEMBL_M… | |
| 12 strand Strand hsapiens_ge… ENSEMBL_M… | |
| 13 band Karyotype band hsapiens_ge… ENSEMBL_M… | |
| 14 transcript_start Transcript start (bp) hsapiens_ge… ENSEMBL_M… | |
| 15 transcript_end Transcript end (bp) hsapiens_ge… ENSEMBL_M… | |
| 16 transcription_start_site Transcription start s… hsapiens_ge… ENSEMBL_M… | |
| 17 transcript_length Transcript length (in… hsapiens_ge… ENSEMBL_M… | |
| 18 transcript_tsl Transcript support le… hsapiens_ge… ENSEMBL_M… | |
| 19 transcript_gencode_basic GENCODE basic annotat… hsapiens_ge… ENSEMBL_M… | |
| 20 transcript_appris APPRIS annotation hsapiens_ge… ENSEMBL_M… | |
| ``` | |
| Users will observe that the `organismAttributes()` function returns a data.frame storing attribute names, datasets, and marts which | |
| are available for `Homo sapiens`. After the ENSEMBL release 87 the `ENSEMBL_MART_SEQUENCE` service provided | |
| by Ensembl does not work properly and thus the `organismAttributes()` function prints out warning messages to make the user aware | |
| when certain marts provided bt Ensembl do not work properly, yet. | |
| An additional feature provided by `organismAttributes()` is the `topic` argument. The `topic` argument allows users to to search for specific attributes, topics, or categories for faster filtering. | |
| ```{r,eval=FALSE} | |
| # show all elements of the data.frame | |
| options(tibble.print_max = Inf) | |
| # search for attribute topic "id" | |
| head(biomartr::organismAttributes("Homo sapiens", topic = "id"), 20) | |
| ``` | |
| ``` | |
| name description dataset mart | |
| <chr> <chr> <chr> <chr> | |
| 1 ensembl_gene_id Gene stable ID hsapiens_ge… ENSEMBL_M… | |
| 2 ensembl_gene_id_version Gene stable ID version hsapiens_ge… ENSEMBL_M… | |
| 3 ensembl_transcript_id Transcript stable ID hsapiens_ge… ENSEMBL_M… | |
| 4 ensembl_transcript_id_version Transcript stable ID … hsapiens_ge… ENSEMBL_M… | |
| 5 ensembl_peptide_id Protein stable ID hsapiens_ge… ENSEMBL_M… | |
| 6 ensembl_peptide_id_version Protein stable ID ver… hsapiens_ge… ENSEMBL_M… | |
| 7 ensembl_exon_id Exon stable ID hsapiens_ge… ENSEMBL_M… | |
| 8 study_external_id Study external refere… hsapiens_ge… ENSEMBL_M… | |
| 9 go_id GO term accession hsapiens_ge… ENSEMBL_M… | |
| 10 dbass3_id DataBase of Aberrant … hsapiens_ge… ENSEMBL_M… | |
| 11 dbass5_id DataBase of Aberrant … hsapiens_ge… ENSEMBL_M… | |
| 12 hgnc_id HGNC ID hsapiens_ge… ENSEMBL_M… | |
| 13 protein_id INSDC protein ID hsapiens_ge… ENSEMBL_M… | |
| 14 mim_morbid_description MIM morbid description hsapiens_ge… ENSEMBL_M… | |
| 15 mim_morbid_accession MIM morbid accession hsapiens_ge… ENSEMBL_M… | |
| 16 mirbase_id miRBase ID hsapiens_ge… ENSEMBL_M… | |
| 17 refseq_peptide RefSeq peptide ID hsapiens_ge… ENSEMBL_M… | |
| 18 refseq_peptide_predicted RefSeq peptide predic… hsapiens_ge… ENSEMBL_M… | |
| 19 wikigene_id WikiGene ID hsapiens_ge… ENSEMBL_M… | |
| 20 mobidblite MobiDBLite hsapiens_ge… ENSEMBL_M… | |
| ``` | |
| Now, all `attribute names` having `id` as part of their `name` are being returned. | |
| Another example is `topic = "homolog"`. | |
| ```{r,eval=FALSE} | |
| # show all elements of the data.frame | |
| options(tibble.print_max = Inf) | |
| # search for attribute topic "homolog" | |
| head(biomartr::organismAttributes("Homo sapiens", topic = "homolog"), 20) | |
| ``` | |
| ``` | |
| <chr> <chr> <chr> <chr> | |
| 1 mspretus_homolog_ensembl_gene Algerian mouse… hsapie… ENSEM… | |
| 2 mspretus_homolog_associated_gene_name Algerian mouse… hsapie… ENSEM… | |
| 3 mspretus_homolog_ensembl_peptide Algerian mouse… hsapie… ENSEM… | |
| 4 mspretus_homolog_chromosome Algerian mouse… hsapie… ENSEM… | |
| 5 mspretus_homolog_chrom_start Algerian mouse… hsapie… ENSEM… | |
| 6 mspretus_homolog_chrom_end Algerian mouse… hsapie… ENSEM… | |
| 7 mspretus_homolog_canonical_transcript_protein Query protein … hsapie… ENSEM… | |
| 8 mspretus_homolog_subtype Last common an… hsapie… ENSEM… | |
| 9 mspretus_homolog_orthology_type Algerian mouse… hsapie… ENSEM… | |
| 10 mspretus_homolog_perc_id %id. target Al… hsapie… ENSEM… | |
| 11 mspretus_homolog_perc_id_r1 %id. query gen… hsapie… ENSEM… | |
| 12 mspretus_homolog_goc_score Algerian mouse… hsapie… ENSEM… | |
| 13 mspretus_homolog_wga_coverage Algerian mouse… hsapie… ENSEM… | |
| 14 mspretus_homolog_dn dN with Algeri… hsapie… ENSEM… | |
| 15 mspretus_homolog_ds dS with Algeri… hsapie… ENSEM… | |
| 16 mspretus_homolog_orthology_confidence Algerian mouse… hsapie… ENSEM… | |
| 17 vpacos_homolog_ensembl_gene Alpaca gene st… hsapie… ENSEM… | |
| 18 vpacos_homolog_associated_gene_name Alpaca gene na… hsapie… ENSEM… | |
| 19 vpacos_homolog_ensembl_peptide Alpaca protein… hsapie… ENSEM… | |
| 20 vpacos_homolog_chromosome Alpaca chromos… hsapie… ENSEM… | |
| ``` | |
| Or `topic = "dn"` and `topic = "ds"` for `dn` and `ds` value retrieval. | |
| ```{r,eval=FALSE} | |
| # show all elements of the data.frame | |
| options(tibble.print_max = Inf) | |
| # search for attribute topic "dn" | |
| head(biomartr::organismAttributes("Homo sapiens", topic = "dn")) | |
| ``` | |
| ``` | |
| name description dataset mart | |
| <chr> <chr> <chr> <chr> | |
| 1 cdna_coding_start cDNA coding start hsapiens_gene_ensembl ENSEMBL_M… | |
| 2 cdna_coding_end cDNA coding end hsapiens_gene_ensembl ENSEMBL_M… | |
| 3 mspretus_homolog_dn dN with Algerian mouse hsapiens_gene_ensembl ENSEMBL_M… | |
| 4 vpacos_homolog_dn dN with Alpaca hsapiens_gene_ensembl ENSEMBL_M… | |
| 5 pformosa_homolog_dn dN with Amazon molly hsapiens_gene_ensembl ENSEMBL_M… | |
| 6 cpalliatus_homolog_dn dN with Angola colobus hsapiens_gene_ensembl ENSEMBL_M… | |
| ``` | |
| ```{r,eval=FALSE} | |
| # show all elements of the data.frame | |
| options(tibble.print_max = Inf) | |
| # search for attribute topic "ds" | |
| head(biomartr::organismAttributes("Homo sapiens", topic = "ds")) | |
| ``` | |
| ``` | |
| name description dataset mart | |
| <chr> <chr> <chr> <chr> | |
| 1 ccds CCDS ID hsapiens_gene_ensembl ENSEMBL_MAR… | |
| 2 cds_length CDS Length hsapiens_gene_ensembl ENSEMBL_MAR… | |
| 3 cds_start CDS start hsapiens_gene_ensembl ENSEMBL_MAR… | |
| 4 cds_end CDS end hsapiens_gene_ensembl ENSEMBL_MAR… | |
| 5 mspretus_homolog_ds dS with Algerian mouse hsapiens_gene_ensembl ENSEMBL_MAR… | |
| 6 vpacos_homolog_ds dS with Alpaca hsapiens_gene_ensembl ENSEMBL_MAR… | |
| ``` | |
| Analogous to the `organismAttributes()` function, the `organismFilters()` function returns | |
| all filters that are available for a query organism of interest. | |
| ```{r,eval=FALSE} | |
| # show all elements of the data.frame | |
| options(tibble.print_max = Inf) | |
| # return available filters for "Homo sapiens" | |
| head(biomartr::organismFilters("Homo sapiens"), 20) | |
| ``` | |
| ``` | |
| name description dataset mart | |
| <chr> <chr> <chr> <chr> | |
| 1 chromosome_name Chromosome/scaffold… hsapiens_… ENSEMBL… | |
| 2 start Start hsapiens_… ENSEMBL… | |
| 3 end End hsapiens_… ENSEMBL… | |
| 4 band_start Band Start hsapiens_… ENSEMBL… | |
| 5 band_end Band End hsapiens_… ENSEMBL… | |
| 6 marker_start Marker Start hsapiens_… ENSEMBL… | |
| 7 marker_end Marker End hsapiens_… ENSEMBL… | |
| 8 encode_region Encode region hsapiens_… ENSEMBL… | |
| 9 strand Strand hsapiens_… ENSEMBL… | |
| 10 chromosomal_region e.g. 1:100:10000:-1… hsapiens_… ENSEMBL… | |
| 11 with_ccds With CCDS ID(s) hsapiens_… ENSEMBL… | |
| 12 with_chembl With ChEMBL ID(s) hsapiens_… ENSEMBL… | |
| 13 with_clone_based_ensembl_gene With Clone-based (E… hsapiens_… ENSEMBL… | |
| 14 with_clone_based_ensembl_transcript With Clone-based (E… hsapiens_… ENSEMBL… | |
| 15 with_dbass3 With DataBase of Ab… hsapiens_… ENSEMBL… | |
| 16 with_dbass5 With DataBase of Ab… hsapiens_… ENSEMBL… | |
| 17 with_ens_hs_transcript With Ensembl Human … hsapiens_… ENSEMBL… | |
| 18 with_ens_hs_translation With Ensembl Human … hsapiens_… ENSEMBL… | |
| 19 with_entrezgene_trans_name With EntrezGene tra… hsapiens_… ENSEMBL… | |
| 20 with_embl With European Nucle… hsapiens_… ENSEMBL… | |
| ``` | |
| The `organismFilters()` function also allows users to search for filters that correspond to | |
| a specific topic or category. | |
| ```{r,eval=FALSE} | |
| # show all elements of the data.frame | |
| options(tibble.print_max = Inf) | |
| # search for filter topic "id" | |
| head(biomartr::organismFilters("Homo sapiens", topic = "id"), 20) | |
| ``` | |
| ``` | |
| name description dataset mart | |
| <chr> <chr> <chr> <chr> | |
| 1 with_protein_id With INSDC protein ID ID… hsapiens_g… ENSEMBL… | |
| 2 with_mim_morbid With MIM morbid ID(s) hsapiens_g… ENSEMBL… | |
| 3 with_refseq_peptide With RefSeq peptide ID(s) hsapiens_g… ENSEMBL… | |
| 4 with_refseq_peptide_predicted With RefSeq peptide pred… hsapiens_g… ENSEMBL… | |
| 5 ensembl_gene_id Gene stable ID(s) [e.g. … hsapiens_g… ENSEMBL… | |
| 6 ensembl_gene_id_version Gene stable ID(s) with v… hsapiens_g… ENSEMBL… | |
| 7 ensembl_transcript_id Transcript stable ID(s) … hsapiens_g… ENSEMBL… | |
| 8 ensembl_transcript_id_version Transcript stable ID(s) … hsapiens_g… ENSEMBL… | |
| 9 ensembl_peptide_id Protein stable ID(s) [e.… hsapiens_g… ENSEMBL… | |
| 10 ensembl_peptide_id_version Protein stable ID(s) wit… hsapiens_g… ENSEMBL… | |
| 11 ensembl_exon_id Exon ID(s) [e.g. ENSE000… hsapiens_g… ENSEMBL… | |
| 12 dbass3_id DataBase of Aberrant 3' … hsapiens_g… ENSEMBL… | |
| 13 dbass5_id DataBase of Aberrant 5' … hsapiens_g… ENSEMBL… | |
| 14 hgnc_id HGNC ID(s) [e.g. HGNC:10… hsapiens_g… ENSEMBL… | |
| 15 protein_id INSDC protein ID(s) [e.g… hsapiens_g… ENSEMBL… | |
| 16 mim_morbid_accession MIM morbid accession(s) … hsapiens_g… ENSEMBL… | |
| 17 mirbase_id miRBase ID(s) [e.g. hsa-… hsapiens_g… ENSEMBL… | |
| 18 refseq_peptide RefSeq peptide ID(s) [e.… hsapiens_g… ENSEMBL… | |
| 19 refseq_peptide_predicted RefSeq peptide predicted… hsapiens_g… ENSEMBL… | |
| 20 wikigene_id WikiGene ID(s) [e.g. 1] hsapiens_g… ENSEMBL… | |
| ``` | |
| ## Construct BioMart queries with `biomartr` | |
| The short introduction to the functionality of | |
| `organismBM()`, `organismAttributes()`, and `organismFilters()` | |
| will allow users to perform BioMart queries in a very intuitive | |
| organism centric way. The main function to perform BioMart queries | |
| is `biomart()`. | |
| For the following examples we will assume that we are interested in the annotation of specific genes from the _Homo sapiens_ proteome. We want to map the corresponding refseq gene id to a set of other gene ids used in other databases. For this purpose, first we need consult the `organismAttributes()` function. | |
| ```{r,eval=FALSE} | |
| # show all elements of the data.frame | |
| options(tibble.print_max = Inf) | |
| head(biomartr::organismAttributes("Homo sapiens", topic = "id")) | |
| ``` | |
| ``` | |
| name description dataset mart | |
| <chr> <chr> <chr> <chr> | |
| 1 ensembl_gene_id Gene stable ID hsapiens_… ENSEMB… | |
| 2 ensembl_gene_id_version Gene stable ID version hsapiens_… ENSEMB… | |
| 3 ensembl_transcript_id Transcript stable ID hsapiens_… ENSEMB… | |
| 4 ensembl_transcript_id_version Transcript stable ID version hsapiens_… ENSEMB… | |
| 5 ensembl_peptide_id Protein stable ID hsapiens_… ENSEMB… | |
| 6 ensembl_peptide_id_version Protein stable ID version hsapiens_… ENSEMB… | |
| ``` | |
| ```{r,eval=FALSE} | |
| # show all elements of the data.frame | |
| options(tibble.print_max = Inf) | |
| # retrieve the proteome of Homo sapiens from refseq | |
| file_path <- biomartr::getProteome( db = "refseq", | |
| organism = "Homo sapiens", | |
| path = file.path("_ncbi_downloads","proteomes") ) | |
| Hsapiens_proteome <- biomartr::read_proteome(file_path, format = "fasta") | |
| # remove splice variants from id | |
| gene_set <- unlist(sapply(strsplit(Hsapiens_proteome@ranges@NAMES[1:5], ".",fixed = TRUE), function(x) x[1])) | |
| result_BM <- biomartr::biomart( genes = gene_set, # genes were retrieved using biomartr::getGenome() | |
| mart = "ENSEMBL_MART_ENSEMBL", # marts were selected with biomartr::getMarts() | |
| dataset = "hsapiens_gene_ensembl", # datasets were selected with biomartr::getDatasets() | |
| attributes = c("ensembl_gene_id","ensembl_peptide_id"), # attributes were selected with biomartr::getAttributes() | |
| filters = "refseq_peptide") # specify what ID type was stored in the fasta file retrieved with biomartr::getGenome() | |
| result_BM | |
| ``` | |
| ``` | |
| refseq_peptide ensembl_gene_id ensembl_peptide_id | |
| 1 NP_000005 ENSG00000175899 ENSP00000323929 | |
| 2 NP_000006 ENSG00000156006 ENSP00000286479 | |
| 3 NP_000007 ENSG00000117054 ENSP00000359878 | |
| 4 NP_000008 ENSG00000122971 ENSP00000242592 | |
| 5 NP_000009 ENSG00000072778 ENSP00000349297 | |
| ``` | |
| The `biomart()` function takes as arguments a set of genes (gene ids specified in the `filter` argument), the corresponding `mart` and `dataset`, as well as the `attributes` which shall be returned. | |
| ## Gene Ontology | |
| The `biomartr` package also enables a fast and intuitive retrieval of GO terms | |
| and additional information via the `getGO()` function. Several databases can be selected | |
| to retrieve GO annotation information for a set of query genes. So far, the `getGO()` | |
| function allows GO information retrieval from the [Ensembl Biomart](http://ensemblgenomes.org/info/access/biomart) database. | |
| In this example we will retrieve GO information for a set of _Homo sapiens_ genes | |
| stored as `hgnc_symbol`. | |
| ### GO Annotation Retrieval via BioMart | |
| The `getGO()` function takes several arguments as input to retrieve GO information from BioMart. | |
| First, the scientific name of the `organism` of interest needs to be specified. Furthermore, a set of | |
| `gene ids` as well as their corresponding `filter` notation (`GUCA2A` gene ids have `filter` notation `hgnc_symbol`; see `organismFilters()` for details) | |
| need to be specified. The `database` argument then defines the database from which GO information shall be retrieved. | |
| ```{r,eval=FALSE} | |
| # show all elements of the data.frame | |
| options(tibble.print_max = Inf) | |
| # search for GO terms of an example Homo sapiens gene | |
| GO_tbl <- biomartr::getGO(organism = "Homo sapiens", | |
| genes = "GUCA2A", | |
| filters = "hgnc_symbol") | |
| GO_tbl | |
| ``` | |
| Hence, for each _gene id_ the resulting table stores all annotated GO terms found in [Ensembl Biomart](http://ensemblgenomes.org/info/access/biomart). | |