Skip to content

Latest commit

 

History

History
204 lines (156 loc) · 9.72 KB

usage.rst

File metadata and controls

204 lines (156 loc) · 9.72 KB

Usage

apybiomart can be used in a project with a simple import:

import apybiomart

The main purpose of the package is to perform queries on BioMart (either synchronously or asynchronously), however users may first need to explore the available marts, datasets, attributes and filters.

In addition to interactively inspect these results, users can also save them to a CSV file, using the --save flag on the CLI and the save=True argument in Python, and optionally specify a filename using the --output <filename.csv> option on the CLI and the output="filename.csv" argument in Python.

Marts, datasets, attributes and filters

BioMart contains different databases, called marts, each of which in turn contains several datasets, each related to a specific species. These datasets can be queried and it is possible to restrict the amount of data returned to one or more particular types of information, namely attributes, and using filters that only retain data satisfying one or more specific criteria.

For more information, please refer to BioMart's help page.

Marts

In order to view the marts available on BioMart, the find_marts() function can be used:

from apybiomart import find_marts
find_marts()

A dataframe with the available marts is returned, with their proper name and display_name:

Mart_ID              Mart_name

0 ENSEMBL_MART_ENSEMBL Ensembl Genes 96 1 ENSEMBL_MART_MOUSE Mouse strains 96 2 ENSEMBL_MART_SEQUENCE Sequence 3 ENSEMBL_MART_ONTOLOGY Ontology 4 ENSEMBL_MART_GENOMIC Genomic features 96 5 ENSEMBL_MART_SNP Ensembl Variation 96 6 ENSEMBL_MART_FUNCGEN Ensembl Regulation 96

A CLI command is also available to retrieve the same information: apybiomart marts.

Datasets

Available datasets for a specific mart can be retrieved using the find_datasets() function:

from apybiomart import find_datasets
find_datasets(mart="ENSEMBL_MART_ENSEMBL")
# same as above, using the default mart
find_datasets()

The find_datasets() function accepts an optional mart argument, which defaults to "ENSEMBL_MART_ENSEMBL". The returned dataframe contains all the available datasets in the given mart, with their name, display_name and the mart to which they belong:

Dataset_ID                                       Dataset_name               Mart_ID

0 rroxellana_gene_ensembl Golden snub-nosed monkey genes (Rrox_v1) ENSEMBL_MART_ENSEMBL 1 ggallus_gene_ensembl Chicken genes (GRCg6a) ENSEMBL_MART_ENSEMBL 2 dmelanogaster_gene_ensembl Drosophila melanogaster genes (BDGP6.22) ENSEMBL_MART_ENSEMBL .. ... ... ... 181 sdorsalis_gene_ensembl Yellowtail amberjack genes (Sedor1) ENSEMBL_MART_ENSEMBL 182 ohni_gene_ensembl Japanese medaka HNI genes (ASM223471v1) ENSEMBL_MART_ENSEMBL 183 pmarinus_gene_ensembl Lamprey genes (Pmarinus_7.0) ENSEMBL_MART_ENSEMBL

A CLI command is also available to retrieve the same information: apybiomart datasets, whose --mart option can be used to specify which mart will be used (default is "ENSEMBL_MART_ENSEMBL").

Attributes

When querying a dataset, users may want to retrieve specific attributes; the find_attributes() function accepts an optional dataset (defaulting to "hsapiens_gene_ensembl") and gathers all the available attributes for the given dataset:

from apybiomart import find_attributes
find_attributes(dataset="hsapiens_gene_ensembl")
# same as above, using the default dataset
find_attributes()

The dataframe returned contains each attribute's name, display_name, description (where available), and the dataset to which it belongs:

Attribute_ID          Attribute_name             Attribute_description             Dataset_ID

0 ensembl_gene_id Gene stable ID Stable ID of the Gene hsapiens_gene_ensembl 1 ensembl_gene_id_version Gene stable ID version Versionned stable ID of the Gene hsapiens_gene_ensembl 2 ensembl_transcript_id Transcript stable ID Stable ID of the Transcript hsapiens_gene_ensembl .. ... ... ... ... 3348 cds_length CDS Length hsapiens_gene_ensembl 3349 cds_start CDS start hsapiens_gene_ensembl 3350 cds_end CDS end hsapiens_gene_ensembl

A CLI command is also available to retrieve the same information: apybiomart attributes, whose --dataset option can be used to specify which dataset will be used (default is "hsapiens_gene_ensembl").

Filters

Datasets can be queried using filters that restrict the returned information to some specific subset of interest (e.g. chromosome, start position, etc.). In order to retrieve the list of filters available for a given dataset, the find_filters() function can be used:

from apybiomart import find_filters
find_filters("hsapiens_gene_ensembl")
# same as above, using the default dataset
find_filters()

This function accepts an optional dataset argument, which defaults to "hsapiens_gene_ensembl", and returns a dataframe with the name, type, description (where available) of each filter, as well as the dataset to which it belongs:

Filter_ID  Filter_type Filter_description             Dataset_ID

0 link_so_mini_closure list hsapiens_gene_ensembl 1 link_go_closure text hsapiens_gene_ensembl 2 link_ensembl_transcript_stable_id text hsapiens_gene_ensembl .. ... ... ... ... 39 germ_line_variation_source list hsapiens_gene_ensembl 40 somatic_variation_source list hsapiens_gene_ensembl 42 so_consequence_name list hsapiens_gene_ensembl

A CLI command is also available to retrieve the same information: apybiomart filters, whose --dataset option can be used to specify which dataset will be used (default is "hsapiens_gene_ensembl").

Queries

Once the desired mart, dataset, attributes and filters have been explored (or if they were known beforehand), it is possible to query BioMart to retrieve the actual data; queries can be performed synchronously or asynchronously.

Exploring the difference between these two approaches is out of the scope of this document, but basically while in synchronous calls the client has to wait for a request to be complete before moving to the next one, in asynchronous calls the client can perform another request while the first one is idle, and so on until all the requests have been performed and a response was returned.

Simply put, apybiomart allows to perform synchronous queries to explore the data, and asynchronous queries to group multiple queries and run them efficiently.

Synchronous Queries

Synchronous queries can be performed using the query() function, which accepts attributes and filters arguments, and an optional dataset argument (which defaults to "hsapiens_gene_ensembl"):

from apybiomart import query
query(attributes=["ensembl_gene_id", "external_gene_name"],
      filters={"chromosome_name": "1"},
      dataset="hsapiens_gene_ensembl")

The attributes are provided as a list of properties, while filters are represented by a filter name : filter value dictionary. The returned dataframe contains the result of the query, restricted according to the provided filters and attributes.

Asynchronous Queries

Asynchronous queries can be performed using the aquery() function, which works just like query(), with the only difference that this is an async coroutine, so it needs to be handled properly taking advantage of the asyncio event loop:

import asyncio
from apybiomart import aquery
loop = asyncio.get_event_loop()
loop.run_until_complete(
    aquery(attributes=["ensembl_gene_id", "external_gene_name"],
           filters={"chromosome_name": "1"},
           dataset="hsapiens_gene_ensembl")
)

This allows to group multiple queries together, and the event loop will take care of scheduling them for execution:

import asyncio
from apybiomart import aquery
loop = asyncio.get_event_loop()
tasks = [aquery(attributes=["ensembl_gene_id", "external_gene_name"],
                filters={"chromosome_name": str(i)},
                dataset="hsapiens_gene_ensembl") for i in range(3)]
loop.run_until_complete(asyncio.gather(*tasks))

It is of course possible to assign the query results to one or more specific variables, for future usage:

# replacing last line of the previous code snippet
single_result = loop.run_until_complete(asyncio.gather(*tasks))
# or using multiple variables
chrom1, chrom2, chrom3 = loop.run_until_complete(asyncio.gather(*tasks))

Please refer to the asyncio documentation for more information.