# Basic usage

This notebook shows some basic usage of the genomic-features package.

In [1]:
import genomic_features as gf

## Retrieving Ensembl gene annotations

We can load annotation tables using {func}`genomic_features.ensembl.annotation`.

In [19]:
ensdb = gf.ensembl.annotation(species="Hsapiens", version="98")
ensdb

Downloading data from 'https://bioconductorhubs.blob.core.windows.net/annotationhub/AHEnsDbs/v98/EnsDb.Hsapiens.v98.sqlite' to file '/home/jovyan/.cache/genomic-features/87408fa36573814f9212483e6e4a939d-EnsDb.Hsapiens.v98.sqlite'.
100%|████████████████████████████████████████| 386M/386M [00:00<00:00, 243GB/s]
SHA256 hash of downloaded file: d02484a69b97465b66121be4c4aada06490267836a12f1c725f5d29ead3e4098
Use this value as the 'known_hash' argument of 'pooch.retrieve' to ensure that the file hasn't changed if it is downloaded again in the future.


EnsemblDB(organism='Homo sapiens', ensembl_release='98')

These tables have been created for the [`ensembldb` Bioconductor package](https://bioconductor.org/packages/release/bioc/html/AnnotationHub.html) {cite:p}`Rainer_2019`, and are automatically downloaded and cached from the [`AnnotationHub`](https://bioconductor.org/packages/release/bioc/html/AnnotationHub.html) resource.

We can check which Ensembl versions are available for different species using the {func}`genomic_features.ensembl.list_ensdb_annotations` util.

In [21]:
gf.ensembl.list_ensdb_annotations(species='Mmusculus')

Unnamed: 0,Species,Ensembl_version
37,Mmusculus,87
105,Mmusculus,88
173,Mmusculus,89
247,Mmusculus,90
330,Mmusculus,91
419,Mmusculus,92
510,Mmusculus,93
621,Mmusculus,94
748,Mmusculus,95
894,Mmusculus,96


## Using `EnsemblDB` objects

The {class}`genomic_features.ensembl.EnsemblDB` is the access point to an annotation. This is an interface to a sqlite table retrieved from AnnotationHub (as shown above). Information on provenance can be accessed via the `metadata` attribute:

In [22]:
ensdb.metadata

{'Db type': 'EnsDb',
 'Type of Gene ID': 'Ensembl Gene ID',
 'Supporting package': 'ensembldb',
 'Db created by': 'ensembldb package from Bioconductor',
 'script_version': '0.3.5',
 'Creation time': 'Mon Nov 18 21:12:12 2019',
 'ensembl_version': '98',
 'ensembl_host': 'localhost',
 'Organism': 'Homo sapiens',
 'taxonomy_id': '9606',
 'genome_build': 'GRCh38',
 'DBSCHEMAVERSION': '2.1'}

And can be queried for genomic features:

In [23]:
genes = ensdb.genes()
genes.head()

Unnamed: 0,gene_id,gene_name,gene_biotype,gene_seq_start,gene_seq_end,seq_name,seq_strand,seq_coord_system,description,gene_id_version
0,ENSG00000000003,TSPAN6,protein_coding,100627108,100639991,X,-1,chromosome,tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858],ENSG00000000003.15
1,ENSG00000000005,TNMD,protein_coding,100584936,100599885,X,1,chromosome,tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757],ENSG00000000005.6
2,ENSG00000000419,DPM1,protein_coding,50934867,50958555,20,-1,chromosome,dolichyl-phosphate mannosyltransferase subunit...,ENSG00000000419.12
3,ENSG00000000457,SCYL3,protein_coding,169849631,169894267,1,-1,chromosome,SCY1 like pseudokinase 3 [Source:HGNC Symbol;A...,ENSG00000000457.14
4,ENSG00000000460,C1orf112,protein_coding,169662007,169854080,1,1,chromosome,chromosome 1 open reading frame 112 [Source:HG...,ENSG00000000460.17


### Filters

{mod}`genomic_features.filters` defines a number of filters to use with these annotations. You can filter by specific columns:

In [24]:
ensdb.genes(filter=gf.filters.GeneBioTypeFilter("Mt_tRNA")).head()

Unnamed: 0,gene_id,gene_name,gene_biotype,gene_seq_start,gene_seq_end,seq_name,seq_strand,seq_coord_system,description,gene_id_version
0,ENSG00000209082,MT-TL1,Mt_tRNA,3230,3304,MT,1,chromosome,mitochondrially encoded tRNA-Leu (UUA/G) 1 [So...,ENSG00000209082.1
1,ENSG00000210049,MT-TF,Mt_tRNA,577,647,MT,1,chromosome,mitochondrially encoded tRNA-Phe (UUU/C) [Sour...,ENSG00000210049.1
2,ENSG00000210077,MT-TV,Mt_tRNA,1602,1670,MT,1,chromosome,mitochondrially encoded tRNA-Val (GUN) [Source...,ENSG00000210077.1
3,ENSG00000210100,MT-TI,Mt_tRNA,4263,4331,MT,1,chromosome,mitochondrially encoded tRNA-Ile (AUU/C) [Sour...,ENSG00000210100.1
4,ENSG00000210107,MT-TQ,Mt_tRNA,4329,4400,MT,-1,chromosome,mitochondrially encoded tRNA-Gln (CAA/G) [Sour...,ENSG00000210107.1


Or by genomic range:

In [25]:
ensdb.genes(filter=gf.filters.GeneRangesFilter("1:10000-20000"))

Unnamed: 0,gene_id,gene_name,gene_biotype,gene_seq_start,gene_seq_end,seq_name,seq_strand,seq_coord_system,description,gene_id_version
0,ENSG00000223972,DDX11L1,transcribed_unprocessed_pseudogene,11869,14409,1,1,chromosome,DEAD/H-box helicase 11 like 1 [Source:HGNC Sym...,ENSG00000223972.5
1,ENSG00000227232,WASH7P,unprocessed_pseudogene,14404,29570,1,-1,chromosome,"WASP family homolog 7, pseudogene [Source:HGNC...",ENSG00000227232.5
2,ENSG00000278267,MIR6859-1,miRNA,17369,17436,1,-1,chromosome,microRNA 6859-1 [Source:HGNC Symbol;Acc:HGNC:5...,ENSG00000278267.1


Logical operations (`&`, `|`, and `~`) on filters are also possible:

In [26]:
ensdb.genes(
    filter=gf.filters.GeneBioTypeFilter("lncRNA") & gf.filters.GeneRangesFilter("1:10000-20000")
)

Unnamed: 0,gene_id,gene_name,gene_biotype,gene_seq_start,gene_seq_end,seq_name,seq_strand,seq_coord_system,description,gene_id_version


### Column selectors

Using the `cols` argument, you can get annotations from other tables in the database.

In [27]:
ensdb.genes(cols=["gene_id", "tx_id", "gene_name", "protein_id", "uniprot_id"]).head()

Unnamed: 0,gene_id,tx_id,gene_name,protein_id,uniprot_id
0,ENSG00000000003,ENST00000373020,TSPAN6,ENSP00000362111,O43657
1,ENSG00000000003,ENST00000612152,TSPAN6,ENSP00000482130,A0A087WYV6
2,ENSG00000000003,ENST00000614008,TSPAN6,ENSP00000482894,A0A087WZU5
3,ENSG00000000005,ENST00000373031,TNMD,ENSP00000362122,Q9H2S6
4,ENSG00000000419,ENST00000371588,DPM1,ENSP00000360644,A0A0S2Z4Y5


### `chromosomes()`

Information on chromosome length for this annotation (useful for downstream operations) is also available through the `chromosomes` function.

In [28]:
ensdb.chromosomes()

Unnamed: 0,seq_name,seq_length,is_circular
0,X,156040895,0
1,20,64444167,0
2,1,248956422,0
3,6,170805979,0
4,3,198295559,0
...,...,...,...
449,LRG_311,115492,0
450,LRG_721,33396,0
451,LRG_741,231167,0
452,LRG_763,176286,0


## Adding annotations to an AnnData object

We provide a utility function `annotate` to merge a table of feature annotations from an scverse dataset and a gene annotation table. In this example we use subsampled scRNA-seq data from [Litviňuková et al. 2020](https://www.nature.com/articles/s41586-020-2797-4), where reads were mapped to the hg38 reference genome with [CellRanger v3.0.1](https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/3.0), which uses the Ensembl release 98 for gene annotations.

In [14]:
import pandas as pd
import scvi
adata = scvi.data.heart_cell_atlas_subsampled()

[34mINFO    [0m File data/hca_subsampled_20k.h5ad already downloaded                                                      


In [18]:
adata.var = adata.var[['gene_ids-Harvard-Nuclei']]

In [30]:
annotated_var = gf.annotate(adata.var, genes, var_on='gene_ids-Harvard-Nuclei')
annotated_var

  annotated_var = gf.annotate(adata.var, genes, var_on='gene_ids-Harvard-Nuclei')


Unnamed: 0,gene_ids-Harvard-Nuclei,gene_id,gene_name,gene_biotype,gene_seq_start,gene_seq_end,seq_name,seq_strand,seq_coord_system,description,gene_id_version
AL627309.1,ENSG00000238009,ENSG00000238009,AL627309.1,lncRNA,89295.0,133723.0,1,-1.0,chromosome,novel transcript,ENSG00000238009.6
AC114498.1,ENSG00000235146,ENSG00000235146,AC114498.1,lncRNA,587629.0,594768.0,1,1.0,chromosome,novel transcript,ENSG00000235146.2
AL669831.2,ENSG00000229905,ENSG00000229905,AL669831.2,lncRNA,760911.0,761989.0,1,1.0,chromosome,novel transcript,ENSG00000229905.1
AL669831.5,ENSG00000237491,ENSG00000237491,LINC01409,lncRNA,778747.0,810065.0,1,1.0,chromosome,long intergenic non-protein coding RNA 1409 [S...,ENSG00000237491.10
FAM87B,ENSG00000177757,ENSG00000177757,FAM87B,lncRNA,817371.0,819837.0,1,1.0,chromosome,family with sequence similarity 87 member B [S...,ENSG00000177757.2
...,...,...,...,...,...,...,...,...,...,...,...
AC007325.2,ENSG00000277196,ENSG00000277196,AC007325.2,protein_coding,138082.0,161852.0,KI270734.1,-1.0,scaffold,"proline dehydrogenase 1, mitochondrial [Source...",ENSG00000277196.4
BX072566.1,ENSG00000277630,ENSG00000277630,BX072566.1,protein_coding,108007.0,139659.0,GL000213.1,-1.0,scaffold,POTE ankyrin domain family member D-like [Sour...,ENSG00000277630.4
AL354822.1,ENSG00000278384,ENSG00000278384,AL354822.1,protein_coding,51867.0,54893.0,GL000218.1,-1.0,scaffold,,ENSG00000278384.1
AC004556.1,ENSG00000276345,ENSG00000276345,AC004556.3,protein_coding,2585.0,11802.0,KI270721.1,1.0,scaffold,"39S ribosomal protein L23, mitochondrial [Sour...",ENSG00000276345.1


In [28]:
adata.var = annotated_var.copy()