# Genomes

In [1]:
import biu
where = '/exports/molepi/tgehrmann/data/'
biu.config.settings.setWhere(where)

W: The following dependencies of BIU are missing. Functionality of BIU will be affected.
W:   tabix, intervaltree, vcf
W: Some optional dependencies of BIU are missing. Functionality of BIU may be affected.
W:   matplotlib_venn, fastcluster, xlrd, openpyxl


## Structures in genome objects

You will observe several structures present in the genome structures. Some sources provide more information than others, so not always will all information be present.

 * gff: Gene structure information and annotations (GFF3 format)
 * genome: Genome sequence (Fasta format)
 * cds: Transcripts (Fasta format)
 * ids: A mapping of Identifiers in the genome (e.g. gene, transcript, protein, symbol)
 * orthology: An index of known orthologs to other species.

## Ensembl Genomes

You can access any genome on Ensembl with the EnsemblGenome class.
You can specify a release number and an organism name, and it will retrieve the GFF annotations, genome, coding sequences and amino acid Fasta files.

### List available organisms

In [2]:
biu.genomes.Ensembl.organisms()

Organisms in Ensembl, release 92:
 * ailuropoda_melanoleuca
 * anas_platyrhynchos
 * anolis_carolinensis
 * aotus_nancymaae
 * astyanax_mexicanus
 * bos_taurus
 * caenorhabditis_elegans
 * callithrix_jacchus
 * canis_familiaris
 * capra_hircus
 * carlito_syrichta
 * cavia_aperea
 * cavia_porcellus
 * cebus_capucinus
 * cercocebus_atys
 * chinchilla_lanigera
 * chlorocebus_sabaeus
 * choloepus_hoffmanni
 * ciona_intestinalis
 * ciona_savignyi
 * colobus_angolensis_palliatus
 * cricetulus_griseus_chok1gshd
 * cricetulus_griseus_crigri
 * danio_rerio
 * dasypus_novemcinctus
 * dipodomys_ordii
 * drosophila_melanogaster
 * echinops_telfairi
 * equus_caballus
 * erinaceus_europaeus
 * felis_catus
 * ficedula_albicollis
 * fukomys_damarensis
 * gadus_morhua
 * gallus_gallus
 * gasterosteus_aculeatus
 * gorilla_gorilla
 * heterocephalus_glaber_female
 * heterocephalus_glaber_male
 * homo_sapiens
 * ictidomys_tridecemlineatus
 * jaculus_jaculus
 * latimeria_chalumnae
 * lepisosteus_oculatus
 *

In [3]:
# Default is the grch38 human genome, release 92
genome = biu.genomes.Ensembl()

In [4]:
print(genome)

Ensembl object
 Genome : ensembl_92.homo_sapiens
 Objects:
  * [ ] gff
  * [ ] genome
  * [ ] cds
  * [ ] aa
  * [ ] ids
 Files:
  * [ ] gff : /exports/molepi/tgehrmann/data/ensembl_92.homo_sapiens/genes.gff3
  * [ ] genome : /exports/molepi/tgehrmann/data/ensembl_92.homo_sapiens/dna.fasta
  * [ ] cds : /exports/molepi/tgehrmann/data/ensembl_92.homo_sapiens/cds.fa
  * [ ] aa : /exports/molepi/tgehrmann/data/ensembl_92.homo_sapiens/aa.fa
  * [ ] ids : /exports/molepi/tgehrmann/data/ensembl_92.homo_sapiens/ids.tsv



### Other genomes

In [5]:
# Load the mouse genome
genome = biu.genomes.Ensembl(organism='mus_musculus')
print(genome)

Ensembl object
 Genome : ensembl_92.mus_musculus
 Objects:
  * [ ] gff
  * [ ] genome
  * [ ] cds
  * [ ] aa
  * [ ] ids
 Files:
  * [ ] gff : /exports/molepi/tgehrmann/data/ensembl_92.mus_musculus/genes.gff3
  * [ ] genome : /exports/molepi/tgehrmann/data/ensembl_92.mus_musculus/dna.fasta
  * [ ] cds : /exports/molepi/tgehrmann/data/ensembl_92.mus_musculus/cds.fa
  * [X] aa : /exports/molepi/tgehrmann/data/ensembl_92.mus_musculus/aa.fa
  * [ ] ids : /exports/molepi/tgehrmann/data/ensembl_92.mus_musculus/ids.tsv



In [6]:
# Load the mouse genome, release 91
genome = biu.genomes.Ensembl(release=91, organism='mus_musculus')
print(genome)

Ensembl object
 Genome : ensembl_91.mus_musculus
 Objects:
  * [ ] gff
  * [ ] genome
  * [ ] cds
  * [ ] aa
  * [ ] ids
 Files:
  * [ ] gff : /exports/molepi/tgehrmann/data/ensembl_91.mus_musculus/genes.gff3
  * [ ] genome : /exports/molepi/tgehrmann/data/ensembl_91.mus_musculus/dna.fasta
  * [ ] cds : /exports/molepi/tgehrmann/data/ensembl_91.mus_musculus/cds.fa
  * [ ] aa : /exports/molepi/tgehrmann/data/ensembl_91.mus_musculus/aa.fa
  * [X] ids : /exports/molepi/tgehrmann/data/ensembl_91.mus_musculus/ids.tsv



In [7]:
print(genome.ids)

Indexed TSV Object
 Filename: /exports/molepi/tgehrmann/data/ensembl_91.mus_musculus/ids.tsv
 Indexes:
  * [ ] gene
  * [ ] transcript
  * [ ] protein
  * [ ] symbol



In [8]:
genome.ids.table[:10]

Unnamed: 0,gene,transcript,protein,symbol
0,ENSMUSG00000107099.3,ENSMUST00000202211.1,ENSMUSP00000144375.1,Slc22a12
1,ENSMUSG00000107099.3,ENSMUST00000202867.3,ENSMUSP00000144526.1,Slc22a12
2,ENSMUSG00000107104.3,ENSMUST00000200719.1,ENSMUSP00000144689.1,Nrxn2
3,ENSMUSG00000107104.3,ENSMUST00000201188.3,ENSMUSP00000144068.1,Nrxn2
4,ENSMUSG00000107104.3,ENSMUST00000201577.1,ENSMUSP00000144493.1,Nrxn2
5,ENSMUSG00000107104.3,ENSMUST00000201938.3,ENSMUSP00000144389.1,Nrxn2
6,ENSMUSG00000107104.3,ENSMUST00000202463.3,ENSMUSP00000144319.1,Nrxn2
7,ENSMUSG00000107104.3,ENSMUST00000202580.3,ENSMUSP00000144625.1,Nrxn2
8,ENSMUSG00000107104.3,ENSMUST00000201950.3,ENSMUSP00000144193.1,Nrxn2
9,ENSMUSG00000107104.3,ENSMUST00000202301.3,ENSMUSP00000143991.1,Nrxn2


### GRCH37 Ensembl Genome
Ensembl maintains seperately the GRCH37 build of the human genome. This can be accessed with a special class.

In [9]:
hg37 = biu.genomes.Ensembl(grch37=True)

In [10]:
print(hg37)

Ensembl object
 Genome : ensembl_grch37.92.homo_sapiens
 Objects:
  * [ ] gff
  * [ ] genome
  * [ ] cds
  * [ ] aa
  * [ ] ids
 Files:
  * [ ] gff : /exports/molepi/tgehrmann/data/ensembl_grch37.92.homo_sapiens/genes.gff3
  * [ ] genome : /exports/molepi/tgehrmann/data/ensembl_grch37.92.homo_sapiens/dna.fasta
  * [ ] cds : /exports/molepi/tgehrmann/data/ensembl_grch37.92.homo_sapiens/cds.fa
  * [X] aa : /exports/molepi/tgehrmann/data/ensembl_grch37.92.homo_sapiens/aa.fa
  * [ ] ids : /exports/molepi/tgehrmann/data/ensembl_grch37.92.homo_sapiens/ids.tsv



In [11]:
#print(hg37.aa)

In [12]:
#hg37.aa['ENSP00000456546.1']

## Wormbase Genomes

You can also download the genomes present on wormbase. It doesn't download the CDS though, so be aware of that... Organisms currently defined in wormbase are:

In [2]:
biu.genomes.Wormbase.organisms()

Organisms in Wormbase


D: curl -L  'https://wormbase.org/rest/widget/index/all/all/downloads' > '/exports/molepi/tgehrmann/data/downloads/2a4190087a93236b6560fbb1faee17454ea483bd'


 * Brugia_malayi
 * Caenorhabditis_angaria
 * Caenorhabditis_brenneri
 * Caenorhabditis_briggsae
 * Caenorhabditis_elegans
 * Caenorhabditis_elegans.1
 * Caenorhabditis_japonica
 * Caenorhabditis_nigoni
 * Caenorhabditis_remanei
 * Caenorhabditis_remanei.1
 * Caenorhabditis_sinica
 * Caenorhabditis_tropicalis
 * Onchocerca_volvulus
 * Pristionchus_pacificus
 * Panagrellus_redivivus
 * Strongyloides_ratti
 * Trichuris_muris
 * Romanomermis_culicivorax
 * Soboliphyme_baturini
 * Trichinella_britovi
 * Trichinella_murrelli
 * Trichinella_nativa
 * Trichinella_nativa.1
 * Trichinella_nelsoni
 * Trichinella_papuae
 * Trichinella_patagoniensis
 * Trichinella_pseudospiralis
 * Trichinella_pseudospiralis.1
 * Trichinella_pseudospiralis.2
 * Trichinella_pseudospiralis.3
 * Trichinella_pseudospiralis.4
 * Trichinella_sp._T6
 * Trichinella_sp._T8
 * Trichinella_sp._T9
 * Trichinella_spiralis
 * Trichinella_spiralis.1
 * Trichinella_zimbabwensis
 * Trichuris_muris.1
 * Trichuris_suis
 * Trichuris_

In [4]:
worm = biu.genomes.Wormbase()
print(worm.ids)

D: curl -L  'ftp://ftp.wormbase.org/pub/wormbase/releases/WS266/species/c_elegans/PRJNA13758/c_elegans.PRJNA13758.WS266.protein.fa.gz' > '/exports/molepi/tgehrmann/data/downloads/4f0d56643d648ba315bb0069ba1df6075c84f36f'
D: gunzip < '/exports/molepi/tgehrmann/data/downloads/4f0d56643d648ba315bb0069ba1df6075c84f36f' > '/exports/molepi/tgehrmann/data/downloads/4f0d56643d648ba315bb0069ba1df6075c84f36f.gunzipped'
D: Fasta input source is file


Indexed TSV Map Object
 Filename: /exports/molepi/tgehrmann/data/data/genomes/wormbase/Caenorhabditis_elegans/map.tsv
 Indexes:
  * [ ] gene
  * [ ] protein
  * [ ] peptide
  * [ ] uniprot
  * [ ] insdc



D: cp -R -T '/exports/molepi/tgehrmann/data/downloads/4f0d56643d648ba315bb0069ba1df6075c84f36f.gunzipped.func' '/exports/molepi/tgehrmann/data/data/genomes/wormbase/Caenorhabditis_elegans/map.tsv'


In [15]:
worm.orthology.Caenorhabditis_brenneri['WBGene00156759']

[TSVIndexRow(gene='WBGene00000005', Brugia_malayi=None, Caenorhabditis_angaria=None, Caenorhabditis_brenneri='WBGene00156759', Caenorhabditis_briggsae='WBGene00041961', Caenorhabditis_elegans=None, Caenorhabditis_japonica='WBGene00135829', Caenorhabditis_nigoni='PRJNA384657:Cnig_chr_I.g3548', Caenorhabditis_remanei='WBGene00072979', Caenorhabditis_sinica='PRJNA194557:Csp5_scaffold_04004.g31813', Caenorhabditis_tropicalis='PRJNA53597:Csp11.Scaffold630.g19142', Danio_rerio='ENSDARG00000055226', Drosophila_melanogaster=None, Homo_sapiens='ENSG00000103064', Mus_musculus='ENSMUSG00000031904', Onchocerca_volvulus=None, Panagrellus_redivivus=None, Pristionchus_pacificus='WBGene00099281', Saccharomyces_cerevisiae_S288c='YHL036W', Strongyloides_ratti=None, Trichuris_muris=None),
 TSVIndexRow(gene='WBGene00000010', Brugia_malayi=None, Caenorhabditis_angaria=None, Caenorhabditis_brenneri='WBGene00156759', Caenorhabditis_briggsae='WBGene00041961', Caenorhabditis_elegans=None, Caenorhabditis_japoni

In [16]:
#worm.ids.uniprot.table

## Flybase
You can also download the genomes present on Flybase

In [17]:
biu.genomes.Flybase.organisms()

Organisms in Flybase, release FB2018_03:
 * dana_r1.05
 * dere_r1.05
 * dgri_r1.05
 * dmel_r6.22
 * dmoj_r1.04
 * dper_r1.3
 * dpse_r3.04
 * dsec_r1.3
 * dsim_r2.02
 * dvir_r1.06
 * dwil_r1.05
 * dyak_r1.05


In [18]:
fly = biu.genomes.Flybase()
print(fly)

Flybase object
 Genome : flybase_FB2018_03.dmel_r6.22
 Objects:
  * [ ] gff
  * [ ] genome
  * [ ] cds
  * [ ] aa
 Files:
  * [ ] gff : /exports/molepi/tgehrmann/data/flybase_FB2018_03.dmel_r6.22/genes.gff3
  * [ ] genome : /exports/molepi/tgehrmann/data/flybase_FB2018_03.dmel_r6.22/dna.fasta
  * [ ] cds : /exports/molepi/tgehrmann/data/flybase_FB2018_03.dmel_r6.22/cds.fa
  * [X] aa : /exports/molepi/tgehrmann/data/flybase_FB2018_03.dmel_r6.22/aa.fa



In [19]:
#print(fly.aa.keys())
#fly.gff['FBtr0070000']

## JGI

## Yeast Genome

## FungiDB

## NCBI

In [6]:
True is True

True