# Genome databases
A genome is described by several files.
 * Genome Fasta files
 * Genome GFF gene model descriptions
 * Transcriptome nucleotide Fasta file
 * Proteome amino acid Fasta file
 
The Genome structure of BIU organizes these files into a comprehensive structure, and provides functionality centered around them.

Several genomes are available in BIU:

In [1]:
import biu

In [2]:
biu.db.listGenomes()

Available versions:
 * GRCh37
 * Ensembl_GRCh37
 * Ensembl_GRCh38_91
 * RefSeq_GRCh37
 * RefSeq_GRCh38


## Initializing and inspecting the Genome structure

There are several types of data available:

 * `Genome.gff` : A GFF3 Structure
 * `Genome.cds` : A Fasta structure of transcript sequences
 * `Genome.aa`  : A Fasta structure of protein sequences
 * `Genome.genome[c]` : An array of genome sequences in FASTA format.

In [3]:
hg = biu.db.Genome("Ensembl_GRCh38_91")
print(hg)

Genome object
 Where: /home/tgehrmann/repos/BIU/docs
 Genome : Ensembl_GRCh38_91
 Objects:
  * [ ] gff
  * [ ] cds
  * [ ] aa
  * [ ] genome[1]
  * [ ] genome[2]
  * [ ] genome[3]
  * [ ] genome[4]
  * [ ] genome[5]
  * [ ] genome[6]
  * [ ] genome[7]
  * [ ] genome[8]
  * [ ] genome[9]
  * [ ] genome[10]
  * [ ] genome[11]
  * [ ] genome[12]
  * [ ] genome[13]
  * [ ] genome[14]
  * [ ] genome[15]
  * [ ] genome[16]
  * [ ] genome[17]
  * [ ] genome[18]
  * [ ] genome[19]
  * [ ] genome[20]
  * [ ] genome[21]
  * [ ] genome[22]
  * [ ] genome[MT]
  * [ ] genome[X]
  * [ ] genome[Y]
 Files:
  * [X] gff : /home/tgehrmann/repos/BIU/docs/genome_Ensembl_GRCh38_91/genome.gff3
  * [X] cds : /home/tgehrmann/repos/BIU/docs/genome_Ensembl_GRCh38_91/cds.fa
  * [X] aa : /home/tgehrmann/repos/BIU/docs/genome_Ensembl_GRCh38_91/aa.fa
  * [X] chr_1 : /home/tgehrmann/repos/BIU/docs/genome_Ensembl_GRCh38_91/chr1.fa.gz
  * [ ] chr_2 : /home/tgehrmann/repos/BIU/docs/genome_Ensembl_GRCh38_91/chr2.fa.gz
  

## Accessing the Fasta Sequences

In [4]:
print(hg.cds)
print(hg.aa)

D: Initializing the FastaResourceManager object NOW
D: Fasta input source is file


Fasta object
 Where: /home/tgehrmann/repos/BIU/docs/genome_Ensembl_GRCh38_91/cds.fa
 Entries: 104817
 Primary type: dna



D: Initializing the FastaResourceManager object NOW
D: Fasta input source is file


Fasta object
 Where: /home/tgehrmann/repos/BIU/docs/genome_Ensembl_GRCh38_91/aa.fa
 Entries: 104817
 Primary type: prot



In [5]:
for i, seqid in enumerate(hg.cds):
    seq = hg.cds[seqid]
    print(">%s\n%s" % (seq.name, seq.seq))
    if i > 1:
        break

>ENST00000640941.1
ATGTCGCTCATGGTCATCATCATGGCGTGTGTTGGGTTCTTCTTGCTGCAGGGGGCCTGGCCACAGGAGGGAGTCCACAGAAAACCTTCCTTCCTGGCCCTCCCAGGTCACCTGGTGAAATCAGAAGAGACAGTCATCCTGCAATGTTGGTCGGATGTCATGTTTGAGCACTTCCTTCTGCACAGAGAGGGGAAGTTTAACAACACTTTGCACCTCATTGGAGAGCACCATGATGGGGTTTCCAAGGCCAACTTCTCCATTGGTCCCATGATGCCTGTCCTTGCAGGAACCTACAGATGCTACGGTTCTGTTCCTCACTCCCCCTATCAGTTGTCAGCTCCCAGTGACCCTCTGGACATGGTGATCATAGGTCTATATGAGAAACCTTCTCTCTCAGCCCAGCCGGGCCCCACGGTTCAGGCAGGAGAGAATGTGACCTTGTCCTGCAGCTCCCGGAGCTCCTATGACATGTACCATCTATCCAGGGAAGGGGAGGCCCATGAACGTAGGCTCCCTGCAGTGCGCAGCATCAACGGAACATTCCAGGCCGACTTTCCTCTGGGCCCTGCCACCCACGGAGGGACCTACAGATGCTTCGGCTCTTTCCGTGACGCTCCCTACGAGTGGTCAAACTCGAGTGATCCACTGCTTGTTTCCGTCACAGGAAACCCTTCAAATAGTTGGCCTTCACCCACTGAACCAAGCTCCAAAACCGGTAACCCCAGACACCTACATGTTCTGATTGGGACCTCAGTGGTCAAAATCCCTTTCACCATCCTCCTCTTCTTTCTCCTTCATCGCTGGTGCTCCGACAAAAAAAATGCTGCTGTAATGGACCAAGAGCCTGCAGGGAACAGAACAGTGAACAGCGAGGATTCTGATGAACAAGACCATCAGGAGGTGTCATACGCA
>ENST00000638726.1
ATGTCGCTCACTGTCGTCAGCATGGCGTGCGTTGGGTTCTTCTTGCTGC

D: Set iterkeys: 104817


In [6]:
for i, seqid in enumerate(hg.aa):
    seq = hg.aa[seqid]
    print(">%s\n%s" % (seq.name, seq.seq))
    if i > 1:
        break

>ENSP00000492546.1
MSLMVIIMACVGFFLLQGAWPQEGVHRKPSFLALPGHLVKSEETVILQCWSDVMFEHFLLHREGKFNNTLHLIGEHHDGVSKANFSIGPMMPVLAGTYRCYGSVPHSPYQLSAPSDPLDMVIIGLYEKPSLSAQPGPTVQAGENVTLSCSSRSSYDMYHLSREGEAHERRLPAVRSINGTFQADFPLGPATHGGTYRCFGSFRDAPYEWSNSSDPLLVSVTGNPSNSWPSPTEPSSKTGNPRHLHVLIGTSVVKIPFTILLFFLLHRWCSDKKNAAVMDQEPAGNRTVNSEDSDEQDHQEVSYA
>ENSP00000492117.1
MSLTVVSMACVGFFLLQGAWPLMGGQDKPFLSARPSTVVPRGGHVALQCHYRRGFNNFMLYKEDRSHVPIFHGRIFQESFIMGPVTPAHAGTYRCRGSRPHSLTGWSAPSNPLVIMVTGNHRKPSLLAHPGPLLKSGETVILQCWSDVMFEHFFLHREGISEDPSRLVGQIHDGVSKANFSIGPLMPVLAGTYRCYGSVPHSPYQLSAPSDPLDIVITGLYEKPSLSAQPGPTVQAGENVTLSCSSWSSYDIYHLSREGEAHERRLRAVPKVNRTFQADFPLGPATHGGTYRCFGSFRALPCVWSNSSDPLLVSVTGICRHLHVLIGTSVVIFLFILLLFFLLYRWCSNKKNAAVMDQEPAGDRTVNRQDSDEQDPQEVTYAQLDHCVFIQRKISRPSQRPKTPLTDTSVYTELPNAEPRSKVVSCPRAPQSGLEGVF
>ENSP00000491436.1
MSLTVVSMACVGFFLLQGAWPLMGGQDKPFLSARPSTVVPRGGHVALQCHYRRGFNNFMLYKEDRSHVPIFHGRIFQESFIMGPVTPAHAGTYRCRGSRPHSLTGWSAPSNPLVIMVTGNHRKPSLLAHPGPLLKSGETVILQCWSDVMFEHFFLHREGISEDPSRLVGQIHDGVSKANFSIGPLMPVLAGTYRCYGSV

D: Set iterkeys: 104817


In [7]:
print(hg.genome['1']['1'][500000:500051])

D: Initializing the FastaResourceManager object NOW
D: Fasta input source is file


AGGTATCCTCTCATCTCAGCTTCCCTAGTAGTTGGAACTCTAGGTGCACAA


## Accessing the GFF structure

In [8]:
print(hg.gff)

D: Initializing the GFF3ResourceManager object NOW
D: GFF input source is file.


GFF3 object
 Where: /home/tgehrmann/repos/BIU/docs/genome_Ensembl_GRCh38_91/genome.gff3
 Entries: 2636880
 Top level statistics:
  * chromosome : 25
  * pseudogene : 14627
  * ncRNA_gene : 22239
  * gene : 21436
  * supercontig : 169



In [9]:
hg.gff.getChildren('transcript:ENST00000417324')

[GFF3Entry(seqid:1, source:havana, feature:exon, start:34554, end:35174, score:., strand:-, phase:., attr:Name=ENSE00001727627;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001727627;rank=3;version=1),
 GFF3Entry(seqid:1, source:havana, feature:exon, start:35277, end:35481, score:., strand:-, phase:., attr:Name=ENSE00001669267;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001669267;rank=2;version=1),
 GFF3Entry(seqid:1, source:havana, feature:exon, start:35721, end:36081, score:., strand:-, phase:., attr:Name=ENSE00001656588;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001656588;rank=1;version=1)]