# Genome databases
A genome is described by several files.
 * Genome Fasta files
 * Genome GFF gene model descriptions
 * Transcriptome nucleotide Fasta file
 * Proteome amino acid Fasta file
 
The Genome structure of BIU organizes these files into a comprehensive structure, and provides functionality centered around them.

Several genomes are available in BIU:

In [1]:
import biu

In [2]:
biu.db.listGenomes()

Available versions:
 * GRCh37
 * Ensembl_GRCh37
 * Ensembl_GRCh38_91
 * RefSeq_GRCh37
 * RefSeq_GRCh38
 * WBcel235


## Initializing and inspecting the Genome structure

There are several types of data available:

 * `Genome.gff` : A GFF3 Structure
 * `Genome.cds` : A Fasta structure of transcript sequences
 * `Genome.aa`  : A Fasta structure of protein sequences
 * `Genome.genome[c]` : An array of genome sequences in FASTA format.

In [3]:
ce = biu.db.Genome("WBcel235")
print(ce)

Genome object
 Where: /home/tgehrmann/repos/BIU/docs
 Genome : WBcel235
 Objects:
  * [ ] gff
  * [ ] cds
  * [ ] aa
  * [ ] genome[all]
 Files:
  * [X] gff : /home/tgehrmann/repos/BIU/docs/genome_WBcel235/genome.gff3
  * [X] cds : /home/tgehrmann/repos/BIU/docs/genome_WBcel235/cds.fa
  * [X] aa : /home/tgehrmann/repos/BIU/docs/genome_WBcel235/aa.fa
  * [X] chr_all : /home/tgehrmann/repos/BIU/docs/genome_WBcel235/chrall.fa.gz



## Accessing the Fasta Sequences

In [4]:
print(ce.cds)
print(ce.aa)

D: Initializing the FastaResourceManager object NOW
D: Fasta input source is file


Fasta object
 Where: /home/tgehrmann/repos/BIU/docs/genome_WBcel235/cds.fa
 Entries: 53146
 Primary type: dna



D: Initializing the FastaResourceManager object NOW
D: Fasta input source is file


Fasta object
 Where: /home/tgehrmann/repos/BIU/docs/genome_WBcel235/aa.fa
 Entries: 28310
 Primary type: prot



In [5]:
for i, seqid in enumerate(ce.cds):
    seq = ce.cds[seqid]
    print(">%s\n%s" % (seq.name, seq.seq))
    if i > 1:
        break

>NR_155259.1
TCGCCGGTGTTCTATGTCTAAAACTGCAATTTGAACCACTTTTTGTACTTGTACAGTTGGATTTTTTCGTGTAGTTTTTTGAAAAAATAGGTTTTGCAAGAGCTCTGTGGTTATTGATTTTTCCTAAAAATACACATTTTCTGCTCAGTTTTGCCCACATTTCGAA
>NR_155258.1
ACTTCGTCGAATCGAGGGACCATCAAAATTGCACGGATGAAACAAGGATGTGCTTCTCCAGTCCACCTACATCCGCCCGAGCTGCTCATCCTCCAAATTCTTCCGTTTTCATGACAAATTATTGTTTTTTTTGTTGAAATTATGTAATTCATTAAATGTAATATTATCCTTATCTGTAAATAATTATCATGATCAATAAAATATCGCTCTTAATGTTCAATGAATAGC
>NR_155257.1
AGGACGGAAGGGGCGTCAGGTCGTTGTACCTCATTACCAGTATGGAGAGGGTCTCATTTATCATAAAATCAAAAAAGTTGAAAAAAGAAACCTCACTCGGTTCGCAAGAACCATCAAACCCTCTCTGAAGGGTCATCACCAGC


In [6]:
for i, seqid in enumerate(ce.aa):
    seq = ce.aa[seqid]
    print(">%s\n%s" % (seq.name, seq.seq))
    if i > 1:
        break

>NP_872268.1
MSRSIFIQMSDSKQLENEASSLRRVAFVGVVVSFTATLVCIIAAPMLYNYMQHMQSVMQSEVDFCRSRSGNIWKEVTRTQVLSKVSGGAIRSRRQTEYENLGVEGSSSQGGCCGCGTSAAGPPGSPGPDGQEGSNGRPGAPGTNGPDGRPATQASASDFCFDCPPGPPGPAGSIGPKGPNGNPGFDGQPGAPGNNGFAGGPGAPGLGGKDGQSGNAGVPGAPGKITNIQRPAGLPGVPGPIGPVGSAGTPGSPGNPGSQGPQGSAGDNGGDGFPGQPGANGDNGPDGETGVSGGCDHCPPPRTAPGY
>NP_872267.1
MSDIKQLENEASSLRRVALVGVAVSFTATLVCVIAAPMLYNYMQHMQSVMQSEVDFCRSRSGNIWREVTRTQVLAKVSGGAVRSRRQAGYESAGVEGNSFSQGGCCGCGVSAAGPPGAPGQDGEDGADGQPGAPGNDGPDGPAATPAPAHEFCFDCPAGPPGPAGPAGPKGPNGNSGSDGQPGAPGNNGNAGGPGAPGQAGQDGHPGNAGAPGAPGKVNEVPGPAGAPGAPGPDGPAGPAGSPGAPGNPGSQGPQGPAGDNGGAGSPGQPGANGDNGADGETGAPGGCDHCPPPRTAPGY
>NP_872265.1
MLGAPARFPRISGVTVARKVRKHRDYSNPETPNFTAQLFTIIVLGGLLLGGSGEEEQTLNQLLVEMDRMGSGNGAVVVLASTNRADVLDKALLRPGRFDRHISIDLPTVLERKDMFELYMRKIKLDHAPQEYSQRLAVLTPSFTGADIMNVCNESAIRAASNKCHVVTIKDMEYALDRVLAGSEKRSRSLVEEEREVVAYHEAGHALVGWMLEHTDALLKVLRGDSEKMLKWKFSGEK


In [7]:
print(ce.genome['all']['NC_003279.8'][500000:500051])

D: Initializing the FastaResourceManager object NOW
D: Fasta input source is file


GATGAGCTGCAGCGGAAGCTTTCATTGGGATCTGTGCAGTACGTTGGAACC


## Accessing the GFF structure

In [8]:
print(ce.gff)

D: Initializing the GFF3ResourceManager object NOW
D: GFF input source is file.


GFF3 object
 Where: /home/tgehrmann/repos/BIU/docs/genome_WBcel235/genome.gff3
 Entries: 517453
 Indexed: Yes
 Feature statistics:
  * region : 7
  * gene : 44830
  * snoRNA : 345
  * exon : 219765
  * mRNA : 28134
  * CDS : 196662
  * pseudogene : 1901
  * piRNA : 15364
  * ncRNA : 7769
  * transcript : 667
  * pseudogenic_tRNA : 209
  * tRNA : 634
  * antisense_RNA : 104
  * lnc_RNA : 181
  * primary_transcript : 271
  * miRNA : 454
  * snRNA : 130
  * rRNA : 22
  * pseudogenic_rRNA : 1
  * scRNA : 1
  * sequence_feature : 2



In [9]:
for entry in ce.gff.getChildren('rna4', feature='exon').query('NC_003279.8', 19241, 20848):
    print(entry)

GFF3Entry(seqid:NC_003279.8, source:RefSeq, feature:exon, start:20271, end:20478, score:., strand:-, phase:., attr:ID=id20;Dbxref=GeneID:171592,Genbank:NM_058262.4,WormBase:WBGene00022278;gbkey=mRNA;gene=rcor-1;partial=true;product=RCOR (REST CO-Repressor) homolog;transcript_id=NM_058262.4)


D: GFF input source is list of GFF3Entries.
D: GFF input source is list of GFF3Entries.


## Get sequences of GFF entries

In [10]:
print(ce.seq('rna4'))

atggATTCGTACACGTCATCTGACGAAGACGCCTCTCGAAAAGAAAACGAAGGCTTGAATATGTTGAATGCATCGCCGGAGCCAATGGAGGAAGATGATCCAGAGGAGCAGGCggaacaagaagaagaaaccAGCAGAATGGCTCGTCCTATAAGATCCATGAGAAAACGCGAAACAACGTCTGGGGAATCAATGGGCGATGAGGATGAAGATTTGGAGGATGAAGAGGACGAAGATGAAGAAGCTGAAGCTCGTGAGCATCATGAAAGTGGTGCTCATGACACATCTTTCTCAAATCCACTTTCCAACGTCGACAATCTAATCCACGTGGGAACCGAATATCAGGCGATTATACAGCCAACTGCAGAGCAATTGGAAAAAGAACCGTGCAGAGATCAACAAATTTGGGCGTTTCCAGACGAAATGAACGAGAATCGGCTTACAGAATACATTTCAGAAGCTACTGGACGATATCAATTACCTATAGATAGGGCTCTGTTCATTCTGAACAAACAGTCAAATGATTTCGACGCTGCGATGGTTCAAGCgatgagaagaaaagaaattcaTGATGATTGGACGGCAGAAGAAATTAGTCTTTTCTCCACTTGCTTCTTTCATTTCGGAAAACGGTTCAAGAAGATTCATGCGGCTatgcccCAACGCTCGCTTTCTTCCATTATCCAATACTATTACAACacgaaaaaagtgcaaaactatAAAACAATGATTAATGTGCATTTGAATGAAACCGACACTTATGATGAACTATTCAAAGAGGTCAATCATTTGGAGAGGGTTCCGTCGGGATATTGTGAGAATTGCAATGCAAAAAGTGATCTGTTGATTcTAAATCGTGTAATGTCGCGTCACGAATGTAAACCGTGTATCCTTTATTTCCGTTTGATGCGTGTTCCACGTCCGGCAAGCCTCCGTGCACTGACAAAACGACGGCAACGAGTTTTATGTCCAGAATaCATGAAAATTTATGTATACGGAT

D: GFF input source is list of GFF3Entries.


In [11]:
print(ce.seq('rna4').translate())

MDSYTSSDEDASRKENEGLNMLNASPEPMEEDDPEEQAEQEEETSRMARPIRSMRKRETTSGESMGDEDEDLEDEEDEDEEAEAREHHESGAHDTSFSNPLSNVDNLIHVGTEYQAIIQPTAEQLEKEPCRDQQIWAFPDEMNENRLTEYISEATGRYQLPIDRALFILNKQSNDFDAAMVQAMRRKEIHDDWTAEEISLFSTCFFHFGKRFKKIHAAMPQRSLSSIIQYYYNTKKVQNYKTMINVHLNETDTYDELFKEVNHLERVPSGYCENCNAKSDLLILNRVMSRHECKPCILYFRLMRVPRPASLRALTKRRQRVLCPEYMKIYVYGYLELMEPANGKAIKRLGIGKEKEEDDDIMVVDDCLLRKPSGPYIVEQSIEADPIDENTCRMTRCFDTPAALALIDNIKRKHHMCVPLVWRVKQTKCMEENEILNEEARQQMFRATMTYSRVPKGEIANWKKDMMALKGRFERFTPELDTTATNGNRSGKVRINYGWSPEEKKNAIRCFHWYKDNFELIAELMATKTVEQIKKFYMDNEKLILESIDTYRAELKSKLGK*


D: GFF input source is list of GFF3Entries.


### Convert several sequences to a Fasta structure

In [12]:
mRNASequences = biu.formats.Fasta([ ce.seq(mrna) for mrna in ce.gff.features['mRNA'][:20]])
print(mRNASequences)

D: GFF input source is list of GFF3Entries.
D: GFF input source is list of GFF3Entries.
D: GFF input source is list of GFF3Entries.
D: GFF input source is list of GFF3Entries.
D: GFF input source is list of GFF3Entries.
D: GFF input source is list of GFF3Entries.
D: GFF input source is list of GFF3Entries.
D: GFF input source is list of GFF3Entries.
D: GFF input source is list of GFF3Entries.
D: GFF input source is list of GFF3Entries.
D: GFF input source is list of GFF3Entries.
D: GFF input source is list of GFF3Entries.
D: GFF input source is list of GFF3Entries.
D: GFF input source is list of GFF3Entries.
D: GFF input source is list of GFF3Entries.
D: GFF input source is list of GFF3Entries.
D: GFF input source is list of GFF3Entries.
D: GFF input source is list of GFF3Entries.
D: GFF input source is list of GFF3Entries.
D: GFF input source is list of GFF3Entries.
D: Fasta input source is a list of sequences.
