# *D. melanogaster* Genome: Capstone Option 1

The first goal for this project would be to attempt to predict the transcription binding sites in the genome. (One of the most common categories are CpG islands, which are predicted below.) I was looking into how the gene predictions were made, and it turns out they are based on Hidden Markov Models and/or support vector machines. So knowing this, I hope to try to build on what we have learned so far in regex and svms, as well as reading up on HMMs which I have already found to be readily available in sklearn, to create my own gene prediction model. I will also certainly need to get some practice with the BioPython library, which was designed specifically to work with gene data. The issue I expect to run into would still be that I only have a single genome, so I only have a single set of all the genes. I will be trying to find more data if I can, hence why I have a backup option where I am already doing EDA, in case this option doesn't end up being feasible. Even without more data, I think I may be able to cluster genes based on their sequences, since their function is determined to a high degree by their form, which comes down to the nucleotide strings themselves. That being said, even if I were able to group genes with similar counterparts, I likely wouldn't be able to predict what they do, since this requires physical experimentation. Still, being able to group genes that may not be fully understood with genes that are could reveal some really cool insights!

In [1]:
import pandas as pd

In [2]:
genome = pd.read_csv('genome.fa')
genome.head()

Unnamed: 0,>chr2L
0,Cgacaatgcacgacagaggaagcagaacagatatttagattgcctc...
1,tttctctcccatattatagggagaaatatgatcgcgtatgcgagag...
2,gccaacatattgtgctctttgattttttggcaacccaaaatggtgg...
3,tgaaCGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAA...
4,TTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGT...


In [3]:
mrna_genbank = pd.read_csv('mrna-genbank.fa')
mrna_genbank.head()

Unnamed: 0,>DQ327735 1
0,ttgttgcgcacacgcaccagaagagaggaggatcgaccaggcagct...
1,tctggctctctggaaagtggtcaaaggagaaggaggaggtcgtaag...
2,gtagaatcgacaatataatcggagtcatatcgggcatcaacgtcgg...
3,atcaacatcaacaacggcagcagacgtcgctaattgcaaccaacac...
4,gctgcagcctggaccctacatatccaatgttcagaatttaaatgca...


In [4]:
mrna_refseq = pd.read_csv('mrna-refseq.fa')
mrna_refseq.head()

Unnamed: 0,>NR_004049 1
0,accccatacccaaccagattattatgatacataatgcttatatgaa...
1,atacatttcgcaacatttattttaggtatataaatacatttattga...
2,attgatatatgccactaaaatggtgtatttttaatttctttcaata...
3,cataattgacattatataaaaatgaattataaaactctaagcggtg...
4,actcggctcatgggtcgatgaagaacgcagcaaactgtgcgtcatc...


In [5]:
ensemble_gtp = pd.read_csv('ensembl-gtp.csv')
ensemble_gtp.head()

Unnamed: 0,gene,transcript,protein
0,FBgn0085804.1,FBtr0114258.1,
1,FBgn0267431.1,FBtr0346770.1,FBpp0312365.1
2,FBgn0039987.1,FBtr0302440.1,
3,FBgn0058182.1,FBtr0302347.1,
4,FBgn0267430.1,FBtr0346769.1,FBpp0312364.1


In [6]:
ensemble_pep = pd.read_csv('ensembl-pep.csv')
ensemble_pep.head()

Unnamed: 0,name,seq
0,FBtr0005088.1,MAASDKSVDDSLYPIAVLIDELKNEDVQLRLNSIKKLSTIALALGE...
1,FBtr0006151.1,MFDLTGKHVCYVADCGGIALETSKVLMTKNIAKLAILQSTENPQAI...
2,FBtr0070000.1,MTRYKQTEFTEDDSSSIGGIQLNEATGHTGMQIRYHTARATWNWRS...
3,FBtr0070002.1,MTCTLVLLIASVLHFRMRGSCLLDIERFPVIPGTIYAGHIAYCAIL...
4,FBtr0070003.1,MDISKVDSTRALVNHWRIFRIMGIHPPGKRTFWGRHYTAYSMVWNV...


In [7]:
ensemble_source = pd.read_csv('ensembl-source.csv')
ensemble_source.head()

Unnamed: 0,name,source
0,FBtr0005088.1,protein_coding
1,FBtr0006151.1,protein_coding
2,FBtr0070000.1,protein_coding
3,FBtr0070001.1,tRNA
4,FBtr0070002.1,protein_coding


In [8]:
ensemble_to_gene_name = pd.read_csv('ensembl-to-gene-name.csv')
ensemble_to_gene_name.head()

Unnamed: 0,name,value
0,FBtr0114258.1,CR41571
1,FBtr0346770.1,CG45784
2,FBtr0302440.1,CR12798
3,FBtr0302347.1,CR40182
4,FBtr0346769.1,CG45783


In [9]:
genes_augustus = pd.read_csv('genes-augustus.csv')
genes_augustus.head()

Unnamed: 0,bin,name,chrom,strand,txStart,txEnd,cdsStart,cdsEnd,exonCount,exonStarts,exonEnds,score,name2,cdsStartStat,cdsEndStat,exonFrames
0,585,g1.t1,chr2L,+,6485,10574,6590,9276,6,6485757379798192866710454,6808791081168589974110574,0,g1,cmpl,cmpl,"0,2,0,2,0,-1,"
1,585,g1.t2,chr2L,+,6485,10574,6590,9276,5,648575738192866710454,680881168589974110574,0,g1,cmpl,cmpl,"0,2,2,0,-1,"
2,585,g2.t1,chr2L,-,10706,40875,11770,40784,26,"10706,12285,13519,13682,14932,22268,22527,2274...","12221,12928,13625,14874,15275,22446,22687,2293...",0,g2,cmpl,cmpl,"2,1,0,2,1,0,2,1,1,0,0,2,2,0,1,2,2,1,2,2,0,2,2,..."
3,585,g3.t1,chr2L,+,56305,71879,67624,70895,6,563056188567568678916808470606,563566196667762680237054971879,0,g3,cmpl,cmpl,"-1,-1,0,0,0,2,"
4,585,g3.t2,chr2L,+,56305,76202,67624,76098,11,"56305,66675,67568,67891,68084,70606,72597,7447...","56356,67003,67762,68023,70549,70806,72977,7457...",0,g3,cmpl,cmpl,"-1,-1,0,0,0,2,1,0,0,2,0,"


In [10]:
genes_ensemble = pd.read_csv('genes-ensembl.csv')
genes_ensemble.head()

Unnamed: 0,bin,name,chrom,strand,txStart,txEnd,cdsStart,cdsEnd,exonCount,exonStarts,exonEnds,score,name2,cdsStartStat,cdsEndStat,exonFrames
0,585,FBtr0300690.1,chr2L,+,7528,9484,7679,9276,3,752881928667,811685899484,0,FBgn0031208.1,cmpl,cmpl,020
1,585,FBtr0330654.1,chr2L,+,7528,9484,7679,8610,2,75288228,81169484,0,FBgn0031208.1,cmpl,cmpl,02
2,585,FBtr0300689.1,chr2L,+,7528,9484,7679,8610,2,75288192,81169484,0,FBgn0031208.1,cmpl,cmpl,02
3,585,FBtr0306589.1,chr2L,-,9838,21376,11214,17136,10,"9838,11409,11778,12285,13519,13682,14932,17052...","11344,11518,12221,12928,13625,14874,15711,1721...",0,FBgn0002121.1,cmpl,cmpl,"2,1,2,1,0,2,0,0,-1,-1,"
4,585,FBtr0078169.1,chr2L,-,9838,21376,11214,15648,10,"9838,11409,11778,12285,13519,13682,14932,19879...","11344,11518,12221,12928,13625,14874,15711,2002...",0,FBgn0002121.1,cmpl,cmpl,"2,1,2,1,0,2,0,-1,-1,-1,"


In [11]:
genes_genescan = pd.read_csv('genes-genscan.csv')
genes_genescan.head()

Unnamed: 0,bin,name,chrom,strand,txStart,txEnd,cdsStart,cdsEnd,exonCount,exonStarts,exonEnds
0,585,chr2L.1,chr2L,+,7573,9276,7573,9276,3,757382288667,811685899276
1,585,chr2L.2,chr2L,-,11770,19944,11770,19944,5,1177012285136781493219884,1222112850148741571119944
2,585,chr2L.3,chr2L,-,21918,25004,21918,25004,4,21918227422348824746,22687229352374425004
3,585,chr2L.4,chr2L,-,27048,41395,27048,41395,12,"27048,28014,28732,30393,31783,33904,34557,3471...","27484,28358,28817,31723,33270,34288,34604,3491..."
4,585,chr2L.5,chr2L,+,67624,76098,67624,76098,8,6762467891680847060672660734947490275077,6776268023705497086972977736927501876098


In [12]:
genes_refseq = pd.read_csv('genes-refseq.csv')
genes_refseq.head()

Unnamed: 0,bin,name,chrom,strand,txStart,txEnd,cdsStart,cdsEnd,exonCount,exonStarts,exonEnds,score,name2,cdsStartStat,csdEndStat,exonFrames
0,812,NM_001170300,chr3R,+,29836567,29837447,29836625,29837342,3,298365672983691229837317,298368322983726229837447,0,CG15510,cmpl,cmpl,002
1,802,NM_001170290,chr3R,-,28520664,28540324,28521128,28539941,17,"28520664,28521356,28521541,28521814,28522481,2...","28521298,28521473,28521700,28522231,28522632,2...",0,Ppn,cmpl,cmpl,11110111011022020
2,802,NM_001170289,chr3R,-,28520664,28540324,28521128,28539941,12,"28520664,28521356,28521541,28522481,28522694,2...","28521298,28521473,28521700,28522632,28523260,2...",0,Ppn,cmpl,cmpl,111011022020
3,713,NM_001103535,chrX,+,16811698,16826494,16812002,16826138,7,"16811698,16822556,16823102,16823856,16824358,1...","16812061,16822772,16823206,16824118,16824772,1...",0,CG4829,cmpl,cmpl,0221222
4,713,NM_001103536,chrX,-,16826680,16835520,16827398,16828074,3,168266801682767516834611,168276051682807516835520,0,CG13004,cmpl,cmpl,"0,0,-1,"


In [13]:
genes_xeno_refseq = pd.read_csv('genes-xeno-refseq.csv')
genes_xeno_refseq.head()

Unnamed: 0,bin,name,chrom,strand,txStart,txEnd,cdsStart,cdsEnd,exonCount,exonStarts,exonEnds,score,name2,cdsStartStat,cdsEndStat,exonFrames
0,777,NM_005051,chr3R,-,25263316,25265638,25263337,25265638,5,2526331625263487252638772526527825265596,2526346325263649252649752526534725265638,0,QARS,incmpl,cmpl,0
1,776,NM_021758,chr3R,-,25064649,25065415,25064649,25065415,3,250646492506486025065145,250647962506499225065415,0,Lin7b,incmpl,incmpl,0
2,585,NM_020812,chr2L,+,95834,101578,95834,101578,12,"95834,96764,98991,99000,99378,99468,99769,9978...","96002,97115,98997,99018,99438,99720,99772,1004...",0,DOCK6,incmpl,incmpl,2000000000
3,736,NM_001172995,chr3L,+,19858461,19860590,19858461,19860590,6,"19858461,19858811,19858862,19859398,19859572,1...","19858489,19858837,19859242,19859527,19860250,1...",0,Papss1,incmpl,incmpl,11000
4,612,NM_001196298,chr2R,-,3590032,3590326,3590032,3590326,1,3590032,3590326,0,LOC100501614,incmpl,incmpl,0


In [14]:
meta_cpg_island_ext_unmasked = pd.read_csv('meta-cpg-island-ext-unmasked.csv')
meta_cpg_island_ext_unmasked.head()

Unnamed: 0,bin,chrom,chromStart,chromEnd,name,length,cpgNum,gcNum,perCpg,perGc,obsExp
0,585,chr2L,7577,8084,CpG: 33,507,33,274,13.0,54.0,0.9
1,585,chr2L,12443,12850,CpG: 33,407,33,211,16.2,51.8,1.21
2,585,chr2L,22112,22357,CpG: 24,245,24,130,19.6,53.1,1.39
3,585,chr2L,27208,27450,CpG: 20,242,20,131,16.5,54.1,1.15
4,585,chr2L,33956,34254,CpG: 24,298,24,159,16.1,53.4,1.17


In [15]:
meta_cytoband = pd.read_csv('meta-cytoband.csv')
meta_cytoband.head()

Unnamed: 0,chrom,chromStart,chromEnd,name,gieStain
0,chrM,0,19524,,gneg
1,chrUn_CP007071v1,0,19956,,gneg
2,chrUn_CP007072v1,0,44411,,gneg
3,chrUn_CP007073v1,0,13157,,gneg
4,chrUn_CP007074v1,0,76224,,gneg


In [16]:
meta_simple_repeat = pd.read_csv('meta-simple-repeat.csv')
meta_simple_repeat.head()

Unnamed: 0,bin,chrom,chromStart,chromEnd,name,period,copyNum,consensusSize,perMatch,perIndel,score,A,C,G,T,entropy,sequence
0,585,chr2L,0,5204,trf,458,11.4,458,99,0,10363,31,19,20,28,1.97,CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTC...
1,585,chr2L,224,503,trf,147,2.0,137,84,11,357,31,21,20,26,1.98,ATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCCATAAT...
2,585,chr2L,682,961,trf,147,2.0,137,84,11,357,31,21,20,26,1.98,ATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCCATAAT...
3,585,chr2L,1140,1419,trf,147,2.0,137,84,11,357,31,21,20,26,1.98,ATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCCATAAT...
4,585,chr2L,1598,1877,trf,147,2.0,137,84,11,357,31,21,20,26,1.98,ATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCCATAAT...


In [17]:
refseq_link = pd.read_csv('refseq-link.csv')
refseq_link.head()

Unnamed: 0,name,product,mrnaAcc,protAcc,geneName,prodName,locusLinkId,omimId
0,C1orf137,putative uncharacterized protein C1orf137,NM_001013643,NP_001013665,242612,388327,388667,0
1,HMOX1,heme oxygenase 1,NM_001285567,NP_001272496,69217,23001,100860951,0
2,Adi1,"1,2-dihydroxy-3-keto-5-methylthiopentene dioxy...",NM_001285856,NP_001272785,163375,297261,101715211,0
3,XDH,xanthine dehydrogenase/oxidase,NM_001285974,NP_001272903,46674,112746,100515259,0
4,LOC101740470,chromobox protein homolog 5-like,NM_001285966,NP_001272895,242613,388328,101740470,0


In [18]:
refseq_summary = pd.read_csv('refseq-summary.csv')
refseq_summary.head()

Unnamed: 0,mrnaAcc,completeness,summary
0,NM_024058,Complete3End,
1,NM_001276675,Complete3End,
2,NM_001276674,Complete3End,
3,NR_074094,FullLength,
4,NR_074112,FullLength,
