<h2>My first attempt at bioinformatic analysis (besides a few Rosalind challenges). I have an undergraduate+ understanding of biology and will try to explain things as I go along. I will tackle the questions:</h2>
    

 - We have a number of different gene prediction systems in the dataset,
   how do they compare to each other? How do they compare to the mRNA
   data?

<h3>I hope to investigate these issues by</h3> 

 1. Gaining understanding of our data sets
 2. Comparing gene prediction sets to mRNA data set


In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb
import Bio

print(Bio.__version__)

In [None]:
# Loading  data

agsts = pd.read_csv('../input/genes-augustus.csv')
gnscn = pd.read_csv('../input/genes-genscan.csv')
rfsq = pd.read_csv('../input/genes-refseq.csv')
xrfsq = pd.read_csv('../input/genes-xeno-refseq.csv')

### NOTE: cannot find gene names when searched on ensembl.org!!!
ensmbl = pd.read_csv('../input/genes-ensembl.csv')
predictions = {'AUGUSTUS':agsts, 'GENSCAN':gnscn, 'REFSEQ':rfsq, 'XENOREFSEQ':xrfsq, 'ENSEMBL':ensmbl}
agsts.head()

<h2>A bit on what all that means.</h2>

We can find out some things from [https://www.kaggle.com/mylesoneill/drosophila-melanogaster-genome][1],
but some of the descriptions are a bit opaque. 


  - **Bin,** from what I can tell, is of no consequence.
  - **txStart, txEnd, cdsStart, cdsEnd** stand for transcription start/end and coding sequence start/end. 
    the *transcription sequence* is what is read by the [ribosome][2], and the coding sequence is everything that is read by the ribosome **AND** translated into RNA.
  - **Score** I have no clue, sorry.
  - **cdsEndStat** I think is bookkeeping; is it known where (or if) the coding sequence ends?
  - **exonFrame** from the <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC441517/bin/nar_32_suppl-2_W309__index.html">literature</a> this is *"the number of
bases of the exon (0, 1 or 2) upstream of the most upstream codon boundary of the exon"*, so basically does the beginning of this codon match with the beginning of its first full <a href="https://en.wikipedia.org/wiki/Genetic_code">codon</

  [1]: https://www.kaggle.com/mylesoneill/drosophila-melanogaster-genome
  [2]: https://en.wikipedia.org/wiki/Ribosome "ribosome"

In [None]:
gnscn.head()

<h3>Hey this is pretty similar to what's in the AUGUSTUS file (thanks Myles!).</h3>

In [None]:
# It would be interesting to sort by transcription start to see the first genes of some chromosomes
agsts.sort_values(by='txStart')

<h2>Now what is this?!</h2>

Those are a bunch of unmapped sections of chromosomes. I don't know if the transcription *really* starts at the first base-which seems unlikely given that the ends of chromosomes are usually capped with <a href="https://en.wikipedia.org/wiki/Telomere">telomeres</a> (at least in humans...). Maybe the transcription 'starts' at 0 for these few sequences because they are unmapped and we don't know much about them.

<h3>Time to compare our predicted gene data sets. First, how many genes did each set predict?</h3>


In [None]:
for method, dataset in predictions.items():
    print(method + " # of genes: " + str(dataset.shape[0]) + '\n')

That pretty much makes sense. AUGUSTUS and GENSCAN are older while the other methods are based on an entire database of sequences and are updated often. Thus I only focus only on RefSeq and Ensembl.



In [None]:
refseq = pd.read_csv('../input/genes-refseq.csv')
refseq = refseq.groupby('chrom')['chrom'].count()
refseq = refseq.loc[refseq.values > 10]

ensmbl = pd.read_csv('../input/genes-ensembl.csv')
ensmbl = ensmbl.groupby('chrom')['chrom'].count()
ensmbl = ensmbl.loc[ensmbl.values > 10]

data = {'rfsq': refseq, 'ensmbl': ensmbl}
counts = pd.DataFrame.from_dict(data=data)
counts


<h2>to be continued...</h2>