# Lab 01: Programming the genomic data file
___

### Goals
- Retrieve genomic data from the public databases.
- Use programming language to extract information out of the sequence files.
-  Do some basic analysis on the data.

### Prerequisites

1. Python programming skills
2. Knowledge on the search engines, Entrez, on NCBI.
3. Pre-installation of [git](https://git-scm.com) and  [anaconda2 bunch](https://www.continuum.io)  which contains the following modules:
- [biopython](http://www.biopython.org)
- [scipy](http://www.scipy.org)
- [matplotlib](http://www.matplotlib.org)

## Session 1: Retrieve the data

It is simple to retrieve the data from the web using some explorers like Mozilla firefox, Internet explorer, or Google chrome. But here we would like you to retrieve the data inside the Python programming environment from Entrez.

In [5]:
from Bio import Entrez
Entrez.email = "ricket.woo@gmail.com"

In [6]:
help(Entrez.esearch)

Help on function esearch in module Bio.Entrez:

esearch(db, term, **keywds)
    ESearch runs an Entrez search and returns a handle to the results.
    
    ESearch searches and retrieves primary IDs (for use in EFetch, ELink
    and ESummary) and term translations, and optionally retains results
    for future use in the user's environment.
    
    See the online documentation for an explanation of the parameters:
    http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch
    
    Return a handle to the results which are always in XML format.
    
    Raises an IOError exception if there's a network error.
    
    Short example:
    
    >>> from Bio import Entrez
    >>> Entrez.email = "Your.Name.Here@example.org"
    >>> handle = Entrez.esearch(db="nucleotide", retmax=10, term="opuntia[ORGN] accD")
    >>> record = Entrez.read(handle)
    >>> handle.close()
    >>> record["Count"] >= 2
    True
    >>> "156535671" in record["IdList"]
    True
    >>> "156535673" in record["Id

In [16]:
entries = Entrez.esearch(db="nucleotide", retmax=10, term="HER2 AND human[ORGN]")
records = Entrez.read(entries)
entries.close()

In [17]:
records

{u'Count': '1269', u'RetMax': '10', u'IdList': ['54792101', '306482680', '194239695', '1074161485', '1074161484', '1074161483', '1074161482', '1074161481', '1074161480', '1074161479'], u'TranslationStack': [{u'Count': '3803', u'Field': 'All Fields', u'Term': 'HER2[All Fields]', u'Explode': 'N'}, {u'Count': '13934653', u'Field': 'Organism', u'Term': '"Homo sapiens"[Organism]', u'Explode': 'Y'}, 'AND'], u'TranslationSet': [{u'To': '"Homo sapiens"[Organism]', u'From': 'human[ORGN]'}], u'RetStart': '0', u'QueryTranslation': 'HER2[All Fields] AND "Homo sapiens"[Organism]'}

In [18]:
records['IdList']

['54792101', '306482680', '194239695', '1074161485', '1074161484', '1074161483', '1074161482', '1074161481', '1074161480', '1074161479']

In [11]:
help(Entrez.efetch)

Help on function efetch in module Bio.Entrez:

efetch(db, **keywords)
    Fetches Entrez results which are returned as a handle.
    
    EFetch retrieves records in the requested format from a list of one or
    more UIs or from user's environment.
    
    See the online documentation for an explanation of the parameters:
    http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch
    
    Return a handle to the results.
    
    Raises an IOError exception if there's a network error.
    
    Short example:
    
    >>> from Bio import Entrez
    >>> Entrez.email = "Your.Name.Here@example.org"
    >>> handle = Entrez.efetch(db="nucleotide", id="57240072", rettype="gb", retmode="text")
    >>> print(handle.readline().strip())
    LOCUS       AY851612                 892 bp    DNA     linear   PLN 10-APR-2007
    >>> handle.close()
    
    This will automatically use an HTTP POST rather than HTTP GET if there
    are over 200 identifiers as recommended by the NCBI.
    
    datab

In [19]:
handle = Entrez.efetch(db="nucleotide", id="54792101", rettype="gb", retmode="text")
file2 = open("54792101.gbk", "w")
file2.write(handle.read())
handle.close()
file2.close()

## Session 2: Extract the features out of the sequence files

Write a Python script, to extract al lthe features out of the Genbank file obtained in the previous session in TAB-delimited format.

In [22]:
def getFeatures(filename):
    """
    @DESCRIPTION: Extract FEATURES out of a Genbank file.
    @INPUT:
        - filename: a filename representing the Genbank file.
    @OUTPUT:
        - features: a dictionary containing all the FEATURE information 
    """
    ## add your code here

## Session 3: Do some basic statistical analysis on the data

1. Get 3-5 different genomes from NCBI Entrez, in Genbank format.

2. Write a python script to convert the `*.gbk` into `*.fasta`.

3. Compute the proportions of the four bases, "A", "C", "T" and "G" in the above genomes. And summarize the results as a single barchart using `matplotlib.pyplt.bar()`.

4. Check if the compositional distributions of the nucleotides are consistent across these species? This is a goodness-of-fit test on a contingency table. Note that the module `scipy.stats` contains many functions on various statistical hypothesis testings.

In [24]:
%matplotlib inline
import matplotlib.pyplot as plt
import scipy.stats as stats

In [25]:
help(plt.bar)

Help on function bar in module matplotlib.pyplot:

bar(left, height, width=0.8, bottom=None, hold=None, data=None, **kwargs)
    Make a bar plot.
    
    Make a bar plot with rectangles bounded by:
    
      `left`, `left` + `width`, `bottom`, `bottom` + `height`
            (left, right, bottom and top edges)
    
    Parameters
    ----------
    left : sequence of scalars
        the x coordinates of the left sides of the bars
    
    height : sequence of scalars
        the heights of the bars
    
    width : scalar or array-like, optional
        the width(s) of the bars
        default: 0.8
    
    bottom : scalar or array-like, optional
        the y coordinate(s) of the bars
        default: None
    
    color : scalar or array-like, optional
        the colors of the bar faces
    
    edgecolor : scalar or array-like, optional
        the colors of the bar edges
    
    linewidth : scalar or array-like, optional
        width of bar edge(s). If None, use default
      

In [27]:
dir(stats)

['Tester',
 '__all__',
 '__builtins__',
 '__doc__',
 '__file__',
 '__name__',
 '__package__',
 '__path__',
 '_binned_statistic',
 '_constants',
 '_continuous_distns',
 '_discrete_distns',
 '_distn_infrastructure',
 '_distr_params',
 '_multivariate',
 '_stats',
 '_stats_mstats_common',
 '_tukeylambda_stats',
 'absolute_import',
 'alpha',
 'anderson',
 'anderson_ksamp',
 'anglit',
 'ansari',
 'arcsine',
 'bartlett',
 'bayes_mvs',
 'bernoulli',
 'beta',
 'betai',
 'betaprime',
 'binned_statistic',
 'binned_statistic_2d',
 'binned_statistic_dd',
 'binom',
 'binom_test',
 'boltzmann',
 'boxcox',
 'boxcox_llf',
 'boxcox_normmax',
 'boxcox_normplot',
 'bradford',
 'burr',
 'burr12',
 'cauchy',
 'chi',
 'chi2',
 'chi2_contingency',
 'chisqprob',
 'chisquare',
 'circmean',
 'circstd',
 'circvar',
 'combine_pvalues',
 'contingency',
 'cosine',
 'cumfreq',
 'describe',
 'dgamma',
 'dirichlet',
 'distributions',
 'division',
 'dlaplace',
 'dweibull',
 'entropy',
 'erlang',
 'expon',
 'exponnorm',


### 4. Simulating the DNA sequence


In [1]:
import random
dna = [random.choice('ACGT') for i in xrange(100)]
print ''.join(dna)

TGGGTACGGACGAACGCCTACTGACAGGAAAGCAATAAGTTAAGGTGAAACTTGCTCCGTCAAAGTGCAACTAATGACACGGGCGTTAGACAAGTAGTTG


In [3]:
from collections import Counter

count = Counter(dna)
for nucleotide in count:
    print nucleotide, count[nucleotide]

A 33
C 19
T 20
G 28


In [18]:
from __future__ import division
import numpy as np
ldna = len(dna)
proportions = map(lambda x: count[x]/ldna, count.keys())
entropies = map(lambda x: -x*np.log(x), proportions)
entropy = sum(entropies)

In [17]:
entropy = reduce(lambda x, y: x + y, entropies)

## Session 4: Article query and reading

### 5. Get and read some articles on horizontal gene transfer (HGT)，and then answer the following questions:

#### (1) what is called "horizontal gene transfer"?

#### (2) List some algorithms for deriving horizontal gene transfer, and write down the mathematical details for one of them. 

#### （3）How can we discriminate "horizontal gene transfer" from "gene loss"?