In [1]:
%load_ext watermark

In [2]:
%watermark -a Schmelling,Nicolas -u -d -v -p biopython,pandas

Schmelling,Nicolas 
last updated: 2017-01-03 

CPython 3.5.2
IPython 4.1.1

biopython 1.66
pandas 0.18.0


---
Any comments and suggestions or questions?     
Please feel free to contact me via [twitter](https://twitter.com/derschmelling) or [email](mailto:Nicolas.Schmelling@hhu.de).

---

# Molecular Tool Set for a Circadian Clock in Cyanobacteria #

__Background__     
Circadian clocks are found in almost all organisms including photosynthetic Cyanobacteria, whereby large diversity exists within the protein components involved. In the model cyanobacterium _Synechococcus elongatus_ PCC 7942 circadian rhythms are driven by a unique KaiABC protein clock, which is embedded in a network of input and output factors. Homologous proteins to the KaiABC clock have been observed in Bacteria and Archaea, where evidence for circadian behavior in these domains is accumulating. However, interaction and function of non-cyanobacterial Kai-proteins as well as homologous input and output components remain mainly unclear.     
__Result__     
Using a universal BLAST analyses, we identified putative KaiC-based timing systems in organisms outside as well as variations within Cyanobacteria. A systematic analyses of publicly available microarray data elucidated interesting variations in circadian gene expression between different cyanobacterial strains, which might be correlated to the diversity of genome encoded clock components. Based on statistical analyses of co-occurrences of the clock components homologous to _Synechococcus elongatus_, we propose putative networks of reduced and fully functional clock systems. Further, we studied KaiC sequence conservation to determine functionally important regions of diverged KaiC homologs. Biochemical characterization of exemplary cyanobacterial KaiC proteins as well as homologs from two thermophilic Archaea demonstrated that kinase activity is always present. However, a KaiA-mediated phosphorylation is only detectable in KaiC1 orthologs.      
__Conclusion__     
Our analysis of 11,264 genomes clearly demonstrates that components of the _Synechococcus elongatus_ circadian clock are present in Bacteria and Archaea. However, all components are less abundant in other organisms than Cyanobacteria and KaiA, Pex, LdpA, and CdpA are only present in the latter. Thus, only reduced KaiBC-based or even simpler, solely KaiC-based timing systems might exist outside of the cyanobacterial phylum, which might be capable of driving diurnal oscillations.

## Data Collection and Preprocessing ##

This notebook is the first of five notebooks containing all of the code necessary to reproduce the data collection and analyses of the publication by [Schmelling et al., 2016](https://doi.org/10.1101/075291 ). Within this notebook are all instructions and code to repeat the data collection and the preprocessing. The final processed datasets are also available at [FigShare](https://figshare.com/authors/Nicolas_Schmelling/699391). 

### Downloading the RefSeq protein sequences from NCBI ###

__Run bash script__

### BLAST with Synechococcus protein sequence against custom RefSeq Database ###

__Install docker__

For more information on how to install docker on your system visit the docker [installation page](https://www.docker.com/products/docker)

__Pull the BLAST docker container and run it__

__Create custom BLAST database from downloaded sequences__

__Run BLAST for selected Synechococcus and Synechocystis sequences__

__Move all Synechocystis sequences into the seq/syn directory__

### Create FASTA from matches ###

In [None]:
import glob
import re

import pandas as pd

from Bio import Entrez
from Bio import SeqIO
from Bio.Blast import NCBIXML

In [None]:
'''
The parse_hits() function will read the BLAST output XML file, 
extract the genome and protein id to further extract the protein record
description and sequence from the genome FASTA file.
'''

def parse_hits(f):
    result_handle = open(f)
    blast_record = NCBIXML.read(result_handle)
    
    # Split file path to extract protein name and use the path
    # and protein to create the new FASTA file
    prot = f.split('/')[f.count('/')].split('_')[0]
    new_fasta = open(f.split(prot)[0]+'%s_matches.fasta' %prot, 'w')
    
    # Count records
    rec = 0

    # Loop through the XML file
    for alignment in blast_record.alignments:
        
        # Record genome and protein ID and incearse the count
        genome = alignment.title.split(' ')[-1]
        ref_no = alignment.title.split(' ',2)[1]
        rec += 1
        
        # Open genome FASTA file and find the original sequence
        # and description for the record and write it to the 
        # new FASTA file containing all matches
        file = glob.glob('AllGenomes/%s*.fasta'%genome)
        for seq_record in SeqIO.parse(file[0], 'fasta'):
            if ref_no in seq_record.description:
                new_fasta.write('>'+str(seq_record.description))
                new_fasta.write('\n')
                new_fasta.write(str(seq_record.seq))
                new_fasta.write('\n')

    result_handle.close()
    new_fasta.close()
 
    print(prot,'\t',rec)

In [None]:
# Run the above function for all Synechococcus proteins
for file in glob.glob('db/seq/*.xml'):
    parse_hits(file)

In [None]:
# Run the above function for all Synechocystis proteins
for file in glob.glob('db/seq/syn/*.xml'):
    parse_hits(file)

### Command line BLAST with matches against Synechococcus Database ###

__Run BLAST docker container__

__Create BLAST database for the reciprocal BLAST__

__Run BLAST for sequences. Use respective genome database__

### Parse hits ###

In [None]:
'''
The filter_hits() function will take the BLAST output XML file from the 
second run and extracts the genome id to furhter collect information about  
the corresponding orgnism, taxonomy, and BLAST results. The function will 
only collect these information for proteins that match to the original query
protein from Synechococcus or Synechocystis and store them into a CSV file.
'''

def filter_hits(blast_file,protein,organism):
    
    # Always tell NCBI who you are.
    Entrez.email = 'schmelli@msu.edu'
    
    # Read the XML file, extract the genome id, store it into a list
    # and remove the duplicates by converintg it into a set and back
    # in to a list
    result_handle = open(blast_file)
    blast_records = NCBIXML.parse(result_handle)

    ids = []

    for blast_record in blast_records:
        ids.append(blast_record.query.split(' ')[-1])
        
    result_handle.close()
        
    ids = list(set(ids))
    
    # Create a dictionary that stores in the end organism, taxid,
    # taxonomy (fetched from NCBI), and the last curation date
    # for each genome.
    taxo_dict = {}
            
    for id in ids:
        # Open genome assembly report file to extract organism name,
        # curation date, and, tax id and use the taxid to fetch
        # taxonomy information from NCBI.
        file = glob.glob('All_Reports/%s*_assembly_report.txt' %id)
        
        f = open(file[0], 'r')
        f_read = f.readlines()
        f.close()
        
        values = []
        
        for line in f_read:
            
            if line.startswith('# Organism name:'): 
                org = re.sub('  +','',line)
                values.append(org.split(':',1)[1][:-1])
                values.append(org.split(':',1)[1].split(' ',1)[0])

            elif line.startswith('# Taxid:'):
                org = re.sub('  +','',line)

                try:
                    handle = Entrez.efetch(db='taxonomy',
                                           id='txid%s[Organism]'\
                                           %org.split(':',1)[1][:-1])
                    record = Entrez.read(handle)
                    values.append(record[0]['Lineage'])
                except KeyError:
                    values.append('missing taxonomy')

                values.append(org.split(':',1)[1][:-1])
                
            elif line.startswith('# Date'):
                date = re.sub('  +','',line)
                values.append(date.split(':',1)[1][:-1])

        taxo_dict[id] = values

    # Reread the XML file and create also a CSV file for storage
    result_handle = open(blast_file)
    blast_records = NCBIXML.parse(result_handle)

    id_dict = {'kaiA':'WP_011377921.1', 'kaiB':'WP_011242647.1',
               'kaiC':'WP_011242648.1', 'pex':'WP_011377679.1',
               'ldpA':'WP_011377652.1', 'prkE':'WP_011243235.1',
               'nhtA':'WP_011378346.1', 'ircA':'WP_011378436.1',
               'cdpA':'WP_011378107.1', 'cikA':'WP_011243194.1',
               'sasA':'WP_011378322.1', 'rpaA':'WP_011377437.1',
               'rpaB':'WP_011378039.1', 'lalA':'WP_011242719.1',
               'labA':'WP_011244514.1', 'crm':'WP_011243720.1',
               'cpmA':'WP_011377895.1',
               'kaiB1':'WP_010874242.1', 'kaiC1':'WP_010874243.1',
               'kaiB2':'WP_010872548.1', 'kaiC2':'WP_010872549.1',
               'kaiB3':'WP_041425845.1', 'kaiC3':'WP_010873229.1'
                }

    csv = open('data/%s.csv'%protein, 'w')
    csv.write('name,genus,taxonomy,taxid,protein,protein_id,genome_id'\
              ',e_value,bitscore,identity,length,seq'\
              ',%s_prot_id,%s_id,date\n'%(organism,organism))

    rec = 0

    # Parse through the XML by records and extract genome ID.
    # First check if the protein ID is in the alignment title.
    # If so continue to write information into the CSV file.
    for blast_record in blast_records:
        
        genome = blast_record.query.split(' ')[-1]
        alignment = blast_record.alignments[0]
        hsp = alignment.hsps[0]
        
        if id_dict[protein] in alignment.title:
        
            # Write organisms name, genus name, taxonomy, and tax id
            csv.write(str(taxo_dict[genome][0]).replace(',',';')+',')
            csv.write(str(taxo_dict[genome][1]).replace(',',';')+',')
            csv.write(str(taxo_dict[genome][2]).replace(',',';')+',')
            csv.write(str(taxo_dict[genome][3]).replace(',',';')+',')

            rec += 1
            
            # Write protein name, protein id, and genome id
            csv.write(str(blast_record.query.split(' ',1)[1]\
                          .split(genome)[0]).replace(',',';')+',')
            csv.write(str(blast_record.query.split(' ',1)[0])\
                          .replace(',',';')+',')
            csv.write(str(blast_record.query.split(' ')[-1]\
                          .replace(',',';'))+',')
            
            # Write BLAST result statistics, like e value, bitscore,
            # and indentity, as well as protein sequence length
            csv.write(str(hsp.expect).replace(',',';')+',')
            csv.write(str(hsp.score).replace(',',';')+',')
            csv.write(str(hsp.identities/float(len(hsp.match))*100)\
                          .replace(',',';')+',')

            csv.write(str(blast_record.query_length).replace(',',';')+',')
            
            # Look for protein sequence in the genome FASTA file
            # and write protein sequence into CSV file
            genome_file = glob.glob('AllGenomes/%s*.fasta' %genome)
            
            for seq_record in SeqIO.parse(genome_file[0], 'fasta'):
                if blast_record.query.split(' ',1)[0] in seq_record.description:
                    csv.write(str(seq_record.seq).replace(',',';')+',')
                    break
            
            # Last write Synechococcus/Synechocystis protein id and genome id
            # as well as curation date
            csv.write(str(id_dict[protein]).replace(',',';')+',')
            csv.write(str(alignment.title.split(' ')[-1]).replace(',',';')+',')

            csv.write(str(taxo_dict[genome][4]).replace(',',';')+'\n')

    result_handle.close()
    csv.close()
    print(rec)

In [None]:
# Run above function for all Synechococcus proteins
for prot in ['kaiA','kaiB','kaiC','pex','ldpA','prkE','nhtA','ircA','cdpA','cikA',
             'sasA','rpaA','rpaB','lalA','labA','crm','cpmA']:
    filter_hits('db/seq/%s_back_blast.xml'%prot,prot,'synechococcus')

In [None]:
# Run above function for all Synechocystis proteins
for prot in ['kaiB1','kaiC1','kaiB2','kaiC2','kaiB3','kaiC3']:
    filter_hits('db/seq/syn/%s_back_blast.xml'%prot,prot,'synechocystis')

---

### Next ###

+ [Distribution of circadian clock protein](2_KaiABC_BLAST_Heatmap.ipynb)
+ [Length distribution of KaiA, KaiB, KaiC](3_KaiABC_BLAST_Scatterplot.ipynb)
+ [Co-occurence of circadian clock proteins in cyanobacteria](4_KaiABC_BLAST_FisherTest.ipynb)
+ [Additional Analyses](5_KaiABC_BLAST_Other.ipynb)

---