`2019-09-07` `Tiago Ferreira Leao` **v1.0**

# Cyanobiome phylogenomics

```
Requirements: packages below
              /home/gerwick-lab/Desktop/data/genomes/prokka/annotation/genome_name
              (Before running this notebook, run prokka for all 80 published genomes)
              Running processing_mash_f1.ipynb to generate mash_taxonomy_2.csv
```
                 

In [27]:
import glob
import os
import pandas as pd
from Bio import SeqIO
from natsort import natsorted
import subprocess
import numpy as np
import re
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
import arrow
from Bio import Phylo
import matplotlib.pyplot as plt
%matplotlib inline

**Parsing input and generating reference (M. producens PAL)**

In [17]:
!mkdir ./pnas-phylogenomics/

mkdir: cannot create directory ‘./pnas-phylogenomics/’: File exists


Input: 21 housekeeping genes from a complete genome (here, [*Moorea* PAL from this linked PNAS paper](http://www.pnas.org/content/114/12/3198.short)) to be found at `./phylogenomics/PAL_core_genes.faa`

In [12]:
!cp ./inputs/hk_genes_calteau_pnas/* ./pnas-phylogenomics/
!ls ./pnas-phylogenomics/*.faa | wc -l

21


In [13]:
input_list = glob.glob("/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/*.faa")
output_handle = open("/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/PAL_core_genes.faa", "w")

count = 0
for item in natsorted(input_list):
    input_handle  = open(item,"r")
    new_id = os.path.basename(item).split("_scaffold")[0]
    for seq_record in SeqIO.parse(input_handle, "fasta"):
        if "Moorea_producens_PAL_15AUG08-1" in seq_record.id:
            count += 1
            new_id = str(os.path.basename(item).split(".faa")[0]) + str("_Moorea_producens_PAL15AUG081")
            output_handle.write(">%s %s\n%s\n" % (
               new_id,
               "",
               seq_record.seq))
    input_handle.close()

output_handle.close()

assert(len(input_list) == count)

In [18]:
!ls ./pnas-phylogenomics/*

./pnas-phylogenomics/10_rpsB.faa  ./pnas-phylogenomics/20_rpsO.faa
./pnas-phylogenomics/11_rpsC.faa  ./pnas-phylogenomics/21_rpsQ.faa
./pnas-phylogenomics/12_rpsD.faa  ./pnas-phylogenomics/2_rplC.faa
./pnas-phylogenomics/13_rpsE.faa  ./pnas-phylogenomics/3_rplE.faa
./pnas-phylogenomics/14_rpsG.faa  ./pnas-phylogenomics/4_rplF.faa
./pnas-phylogenomics/15_rpsH.faa  ./pnas-phylogenomics/5_rplK.faa
./pnas-phylogenomics/16_rpsI.faa  ./pnas-phylogenomics/6_rplM.faa
./pnas-phylogenomics/17_rpsK.faa  ./pnas-phylogenomics/7_rplN.faa
./pnas-phylogenomics/18_rpsL.faa  ./pnas-phylogenomics/8_rplR.faa
./pnas-phylogenomics/19_rpsM.faa  ./pnas-phylogenomics/9_rplV.faa
./pnas-phylogenomics/1_rplA.faa   ./pnas-phylogenomics/PAL_core_genes.faa


**Identify housekeeping genes in prokka annotation for final scaffolds**

In [19]:
!mkdir ./pnas-phylogenomics/diamond

In [20]:
!diamond makedb --in ./pnas-phylogenomics/PAL_core_genes.faa -d ./pnas-phylogenomics/diamond/PAL_hkg

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Database file: ./pnas-phylogenomics/PAL_core_genes.faa
Opening the database file...  [8.5e-05s]
Loading sequence data (0 sequences processed)...  [5.7e-05s]
Writing trailer...  [2e-06s]
Closing the input file...  [5e-06s]
Closing the database file...  [1.9e-05s]
Processed 21 sequences, 3334 letters.
Total time = 0.000221s


In [21]:
def run_diamond(database,input_fasta,output_path):
    output_file = os.path.join(output_path, os.path.splitext(os.path.basename(input_fasta))[0] + ".matches")
    diamond_cmd = "diamond blastp -d %s -q %s -a %s"%(database,input_fasta,output_file)
    p = subprocess.Popen(diamond_cmd, shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE,
                         stderr=subprocess.STDOUT, close_fds=True,cwd="./")
    output = p.stdout.read()
    print output

def view_diamond(diamond_matches,m8_output):
    view_cmd = "diamond view -a %s -o %s"%(diamond_matches,m8_output)
    p = subprocess.Popen(view_cmd, shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE,
                         stderr=subprocess.STDOUT, close_fds=True,cwd="./")
    output = p.stdout.read()
    print output
    
outdir = "/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond/"
dbfile = "/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond/PAL_hkg"
prokka_list = glob.glob("/home/gerwick-lab/Desktop/data/genomes/prokka/annotation/*/*.faa")
scaffold_list = glob.glob("/home/tiago/Desktop/cyanet/genomics_pnas/outputs/final_scaffolds/*.fasta")

strain_list = []
for item in scaffold_list:
    strain = os.path.basename(item).split(".")[0][3:]
    strain_list.append(strain)
    
strain_list.append("LADK01")
    
count = 0
    
for prokka_faa in natsorted(prokka_list):
    strain = os.path.basename(prokka_faa).split(".")[0]
    if len(strain) < 3:
        temp_strain = "1%s"%strain
    else:
        temp_strain = strain
    if temp_strain in strain_list:
        count += 1
        run_diamond(dbfile,
                    prokka_faa,
                    outdir)
        view_diamond("%s%s.matches.daa"%(outdir,strain),"%s%s.m8"%(outdir,strain))

count

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [2.3e-05s]
Opening the input file...  [4.4e-05s]
Opening the output file...  [2.2e-05s]
Loading query sequences...  [0.068517s]
Building query histograms...  [0.058913s]
Allocating buffers...  [0.000463s]
Loading reference sequences...  [2.6e-05s]
Building reference histograms...  [0.003671s]
Allocating buffers...  [0.000447s]
Initializing temporary storage...  [0.00095s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.001737s]
Building query index...  [0.03416s]
Building seed filter...  [0.00588s]
Searching alignments...  [0.224321s]
Processing query chunk 0, reference

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [1.9e-05s]
Opening the input file...  [4.6e-05s]
Opening the output file...  [4.2e-05s]
Loading query sequences...  [0.04527s]
Building query histograms...  [0.048156s]
Allocating buffers...  [0.00052s]
Loading reference sequences...  [2.9e-05s]
Building reference histograms...  [0.003345s]
Allocating buffers...  [0.000377s]
Initializing temporary storage...  [0.000848s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.001649s]
Building query index...  [0.026088s]
Building seed filter...  [0.004471s]
Searching alignments...  [0.223281s]
Processing query chunk 0, referenc

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [2.4e-05s]
Opening the input file...  [2.2e-05s]
Opening the output file...  [1.8e-05s]
Loading query sequences...  [0.044855s]
Building query histograms...  [0.054437s]
Allocating buffers...  [0.000659s]
Loading reference sequences...  [2.9e-05s]
Building reference histograms...  [0.002992s]
Allocating buffers...  [0.000535s]
Initializing temporary storage...  [0.001318s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.002052s]
Building query index...  [0.028633s]
Building seed filter...  [0.004561s]
Searching alignments...  [0.221696s]
Processing query chunk 0, refere

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [2e-05s]
Opening the input file...  [4.6e-05s]
Opening the output file...  [2.6e-05s]
Loading query sequences...  [0.043405s]
Building query histograms...  [0.044491s]
Allocating buffers...  [0.000339s]
Loading reference sequences...  [2.1e-05s]
Building reference histograms...  [0.002609s]
Allocating buffers...  [0.000366s]
Initializing temporary storage...  [0.000816s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.00166s]
Building query index...  [0.024521s]
Building seed filter...  [0.004098s]
Searching alignments...  [0.226804s]
Processing query chunk 0, reference

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [2e-05s]
Opening the input file...  [3.4e-05s]
Opening the output file...  [3.1e-05s]
Loading query sequences...  [0.049549s]
Building query histograms...  [0.054089s]
Allocating buffers...  [0.000489s]
Loading reference sequences...  [2.9e-05s]
Building reference histograms...  [0.003652s]
Allocating buffers...  [0.000356s]
Initializing temporary storage...  [0.000815s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.002265s]
Building query index...  [0.029272s]
Building seed filter...  [0.00467s]
Searching alignments...  [0.224471s]
Processing query chunk 0, reference

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [1.8e-05s]
Opening the input file...  [4.5e-05s]
Opening the output file...  [2.7e-05s]
Loading query sequences...  [0.072609s]
Building query histograms...  [0.068037s]
Allocating buffers...  [0.000343s]
Loading reference sequences...  [2.3e-05s]
Building reference histograms...  [0.002554s]
Allocating buffers...  [0.000342s]
Initializing temporary storage...  [0.00083s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.001531s]
Building query index...  [0.039886s]
Building seed filter...  [0.006356s]
Searching alignments...  [0.221691s]
Processing query chunk 0, referen

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [1.5e-05s]
Opening the input file...  [2.8e-05s]
Opening the output file...  [2.9e-05s]
Loading query sequences...  [0.038214s]
Building query histograms...  [0.038769s]
Allocating buffers...  [0.000372s]
Loading reference sequences...  [3.4e-05s]
Building reference histograms...  [0.004956s]
Allocating buffers...  [0.000347s]
Initializing temporary storage...  [0.000827s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.001955s]
Building query index...  [0.023335s]
Building seed filter...  [0.003778s]
Searching alignments...  [0.226675s]
Processing query chunk 0, refere

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [1.3e-05s]
Opening the input file...  [2.5e-05s]
Opening the output file...  [2.1e-05s]
Loading query sequences...  [0.018194s]
Building query histograms...  [0.027449s]
Allocating buffers...  [0.00037s]
Loading reference sequences...  [3.3e-05s]
Building reference histograms...  [0.003038s]
Allocating buffers...  [0.000352s]
Initializing temporary storage...  [0.000886s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.00361s]
Building query index...  [0.014014s]
Building seed filter...  [0.002806s]
Searching alignments...  [0.226228s]
Processing query chunk 0, referenc

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [1.7e-05s]
Opening the input file...  [3e-05s]
Opening the output file...  [2.6e-05s]
Loading query sequences...  [0.01591s]
Building query histograms...  [0.023828s]
Allocating buffers...  [0.000362s]
Loading reference sequences...  [2.1e-05s]
Building reference histograms...  [0.002207s]
Allocating buffers...  [0.000344s]
Initializing temporary storage...  [0.000893s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.002126s]
Building query index...  [0.012299s]
Building seed filter...  [0.002301s]
Searching alignments...  [0.225259s]
Processing query chunk 0, reference

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [1.8e-05s]
Opening the input file...  [3.2e-05s]
Opening the output file...  [2.8e-05s]
Loading query sequences...  [0.080462s]
Building query histograms...  [0.07576s]
Allocating buffers...  [0.000366s]
Loading reference sequences...  [3.6e-05s]
Building reference histograms...  [0.003987s]
Allocating buffers...  [0.000363s]
Initializing temporary storage...  [0.000873s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.001908s]
Building query index...  [0.043538s]
Building seed filter...  [0.006767s]
Searching alignments...  [0.221602s]
Processing query chunk 0, referen

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [2e-05s]
Opening the input file...  [3.7e-05s]
Opening the output file...  [3.2e-05s]
Loading query sequences...  [0.032474s]
Building query histograms...  [0.03323s]
Allocating buffers...  [0.000379s]
Loading reference sequences...  [2.1e-05s]
Building reference histograms...  [0.002499s]
Allocating buffers...  [0.000366s]
Initializing temporary storage...  [0.000847s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.001966s]
Building query index...  [0.019311s]
Building seed filter...  [0.00322s]
Searching alignments...  [0.224686s]
Processing query chunk 0, reference 

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [1.9e-05s]
Opening the input file...  [3.5e-05s]
Opening the output file...  [3.1e-05s]
Loading query sequences...  [0.023129s]
Building query histograms...  [0.023425s]
Allocating buffers...  [0.000387s]
Loading reference sequences...  [3.4e-05s]
Building reference histograms...  [0.002373s]
Allocating buffers...  [0.000346s]
Initializing temporary storage...  [0.000847s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.001881s]
Building query index...  [0.014307s]
Building seed filter...  [0.002516s]
Searching alignments...  [0.22584s]
Processing query chunk 0, referen

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [1.2e-05s]
Opening the input file...  [2.2e-05s]
Opening the output file...  [1.8e-05s]
Loading query sequences...  [0.037888s]
Building query histograms...  [0.035367s]
Allocating buffers...  [0.000373s]
Loading reference sequences...  [3.5e-05s]
Building reference histograms...  [0.002379s]
Allocating buffers...  [0.000369s]
Initializing temporary storage...  [0.000925s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.004309s]
Building query index...  [0.022557s]
Building seed filter...  [0.003469s]
Searching alignments...  [0.228233s]
Processing query chunk 0, refere

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [1.7e-05s]
Opening the input file...  [3.1e-05s]
Opening the output file...  [2.6e-05s]
Loading query sequences...  [0.028759s]
Building query histograms...  [0.041607s]
Allocating buffers...  [0.00035s]
Loading reference sequences...  [2.4e-05s]
Building reference histograms...  [0.00252s]
Allocating buffers...  [0.000453s]
Initializing temporary storage...  [0.001068s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.002105s]
Building query index...  [0.020369s]
Building seed filter...  [0.00324s]
Searching alignments...  [0.225399s]
Processing query chunk 0, reference

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [1.9e-05s]
Opening the input file...  [3.5e-05s]
Opening the output file...  [3.2e-05s]
Loading query sequences...  [0.089358s]
Building query histograms...  [0.092071s]
Allocating buffers...  [0.000472s]
Loading reference sequences...  [2.8e-05s]
Building reference histograms...  [0.004056s]
Allocating buffers...  [0.000485s]
Initializing temporary storage...  [0.001248s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.001921s]
Building query index...  [0.054812s]
Building seed filter...  [0.009021s]
Searching alignments...  [0.223815s]
Processing query chunk 0, refere

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [1.7e-05s]
Opening the input file...  [3.1e-05s]
Opening the output file...  [2.8e-05s]
Loading query sequences...  [0.028707s]
Building query histograms...  [0.028607s]
Allocating buffers...  [0.000376s]
Loading reference sequences...  [2e-05s]
Building reference histograms...  [0.003464s]
Allocating buffers...  [0.000361s]
Initializing temporary storage...  [0.001111s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.001823s]
Building query index...  [0.015555s]
Building seed filter...  [0.002784s]
Searching alignments...  [0.224761s]
Processing query chunk 0, referenc

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [1.9e-05s]
Opening the input file...  [3.6e-05s]
Opening the output file...  [3.2e-05s]
Loading query sequences...  [0.055277s]
Building query histograms...  [0.06164s]
Allocating buffers...  [0.000359s]
Loading reference sequences...  [2.1e-05s]
Building reference histograms...  [0.003078s]
Allocating buffers...  [0.000375s]
Initializing temporary storage...  [0.000865s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.001543s]
Building query index...  [0.032701s]
Building seed filter...  [0.006257s]
Searching alignments...  [0.221292s]
Processing query chunk 0, referen

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [1.3e-05s]
Opening the input file...  [2.3e-05s]
Opening the output file...  [2e-05s]
Loading query sequences...  [0.066992s]
Building query histograms...  [0.073526s]
Allocating buffers...  [0.000511s]
Loading reference sequences...  [3e-05s]
Building reference histograms...  [0.002428s]
Allocating buffers...  [0.000353s]
Initializing temporary storage...  [0.001239s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.001675s]
Building query index...  [0.041626s]
Building seed filter...  [0.007341s]
Searching alignments...  [0.223707s]
Processing query chunk 0, reference 

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [2.1e-05s]
Opening the input file...  [3.8e-05s]
Opening the output file...  [3.4e-05s]
Loading query sequences...  [0.027476s]
Building query histograms...  [0.032213s]
Allocating buffers...  [0.000357s]
Loading reference sequences...  [1.9e-05s]
Building reference histograms...  [0.002989s]
Allocating buffers...  [0.000347s]
Initializing temporary storage...  [0.000874s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.001904s]
Building query index...  [0.017923s]
Building seed filter...  [0.003352s]
Searching alignments...  [0.225588s]
Processing query chunk 0, refere

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [2e-05s]
Opening the input file...  [3.5e-05s]
Opening the output file...  [4e-05s]
Loading query sequences...  [0.032563s]
Building query histograms...  [0.0347s]
Allocating buffers...  [0.000425s]
Loading reference sequences...  [2.3e-05s]
Building reference histograms...  [0.003389s]
Allocating buffers...  [0.000433s]
Initializing temporary storage...  [0.000981s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.001865s]
Building query index...  [0.020698s]
Building seed filter...  [0.003378s]
Searching alignments...  [0.226977s]
Processing query chunk 0, reference ch

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [2e-05s]
Opening the input file...  [3.4e-05s]
Opening the output file...  [4e-05s]
Loading query sequences...  [0.042411s]
Building query histograms...  [0.040414s]
Allocating buffers...  [0.000376s]
Loading reference sequences...  [3.3e-05s]
Building reference histograms...  [0.002728s]
Allocating buffers...  [0.000354s]
Initializing temporary storage...  [0.000881s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.001929s]
Building query index...  [0.024706s]
Building seed filter...  [0.003925s]
Searching alignments...  [0.228101s]
Processing query chunk 0, reference 

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [2e-05s]
Opening the input file...  [3.5e-05s]
Opening the output file...  [3.3e-05s]
Loading query sequences...  [0.031926s]
Building query histograms...  [0.026814s]
Allocating buffers...  [0.000344s]
Loading reference sequences...  [1.9e-05s]
Building reference histograms...  [0.002683s]
Allocating buffers...  [0.000323s]
Initializing temporary storage...  [0.000847s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.002173s]
Building query index...  [0.01583s]
Building seed filter...  [0.002963s]
Searching alignments...  [0.226651s]
Processing query chunk 0, reference

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [2e-05s]
Opening the input file...  [3.6e-05s]
Opening the output file...  [3.3e-05s]
Loading query sequences...  [0.074802s]
Building query histograms...  [0.070215s]
Allocating buffers...  [0.00039s]
Loading reference sequences...  [2.5e-05s]
Building reference histograms...  [0.003529s]
Allocating buffers...  [0.000347s]
Initializing temporary storage...  [0.000866s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.001514s]
Building query index...  [0.041609s]
Building seed filter...  [0.006289s]
Searching alignments...  [0.222813s]
Processing query chunk 0, reference

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [2e-05s]
Opening the input file...  [3.6e-05s]
Opening the output file...  [3.3e-05s]
Loading query sequences...  [0.070619s]
Building query histograms...  [0.081739s]
Allocating buffers...  [0.000484s]
Loading reference sequences...  [2.7e-05s]
Building reference histograms...  [0.003213s]
Allocating buffers...  [0.000351s]
Initializing temporary storage...  [0.000908s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.001431s]
Building query index...  [0.048712s]
Building seed filter...  [0.007738s]
Searching alignments...  [0.224967s]
Processing query chunk 0, referenc

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [2e-05s]
Opening the input file...  [3.5e-05s]
Opening the output file...  [3.3e-05s]
Loading query sequences...  [0.051123s]
Building query histograms...  [0.060215s]
Allocating buffers...  [0.000505s]
Loading reference sequences...  [2.9e-05s]
Building reference histograms...  [0.004485s]
Allocating buffers...  [0.00036s]
Initializing temporary storage...  [0.000948s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.00153s]
Building query index...  [0.032729s]
Building seed filter...  [0.005388s]
Searching alignments...  [0.22293s]
Processing query chunk 0, reference c

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [1.3e-05s]
Opening the input file...  [2.3e-05s]
Opening the output file...  [2e-05s]
Loading query sequences...  [0.022956s]
Building query histograms...  [0.030479s]
Allocating buffers...  [0.000367s]
Loading reference sequences...  [2.5e-05s]
Building reference histograms...  [0.002609s]
Allocating buffers...  [0.00036s]
Initializing temporary storage...  [0.000929s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.0024s]
Building query index...  [0.017236s]
Building seed filter...  [0.003291s]
Searching alignments...  [0.226041s]
Processing query chunk 0, reference c

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [1.3e-05s]
Opening the input file...  [2.5e-05s]
Opening the output file...  [2.2e-05s]
Loading query sequences...  [0.02478s]
Building query histograms...  [0.036766s]
Allocating buffers...  [0.000373s]
Loading reference sequences...  [3.5e-05s]
Building reference histograms...  [0.00291s]
Allocating buffers...  [0.000348s]
Initializing temporary storage...  [0.000909s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.00181s]
Building query index...  [0.019024s]
Building seed filter...  [0.003544s]
Searching alignments...  [0.222927s]
Processing query chunk 0, reference

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [1.5e-05s]
Opening the input file...  [2.7e-05s]
Opening the output file...  [3e-05s]
Loading query sequences...  [0.02327s]
Building query histograms...  [0.029571s]
Allocating buffers...  [0.000382s]
Loading reference sequences...  [3.5e-05s]
Building reference histograms...  [0.002299s]
Allocating buffers...  [0.000333s]
Initializing temporary storage...  [0.000902s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.001522s]
Building query index...  [0.015594s]
Building seed filter...  [0.003018s]
Searching alignments...  [0.225608s]
Processing query chunk 0, reference

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [2e-05s]
Opening the input file...  [3.9e-05s]
Opening the output file...  [3.5e-05s]
Loading query sequences...  [0.040823s]
Building query histograms...  [0.040297s]
Allocating buffers...  [0.000389s]
Loading reference sequences...  [2.1e-05s]
Building reference histograms...  [0.003182s]
Allocating buffers...  [0.000462s]
Initializing temporary storage...  [0.001279s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.00215s]
Building query index...  [0.022301s]
Building seed filter...  [0.003761s]
Searching alignments...  [0.225677s]
Processing query chunk 0, reference

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [2.1e-05s]
Opening the input file...  [4e-05s]
Opening the output file...  [3.7e-05s]
Loading query sequences...  [0.071106s]
Building query histograms...  [0.057813s]
Allocating buffers...  [0.000419s]
Loading reference sequences...  [2.3e-05s]
Building reference histograms...  [0.003643s]
Allocating buffers...  [0.000345s]
Initializing temporary storage...  [0.001064s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.001444s]
Building query index...  [0.032578s]
Building seed filter...  [0.005539s]
Searching alignments...  [0.223976s]
Processing query chunk 0, referenc

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [2e-05s]
Opening the input file...  [3.4e-05s]
Opening the output file...  [3.3e-05s]
Loading query sequences...  [0.04233s]
Building query histograms...  [0.043344s]
Allocating buffers...  [0.00041s]
Loading reference sequences...  [2.1e-05s]
Building reference histograms...  [0.003957s]
Allocating buffers...  [0.000352s]
Initializing temporary storage...  [0.000938s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.001686s]
Building query index...  [0.025939s]
Building seed filter...  [0.004164s]
Searching alignments...  [0.226163s]
Processing query chunk 0, reference 

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [2.1e-05s]
Opening the input file...  [5.1e-05s]
Opening the output file...  [3.4e-05s]
Loading query sequences...  [0.081935s]
Building query histograms...  [0.081718s]
Allocating buffers...  [0.000361s]
Loading reference sequences...  [2.2e-05s]
Building reference histograms...  [0.002982s]
Allocating buffers...  [0.000347s]
Initializing temporary storage...  [0.000938s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.001719s]
Building query index...  [0.052463s]
Building seed filter...  [0.008066s]
Searching alignments...  [0.222318s]
Processing query chunk 0, refere

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [1.3e-05s]
Opening the input file...  [2.4e-05s]
Opening the output file...  [2.1e-05s]
Loading query sequences...  [0.050304s]
Building query histograms...  [0.056709s]
Allocating buffers...  [0.000403s]
Loading reference sequences...  [2.3e-05s]
Building reference histograms...  [0.005114s]
Allocating buffers...  [0.000377s]
Initializing temporary storage...  [0.000999s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.001665s]
Building query index...  [0.030517s]
Building seed filter...  [0.005156s]
Searching alignments...  [0.220472s]
Processing query chunk 0, refere

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [2.1e-05s]
Opening the input file...  [3.6e-05s]
Opening the output file...  [3.5e-05s]
Loading query sequences...  [0.079456s]
Building query histograms...  [0.073875s]
Allocating buffers...  [0.000362s]
Loading reference sequences...  [3.4e-05s]
Building reference histograms...  [0.002244s]
Allocating buffers...  [0.000356s]
Initializing temporary storage...  [0.001266s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.001817s]
Building query index...  [0.042589s]
Building seed filter...  [0.006966s]
Searching alignments...  [0.221745s]
Processing query chunk 0, refere

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [1.2e-05s]
Opening the input file...  [2.2e-05s]
Opening the output file...  [2e-05s]
Loading query sequences...  [0.047336s]
Building query histograms...  [0.058864s]
Allocating buffers...  [0.000367s]
Loading reference sequences...  [3.5e-05s]
Building reference histograms...  [0.003547s]
Allocating buffers...  [0.000339s]
Initializing temporary storage...  [0.000955s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.001804s]
Building query index...  [0.031473s]
Building seed filter...  [0.005005s]
Searching alignments...  [0.222983s]
Processing query chunk 0, referenc

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [2.5e-05s]
Opening the input file...  [4.6e-05s]
Opening the output file...  [4.6e-05s]
Loading query sequences...  [0.050501s]
Building query histograms...  [0.04699s]
Allocating buffers...  [0.000372s]
Loading reference sequences...  [2.3e-05s]
Building reference histograms...  [0.002941s]
Allocating buffers...  [0.000403s]
Initializing temporary storage...  [0.001019s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.001743s]
Building query index...  [0.026582s]
Building seed filter...  [0.004529s]
Searching alignments...  [0.234016s]
Processing query chunk 0, referen

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [2e-05s]
Opening the input file...  [3.7e-05s]
Opening the output file...  [3.4e-05s]
Loading query sequences...  [0.080297s]
Building query histograms...  [0.088954s]
Allocating buffers...  [0.000543s]
Loading reference sequences...  [3.2e-05s]
Building reference histograms...  [0.003298s]
Allocating buffers...  [0.000574s]
Initializing temporary storage...  [0.001444s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.003908s]
Building query index...  [0.048732s]
Building seed filter...  [0.0076s]
Searching alignments...  [0.221124s]
Processing query chunk 0, reference 

diamond v0.8.31.93 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond
Opening the database...  [1.8e-05s]
Opening the input file...  [3.2e-05s]
Opening the output file...  [3e-05s]
Loading query sequences...  [0.018386s]
Building query histograms...  [0.025519s]
Allocating buffers...  [0.000372s]
Loading reference sequences...  [3.4e-05s]
Building reference histograms...  [0.002505s]
Allocating buffers...  [0.00037s]
Initializing temporary storage...  [0.000975s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.004666s]
Building query index...  [0.015311s]
Building seed filter...  [0.002664s]
Searching alignments...  [0.225728s]
Processing query chunk 0, reference

76

**Selecting genes (the highest %ID homolog per PAL housekeeping gene) and adding to existent aligments**

In [22]:
diamond_columns = ["qseqid","sseqid","pident","length","mismatch","gapopen",
                   "qstart","qend","sstart","send","evalue","bitscore"]
                    #retrived from https://www.biostars.org/p/166013/

In [24]:
def get_hkg_dict(m8_file):
    prokka_hkg = {}
    diamond_df = pd.read_csv(m8_file,sep="\t",names=diamond_columns).sort_values(by="sseqid")
    for i,r in diamond_df.iterrows():
        if diamond_df.qseqid.loc[i] not in list(prokka_hkg.values()):
            subset = diamond_df[diamond_df.sseqid == diamond_df.sseqid.loc[i]]
            max_id = subset.loc[subset['pident'].idxmax()]["pident"]
            max_qseq = subset.loc[subset['pident'].idxmax()]["qseqid"]
            max_sseq = subset.loc[subset['pident'].idxmax()]["sseqid"]
            prokka_hkg[max_qseq] = max_sseq
    assert(len(prokka_hkg.keys()) == len(np.unique(diamond_df.sseqid.values)))
    return prokka_hkg

def get_strainID(strain,mash_dict):
    if len(strain) < 3:
        strain = "1%s"%strain
    if strain in mash_dict.keys():
        strainID = mash_dict[strain]
    else:
        strainID = "Cyanobacterium"
    return strainID

def extract_genes(m8_file,prokka_folder,hkg_folder,mash_dict,prokka_hkg):
    strain = os.path.basename(m8_file).split(".")[0]
    prokka_faa = "%s%s/%s.faa"%(prokka_folder,strain,strain)
    strainID = get_strainID(strain,mash_dict)
    input_handle  = open(prokka_faa,"r")
    for seq_record in SeqIO.parse(input_handle, "fasta"):
        if seq_record.id in prokka_hkg.keys():
            hkg_homolog = str(prokka_hkg[seq_record.id].split("_Moorea")[0])
            output_handle = open("%s%s.faa"%(hkg_folder,hkg_homolog), "a")
            new_id = "%s_%s"%(strainID,strain)
            output_handle.write("\n\n>%s %s\n%s\n" % (
               new_id,
               "",
               seq_record.seq))
            output_handle.close()
    input_handle.close()

m8_list = glob.glob("/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/diamond/*.m8")
prokka_folder = "/home/gerwick-lab/Desktop/data/genomes/prokka/annotation/"
hkg_folder = "/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/"

mash_df = pd.read_csv('/home/tiago/Desktop/cyanet/cyanobiome/mash-Linux64-v2.0/mash_taxonomy_2.csv',sep=',',header=None)
mash_dict = pd.Series(mash_df[1].values,index=mash_df[0]).to_dict()

complete_count = 0
dict_len = []

for m8_file in m8_list:
    prokka_hkg = get_hkg_dict(m8_file)
    dict_len.append(len(prokka_hkg))
    if len(prokka_hkg) == 21:
        complete_count += 1
        extract_genes(m8_file,prokka_folder,hkg_folder,mash_dict,prokka_hkg)

print complete_count
print np.average(dict_len)

74
20.973684210526315


**Running MUSCLE and TrimAl**

In [None]:
#downloading trimAl
# !wget http://trimal.cgenomics.org/_media/trimal.v1.2rev59.tar.gz -P /home/tiago/
# !tar -xzf /home/tiago/trimal.v1.2rev59.tar.gz -C /home/tiago/; rm /home/tiago/trimal.v1.2rev59.tar.gz

In [None]:
# %%bash

cd /home/tiago/trimAl/source/

make #compiling trimAl

In [None]:
#!echo 'export PATH=$PATH:$HOME//home/tiago/trimAl/source/' >> ~/.bashrc #adding to path

In [25]:
input_list

['/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/19_rpsM.faa',
 '/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/11_rpsC.faa',
 '/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/12_rpsD.faa',
 '/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/9_rplV.faa',
 '/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/20_rpsO.faa',
 '/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/16_rpsI.faa',
 '/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/13_rpsE.faa',
 '/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/15_rpsH.faa',
 '/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/18_rpsL.faa',
 '/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/14_rpsG.faa',
 '/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/8_rplR.faa',
 '/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/3_rplE.faa',
 '/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/1_rplA.faa',
 '/home/tiago/De

In [26]:
for fasta_file in input_list:
    muscle_in = fasta_file
    muscle_out = os.path.join(os.path.dirname(fasta_file),os.path.basename(fasta_file).strip(".faa")+".ali.faa")
    trim_out = os.path.join(os.path.dirname(fasta_file),os.path.basename(fasta_file).strip(".faa")+".trimmed.faa")
    subprocess.call("muscle -in %s -out %s"%(muscle_in,muscle_out),shell=True)
    subprocess.call("/home/tiago/trimAl/source/trimal -in %s -out %s -gt 0.8 -st 0.001 -cons 70"%(muscle_out,trim_out),shell=True)

In [28]:
!ls ./pnas-phylogenomics/

10_rpsB.ali.faa      17_rpsK.faa	  3_rplE.trimmed.faa
10_rpsB.faa	     17_rpsK.trimmed.faa  4_rplF.ali.faa
10_rpsB.trimmed.faa  18_rpsL.ali.faa	  4_rplF.faa
11_rpsC.ali.faa      18_rpsL.faa	  4_rplF.trimmed.faa
11_rpsC.faa	     18_rpsL.trimmed.faa  5_rplK.ali.faa
11_rpsC.trimmed.faa  19_rpsM.ali.faa	  5_rplK.faa
12_rpsD.ali.faa      19_rpsM.faa	  5_rplK.trimmed.faa
12_rpsD.faa	     19_rpsM.trimmed.faa  6_rplM.ali.faa
12_rpsD.trimmed.faa  1_rplA.ali.faa	  6_rplM.faa
13_rpsE.ali.faa      1_rplA.faa		  6_rplM.trimmed.faa
13_rpsE.faa	     1_rplA.trimmed.faa   7_rplN.ali.faa
13_rpsE.trimmed.faa  20_rpsO.ali.faa	  7_rplN.faa
14_rpsG.ali.faa      20_rpsO.faa	  7_rplN.trimmed.faa
14_rpsG.faa	     20_rpsO.trimmed.faa  8_rplR.ali.faa
14_rpsG.trimmed.faa  21_rpsQ.ali.faa	  8_rplR.faa
15_rpsH.ali.faa      21_rpsQ.faa	  8_rplR.trimmed.faa
15_rpsH.faa	     21_rpsQ.trimmed.faa  9_rplV.ali.faa
15_rpsH.trimmed.faa  2_rplC.ali.faa	  9_rplV.faa
16_rpsI.ali.faa      2_rplC.faa		  9_rplV.

**Concatenating processed aligments**

In [29]:
ali_list = natsorted(glob.glob("/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/*.trimmed.faa"))

sequences_dict = {}

for item in ali_list:
    print item
    for record in (SeqIO.parse(item, "fasta")):
        if record.id not in sequences_dict:
            sequences_dict[record.id] = '%s'%record.seq
        else:
            sequences_dict[record.id] = '%s'%sequences_dict[record.id]+'%s'%record.seq
            
sequences_dict

/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/1_rplA.trimmed.faa
/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/2_rplC.trimmed.faa
/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/3_rplE.trimmed.faa
/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/4_rplF.trimmed.faa
/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/5_rplK.trimmed.faa
/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/6_rplM.trimmed.faa
/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/7_rplN.trimmed.faa
/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/8_rplR.trimmed.faa
/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/9_rplV.trimmed.faa
/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/10_rpsB.trimmed.faa
/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/11_rpsC.trimmed.faa
/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/12_rpsD.trimmed.faa
/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phy

{"'Nostoc_azollae'_0708_chromosome": 'KKSRRLLKVEDRYPLEALMLKETATAKFEAAEAHIRLGIDPKYTDQQLRTTVALPKGTGQIVRVAVIARGEKVEAAGADIAGSEELIDQIQKGMDFDLIATPDVMPMVAKLGKLLGPRGLMPSPKGGTVTDVGAIEFKAGKLEFRADRTGIVHVFGKASFSEDLLNLKALQETIDRNRPSGAKGRYWRTLYVSATMGPSIKIDISALRDLQMSVGILGTKLGMTQIFD-EAGVAIPVTVIQAGPCTVTQVKTKQTDGYCAIQVGYGVVK-PKVLNRPLLGHLAKSS-APALRHLNEYRTDASGDYALGQELK-ADIF-SAGQIVDVIGTSIGRGFAGNQKRNHFGRGPMSHGSKNHRAPGSIGAGTTPGRVYPGKRMAGRLGGTRVTIRKLTIVRLDADRNLILIKGAIPGKPGALVNIVPTNKVGR---LKYQETIPKLQQFYTNVHQVPKVIKITVNRGLGEAAQNAKALEASLEIAVTGQKPVVTRAKKAIAGFKIRQGMPVGIMVTLRGERMYAFLDRLVSLALPRIRDFRGVSPKSFDGRGNYTLGVREQLIFPEVEYDSIDQIRGLDISIITTAKNDEEGRALLKEMGMPFRQMSRIGKCPIVPKVQVAIDGVVVKGPKGELSRLPVVQGETLVTRDDTRSRQMHGLSRTLVANMVEGVSQGFQRRLEIQGVGYRAQVQGRNLVLNMGYSHQVQIPPDGIQFAVEGTINVIVSGYDKEIVGNTAAKIRAVRPPEPYKGKGIRYVGEVRRKAGKTGGK------------MAKKVVAVIKLALNAGKANPAPPVGPALGQHGVNIMMFCKEYNAKTAD-QAGMVIPVEISV-FEDRSFAFVLKTPPASVLIRKAAKIERGSNEPNKKKVGSISKAQLKEIAQTKLPDLNANDIDAAMNIVAGTAKNMGVTITDSKTLPSLEREWYVVDADKRLGRLASEIAVLRGKRKAEYTPHLDT

In [30]:
seq = []

for key in sequences_dict:
    seq.append(SeqRecord(Seq("%s"%sequences_dict[key],IUPAC.protein), id = "%s"%key, description=""))

for record in seq:
    print len(record)
    
date = arrow.now().format('YYMMDD')
handle = open("/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/concat_genes-%s.faa"%date, "w")
SeqIO.write(seq,handle,"fasta")
handle.close

3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207
3207


<function close>

In [31]:
sequences_dict["Stanieria_cyanosphaera_PCC_7437"]

'KKSRRMAKVDQKYPLEALLLKETATAKFETAEAHIRLGIDPKYTDQQLRTTVTFPKGTGQTVRIAVIARGEKVEAAGAEIVGSEELIDEIQKGMDFDLIATPDMMPKVARLGRMLGPKGLMPSPKGGTVTDLPAIDFKGGKQEFRADRTGIVHVFGKVSFSEDLLNLKALQETVDRNRPSGAKGRYWRSIFVSSSMGPSIQVDISGLRDLKMAVGILGTKLGMTQIFDKETGSAIPITVVQAGPCVITQIKTKATDGYNSIQIGYGEVK-EKALNKPELGHLKKSG-ATPLRHLKEYQIDDVSSFELGQSIN-AELF-SAGEIVDVTGTTIGRGFAGYQKRHNFKRGSMSHGSKNHRLPGSTGAGTTPGRVYPGKRMAGRYGASQVTIRKLTVVEIDSEKNLLLIKGAVPGKAGALLSIAPSNIVGKK--LKYQEQIPKLQQFYQNIHQVPKVIKVVVNRGLGEASQNAKALESSIELGITGQKPVVTRAKKAIAGFKIRKGMPVGVMVTLRGERMYAFLDRLINLALPRIRDFRGVSPKSFDGRGNYSLGVREQLIFPEIDYDSIDQIRGMDISIITTANTDEEGRALLKEMGMPFRNMSRIGKRPIIPKVTVNIQEVTVKGPKGTLERLPVVQGETLVVPDESRARERHGLSRTLVANMVSGVADGFEKRLQIQGVGYRAQAQGKKLTLNVGYSKPVEMMPDGIQVAVEKNTEITVSGIDKEVVGNVAAKIRAVRPPEPYKGKGIRYLGEVRRKAGKTGKK------------MPRKVVAIIKLALPAGKANPAPPVGPALGQHGVNIMAFCKEYNARTAD-KVGLVIPVEISV-FEDRSFTFILKTPPASVLIRKAAGIERGASQPNKQTVATITQAQLREIAETKMPDLNANDIEAAMKIVAGTAKNMGVAIAEMKTLPTLEQKWYVVDAEKRLGRLASEIAILRGKNKPTFTPHMDTGDFVIVVNADKVVTGRKQKLYRRHSGRPGGMKVETFL

```
PS: Problem saving Stanieria_cyanosphaera_PCC_7437, which was manually fixed
```

**Renaming cyanobiome genomes with SIO tag**

In [35]:
def rename(filename, outpath):
#     strain = os.path.splitext(os.path.basename(filename))[0]
    new_file = []
    input_handle = open("%s"%filename, "rU")
    for record in SeqIO.parse(input_handle, "fasta"):
        key = record.id
        if len(key.rsplit('_', 1)[1]) < 3:
            if key.count("_") < 2:
                record.id = key.split("_")[0] + "_" + "SIO"  + "1" + key.split("_")[1] 
        m = re.match(r'^(\D*)(_)[\S]{3}$',record.id)
        if m:
            strain = m.group(0)
            if strain.count("_") < 2:
                record.id = strain.split("_")[0] + "_" + "SIO"+ strain.split("_")[1]
        record.description = ""
        new_file.append(record)
    output_handle = open("%s/concat_genes-191112.ren.fasta"%outpath, "w")
    SeqIO.write(new_file, output_handle, "fasta")
    output_handle.close()
    input_handle.close()
    
rename("/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/concat_genes-191112.faa",
       "/home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/")

In [38]:
!grep '>' /home/tiago/Desktop/cyanet/genomics_pnas/pnas-phylogenomics/concat_genes-191112.ren.fasta

>Symploca_SIO2E9
>Thermosynechococcus_sp._NK55a
>Moorea_SIO4G3
>Moorea_SIO4G2
>Symploca_SIO2E6
>Okeania_SIO3H1
>Trichodesmium_LADK01
>Symploca_SIO1A3
>Cyanobium_gracile_PCC_6307
>Leptolyngbya_SIO1E4
>Okeania_SIO2D1
>Synechococcus_elongatus_PCC_6301_(re-annotation)
>Arthrospira_platensis_NIES-39
>Moorea_SIO3F7
>SM1_D11_(SM2F09_MP_sspace)
>Gloeocapsa_sp._PCC_7428
>Geitlerinema_sp._PCC_7407
>Prochlorococcus_marinus_str._MIT_9303
>Prochlorococcus_marinus_str._MIT_9301
>Cyanothece_SIO1E1
>Synechococcus_sp._JA-3-3Ab
>Calothrix_sp._PCC_7507
>Synechococcus_elongatus_PCC_7942
>Cyanothece_sp._ATCC_51142_chromosome_circular
>Moorea_SIO3I8
>Moorea_SIO3I6
>Moorea_SIO3I7
>Moorea_SIO3E8
>Halothece_sp._PCC_7418
>Synechococcus_sp._UTEX_2973
>Synechococcus_sp._CC9902
>Okeania_SIO3I5
>Leptolyngbya_SIO1D8
>Okeania_SIO2G4
>Pleurocapsa_sp._PCC_7327
>Sphaerospermopsis_SIO1G2
>Sphaerospermopsis_SIO1G1
>Oscillatoria_SIO1A7
>Spirulina_major_PCC_6313
>Moorea_SIO3A5
>Moorea

In [39]:
!mv ./pnas-phylogenomics/concat_genes-191112.ren.fasta ./outputs/

Next, a phylogenetic tree was reconstructed based on the concatenated ribosomal protein sequences extracted from the cyanobacterial genomes, using maximum likelihood (ML) implemented in IQ-TREE 1.6.10 (16). Amino acid substitution model was determined using ModelFinder (17) as part of IQ-TREE, which chose LG+R10 (LG substitution matrix, plus FreeRate model with 10 rate categories) as the best model. Phylogenetic reconstruction was performed using this model and IQ-TREE default settings. Branch supports were provided using 100 replicates of classical bootstrap, the out-group was Melainabacteria SM1 D11. The overall shape and clades of the tree were consistent with previous studies. All genomes but Merismopedia sp. 2A8 contained a complete set of the 29 selected housekeeping genes.

```
Outputs: concat_genes-190425.ren.fasta
         Figure 1
```