This notebook contains the notes for the analysis of the Thermococales genomes.

There are three folders with this notebook:
- data: which contains most of the raw data needed for the analysis
- analysis: containing post-processing of the raw genome data used
- scripts: the scripts used in the analysis. This is a copy of a previous Github repository, but to keep the scripts consistent with the analysis, I added a local copy in here.

Requirements:
- Blast+
- Fastortho

All the genomes were downloaded from JGI, from the Globus website. Only one genome (_Thermococcus eurythermalis_) was downloaded from NCBI in GBK format. All the data is available in the data folder, under img_genomes and ncbi_genomes

First, we need to process the genomes to be able to use them for the ortholog search, and further processing

In [3]:
!python scripts/PrepareMCLgenomes_fromGBK_Genbank.py
!cat data/list_ncbi_genomes.txt

usage: PrepareMCLgenomes_fromGBK_Genbank.py [-h] -g GBK_LIST -i INPUT_FOLDER -o OUTPUT_DIRECTORY
PrepareMCLgenomes_fromGBK_Genbank.py: error: argument -g/--gbk_list is required
Teurythermalis	Teurythermalis


In [4]:
!python scripts/PrepareGenomeData.py
!cat data/list_img_genomes.txt

usage: PrepareGenomeData.py [-h] -g GENOME_LIST -i INPUT_FOLDER -s GENOME_SOURCE -o OUTPUT_DIRECTORY
PrepareGenomeData.py: error: argument -g/--genome_list is required
644736411	Thermococcus gammatolerans EJ3	TgammaEJ3
644736412	Thermococcus sibiricus MM 739	TsibiriMM739
2545824513	Thermococcus sp. Strain 175 (First SPAdes assembly of Thermococcus sp. Strain 175)	Tsp175
643348580	Thermococcus onnurineus NA1	TonnuNA1
650716097	Thermococcus sp. 4557	Tsp4557
2501025505	Palaeococcus ferrophilus DMJ, DSM 13482	PferroDSM13482
2521172719	Pyrococcus sp. ST04	PspST04
2513237398	Thermococcus litoralis DSM 5473	TlitoDSM5473
2511231053	Thermococcus sp. AM4	TspAM4
638154520	Thermococcus kodakarensis KOD1	TkodKOD1
650716096	Thermococcus barophilus MP, DSM 11836	TbaroDSM11836
638154516	Pyrococcus horikoshii OT3	PhorikoOT3
2510065005	Thermococcus sp. PK	TspPK
638154514	Pyrococcus abyssi GE5	PabyGE5
2517093039	Pyrococcus furiosus COM1	PfurCOM1
2518645553	Thermococcus sp. CL1	TspCL1
650716079	Pyrococcus

In [5]:
#Prepare ncbi genomes
#!python scripts/PrepareMCLgenomes_fromGBK_Genbank.py -g data/list_ncbi_genomes.txt -i data/ncbi_genomes/ -o data/processed_genomes/
#Prepare img genomes
#!python scripts/PrepareGenomeData.py -g data/list_img_genomes.txt -i data/img_genomes/ -o data/processed_genomes -s img

Now we create the necessary files for the blast search

In [6]:
%%bash
mkdir analysis/blast
cat data/processed_genomes/protein/*.fasta > analysis/blast/All_proteins.faa
makeblastdb -in analysis/blast/All_proteins.faa -dbtype prot



Building a new DB, current time: 09/08/2015 17:07:21
New DB name:   analysis/blast/All_proteins.faa
New DB title:  analysis/blast/All_proteins.faa
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 43069 sequences in 1.73347 seconds.


In [1]:
#Run Blastp

### Running Fastortho

I used Fastortho (a reimplementation of OrthoMCL) to search for orthologous genes among the processed genomes. To run the program, I need to create an option file that will contain the parameters used for the software.

In [6]:
%%bash
mkdir analysis/fastortho

echo --mcl_path /usr/local/bin/mcl >> analysis/fastortho/option_file.txt
echo --working_directory /Users/jugalde/Documents/Research/Thermococcales_CompGenomPres/analysis/fastortho >> analysis/fastortho/option_file.txt
echo --project_name Thermo_090915 >> analysis/fastortho/option_file.txt
echo --blast_file /Users/jugalde/Documents/Research/Thermococcales_CompGenomPres/analysis/blast/blastp.Thermo_090815 >> analysis/fastortho/option_file.txt

for entry in `ls -d -1 $PWD/data/processed_genomes/protein/*.*`; do echo --single_genome_fasta $entry; done >> analysis/fastortho/option_file.txt


In [7]:
%%bash
cat analysis/fastortho/option_file.txt

--mcl_path /usr/local/bin/mcl
--working_directory /Users/jugalde/Documents/Research/Thermococcales_CompGenomPres/analysis/fastortho
--project_name Thermo_090915
--blast_file /Users/jugalde/Documents/Research/Thermococcales_CompGenomPres/analysis/blast/blastp.Thermo_090815
--single_genome_fasta /Users/jugalde/Documents/Research/Thermococcales_CompGenomPres/data/processed_genomes/protein/PabyGE5.fasta
--single_genome_fasta /Users/jugalde/Documents/Research/Thermococcales_CompGenomPres/data/processed_genomes/protein/PferroDSM13482.fasta
--single_genome_fasta /Users/jugalde/Documents/Research/Thermococcales_CompGenomPres/data/processed_genomes/protein/PfurCOM1.fasta
--single_genome_fasta /Users/jugalde/Documents/Research/Thermococcales_CompGenomPres/data/processed_genomes/protein/PfurDSM3638.fasta
--single_genome_fasta /Users/jugalde/Documents/Research/Thermococcales_CompGenomPres/data/processed_genomes/protein/PhorikoOT3.fasta
--single_genome_fasta /Users/jugalde/Documents/Research/Thermo

Now we can run the FastOrtho analysis

In [8]:
%%bash
time /Users/jugalde/BioApps/FastOrtho/src/FastOrtho --option_file analysis/fastortho/option_file.txt

which /usr/local/bin/mcl
/usr/local/bin/mcl
 1.00 to classify blast hits
 1.00 to prepare classified hits for mcl
/usr/local/bin/mcl /Users/jugalde/Documents/Research/Thermococcales_CompGenomPres/analysis/fastortho/Thermo_090915.mtx -I 1.5 -o /Users/jugalde/Documents/Research/Thermococcales_CompGenomPres/analysis/fastortho/Thermo_090915.ocl
gene count = 40471 in 20 taxons
 3.00 to run mcl and convert its output
 5.00 total duration


[mclIO] reading </Users/jugalde/Documents/Research/Thermococcales_CompGenomPres/analysis/fastortho/Thermo_090915.mtx>
.......................................
[mclIO] read native interchange 40471x40471 matrix with 588878 entries
[mcl] pid 17814
 ite -------------------  chaos  time hom(avg,lo,hi) m-ie m-ex i-ex fmv
  1  ...................  11.84  0.14 0.99/0.12/2.84 1.11 1.11 1.11   0
  2  ...................  12.28  0.16 0.95/0.25/2.86 1.17 1.06 1.18   0
  3  ...................  11.51  0.19 0.92/0.31/2.90 1.11 0.97 1.14   0
  4  ...................  11.42  0.17 0.88/0.23/2.93 1.08 0.96 1.10   0
  5  ...................   9.93  0.16 0.83/0.25/2.57 1.05 0.97 1.06   0
  6  ...................   5.94  0.15 0.78/0.21/1.82 1.03 0.97 1.03   0
  7  ...................   6.25  0.13 0.73/0.23/1.30 1.01 0.97 1.01   0
  8  ...................   6.47  0.13 0.68/0.13/1.19 1.01 0.97 0.97   0
  9  ...................  11.50  0.12 0.64/0.12/1.19 1.00 0.95 0.93   0
 10  ...................  14.01  0.

# Cluster Analysis