In [14]:
import sys 
sys.path.append('/home/tahani/Documents/github/0_pulls/pymodulon/')

from pymodulon.compare import * 

### Create a BBH CSV file
If running a comparison between two organisms, you may find that the bbh.csv does not exist within the *modulome_compare_data* repository. As such, pymodulon.compare contains a set of functions that will allow you to create the desired bbh.csv file for you specific comparison. This only needs to be done once per combination.

In order to make the bbh.csv, you need to obtain the Genbank Full Genome files for your organisms. These can be found at https://www.ncbi.nlm.nih.gov/genome/. Search for both your organisms/strains and download the "Genbank (full)" file. Once saved into a known location, you will use the `make_prots` function to convert the Genbank files into protein FASTA files.

`make_prots` recieves two parameters:
* `gbk`: Path to the genbank file (one file per function call)
* `out_path`: Path to the protein FASTA files

## 1. ecoli vs. selon

In [37]:
ecoli_gb= "/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/genbank_files/ecoli.gb"
selon_gb = "/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/genbank_files/selon.gb"
ecoli_out = "/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/ecoli_prot.fasta"
selon_out = "/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/selon_prot.fasta"

make_prots(ecoli_gb,ecoli_out)
make_prots(selon_gb, selon_out)



### Create Genbank database files
In order to run Bidirectional Best Hits between two organisms, it is required to build the necessary Genbank Database files from the newly created protein FASTA files. In order to do so, use the `make_prots_db` function. You will do this for both organims.

`make_prots_db` recieves one parameter:
* `fasta_file`: String path to protein FASTA file

If there is an error, the function will print out error message. The function will also not execut if the necessary Genbank DB files already exist.

In [42]:
make_prot_db(selon_out)
make_prot_db(ecoli_out)

BLAST DB files already exist
running makeblastdb with following command line...
makeblastdb -in /home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/ecoli_prot.fasta -parse_seqids -dbtype prot
Protein DB files created successfully


### Create BBH CSV
Once the previous steps have been completed, you can now create the BBH CSV file. In order to do so, you can use the `get_bbh` function. 

`get_bbh` has the following parameters:
* `db1`: String path to protein FASTA file (output of make_prots function) for organism 1
* `db2`: String path to protein FASTA file (output of make_prots function) for organism 2
* `outdir`: String path to output directory, default is "bbh" and will create the directory if it does not exist
* `outname`: Default db1_vs_db2_parsed.csv where db[1-2] are the passed arguments name of the csv file where that will save the results
* `mincov`: Minimum coverage to call hits in BLAST, must be between 0 and 1
* `evalue`: evalue thershold for BLAST hits, Default .001
* `threads`: Number of threads to run BLAST, Default 1
* `force`: Whether to overwrite existing files or not
* `savefiles`: Whether to save files to outdir

The function will return a Pandas DataFrame containing all the BLAST hits between the two genes. The function will also save the DataFrame to a csv file, which you can then use in `compare_ica` between the two organisms.

In [44]:
get_bbh(ecoli_out,selon_out, outdir = "/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv")

/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/ecoli_prot.fasta  already blasted
/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/selon_prot.fasta  already blasted
bbh already parsed for ecoli_prot selon_prot


Unnamed: 0.1,Unnamed: 0,gene,subject,PID,alnLength,mismatchCount,gapOpenCount,queryStart,queryEnd,subjectStart,subjectEnd,eVal,bitScore,gene_length,COV,BBH
0,5,b0003,Synpcc7942_1440,33.808,281,171,6,2,279,3,271,1.270000e-42,145.0,310,0.906452,<=>
1,10,b0014,Synpcc7942_2468,56.966,646,255,6,1,635,1,634,0.000000e+00,733.0,638,1.012539,<=>
2,15,b0015,Synpcc7942_2074,42.037,383,189,8,5,367,4,373,3.220000e-86,262.0,376,1.018617,<=>
3,27,b0023,Synpcc7942_1520,34.884,86,49,2,1,79,1,86,9.370000e-08,42.4,87,0.988506,<=>
4,28,b0025,Synpcc7942_0492,38.225,293,175,5,18,305,14,305,8.330000e-57,182.0,313,0.936102,<=>
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
662,8594,b4352,Synpcc7942_0827,30.659,349,199,10,1,313,1,342,4.980000e-40,140.0,318,1.097484,<=>
663,8609,b4374,Synpcc7942_2509,28.910,211,127,8,6,196,7,214,6.690000e-11,56.2,225,0.937778,<=>
664,8610,b4375,Synpcc7942_2365,46.525,518,274,2,8,523,25,541,2.820000e-166,479.0,529,0.979206,<=>
665,8619,b4389,Synpcc7942_1452,45.648,471,236,6,1,456,1,466,3.610000e-128,376.0,460,1.023913,<=>


## 2. selon vs. ecoli

In [45]:
get_bbh(selon_out,ecoli_out,outdir = "/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv")

/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/selon_prot.fasta  already blasted
/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/ecoli_prot.fasta  already blasted
parsing BBHs for selon_prot ecoli_prot


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  best_hit['BBH'] = '<=>'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  best_hit['BBH'] = '->'


Saving results to: /home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/selon_prot_vs_ecoli_prot_parsed.csv


Unnamed: 0,gene,subject,PID,alnLength,mismatchCount,gapOpenCount,queryStart,queryEnd,subjectStart,subjectEnd,eVal,bitScore,gene_length,COV,BBH
0,Synpcc7942_0001,b3701,26.738,374,265,5,16,388,1,366,9.15e-40,142,390,0.958974,<=>
3,Synpcc7942_0004,b2312,39.232,469,259,13,27,473,2,466,4.4e-102,312,493,0.951318,<=>
5,Synpcc7942_0009,b3172,29.351,385,245,11,4,368,11,388,3.07e-36,134,400,0.9625,<=>
12,Synpcc7942_0015,b0804,45.089,224,122,1,1,223,1,224,3e-57,178,223,1.00448,<=>
14,Synpcc7942_0017,b4477,31.792,173,115,3,14,184,9,180,7.22e-20,80.1,187,0.925134,<=>
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8667,Synpcc7942_2600,b0428,37.918,269,165,2,3,270,28,295,1.38e-43,147,276,0.974638,<=>
8671,Synpcc7942_2604,b0430,31.5,200,130,3,2,195,3,201,1.56e-20,82,195,1.02564,<=>
8672,Synpcc7942_2606,b1719,40.172,580,328,7,22,597,68,632,2.7e-157,462,604,0.960265,<=>
8674,Synpcc7942_2607,b1749,35.531,273,150,7,1,258,1,262,1.07e-41,141,265,1.03019,<=>


## 2 Staph vs. Selon

In [25]:
staph_gb= "/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/genbank_files/staph.gb"
selon_gb = "/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/genbank_files/selon.gb"
staph_out = "/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/staph_prot.fasta"
selon_out = "/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/selon_prot.fasta"

make_prots(staph_gb,staph_out)
make_prots(selon_gb, selon_out)



In [26]:
make_prot_db(staph_out)
make_prot_db(selon_out)

BLAST DB files already exist
BLAST DB files already exist


In [28]:
get_bbh(staph_out,selon_out, outdir = "/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv")

blasting /home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/staph_prot.fasta vs /home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/selon_prot.fasta
running blastp with following command line...
blastp -db /home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/staph_prot.fasta -query /home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/selon_prot.fasta -out /home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/selon_prot_vs_staph_prot.txt -evalue 0.001 -outfmt 6 -num_threads 1
blasting /home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/selon_prot.fasta vs /home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/staph_prot.fasta
running blastp with following command line...
blastp -db /home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/selon_prot.fasta -query /home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/staph_prot.fasta -out /home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/staph_prot_vs_

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  best_hit['BBH'] = '<=>'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  best_hit['BBH'] = '->'


Saving results to: /home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/staph_prot_vs_selon_prot_parsed.csv


Unnamed: 0,gene,subject,PID,alnLength,mismatchCount,gapOpenCount,queryStart,queryEnd,subjectStart,subjectEnd,eVal,bitScore,gene_length,COV,BBH
0,USA300HOU_RS00005,Synpcc7942_1100,43.45,458,245,6,2,450,24,476,5.26e-132,386,453,1.01104,<=>
1,USA300HOU_RS00010,Synpcc7942_0001,28.721,383,253,9,2,375,16,387,2.67e-52,175,377,1.01592,<=>
3,USA300HOU_RS00020,Synpcc7942_2250,36.927,371,227,4,1,368,1,367,1.48e-62,201,370,1.0027,<=>
5,USA300HOU_RS00025,Synpcc7942_2491,56.032,630,270,4,12,634,3,632,0,664,644,0.978261,<=>
6,USA300HOU_RS00030,Synpcc7942_0254,47.759,848,408,10,6,821,4,848,0,769,887,0.956032,<=>
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5392,USA300HOU_RS14580,Synpcc7942_0702,30.645,248,152,6,1,247,1,229,4.34e-32,115,282,0.879433,<=>
5442,USA300HOU_RS14705,Synpcc7942_0267,38.426,216,127,3,25,236,28,241,1.01e-46,152,239,0.903766,<=>
5443,USA300HOU_RS14710,Synpcc7942_2423,51.442,624,299,4,2,622,4,626,0,630,625,0.9984,<=>
5445,USA300HOU_RS14715,Synpcc7942_1582,38.197,466,268,6,4,459,7,462,2.48e-114,340,459,1.01525,<=>


## 3. Selon vs. Staph

In [29]:
get_bbh(selon_out,staph_out, outdir = "/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv")

/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/selon_prot.fasta  already blasted
/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/staph_prot.fasta  already blasted
parsing BBHs for selon_prot staph_prot


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  best_hit['BBH'] = '<=>'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  best_hit['BBH'] = '->'


Saving results to: /home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/selon_prot_vs_staph_prot_parsed.csv


Unnamed: 0,gene,subject,PID,alnLength,mismatchCount,gapOpenCount,queryStart,queryEnd,subjectStart,subjectEnd,eVal,bitScore,gene_length,COV,BBH
0,Synpcc7942_0001,USA300HOU_RS00010,28.721,383,253,9,16,387,2,375,2.85e-52,175,390,0.982051,<=>
1,Synpcc7942_0003,USA300HOU_RS05325,44.488,762,362,13,18,775,22,726,0,607,777,0.980695,<=>
2,Synpcc7942_0004,USA300HOU_RS05330,42.826,460,244,6,24,474,8,457,4.36e-135,396,493,0.933063,<=>
4,Synpcc7942_0009,USA300HOU_RS04770,52.538,394,185,1,5,398,3,394,5.74e-145,414,400,0.985,<=>
6,Synpcc7942_0012,USA300HOU_RS01935,29.348,92,64,1,6,96,4,95,6.37e-15,61.6,107,0.859813,<=>
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5573,Synpcc7942_2595,USA300HOU_RS03795,31.83,399,257,6,43,435,22,411,7.7e-73,233,476,0.838235,<=>
5581,Synpcc7942_2604,USA300HOU_RS05270,31.461,178,113,5,24,195,21,195,1.92e-16,70.5,195,0.912821,<=>
5582,Synpcc7942_2606,USA300HOU_RS08915,44.696,575,304,7,10,580,58,622,0,527,604,0.951987,<=>
5586,Synpcc7942_2612,USA300HOU_RS06755,58.006,331,131,5,18,344,4,330,4.66e-137,391,366,0.904372,<=>


## 3. Selon vs. Subtilis

In [30]:
subtilis_gb= "/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/genbank_files/subtilis.gb"
selon_gb = "/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/genbank_files/selon.gb"
subtilis_out = "/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/subtilis_prot.fasta"
selon_out = "/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/selon_prot.fasta"

make_prots(subtilis_gb,subtilis_out)
make_prots(selon_gb, selon_out)

In [31]:
make_prot_db(subtilis_out)
make_prot_db(selon_out)

running makeblastdb with following command line...
makeblastdb -in /home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/subtilis_prot.fasta -parse_seqids -dbtype prot
Protein DB files created successfully
BLAST DB files already exist


In [33]:
get_bbh(subtilis_out,selon_out, outdir = "/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv")

/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/subtilis_prot.fasta  already blasted
/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/selon_prot.fasta  already blasted
bbh already parsed for subtilis_prot selon_prot


Unnamed: 0.1,Unnamed: 0,gene,subject,PID,alnLength,mismatchCount,gapOpenCount,queryStart,queryEnd,subjectStart,subjectEnd,eVal,bitScore,gene_length,COV,BBH
0,0,BSU_00010,Synpcc7942_1100,47.020,453,222,4,7,445,28,476,1.950000e-151,435.0,446,1.015695,<=>
1,1,BSU_00020,Synpcc7942_0001,28.721,383,255,8,1,376,16,387,1.140000e-49,168.0,378,1.013228,<=>
2,2,BSU_00030,Synpcc7942_0081,40.678,59,35,0,5,63,1,59,3.120000e-09,44.7,71,0.830986,<=>
3,3,BSU_00040,Synpcc7942_2250,35.294,374,229,5,1,368,1,367,7.300000e-57,187.0,370,1.010811,<=>
4,5,BSU_00060,Synpcc7942_2491,58.571,630,253,4,6,627,3,632,0.000000e+00,718.0,638,0.987461,<=>
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
626,8217,BSU_40920,Synpcc7942_0554,60.000,365,142,3,3,366,2,363,1.110000e-159,449.0,366,0.997268,<=>
627,8226,BSU_41000,Synpcc7942_0267,36.744,215,133,2,25,237,28,241,7.830000e-46,149.0,239,0.899582,<=>
628,8227,BSU_41010,Synpcc7942_2423,50.243,617,301,4,7,619,6,620,0.000000e+00,617.0,628,0.982484,<=>
629,8229,BSU_41020,Synpcc7942_1582,42.950,461,255,4,2,459,7,462,1.290000e-125,369.0,459,1.004357,<=>


## 3. Subtilis vs. Selon

In [35]:
get_bbh(selon_out,subtilis_out, outdir = "/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv")

/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/selon_prot.fasta  already blasted
/home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/subtilis_prot.fasta  already blasted
parsing BBHs for selon_prot subtilis_prot


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  best_hit['BBH'] = '<=>'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  best_hit['BBH'] = '->'


Saving results to: /home/tahani/Documents/github/pymodulon/bbh_FASTA_genbank/csv/selon_prot_vs_subtilis_prot_parsed.csv


Unnamed: 0,gene,subject,PID,alnLength,mismatchCount,gapOpenCount,queryStart,queryEnd,subjectStart,subjectEnd,eVal,bitScore,gene_length,COV,BBH
0,Synpcc7942_0001,BSU_00020,28.721,383,255,8,16,387,1,376,1.82e-49,168,390,0.982051,<=>
1,Synpcc7942_0003,BSU_06480,51.852,756,318,14,18,770,22,734,0,742,777,0.972973,<=>
2,Synpcc7942_0004,BSU_06490,45.887,462,236,6,25,480,10,463,8.07e-151,436,493,0.93712,<=>
6,Synpcc7942_0009,BSU_29450,56.709,395,169,2,1,395,1,393,5.68e-161,455,400,0.9875,<=>
10,Synpcc7942_0012,BSU_40910,38.462,91,56,0,6,96,4,94,1.19e-21,79,107,0.850467,<=>
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8162,Synpcc7942_2604,BSU_14910,32.973,185,111,5,19,195,24,203,1.31e-23,90.1,195,0.948718,<=>
8164,Synpcc7942_2606,BSU_28950,45.876,582,299,7,22,599,71,640,0,543,604,0.963576,<=>
8168,Synpcc7942_2607,BSU_40880,35.094,265,150,9,1,260,1,248,4.91e-43,144,265,1,<=>
8170,Synpcc7942_2610,BSU_15610,47.034,236,124,1,10,245,2,236,9.56e-71,215,265,0.890566,<=>
