# Compile Results

All the program outputs are in different formats. The code here compiles them into a common format for easier comparison.

In [1]:
# to suppress warning from ete3 because it's not up to date with python3.12
import warnings
warnings.filterwarnings("ignore", category=SyntaxWarning)

In [2]:
import ete3
import pandas as pd
import os
import sys

# in ../code we have compile_results.py which contains all the functions we need
sys.path.append(os.path.join(os.path.dirname(os.getcwd()), 'code'))
import compile_results

In [3]:
genome_tree_labeled_path = "../data/genome_tree/genome_tree.iqtree.treefile.rooted.labeled"
genome_tree_labeled = ete3.Tree(genome_tree_labeled_path, format=1)

## ALE

In [4]:
ale_gene_dynamics_dir = "../data/inferences/gene_dynamics/ALE/"
# compile the outputs
compile_results.compile_ale_outputs(output_dir=ale_gene_dynamics_dir, 
                                    input_tree=genome_tree_labeled, 
                                    compiled_results_dir="../data/compiled_results/",
                                    var_str='gene')
print("-----------------------------------")

varwise, branchwise transfers:
          nog_id source_branch recipient_branch  transfers
0          D3SHM          1582             N213       0.25
1          D3SHM         53346             1582       0.20
2          D3SHM          N242             1582       0.25
3          D3SHM          N213             1582       0.30
4          D3W6Z       1221500             1398       0.01
...          ...           ...              ...        ...
7603256  COG0640            N3              N12       0.27
7603257  COG0640            N3               N8       0.05
7603258  COG0640            N3              N11       0.01
7603259  COG0640            N3               N4       0.01
7603260  COG0640            N3               N7       0.03

[7603261 rows x 4 columns]
varwise transfers:
       nog_id  transfers
0     COG0001      56.22
1     COG0002      55.65
2     COG0003      29.01
3     COG0004      57.39
4     COG0005      30.86
...       ...        ...
7017    DA689       1.22
7018    DA68R 

## GLOOME

GLOOME only infers gains or losses on branches, but without any inference of the source of the gene transfers for the gains. This means that we don't have anything for the source column but we have only the recipient column, when we compile branchwise inferences.

In [4]:
# first we compile results for the gene dynamics
gloome_output_dir = "../data/inferences/gene_dynamics/GLOOME/"
compile_results.read_and_compile_gloome_results(
    gloome_output_dir=gloome_output_dir, input_tree=genome_tree_labeled,
    compiled_results_dir="../data/compiled_results/", var_str='gene', 
    var_name_str='nog_id', pa_matrix_tsv_filepath="../data/filtered/pa_matrix.nogs.numerical.tsv")
print("-----------------------------------")
# then the results for the ecosystem type dynamics
gloome_output_dir = "../data/inferences/ecotype_dynamics/GLOOME/"
compile_results.read_and_compile_gloome_results(
    gloome_output_dir=gloome_output_dir, input_tree=genome_tree_labeled,
    compiled_results_dir="../data/compiled_results/", var_str='ecotype',
    var_name_str='ecotype', pa_matrix_tsv_filepath="../data/filtered/pa_matrix.ecosystem_type.numerical.tsv")
print("-----------------------------------")
# and finally the results for the ecosystem subtype dynamics
gloome_output_dir = "../data/inferences/ecosubtype_dynamics/GLOOME/"
compile_results.read_and_compile_gloome_results(
    gloome_output_dir=gloome_output_dir, input_tree=genome_tree_labeled,
    compiled_results_dir="../data/compiled_results/", var_str='ecosubtype',
    var_name_str='ecosubtype', pa_matrix_tsv_filepath="../data/filtered/pa_matrix.ecosystem_subtype.numerical.tsv")

Var IDs in the PA matrix file for ml.gene looks like: ['COG0659' 'COG0658' 'COG0653' ... 'DA6A4' 'DA6AQ' 'DA6FD']
POS to var ID mapping for ml.gene looks like: {1: 'COG0659', 2: 'COG0658', 3: 'COG0653', 4: 'COG0652', 5: 'COG0651', 6: 'COG0650', 7: 'COG0657', 8: 'COG0656', 9: 'COG0655', 10: 'COG0654', 11: 'COG5817', 12: 'D0UWT', 13: 'D0UZ1', 14: 'D0UZC', 15: 'COG4569', 16: 'COG4568', 17: 'COG4565', 18: 'COG4564', 19: 'COG4567', 20: 'COG4566', 21: 'D0V1K', 22: 'COG2103', 23: 'COG2102', 24: 'COG2105', 25: 'COG2104', 26: 'COG2107', 27: 'COG2109', 28: 'COG2108', 29: 'D0V6U', 30: 'D0VAI', 31: 'D0VAR', 32: 'D0VF6', 33: 'D0VFV', 34: 'COG3179', 35: 'COG3178', 36: 'COG3177', 37: 'COG3176', 38: 'COG3175', 39: 'COG3174', 40: 'COG3173', 41: 'COG3172', 42: 'COG3171', 43: 'COG3170', 44: 'D0VR1', 45: 'D0VVD', 46: 'D0VYF', 47: 'D0W0V', 48: 'D0W1T', 49: 'D0W64', 50: 'D0W97', 51: 'D0WC3', 52: 'D0WD1', 53: 'D0WD4', 54: 'D0WEA', 55: 'D0WEY', 56: 'D0WIC', 57: 'D0WIF', 58: 'COG5557', 59: 'D0WPQ', 60: 'D0WX5'

## Count

In [4]:
# compile the results for the count changes in the gene dynamics 
print("\033[1m" + "Compiling count changes in gene dynamics" + "\033[0m")
count_output_dir = "../data/inferences/gene_dynamics/Count/"
compile_results.compile_count_changes(count_dir=count_output_dir,
                                    input_tree_filepath=genome_tree_labeled_path,
                                    compiled_results_dir="../data/compiled_results/",
                                    var_str='gene', var_name_str='nog_id')
print("-----------------------------------")
# compile the results for the count changes in the ecosystem type dynamics
print("\033[1m" + "Compiling count changes in ecosystem type dynamics" + "\033[0m")
count_output_dir = "../data/inferences/ecotype_dynamics/Count/"
compile_results.compile_count_changes(count_dir=count_output_dir,
                                    input_tree_filepath=genome_tree_labeled_path,
                                    compiled_results_dir="../data/compiled_results/",
                                    var_str='ecotype', var_name_str='ecotype')
print("-----------------------------------")
# compile the results for the count changes in the ecosystem subtype dynamics
print("\033[1m" + "Compiling count changes in ecosystem subtype dynamics" + "\033[0m")
count_output_dir = "../data/inferences/ecosubtype_dynamics/Count/"
compile_results.compile_count_changes(count_dir=count_output_dir,
                                    input_tree_filepath=genome_tree_labeled_path,
                                    compiled_results_dir="../data/compiled_results/",
                                    var_str='ecosubtype', var_name_str='ecosubtype')
print("-----------------------------------")
print("All results compiled and saved in ../data/compiled_results/")


[1mCompiling count changes in gene dynamics[0m
Count branchwise transfers:
     branch  transfers
0    109790         24
1      1587         60
2      1604         53
3     33959         29
4      1582         66
..      ...        ...
311     N27        130
312     N17          0
313     N11          0
314      N7          0
315      N3          0

[316 rows x 2 columns]
Count varwise transfers:
       nog_id  transfers
1747  COG0001          0
1745  COG0002          0
1746  COG0003          2
1750  COG0004          0
1751  COG0005          0
...       ...        ...
8192    DA68R          2
8193    DA697          1
8194    DA6A4          0
8195    DA6AQ          3
8196    DA6FD          1

[8197 rows x 2 columns]
Count varwise, branchwise transfers:
          nog_id  branch  transfers  losses  expansions  reductions
48       COG0001     N44          0       2           0           0
57       COG0001   35841          0       0           0           1
64       COG0001  260554        