# Gene and Ecosystem Dynamics Inference


## ALE

ALE only requires the species tree to be rooted, so we can use the unrooted gene trees directly.
However, rooted trees seem to make ALE run faster, so we will root the gene trees using MAD.

```bash
# from dir: `RESULTS_ALE`
# root the gene trees using MAD
nohup python code/run_MAD_on_EggNOG_parallel.py -i data/filtered/map.nog_tree.filtered.pruned.tsv -m ~/bin/mad/mad -p 100 > data/nohup_mad_gene_tree_rooting.log & disown
# this results in the file `data/filtered/map.nog_tree.filtered.pruned.tsv.rooted` which we give as input gene trees file to run ALE

# run ALE
# with data/ as working directory. e.g. `cd data/`
nohup python ../code/run_ALE.py --species genome_tree/genome_tree.iqtree.treefile.rooted --gene filtered/map.nog_tree.filtered.pruned.tsv.rooted --output-dir inferences/gene/ALE/ > nohup_run_ALE.out & disown
```

## GLOOME

Create param files for GLOOME

In [8]:
genome_tree_filepath = "genome_tree/genome_tree.iqtree.treefile.rooted"
pa_matrix_nogs_filepath = "filtered/pa_matrix.nogs.binary.fasta"
pa_matrix_ecotype_filepath = "filtered/pa_matrix.ecosystem_type.binary.fasta"
pa_matrix_ecosubtype_filepath = "filtered/pa_matrix.ecosystem_subtype.binary.fasta"

In [9]:
%%bash -s "$genome_tree_filepath" "$pa_matrix_nogs_filepath" "$pa_matrix_ecotype_filepath" "$pa_matrix_ecosubtype_filepath"
cat > ../data/GLOOME_GD.params << EOL
_seqFile $2
_treeFile $1
# use mixture-model
_gainLossDist 1
# include Parsimony results also
_costMatrixGainLossRatio 1
# in this case, character frequencies are not equal across the tree
_isRootFreqEQstationary 0
## Advanced 
_logValue 4
# make this dir ahead of time before running GLOOME
_outDir inferences/gene_dynamics/GLOOME/
EOL

# now for ECOTYPE
cat > ../data/GLOOME_ED_Type.params << EOL
_seqFile $3
_treeFile $1
# use mixture-model
_gainLossDist 1
# include Parsimony results also
_costMatrixGainLossRatio 1
# in this case, character frequencies are not equal across the tree
_isRootFreqEQstationary 0
## Advanced
_logValue 4
# make this dir ahead of time before running GLOOME
_outDir inferences/ecotype_dynamics/GLOOME/
EOL

# now for ECOSUBTYPE
cat > ../data/GLOOME_ED_Subtype.params << EOL
_seqFile $4
_treeFile $1
# use mixture-model
_gainLossDist 1
# include Parsimony results also
_costMatrixGainLossRatio 1
# in this case, character frequencies are not equal across the tree
_isRootFreqEQstationary 0
## Advanced
_logValue 4
# make this dir ahead of time before running GLOOME
_outDir inferences/ecosubtype_dynamics/GLOOME/
EOL

Then run GLOOME with these param files as input, from inside the `data/` directory.


```bash
# from dir: `data/`

# first make the directories to store the results, if not already present
mkdir -p inferences/gene_dynamics/GLOOME inferences/ecotype_dynamics/GLOOME inferences/ecosubtype_dynamics/GLOOME

# run GLOOME
nohup ~/bin/GLOOME.VR01.266 GLOOME_GD.params > nohup_GLOOME_GD.out & disown
nohup ~/bin/GLOOME.VR01.266 GLOOME_ED_Type.params > nohup_GLOOME_ED_Type.out & disown
nohup ~/bin/GLOOME.VR01.266 GLOOME_ED_Subtype.params > nohup_GLOOME_ED_Subtype.out & disown
```

## Count

We do a similar thing for Count, but Count runs very fast and we can just go into each directory and run it.

```bash
# from dir: `data/`
cd inferences/gene_dynamics/Count
# run Count
java -Xmx2048M -cp ~/bin/Count/Count.jar ca.umontreal.iro.evolution.genecontent.AsymmetricWagner ../../../genome_tree/genome_tree.iqtree.treefile.rooted.labeled ../../../filtered/pa_matrix.nogs.numerical.tsv > Count_output.tsv
# separate the output into the information of each node's (of genome tree) genome size, changes and families
grep "# PRESENT" Count_output.tsv > Count_genome_sizes.tsv && grep "# CHANGE" Count_output.tsv > Count_changes.tsv && grep "# FAMILY" Count_output.tsv > Count_families.tsv

# similarly for ecotype dynamics
cd ../../ecotype_dynamics/Count
java -Xmx2048M -cp ~/bin/Count/Count.jar ca.umontreal.iro.evolution.genecontent.AsymmetricWagner ../../../genome_tree/genome_tree.iqtree.treefile.rooted.labeled ../../../filtered/pa_matrix.ecosystem_type.numerical.tsv > Count_output.tsv
grep "# PRESENT" Count_output.tsv > Count_genome_sizes.tsv && grep "# CHANGE" Count_output.tsv > Count_changes.tsv && grep "# FAMILY" Count_output.tsv > Count_families.tsv

# similarly for ecosubtype dynamics
cd ../../ecosubtype_dynamics/Count
java -Xmx2048M -cp ~/bin/Count/Count.jar ca.umontreal.iro.evolution.genecontent.AsymmetricWagner ../../../genome_tree/genome_tree.iqtree.treefile.rooted.labeled ../../../filtered/pa_matrix.ecosystem_subtype.numerical.tsv > Count_output.tsv
grep "# PRESENT" Count_output.tsv > Count_genome_sizes.tsv && grep "# CHANGE" Count_output.tsv > Count_changes.tsv && grep "# FAMILY" Count_output.tsv > Count_families.tsv
```

Note here that we are running `AsymmetricWagner` Parsimony model. One can also run `Posteriors` model for ML inference. The latter makes sense for Gene Dynamics (GD) but not for Ecosystem Dynamics (ED).
We chose to run `AsymmetricWagner` for everything, since the comparative study that we performed showed that it infers less false positive changes than `Posteriors`.