# Analyzing Phylogeny Estimation Software on Aligned DNA Sequence Data

## Data
The estimated and true trees are in the `data/` folder of this repository. These were calculates from data described in <i>Liu et al., "Rapid and Accurate Large-Scale Coestimation of Sequence Alignments and Phylogenetic Trees," Science, vol. 324, no. 5934, pp. 1561-1564, 19 June 2009.</i> You can access the original datasets [here](https://sites.google.com/eng.ucsd.edu/datasets/alignment/sate-i?authuser=0). To view the tools used to compute the trees from this data, view the following:
- [FastTree](http://www.microbesonline.org/fasttree/#Install): FastTree+GTR, FastTree+JC69
- [FastME](https://gite.lirmm.fr/atgc/FastME/): NJ+LogDet, NJ+JC69, NJ+P-Distances
  
## Loading Required Modules and Functions
For phylogenetic analysis, we will use dendropy, which has a build in function for false negative and false positive rate calculation. We will test out the module by calculation the false positive and negative rates between two arbitrary trees.

In [1]:
import dendropy
print(dendropy.__version__)

4.20220511.00


In [2]:
tns = dendropy.TaxonNamespace()

tree1 = dendropy.Tree.get(
    path="data/1000M1/R0/fasttree/gtrFastTree.tree", 
    schema="newick",
    taxon_namespace=tns
)
tree2 = dendropy.Tree.get(
    path="data/1000M1/R0/rose.tt", 
    schema="newick",
    taxon_namespace=tns
)

tree1.encode_bipartitions()
tree2.encode_bipartitions()

[<dendropy.datamodel.treemodel._bipartition.Bipartition at 0x7f8524016a00>,
 <dendropy.datamodel.treemodel._bipartition.Bipartition at 0x7f8523e2ed60>,
 <dendropy.datamodel.treemodel._bipartition.Bipartition at 0x7f8523e2e490>,
 <dendropy.datamodel.treemodel._bipartition.Bipartition at 0x7f85241cd910>,
 <dendropy.datamodel.treemodel._bipartition.Bipartition at 0x7f85241cd9a0>,
 <dendropy.datamodel.treemodel._bipartition.Bipartition at 0x7f85241cd9d0>,
 <dendropy.datamodel.treemodel._bipartition.Bipartition at 0x7f85241cda00>,
 <dendropy.datamodel.treemodel._bipartition.Bipartition at 0x7f85241cda30>,
 <dendropy.datamodel.treemodel._bipartition.Bipartition at 0x7f85241cda60>,
 <dendropy.datamodel.treemodel._bipartition.Bipartition at 0x7f85241cdac0>,
 <dendropy.datamodel.treemodel._bipartition.Bipartition at 0x7f85241cdb20>,
 <dendropy.datamodel.treemodel._bipartition.Bipartition at 0x7f85241cdb80>,
 <dendropy.datamodel.treemodel._bipartition.Bipartition at 0x7f85241cdbb0>,
 <dendropy.d

In [3]:
from dendropy.calculate import treecompare

fpnn = treecompare.false_positives_and_negatives

print(fpnn(tree1, tree2))

(90, 91)


## Calculating Error Rates
The following snippets will loop through each calculated tree and compute a dictionary containing average FP/FN rates across each of the five replicates (R0-R4) for each dataset and method.

In [4]:
err = {
    "1000M1": {
        "nj_logdet": [],
        "nj_jc": [],
        "nj_pdist": [],
        "ft_gtr": [],
        "ft_jc": []
    },
    "1000M4": {
        "nj_logdet": [],
        "nj_jc": [],
        "nj_pdist": [],
        "ft_gtr": [],
        "ft_jc": []
    }
}

In [5]:
from os import listdir

# Loop through each dataset and replicate
for f in listdir("data/"):
    for g in listdir(f"data/{f}"):
        # Load the true tree
        true_tree = dendropy.Tree.get(
            path=f"data/{f}/{g}/rose.tt", 
            schema="newick",
            taxon_namespace=tns
        )
        
        # Load trees obtained from each method
        nj_logdet_tree = dendropy.Tree.get(
            path=f"data/{f}/{g}/nj_logdet/rose.aln.true.phylip_fastme_tree.txt", 
            schema="newick",
            taxon_namespace=tns
        )
        
        nj_jc_tree = dendropy.Tree.get(
            path=f"data/{f}/{g}/nj_jc/rose.aln.true.phylip_fastme_tree.txt", 
            schema="newick",
            taxon_namespace=tns
        )
        
        nj_pdist_tree = dendropy.Tree.get(
            path=f"data/{f}/{g}/nj_pdist/rose.aln.true.phylip_fastme_tree.txt", 
            schema="newick",
            taxon_namespace=tns
        )
        
        ft_gtr_tree = dendropy.Tree.get(
            path=f"data/{f}/{g}/fasttree/gtrFastTree.tree", 
            schema="newick",
            taxon_namespace=tns
        )
        
        ft_jc_tree = dendropy.Tree.get(
            path=f"data/{f}/{g}/fasttree/jcFastTree.tree", 
            schema="newick",
            taxon_namespace=tns
        )
        
        # Gather bipartitions for FP/FN calculation
        true_tree.encode_bipartitions()
        nj_logdet_tree.encode_bipartitions()
        nj_jc_tree.encode_bipartitions()
        nj_pdist_tree.encode_bipartitions()
        ft_gtr_tree.encode_bipartitions()
        ft_jc_tree.encode_bipartitions()
        
        # Add results to arrays stored in error dictionary
        err[f]["nj_logdet"].append(fpnn(true_tree, nj_logdet_tree))
        err[f]["nj_jc"].append(fpnn(true_tree, nj_jc_tree))
        err[f]["nj_pdist"].append(fpnn(true_tree, nj_pdist_tree))
        err[f]["ft_gtr"].append(fpnn(true_tree, ft_gtr_tree))
        err[f]["ft_jc"].append(fpnn(true_tree, ft_jc_tree))

In [9]:
def pairwise_mean(a):
    ''' Return a tuple of columnwise averages
    
    a:  array
        Array of tuples
    '''
    fir = sum([e[0] for e in a])
    sec = sum([e[1] for e in a])
    
    return (fir/len(a), sec/len(a))

# Take columnwise averages of arrays in the error dictionary
for key in err:
    for subkey in err[key]:
        err[key][subkey] = pairwise_mean(err[key][subkey])

## Visualization
We will use Pandas to visualize the average (FP, FN) error rates for each dataset and method.

In [11]:
from pandas import DataFrame

DataFrame.from_dict(err)

Unnamed: 0,1000M1,1000M4
nj_logdet,"(211.6, 207.4)","(110.6, 81.6)"
nj_jc,"(224.2, 220.0)","(104.6, 75.6)"
nj_pdist,"(191.6, 187.4)","(122.6, 93.6)"
ft_gtr,"(109.0, 104.8)","(74.2, 45.2)"
ft_jc,"(128.4, 124.2)","(80.6, 51.6)"
