## Demonstration with supertree construction

We created 100 input sets, each consisting of 30 phylogenetic trees organized into 10 overlapping subsets, with taxon overlap levels gradually increasing from 10% to 90%. Each tree includes 30 taxa, selected from a total of 55 unique species, and was assembled using the proposed pipeline.

We applied several established supertree construction methods to these input sets using the CLANN software package (the split fit algorithm, the most similar supertree algorithm, and the average supertree approach based on neighbor joining), as well as the majority-rule and spectral clustering methods.

For each method, we measured the success rate of supertree construction, the average number of taxa in the output trees, and the average Robinson-Foulds (RF) distance between each supertree and its corresponding 30 input trees.
The complete set of input trees and resulting supertrees used in this demonstration is available in the GitHub repository associated with this study.


The success rate reflects the proportion of input sets (out of 100) for which the supertree was successfully parsed, included all expected taxa, and could be compared to the corresponding input trees without errors.

In [1]:
!pip install dendropy pandas

Collecting dendropy
  Downloading DendroPy-5.0.8-py3-none-any.whl.metadata (6.1 kB)
Downloading DendroPy-5.0.8-py3-none-any.whl (465 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m465.1/465.1 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dendropy
Successfully installed dendropy-5.0.8


In [2]:
import os
import statistics
from dendropy import Tree, TreeList
from dendropy.calculate.treecompare import symmetric_difference
import pandas as pd

# Path to the input tree sets
input_folder = "input_multisets"

# Paths to 5 supertree files
method_files = {
    "Split Fit": "supertrees_sfit.txt",
    "Most Similar": "supertrees_dfit.txt",
    "Average NJ": "supertrees_nj.txt",
    "Majority Rule": "supertrees_mrplus.txt",
    "Spectral Clustering": "supertrees_scs.txt"
}

# Load input multisets into a list of TreeList objects
input_trees_per_set = []
for i in range(1, 101):
    file_path = os.path.join(input_folder, f"multiset_{i}.txt")
    try:
        trees = TreeList.get(path=file_path, schema="newick", preserve_underscores=True)
        input_trees_per_set.append(trees)
    except Exception as e:
        print(f"Error reading {file_path}: {e}")
        input_trees_per_set.append([])

# Analyze supertrees
results = []

for method_name, supertree_file in method_files.items():
    with open(supertree_file, 'r') as f:
        lines = f.readlines()

    success_count = 0
    taxon_counts = []
    rf_distances = []

    for idx, line in enumerate(lines):
        line = line.strip()
        if not line:
            continue
        try:
            supertree = Tree.get(data=line, schema="newick", preserve_underscores=True)
            input_trees = input_trees_per_set[idx]
            if not input_trees:
                continue

            success_count += 1
            taxon_counts.append(len(supertree.taxon_namespace))

            distances = []
            for input_tree in input_trees:
                input_tree.migrate_taxon_namespace(supertree.taxon_namespace)
                rf = symmetric_difference(supertree, input_tree)
                distances.append(rf)
            avg_rf = sum(distances) / len(distances)
            rf_distances.append(avg_rf)
        except Exception as e:
            print(f"[{method_name}] Error on line {idx + 1}: {e}")
            continue

    avg_taxa = statistics.mean(taxon_counts) if taxon_counts else 0
    avg_rf = statistics.mean(rf_distances) if rf_distances else 0
    total = len(lines)

    results.append({
        "Method": method_name,
        "Success rate": f"{100.0 * success_count / total:.1f}%",
        "Average taxa in output": f"{avg_taxa:.1f}",
        "Average RF distance": f"{avg_rf:.1f}"
    })

# Output results
df = pd.DataFrame(results)
print(df.to_string(index=False))

             Method Success rate Average taxa in output Average RF distance
          Split Fit       100.0%                   55.0                94.6
       Most Similar       100.0%                   55.0                95.8
         Average NJ       100.0%                   55.0                94.7
      Majority Rule       100.0%                   55.0                82.8
Spectral Clustering       100.0%                   55.0                85.9


Note that RF distances are reported as absolute values and represent the average symmetric difference between each supertree and its corresponding 30 input trees. Given that trees were constructed from partially overlapping taxon sets and not all input trees contain the same species, the observed values (ranging from ~83 to ~96) are within expected bounds for this type of dataset.