Skip to content
/ AncFlow Public

AncFlow: A snakemake based pipeline for the ancestral sequence reconstruction of clustered phylogenetic subtrees.

Notifications You must be signed in to change notification settings

rrouz/AncFlow

Repository files navigation

AncFlow

AncFlow is a comprehensive snakemake driven pipeline designed to perform ancestral sequence reconstruction upon the immediate parent nodes of AutoPhy clustered phylogenetic trees. AncFlow is fed protein sequence data which undergo multiple sequence alignment via MAFFT, phylogenetic tree inferrence via IQ-TREE, and monophyletic clustering via AutoPhy. Size defined cladistic subtrees of immediate parent nodes are extracted from the clustered phylogenetic tree to resolve ancestral sequence reconstructions via bnkit.

Installation for Mac/Linux

  1. Clone repository
git clone https://github.com/rrouz/AncFlow.git
  1. Enter directory
cd AncFlow
  1. Create conda environment:
conda env create -f environment.yml
  1. Activate conda environment:
conda activate ancflow
  1. Perform dry run:
snakemake -n
  1. Run pipeline, where -j(number of cores 1-4):
snakemake -j2

Installation for Windows

For Windows users, it is recommended to utilize Windows Subsystem for Linux (WSL) along with Miniconda.

  1. Install WSL by following the official Microsoft documentation: Install WSL

  2. Install Miniconda within your WSL environment: Install Miniconda

  3. Follow the installation steps for Mac/Linux

Installation Troubleshooting

If you encounter difficulties, or error messages in pipeline setup we recommend using miniconda, with a conda version of 24.1.2 or newer. Additionally, when doing so completely remove your existing AncFlow conda environment and re-create a fresh environment after instllation of the newer conda version.

Basic Usage

Workflow

Getting Started:

Create a FASTA file containing at least 50 intentionally curated protein sequences. For a seamless run use Swiss-Prot sequences of protein families found in the Pfam or UniProt databases. It is critical that the sequence headers follow the pipe delimited format below:

  • Using custom sequences is possible, however you may encounter parsing errors requiring manual correction of the sequence headers.
>sp|P45996.1|OMP53_HAEIF RecName: Full=Outer membrane protein P5; Short=OMP P5; AltName: Full=Fimbrin; AltName: Full=Outer membrane porin A; AltName: Full=Outer membrane protein A; Flags: Precursor

Database Source: Swiss-Prot (sp)
Accession Number: Sequence identifier (P45996.1)
Identifier: Often the protein name or shorthand (OMP53_HAEIF)

Interpretting Autophy:

After IQ-TREE has inferred the phylogeny of your sequences, AutoPhy will attempt to cluster and resolve novel subfamilies. Upon successful completion AutoPhy will create an output directory with the monophyletically clustered trees. At this point we recommend analyzing the autophy outputs for clades of interest and to take note of their clade sizes as you will be prompted for the minimum retained clade size (this is a number typically larger than 2) for downstream subtree extraction.

Acyltransferace family tree colored and clustered by AutoPhy: Acyltransferace Sample Tree

For example, clades 11.0 and 31 are of particular interest for downstream ancestral sequence reconstruction. Therefore, to capture them, the minimum prompted clade size should be set to a value less than or equal to 5. The colored nodes depict the ancestral sequences of these target clades, whose sequences were later derived from the pipeline and used to predict the superimposed structures below. Target Clade Predicted Ancestral Structures

Ancestral Sequence Reconstruction:

The extracted subtree and respective MSA are used as input for GRASP via BNKIT. GRASP employs statistical models and maximum likelihood approaches to infer the most likely ancestral sequences at internal nodes of the subtree.

AncFlow and Protein Structure Prediction

Ancestral sequences reconstructions can be use by protein model prediction tools, like AlphaFold2, to approximate the tertiary structures of target nodes by their derived sequences.

About

AncFlow: A snakemake based pipeline for the ancestral sequence reconstruction of clustered phylogenetic subtrees.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages