Work in progress - please contact ejmctavish@ucmerced.edu with any issues or questions.
Extensiphy is a tool for updating an existing multiple sequence alignment Field et al. 2022. Extensiphy works by taking an input alignment, and assembling homologous loci from raw short read data.
Say you built a sequence alignment and phylogeny for a group of bacteria during an outbreak and then received some new sequencing data that you wish to quickly incorporate into the phylogeny. Extensiphy makes it convenient to do this while also ensuring you can easily do this again any time you acquire new data.
You can use Extensiphy in multiple ways. Depending on the input and output you wish to receive, you will need different data in different formats.
In this tutorial we will update a Neisseria gonhorrhea core genome alignemnt with newly sequences short read data.
You need:
- Alignment file (fasta format)
- Directory of paired-end read files
- (Optional) a phylogeny generated from the input alignment
Copy the code and data to your home directory
cp -r moledata/extensiphy/ ./
All of the dependencies are already installed on the virtual machine- but this requires another version of raxml so we will install it using apt-get
sudo apt-get install raxml
Check that the install is working by running:
cd extensiphy
./extensiphy.sh -a ./testdata/combo.fas -d ./testdata -1 _R1.fq -2 _R2.fq -u PHYLO -o EP_output
Extensiphy takes command line arguments to update a sequence alignment with new taxa sequences. Lets look at the options used by Extensiphy. Extensiphy use revolves around calling extensiphy.sh followed by flags (dashes next to a letter corresponding the option you wish to use or input). Calling the help menu with the following command:
./extensiphy.sh -h
has the following output:
Extensiphy is a program for quickly adding genomic sequence data to multiple sequence alignments and phylogenies.
View the README for more specific information.
Inputs are generally a multiple sequence file in fasta format and a directory of
Fastq paired-end read sequences.
EXAMPLE COMMAND:
/path/to/extensiphy.sh -u ALIGN -a /path/to/alignment_file -d /path/to/directory_of_reads [any other options]
REQUIRED FLAGS
(-a) alignment in fasta format,
(-d) directory of paired end fastq read files for all query taxa,
(-u) produce only an updated alignment or perform full phylogenetic estimation (ALIGN or PHYLO) (DEFAULT: ALIGN)
OPTIONAL FLAGS
(-t) tree in Newick format produced from the input alignment that you wish to update with new sequences or specify NONE to perform new inference (DEFAULT: NONE),
(-m) alignment type (SINGLE_LOCUS_FILES, PARSNP_XMFA or CONCAT_MSA) (DEFAULT: CONCAT_MSA),
(-o) directory name to hold results (DEFAULT: creates EP_output),
(-i) clean up intermediate output files to save HD space (Options: CLEAN, KEEP)(DEFAULT: KEEP),
(-r) Select a reference sequence from the alignment file for read mapping or leave as default and the first sequence in the alignment used (DEFAULT: RANDOM),
(-p) number of taxa to process in parallel,
(-c) number of threads per taxon being processed,
(-e) set read-type as single end (SE) or pair-end (PE) (DEFAULT: PE)
(-1, -2) suffix (ex: R1.fastq or R2.fastq) for both sets of paired end files (DEFAULTS: R1.fq and R2.fq),
(-g) output format (CONCAT_MSA or SINGLE_LOCUS_FILES) (DEFAULT: CONCAT_MSA),
(-s) specify the suffix (.fa, .fasta, etc) (DEFAULT: .fasta),
(-b) bootstrapping tree ON or OFF (DEFAULT: OFF)
if using single locus MSA files as input,
(-f) csv file name to keep track of individual loci when concatenated (DEFAULT: loci_positions.csv),
(-n) Set size of loci size cutoff used as input or output (Options: int number)
Extensiphy has a number of default settings for these so you will not always have to explicitly use all of these options for every run. The use of these flags depends on the input you wish to use and the output you desire to have at the end of a run.
We will try updating our existing alignment with three newly published sequences of gonorrhea.
The SRR19310037 and SRR19310038 were just added to NCBI's sequence read archive (SRA) on May 27, 2022! SRR19127720 is from early May.
We can get more information about where these lineages were sequences, and if they cary antimicrobial resistance (AMR) genes using the data in NCBI's Pathogen Database. (See the "AMR Genotype column")
- https://www.ncbi.nlm.nih.gov/pathogens/isolates/#SRR19310037
- https://www.ncbi.nlm.nih.gov/pathogens/isolates/#SRR19310038
- https://www.ncbi.nlm.nih.gov/pathogens/isolates/#SRR19127720
Q Do any of these lineages have known anti-microbial resistance genes?
Copy the data to your extensiphy directory by running:
The new sequence data and alignment to update are in the folder extensiphy/Neisseria_demo
cd Neisseria_demo
(If you are are not running this on the MOLE virtual machines, you can download the data from tar ball link and extract it)
If you wanted to download the sequences directly from SRA you could use Don't run these at the workshop (they wont work anyways unless you have installed SRA toolkit)
fastq-dump --split-files SRR19310037
fastq-dump --split-files SRR19310038
fastq-dump --split-files SRR19127720
NCBI places these sequences in a SNP tree, but that does not incorporate any uncertainty, or a full phylogenetic analysis. https://www.ncbi.nlm.nih.gov/pathogens/tree/#Neisseria/PDG000000032.278/PDS000104772.5?accessions=PDT001307203.3
We will add these sequences to an existing core genome alignment generated using ParSNP, and Extensiphy, published Field et al 2022 and the alignment stored in Dryad data deposit
(I have subsampled the alignment from 1200 lineages to 30, and cut the sequences down to 500K BP to make inference faster for this demo).
../extensiphy.sh -a neisseria_aln.fas -1 _1.fastq -2 _2.fastq -d neisseria_reads/ -u PHYLO -i CLEAN -o EP_demo_1
(This may take 5-10 min.)
The ML tree file will be in the file EP_demo_1/RESULTS/RAxML_bestTree.consensusFULL
Transfer it to your computer, and open it in figtree Root it with "ERR2525602" as an outgroup.
By default, EP uses RaxmlHPC to estimate ML phylogenies - but the updated alignemnet is saved as output_dir/RESULTS/extended.aln, which you can use to estimate a phylogeny using any method.
The new taxa we have added were sampled in the last month - whereas the existing tips are from 2019 or earlier.
Q Are our new sequences (SRR19310037, SRR19310038, and SRR19127720) closely related in the ML tree?
You can have EP automatically bootstrap the tree as well by adding '-b ON' to the arguments.
Bootstrapping takes a little while - so I have run it for you, and put the output files in 'EP_demo/bootstrap_results/'
The majority rule consensus tree is in the file bootstrap_results/RAxML_bipartitions.majority_rule_bootstrap_consensus
Q Are our new sequences (SRR19310037, SRR19310038, and SRR19127720) closely related in the bootstrap consensus tree?
Many of the common antimicrobial resistance (AMR) genes are horizontally transferred, so close relatives can have different AMR genes and phenotypes. Look up the two closest relatives of the new lineage with known AMR genes in https://www.ncbi.nlm.nih.gov/pathogens/
Q Do these three lineages share the same resistance genes?
The reference that you choose can affect your consensus sequence calling, and therefore your phylogenetic inference. Lets try assembling these new taxa, but using the outgroup as a reference instead.
../extensiphy.sh -a neisseria_aln.fas -1 _1.fastq -2 _2.fastq -d neisseria_reads/ -u ALIGN -r ERR2525602 -i CLEAN -o EP_demo_alternate_ref
The consensus aligned sequence for each run is saved in output_dir/combine_and_infer/seqname_align.fas (Where 'seqname' is the filename stub of the reads).
In the Neisseria_demo folder is a really simple script that counts the differences between aligned sequences in fastas files.
python diff_counter.py <path_to_seq1> <path_to_seq2>
Q Does changing the reference taxon change the sequences?
We can then combine the previous extended alignemnt with these new, slightly different consensus sequence estimates.
Open each of your newly estimated sequences in a text editor (e.g. nano), and rename them to reflect that you used an alternate reference.
e.g. change ">SRR19310037" to ">SRR19310037_alt"
Because they are already aligned, we can just concateneate them to form an alignment that includes both our original consesuses for these taxa, and these new inferences.
cat EP_demo_1/RESULTS/extended.aln EP_demo_alternate_ref/combine_and_infer/SRR19310037_align.fas EP_demo_alternate_ref/combine_and_infer/SRR19310038_align.fas EP_demo_alternate_ref/combine_and_infer/SRR19127720_align.fas > combined_refs.fas
We can then estimate a tree on this updated alignment - e.g. using RAxML (or any other phylogenetic inference software)
raxmlHPC -s combined_refs.fas -m GTRGAMMA -p 1 -n compare_references
Your ML tree will be saved as RAxML_bestTree.compare_references. Take a look at it in figtree.
Q Does changing the reference taxon change the phylogeny?
We also have a python package for updating trees by searching GenBank for homologous loci.
Check it out on GitHub Physcraper, Sanchez-Reyes 2021