This repository contains the necessary data and scripts to reproduce the analyses and figures in the manuscript entitled Inferring phenotypic trait evolution on large trees with many incomplete measurements by [temporarily blinded for review].
The instructions below should be sufficient to perform all analyses on OSX, Windows, and Linux, but please raise an issue on this GitHub repository if you encounter any problems.
- beast
- data
- ele_1307_sm_sa1.tre - Mammalian phylogeny from Fritz et al. (2009). Downloaded from here.
- hiv_dates.csv - Sampling dates associated with HIV data.
- hiv_newick.txt - Newick HIV tree. Tree originated from ML tree in Blanquart et al. (2017) (processing described in manuscript).
- hiv_processed_data.csv - Processed HIV data used in analysis (processing described in manuscript).
- mammals_log_data.csv - CSV file that stores logged mammals data. Missing values are prepresented by NaN. File was created by running Julia script
./scripts/mammals_data.jl
. - mammals_newick.txt - The mammalST_MSW05_bestDates tree manually copied from ele_1307_sm_sa1.tre.
- mammals_trimmed_newick.txt - Mammals newick file with taxa not present in the data set removed. File was created by running Julia script
./scripts/mammals_data.jl
. - PanTHERIA_1-0_WR05_Aug2008.txt - Mammalian phenotype data from PanTHERIA database. Downloaded from here.
- prokaryotes_newick.txt - Obtained tree log file after running
./xml/prokaryotes.xml
in BEAST, resampled tree every 100,000 trees to reduce the file size using the BEAST software LogCombiner, determined the maximum clade credibility tree using TreeAnnotator, and output a newick file using FigTree. This newick had a single negative branch length, which was manually edited to 0.0 with the child banch lengths shortened accordingly. - prokaryotes_processed_data.csv - Processed prokaryote data used in analysis (processing described in manuscript).
- logs
- hiv_prediction - directory for storing BEAST log files in HIV posterior predictive power analysis.
- PCMBase_timing - directory for storing results of speed comparison between PCMBaseCpp and BEAST.
- simulation_study - directory for storing BEAST log files from simulation studies.
- timing - directory for storing BEAST log files and timing files for comparing computational efficiency with other Bayesian inference regimes.
- hiv.log - BEAST log file from running
./xml/hiv.xml
in BEAST. - mammals.log - BEAST log file from running
./xml/mammals.xml
in BEAST. - prokaryotes.log - BEAST log file from running
./xml/prokaryotes.xml
in BEAST. The log file present in this repository has been subsampled from the original to meet GitHub file size limits.
- scripts
- plots
- HIV_MSE.R - Generates Figure 7.
- bacteriaCorrelation.csv - Stores values for generating Figure 5. Created by running Julia script
./scripts/plots/bacteria_correlation.jl
. - bacteriaCorrelationLabels.csv - Stores the order of the traits in Figure 5. Created by running Julia script
./scripts/plots/bacteria_correlation.jl
. - bacteria_correlation.jl - Processes
./logs/prokatyrotes.log
and prepares csv files for Figure 5. - bacteria_plot.r - Generates Figure 5.
- correlation_plot.r - Functions for making Figures 3 and 5.
- correlation_prep.jl - Functions for processing BEAST log files into correlation csv files.
- custom_boxplot.R - Functions for customized box plots used by
./scripts/plots/HIV_MSE.R
. Modified from here. - hiv_prediction.csv - Stored values for producing Figure 7. Created by
/scripts/hiv_prediction_analysis.jl
. - mammalsCorrelation.csv - Stored values for producint Figure 3. Created by
./scripts/plots/mammals_correlation.jl
. - mammalsCorrelationLabels.csv - Order of the traits for Figure 3. Created by
./scripts/plots/mammals_correlation.jl
. - mammals_correlation.jl - Processes
./logs/mammals.log
and prepares csv files for Figure 3. - mammals_plot.r - Generates Figure 3.
- simulation_plotting.r - Generates Figure 2, SI Figures 1, 2, and 3.
- storage - Directory for storing intermediate files produced by various scripts.
- dependencies.jl - Installs all Julia packages necessary to run the other scripts.
- dependencies.r - Installs all R packages necessary to run the other scripts. Note that on Windows machines you will also need to install RTools.
- hiv_pcmTiming.r - Times how quickly PCMBaseCpp evaluates the likelihood for the HIV example.
- hiv_prediction_analysis.jl - Analyzes results of XML created by hiv_prediction_setup.jl and outputs a summary to
./scripts/plots/hiv_prediciton.csv
. - hiv_prediction_setup.jl - Creates XML to test posterior predictive performance of various models.
- mammals_data.jl - Inputs the raw data files for the mammals example and processes them into formats used for XML files and timing.
- mammals_pcmTiming.r - Times how quickly PCMBaseCpp evaluates the likelihood for the mammals example.
- pcm_test_analysis.jl - Analyzes results of speed tests and outputs them to
./scripts/storage/speed_results.csv
. - pcm_test_setup.jl - Creates XML and R readable data files for speed comparison between BEAST and PCMBaseCpp
- pcm_timing.r - Function for timing how quickly PCMBaseCpp evaluates likelihoods.
- prok_pcmTiming.r - Times how quickly PCMBaseCpp evaluates the likelihood for the prokaryotes example.
- run_timing_sim.sh - Linux shell script to evaluate speed of BEAST and PCMBaseCpp likelihood calculations on simulated data sets.
- run_timing.sh - Linux shell scripts to evaluate speed of BEAST and PCMBaseCpp likelihood calculations on real data sets.
- sim_pcmTiming.r - R script for timing PCMBaseCpp with simulated data.
- simulation_analysis.jl - Analyzes log files from simulation study. The log files should be in the
./logs/simulation_study
directory. Produces the trait_simulation.csv and matrix_simulation.csv files in the./scripts/storage
directory. - simulation_setup.jl - Creates XML files form simulation study.
- timing_analysis.jl - Processes results of speed comparison against earlier methods and outputs
./scripts/storage/timing_summary.csv
. - timing_setup.jl - Creates BEAST xml files for speed tests for Table 1.
- xml_setup.jl - Creates mammals.xml and hiv.xml files.
- plots
- xml
- PCMBase_comparison - Directory for storing BEAST xml files for comparision with PCMBaseCpp.
- hiv_precision - Directory for storing BEAST xml files for the posterior predictive performance analysis on the HIV data set. All internal files are templates used by
./scripts/hiv_prediction_setup.jl
for generating new xml and are not intdended to be run directly. - simulation_study - Directory for storing BEAST xml files for the simulation study.
- timing - Directory for storing xml files for comparing the computational efficiency of our methods agains other Bayesian inference regimes. All internal files are templates used by
./scripts/timing_setup.jl
for generating new xml and are not intdended to be run directly. - hiv.xml - BEAST xml file for the HIV heritability analysis in Section 7.3. Generated by
./scripts/xml_setup.jl
Julia script. - mammals.xml - BEAST xml file for the mammals life history analysis in Section 7.1. Generated by
./scripts/xml_setup.jl
Julia script. - prokaryotes.xml - BEAST xml file for prokaryotes analysis in Section 7.2. Created with BEAUti, the graphical user interface for generating BEAST XML files, and manually edited.
- Ensure you have Java version 1.8 installed.
- From the command line, type
java -version
. - The output should start with
java version "1.8.0_<other numbers>"
. - If java is not installed or the wrong version of java is installed, please follow the instructions here.
- Note that if a different version of Java is already installed, you will probably be able to run all of the xml. If you encounter problems, you can update your version of Java later.
- From the command line, type
- Ensure you have
git
installed.- From the command line, type
git --version
. The version of git should be returned. - If
git
is not installed, you can find directions to install it here.
- From the command line, type
- Ensure you have
ant
installed.- From the command line, type
ant -version
. The version of ant should be returned. - If
ant
is not installed, you can find directions to install in here.
- From the command line, type
- Download and build BEAGLE (this is only necessary to run the
prokaryote.xml
example).- Follow the instructions here to download and install BEAGLE.
- Download and build BEAST
- Enter the following code on the command line.
git clone --branch repeated_measures --depth 1 https://github.com/beast-dev/beast-mcmc.git cd beast-mcmc git checkout repeated_measures ant
- Navigate to the directory with the
beast.jar
file, which can be found in./beast
directory.- From the
beast-mcmc
directory (which you set up above), enter the following into the command line:
cd build cd dist
- From the
- Run an
.xml
file on BEAST-
Option 1: launch the BEAST gui
- Enter
java -jar beast.jar
- A gui will pop up.
- Click Choose File and select the
.xml
file you wish to run. - Click Run.
- BEAST should begin running in a gui window.
- Enter
-
Option 2: run BEAST from the command line
- Enter
java -jar beast.jar <path/to/your/file.xml>
. - BEAST should begin running on the command line.
- Enter
-
The output
.log
file will be in the././beast
directory.
-
- Use Tracer to view the
.log
file.- Tracer installation and usage instructions can be found here.
Some scripts (those ending in .r
or .R
) related to plotting and the comparison with PCMBase are written in R. All scripts were run with R-3.6.2, but will likely work on any R version 3.4 or above.
Instructions for installing R can be found here.
If you use a Windows machine, you will also need to install RTools to install all relevant packages.
To ensure all necessary packages are installed, run the ./scripts/dependencies.r
R script.
All R scripts in this repository should be run from the directory they are located in.
To run an R script, you can use RStudio, or simply navigate on the comand line to the appropriate directory and enter Rscript <file.r>
into the command line, replacing <file.r>
with the desired script.
Most scripts (those ending in .jl
) related to data pre-processing, simulating data, and processing BEAST log files are written in Julia.
All scripts were run with Julia v1.4, but will likely work on any version 1.0 or above.
Instructions for installing Julia can be found here.
To ensure all necessary packages are installed, run the ./scripts/dependencies.jl
Julia script.
To run a Julia script, use the command line to navigate to the directory containing the script of interest and enter julia <file.jl>
into the command line, replacing <file.jl>
with the script you want to run.
Unless otherwise noted, all scripts are in the ./scripts
directory of this repository.
- Run mammals_data.jl script.
- Run xml_setup.jl script.
- Creates mammals.xml file in the
./xml
directory.
- Creates mammals.xml file in the
- Run the
./xml/mammals.xml
file according to the instructions in the BEAST section above. - Move the log file, located at
./beast/mammals.log
, to the./logs
directory of this repository.- You can examine this file using Tracer.
- Run the mammals_correlation.jl script in the
./scripts/plots
directory to extract relevant information from log file. - Run the mammals_plot.r script in the
./scripts/plots
directory to generate Figure 3.
- Run the
./xml/prokatyotes.xml
file according to the instructions in the BEAST section above.- The prokaryotes.xml file in the
./xml
directory was created by BEAUti and manually edited.
- The prokaryotes.xml file in the
- Move the prokaroytes.log file (located in the
./beast
directory) to the./logs
directory.- You can examine this file using Tracer.
- Run the bacteria_correlation.jl script in the
./scripts/plots
directory to extract relevant information from log file. - Run the bacteria_plot.r script in the
./scripts/plots
directory to generate Figure 5.
- Run xml_setup.jl script.
- Creates hiv.xml file in the
./xml
directory.
- Creates hiv.xml file in the
- Run the hiv.xml file in the
./xml
directory according to the instructions in the BEAST section above. - Move the hiv.log file (located in the
./beast
directory) to the./logs
directory. - You can examine this file using Tracer. The relevant parameters are:
- varianceProportionStatistic11
$\rightarrow$ GSVL - varianceProportionStatistic22
$\rightarrow$ SPVL - varianceProportionStatistic33
$\rightarrow$ CD4 slope
- varianceProportionStatistic11
- Run hiv_prediction_setup.jl script to generate xml files.
- All xml are output to the
./xml/hiv_prediction
directory.
- All xml are output to the
- Run all xml files ending in a number in the
./xml/hiv_prediction
directory according to the instructions in the BEAST section above.- Note that the four xml files not ending in a number are templates and have not had any data removed.
- Copy all log files to the
./logs/hiv_prediction
directory. - Run the hiv_prediction_analysis.jl script.
- Run the HIV_MSE.R script in the
./scripts/plots
directory to generate Figure 7.
- Run timing_setup.jl script to generate xml for timing.
- All xml are in the
./xml/timing
directory.
- All xml are in the
- Run all xml files ending in a number in the
./xml/timing
directory according to the instructions in the BEAST section above.- Note that you will need to save the screen output to a
.txt
file with the same name as the original xml file. To do this, using the command line, use the following commandjava -jar beast.jar <path/to/your/file.xml> > <file.txt>
, replacing<path/to/your/file.xml>
and<file.txt>
with the relevant paths and filenames.
- Note that you will need to save the screen output to a
- Move all
.log
files and saved.txt
files generated to the./logs/timing
directory of this repository. - Run the timing_analysis.jl script. The output will be written to
./scripts/storage/timing_summary.csv
file.
- Ensure the mammals.log, hiv.log, and prokaryotes.log files are in the
./logs
directory of this repository. - Run the simulation_setup.jl script to generate simulated data sets and xml files.
- Run all xml files ending in the
./xml/simulation_study
directory according to the instructions in the BEAST section above. - Copy all
.log
files produced from running the xml files above to the./logs/simulation_study
directory of this repository. - Run the simulation_analysis.jl script.
- Run the simulation_plotting.r script in the
./scripts/plots
directory to generate Figure 2 and SI Figures 1, 2, and 3.
To perform simulation studies with different parameter values than those in the three examples we used in this papers, use the sim_xml
function in the simulation_setup.jl script.
This function takes the following eight arguments, the last two of which are optional:
-
xml_dir
- String representing the directory you want to write the XML files two. This is already set in the script, but can be changed to wherever you want. -
newick
- String representing the newick representation of the phylogenetic tree. -
$\Sigma$ - 2-D array representing the diffusion variance. -
$\Gamma$ - 2-D array representing the residual variance (note that this is different notation than the paper where$\Gamma$ is the residual precision). -
sparsity
- Floating point number between 0 and 1 representing the proportion of the data you want to randomly remove. -
base_name
- String that will be the first part of the filename. -
standardize_tree
- Boolean value indicating whether you want to rescale the tree edge lengths such that the maximum distance between the root and any tip is equal to 1. -
rep
- Integer that, if supplied as an argument, appendsr_<rep>
to the end of the filename. Used to distinguish different runs with the samebase_name
, number of taxa, number of traits, and percent missing.
Note that sim_xml
stores the parameters used for simulation in the ./scripts/storage/simulation
directory, and relies on the internal structure of this repository.
Any scripts you write using sim_xml
should be located in the ./scripts
directory.
- Ensure the mammals.log, hiv.log, and prokaryotes.log files are in the
./logs
directory of this repository. - Run the mammals_data.jl script to generate the relevant newick files.
- Run the pcm_test_setup.jl script to generate simulated data sets and xml files.
- Execute the run_timing.sh and run_timing_sim.sh linux shell scripts.
- The run_timing_sim.sh script will probably take a long time (several hours) to complete.
- Run the pcm_test_analysis.jl script. The results are located in the
./scripts/storage/speed_results.csv
file.