Skip to content

suchard-group/incomplete_measurements

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Inferring phenotypic trait evolution on large trees with many incomplete measurements

This repository contains the necessary data and scripts to reproduce the analyses and figures in the manuscript entitled Inferring phenotypic trait evolution on large trees with many incomplete measurements by [temporarily blinded for review].

The instructions below should be sufficient to perform all analyses on OSX, Windows, and Linux, but please raise an issue on this GitHub repository if you encounter any problems.

File Structure

  • beast
    • beast.jar - Compressed BEAST source code. This file was compiled from the version of BEAST located here.
  • data
    • ele_1307_sm_sa1.tre - Mammalian phylogeny from Fritz et al. (2009). Downloaded from here.
    • hiv_dates.csv - Sampling dates associated with HIV data.
    • hiv_newick.txt - Newick HIV tree. Tree originated from ML tree in Blanquart et al. (2017) (processing described in manuscript).
    • hiv_processed_data.csv - Processed HIV data used in analysis (processing described in manuscript).
    • mammals_log_data.csv - CSV file that stores logged mammals data. Missing values are prepresented by NaN. File was created by running Julia script ./scripts/mammals_data.jl.
    • mammals_newick.txt - The mammalST_MSW05_bestDates tree manually copied from ele_1307_sm_sa1.tre.
    • mammals_trimmed_newick.txt - Mammals newick file with taxa not present in the data set removed. File was created by running Julia script ./scripts/mammals_data.jl.
    • PanTHERIA_1-0_WR05_Aug2008.txt - Mammalian phenotype data from PanTHERIA database. Downloaded from here.
    • prokaryotes_newick.txt - Obtained tree log file after running ./xml/prokaryotes.xml in BEAST, resampled tree every 100,000 trees to reduce the file size using the BEAST software LogCombiner, determined the maximum clade credibility tree using TreeAnnotator, and output a newick file using FigTree. This newick had a single negative branch length, which was manually edited to 0.0 with the child banch lengths shortened accordingly.
    • prokaryotes_processed_data.csv - Processed prokaryote data used in analysis (processing described in manuscript).
  • logs
    • hiv_prediction - directory for storing BEAST log files in HIV posterior predictive power analysis.
    • PCMBase_timing - directory for storing results of speed comparison between PCMBaseCpp and BEAST.
    • simulation_study - directory for storing BEAST log files from simulation studies.
    • timing - directory for storing BEAST log files and timing files for comparing computational efficiency with other Bayesian inference regimes.
    • hiv.log - BEAST log file from running ./xml/hiv.xml in BEAST.
    • mammals.log - BEAST log file from running ./xml/mammals.xml in BEAST.
    • prokaryotes.log - BEAST log file from running ./xml/prokaryotes.xml in BEAST. The log file present in this repository has been subsampled from the original to meet GitHub file size limits.
  • scripts
    • plots
      • HIV_MSE.R - Generates Figure 7.
      • bacteriaCorrelation.csv - Stores values for generating Figure 5. Created by running Julia script ./scripts/plots/bacteria_correlation.jl.
      • bacteriaCorrelationLabels.csv - Stores the order of the traits in Figure 5. Created by running Julia script ./scripts/plots/bacteria_correlation.jl.
      • bacteria_correlation.jl - Processes ./logs/prokatyrotes.log and prepares csv files for Figure 5.
      • bacteria_plot.r - Generates Figure 5.
      • correlation_plot.r - Functions for making Figures 3 and 5.
      • correlation_prep.jl - Functions for processing BEAST log files into correlation csv files.
      • custom_boxplot.R - Functions for customized box plots used by ./scripts/plots/HIV_MSE.R. Modified from here.
      • hiv_prediction.csv - Stored values for producing Figure 7. Created by /scripts/hiv_prediction_analysis.jl.
      • mammalsCorrelation.csv - Stored values for producint Figure 3. Created by ./scripts/plots/mammals_correlation.jl.
      • mammalsCorrelationLabels.csv - Order of the traits for Figure 3. Created by ./scripts/plots/mammals_correlation.jl.
      • mammals_correlation.jl - Processes ./logs/mammals.log and prepares csv files for Figure 3.
      • mammals_plot.r - Generates Figure 3.
      • simulation_plotting.r - Generates Figure 2, SI Figures 1, 2, and 3.
    • storage - Directory for storing intermediate files produced by various scripts.
    • dependencies.jl - Installs all Julia packages necessary to run the other scripts.
    • dependencies.r - Installs all R packages necessary to run the other scripts. Note that on Windows machines you will also need to install RTools.
    • hiv_pcmTiming.r - Times how quickly PCMBaseCpp evaluates the likelihood for the HIV example.
    • hiv_prediction_analysis.jl - Analyzes results of XML created by hiv_prediction_setup.jl and outputs a summary to ./scripts/plots/hiv_prediciton.csv.
    • hiv_prediction_setup.jl - Creates XML to test posterior predictive performance of various models.
    • mammals_data.jl - Inputs the raw data files for the mammals example and processes them into formats used for XML files and timing.
    • mammals_pcmTiming.r - Times how quickly PCMBaseCpp evaluates the likelihood for the mammals example.
    • pcm_test_analysis.jl - Analyzes results of speed tests and outputs them to ./scripts/storage/speed_results.csv.
    • pcm_test_setup.jl - Creates XML and R readable data files for speed comparison between BEAST and PCMBaseCpp
    • pcm_timing.r - Function for timing how quickly PCMBaseCpp evaluates likelihoods.
    • prok_pcmTiming.r - Times how quickly PCMBaseCpp evaluates the likelihood for the prokaryotes example.
    • run_timing_sim.sh - Linux shell script to evaluate speed of BEAST and PCMBaseCpp likelihood calculations on simulated data sets.
    • run_timing.sh - Linux shell scripts to evaluate speed of BEAST and PCMBaseCpp likelihood calculations on real data sets.
    • sim_pcmTiming.r - R script for timing PCMBaseCpp with simulated data.
    • simulation_analysis.jl - Analyzes log files from simulation study. The log files should be in the ./logs/simulation_study directory. Produces the trait_simulation.csv and matrix_simulation.csv files in the ./scripts/storage directory.
    • simulation_setup.jl - Creates XML files form simulation study.
    • timing_analysis.jl - Processes results of speed comparison against earlier methods and outputs ./scripts/storage/timing_summary.csv.
    • timing_setup.jl - Creates BEAST xml files for speed tests for Table 1.
    • xml_setup.jl - Creates mammals.xml and hiv.xml files.
  • xml
    • PCMBase_comparison - Directory for storing BEAST xml files for comparision with PCMBaseCpp.
    • hiv_precision - Directory for storing BEAST xml files for the posterior predictive performance analysis on the HIV data set. All internal files are templates used by ./scripts/hiv_prediction_setup.jl for generating new xml and are not intdended to be run directly.
    • simulation_study - Directory for storing BEAST xml files for the simulation study.
    • timing - Directory for storing xml files for comparing the computational efficiency of our methods agains other Bayesian inference regimes. All internal files are templates used by ./scripts/timing_setup.jl for generating new xml and are not intdended to be run directly.
    • hiv.xml - BEAST xml file for the HIV heritability analysis in Section 7.3. Generated by ./scripts/xml_setup.jl Julia script.
    • mammals.xml - BEAST xml file for the mammals life history analysis in Section 7.1. Generated by ./scripts/xml_setup.jl Julia script.
    • prokaryotes.xml - BEAST xml file for prokaryotes analysis in Section 7.2. Created with BEAUti, the graphical user interface for generating BEAST XML files, and manually edited.

Computing Environment

Java & BEAST

  1. Ensure you have Java version 1.8 installed.
    • From the command line, type java -version.
    • The output should start with java version "1.8.0_<other numbers>".
    • If java is not installed or the wrong version of java is installed, please follow the instructions here.
    • Note that if a different version of Java is already installed, you will probably be able to run all of the xml. If you encounter problems, you can update your version of Java later.
  2. Ensure you have git installed.
    • From the command line, type git --version. The version of git should be returned.
    • If git is not installed, you can find directions to install it here.
  3. Ensure you have ant installed.
    • From the command line, type ant -version. The version of ant should be returned.
    • If ant is not installed, you can find directions to install in here.
  4. Download and build BEAGLE (this is only necessary to run the prokaryote.xml example).
    • Follow the instructions here to download and install BEAGLE.
  5. Download and build BEAST
    • Enter the following code on the command line.
    git clone --branch repeated_measures --depth 1 https://github.com/beast-dev/beast-mcmc.git
    cd beast-mcmc
    git checkout repeated_measures
    ant
    

Running an XML file on BEAST

  1. Navigate to the directory with the beast.jar file, which can be found in ./beast directory.
    • From the beast-mcmc directory (which you set up above), enter the following into the command line:
    cd build
    cd dist
    
  2. Run an .xml file on BEAST
    • Option 1: launch the BEAST gui

      • Enter java -jar beast.jar
      • A gui will pop up.
      • Click Choose File and select the .xml file you wish to run.
      • Click Run.
      • BEAST should begin running in a gui window.
    • Option 2: run BEAST from the command line

      • Enter java -jar beast.jar <path/to/your/file.xml>.
      • BEAST should begin running on the command line.
    • The output .log file will be in the ././beast directory.

  3. Use Tracer to view the .log file.
    • Tracer installation and usage instructions can be found here.

R

Some scripts (those ending in .r or .R) related to plotting and the comparison with PCMBase are written in R. All scripts were run with R-3.6.2, but will likely work on any R version 3.4 or above. Instructions for installing R can be found here.

If you use a Windows machine, you will also need to install RTools to install all relevant packages.

To ensure all necessary packages are installed, run the ./scripts/dependencies.r R script.

All R scripts in this repository should be run from the directory they are located in. To run an R script, you can use RStudio, or simply navigate on the comand line to the appropriate directory and enter Rscript <file.r> into the command line, replacing <file.r> with the desired script.

Julia

Most scripts (those ending in .jl) related to data pre-processing, simulating data, and processing BEAST log files are written in Julia. All scripts were run with Julia v1.4, but will likely work on any version 1.0 or above. Instructions for installing Julia can be found here.

To ensure all necessary packages are installed, run the ./scripts/dependencies.jl Julia script.

To run a Julia script, use the command line to navigate to the directory containing the script of interest and enter julia <file.jl> into the command line, replacing <file.jl> with the script you want to run.

Work Flows

Unless otherwise noted, all scripts are in the ./scripts directory of this repository.

Mammals Correlation and Figure 3

  1. Run mammals_data.jl script.
  2. Run xml_setup.jl script.
    • Creates mammals.xml file in the ./xml directory.
  3. Run the ./xml/mammals.xml file according to the instructions in the BEAST section above.
  4. Move the log file, located at ./beast/mammals.log, to the ./logs directory of this repository.
    • You can examine this file using Tracer.
  5. Run the mammals_correlation.jl script in the ./scripts/plots directory to extract relevant information from log file.
  6. Run the mammals_plot.r script in the ./scripts/plots directory to generate Figure 3.

Prokaryotes Correlation and Figure 5

  1. Run the ./xml/prokatyotes.xml file according to the instructions in the BEAST section above.
    • The prokaryotes.xml file in the ./xml directory was created by BEAUti and manually edited.
  2. Move the prokaroytes.log file (located in the ./beast directory) to the ./logs directory.
    • You can examine this file using Tracer.
  3. Run the bacteria_correlation.jl script in the ./scripts/plots directory to extract relevant information from log file.
  4. Run the bacteria_plot.r script in the ./scripts/plots directory to generate Figure 5.

HIV Heritability

  1. Run xml_setup.jl script.
    • Creates hiv.xml file in the ./xml directory.
  2. Run the hiv.xml file in the ./xml directory according to the instructions in the BEAST section above.
  3. Move the hiv.log file (located in the ./beast directory) to the ./logs directory.
  4. You can examine this file using Tracer. The relevant parameters are:
    • varianceProportionStatistic11 $\rightarrow$ GSVL
    • varianceProportionStatistic22 $\rightarrow$ SPVL
    • varianceProportionStatistic33 $\rightarrow$ CD4 slope

HIV Posterior Predictive Power and Figure 7

  1. Run hiv_prediction_setup.jl script to generate xml files.
    • All xml are output to the ./xml/hiv_prediction directory.
  2. Run all xml files ending in a number in the ./xml/hiv_prediction directory according to the instructions in the BEAST section above.
    • Note that the four xml files not ending in a number are templates and have not had any data removed.
  3. Copy all log files to the ./logs/hiv_prediction directory.
  4. Run the hiv_prediction_analysis.jl script.
  5. Run the HIV_MSE.R script in the ./scripts/plots directory to generate Figure 7.

Computational Efficiency and Table 1

  1. Run timing_setup.jl script to generate xml for timing.
    • All xml are in the ./xml/timing directory.
  2. Run all xml files ending in a number in the ./xml/timing directory according to the instructions in the BEAST section above.
    • Note that you will need to save the screen output to a .txt file with the same name as the original xml file. To do this, using the command line, use the following command java -jar beast.jar <path/to/your/file.xml> > <file.txt>, replacing <path/to/your/file.xml> and <file.txt> with the relevant paths and filenames.
  3. Move all .log files and saved .txt files generated to the ./logs/timing directory of this repository.
  4. Run the timing_analysis.jl script. The output will be written to ./scripts/storage/timing_summary.csv file.

Simulation Study; Figure 2; and SI Figures 1, 2, and 3

  1. Ensure the mammals.log, hiv.log, and prokaryotes.log files are in the ./logs directory of this repository.
  2. Run the simulation_setup.jl script to generate simulated data sets and xml files.
  3. Run all xml files ending in the ./xml/simulation_study directory according to the instructions in the BEAST section above.
  4. Copy all .log files produced from running the xml files above to the ./logs/simulation_study directory of this repository.
  5. Run the simulation_analysis.jl script.
  6. Run the simulation_plotting.r script in the ./scripts/plots directory to generate Figure 2 and SI Figures 1, 2, and 3.

To perform simulation studies with different parameter values than those in the three examples we used in this papers, use the sim_xml function in the simulation_setup.jl script. This function takes the following eight arguments, the last two of which are optional:

  • xml_dir - String representing the directory you want to write the XML files two. This is already set in the script, but can be changed to wherever you want.
  • newick - String representing the newick representation of the phylogenetic tree.
  • $\Sigma$ - 2-D array representing the diffusion variance.
  • $\Gamma$ - 2-D array representing the residual variance (note that this is different notation than the paper where $\Gamma$ is the residual precision).
  • sparsity - Floating point number between 0 and 1 representing the proportion of the data you want to randomly remove.
  • base_name - String that will be the first part of the filename.
  • standardize_tree - Boolean value indicating whether you want to rescale the tree edge lengths such that the maximum distance between the root and any tip is equal to 1.
  • rep - Integer that, if supplied as an argument, appends r_<rep> to the end of the filename. Used to distinguish different runs with the same base_name, number of taxa, number of traits, and percent missing.

Note that sim_xml stores the parameters used for simulation in the ./scripts/storage/simulation directory, and relies on the internal structure of this repository. Any scripts you write using sim_xml should be located in the ./scripts directory.

Comparison with PCMBaseCpp and SI Tables 1 and 2

  1. Ensure the mammals.log, hiv.log, and prokaryotes.log files are in the ./logs directory of this repository.
  2. Run the mammals_data.jl script to generate the relevant newick files.
  3. Run the pcm_test_setup.jl script to generate simulated data sets and xml files.
  4. Execute the run_timing.sh and run_timing_sim.sh linux shell scripts.
    • The run_timing_sim.sh script will probably take a long time (several hours) to complete.
  5. Run the pcm_test_analysis.jl script. The results are located in the ./scripts/storage/speed_results.csv file.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published