Skip to content

raphael-group/calder

Repository files navigation

CALDER

CALDER (Cancer Analysis of Longitudinal Data through Evolutionary Reconstruction) is an algorithm for inferring evolutionary phylogenies using multiple longitudinal bulk DNA sequencing samples from the same patient. CALDER improves upon previous methods by enforcing the evolutionary relationships that are expected between temporally ordered samples.

Setup

The setup process for CALDER requires the following steps:

Download

The following command clones the current CALDER repository from GitHub:

git clone https://github.com/raphael-group/calder.git

Requirements

The following software is required for CALDER:

Testing

With the dependencies set up correctly, the following command will run CALDER on the provided test input and write the results to a subdirectory called "output":

java -jar calder.jar -i CLL003_clustered.txt -o output

This should take no more than a few seconds to run and the output should match the contents of the sample_output folder.

Use

CALDER has the following steps.

Input

The input file is a tab-separated text file representing a matrix of read counts. Each row corresponds to a longitudinal sample, and alternating columns designate the reference reads and variant reads covering each mutation, respectively. For example, an instance with 3 samples and 4 mutations could be like so:

    a     a   b   b   c   c   d   d
t1  700   300 0   0   0   0   0   0
t2  700   300 800 200 900 100 900 100
t3  600   400 800 200 900 100 900 100

For real datasets with a considerable number of mutations (more than 40), we recommend using Absence-Aware Clustering to cluster mutations.

CALDER assumes that input mutations are in copy-neutral regions, i.e., that the number of reads with a mutation is proportional to the number of cells with that mutation. If you suspect this assumption does not hold for your data, consider excluding mutations that may be affected by CNA; alternatively, if you have copy number calls (e.g., from HATCHet), you could correct the read counts to represent the true CCF.

Running

The command to run CALDER is simply "java -jar calder.jar" followed by command line arguments. The option -i to designate the input file is required.

Output

For each solution, CALDER produces 2 text files: a DOT file containing the inferred phylogenetic tree T, and a CSV file containing the inferred frequency matrix Fhat and the clone proportion matrix U. DOT files can be visualized using standard tools such as graphviz (see below for an example), and the matrices in CSV format can also be manipulated using standard tools -- (see soln_to_timescape.py for an example that does so using the pandas library in Python).

Visualizing output

To visualize a tree using Graphviz (after installing it), you can navigate to the output directory and run the following command:

dot -Tpng CLL003_tree1.dot > CLL003_soln1.png

See the Graphviz documentation for more options.

We provide a script to support visualizing clone mixture proportions using the Timescape R package. This requires the following dependencies:

  • Python 3
  • Python packages: networkx, pandas, and pydot
  • R >= 3.3
  • R package: Timescape (and its dependencies)

First, run the Python script to convert the solution DOT and CSV files to Timescape-formatted files (assuming that python refers to Python 3):

python soln_to_timescape.py outdir/CLL003_soln1.csv outdir/CLL003_tree1.dot CLL003

Then, run the following commands in R to generate the visualization:

library(timescape)
prev <- read.table("CLL003_prev.txt", header=TRUE)
edges <- read.table("CLL003_edges.txt", header=TRUE)
timescape(clonal_prev = prev, tree_edges = edges)

See the Timescape documentation for more options.

Clustering mutations

We recommend clustering mutations by frequency before running CALDER - primarily because we generally expect to have multiple mutations distinguishing between any two clonal expansion events, and therefore between any two clones. We recommend using Absence-Aware Clustering, a clustering algorithm that pays particular attention to the distinction between mutation presence and absence. Python scripts are included to convert a CALDER input file to the format required by the clustering software, and to convert the clustering output back to CALDER input format.

Requirements:

The following command converts CALDER-formatted input to clustering input (assuming that python refers to Python 3, otherwise use python3 explicitly):

python calder_to_clustering.py calder_input.txt clustering_input.txt

Then, after running Absence-Aware Clustering, use the following command to apply the cluster assignments to the original data (where clustering_assignments.txt is the output file from the top level of the clustering output directory):

python apply_clustering.py clustering_input.txt cluster_assignments.txt calder_input_clustered.txt

CALDER Command line options

Required
-i,--input <arg>       input file path
-o,--output <arg>      output directory

Additional options
-a,--alpha <arg>       confidence level alpha (default 0.9)
-c,--printconf         print effective confidence level
-d,--details           print detailed values of objective function terms
-e,--enumerate         enumerate all maximal trees instead of just
                    optimal solutions
-g,--print-graph       print ancestry graph
-h,--threshold <arg>   detection threshold h (default 0.01)
-n,--intervals         print confidence intervals
-N,--nonlongitudinal   remove longitudinal constraints
-O,--objective <arg>   objective function (l0, l1, or l0center)
-r,--remove-columns    discard mutations/clusters with abnormally high
                    frequencies
-s,--solutions <arg>   maximum number of optimal solutions to return
                    (default 1)
-st,--timeout <arg>    timeout setting for JavaILP solver
-sv,--verbose <arg>    verbosity setting for JavaILP solver (effect
                    depends on solver)
-t,--time              track and output timing information
-v,--solver <arg>      MILP solver back-end (default gurobi)

The l0center option for the objective function includes the L1 norm of the difference between the observed and inferred frequency matrices as a subsequent objective.

ILP solvers

CALDER requires a specialized ILP solver. We recommend the Gurobi optimizer (version 8.0 required), as it is fast, easy to install, and supported on all platforms (website includes instructions for obtaining a license, downloading, and installing).

If for some reason you would prefer not to use Gurobi (e.g., if you are part of a non-academic entity and not interested in purchasing a license), we also support the GLPK solver with the GLPK for Java interface, or the lp-solve solver. We generally found GLPK to be faster and easier to use on all platforms. For more details, see the specific installation instructions for each solver. Installation tends to be easier on Linux systems than on Mac or Windows systems. Note that you will need to specify the alternate solver using the -v option.

Additional information

For assistance with running CALDER, interpreting the results, or other related questions, please email me (Matt Myers) at this address: mm63@cs.princeton.edu

License

See LICENSE for license information.

Citation

If you use CALDER in your work, please cite the following paper (available here):

Myers, M.A., Satas, G. and Raphael, B.J., 2019. CALDER: Inferring Phylogenetic Trees from Longitudinal Tumor Samples. Cell Systems.

About

CALDER (Cancer Analysis of Longitudinal Data through Evolutionary Reconstruction) reconstructs evolutionary trees from longitudinal bulk DNA sequencing data

Resources

License

Stars

Watchers

Forks

Packages

No packages published