Reconstruction of clone- and haplotype-specific Cancer Karyotypes
NOTE: this repository contains only the initial version of RCK at the time of its publication.
Further development, bug fixing, new releases, etc. of RCK is conducted a the https://github.com/aganezov/RCK repository.
RCK - is a method for Reconstruction of clone- and haplotype-specific Cancer Karyotypes from tumor mixtures, distributed both as a standalone software package and as a Python library under the MIT licence.
RCK has been initially designed and developed by Sergey Aganezov in the group of prof. Ben Raphael at Princeton University (group site). Current development of RCK is continued by Sergey Aganezov in the group of prof. Michael Schatz at Johns Hopkins University (group site).
The full description of the algorithm and its application on published cancer datasets are described in:
- Algorithm overview
- Input preprocessing
- High-level RCK data processing recipe
- Running RCK
RCK infers clone- and haplotype-speicifc cancer genome karyotypes from tumor mixtures.
RCK assumes that:
- the reference human genome is diploid (except for sex chromosomes)
- somatic evolution is propagated by large scale rearrangements (any type, quantity, etc) that respect the infinite sites assumption (i.e., no genomic location, on either copy of the homologous chromosome, prticipates in the double-stranded breakage, which are requried for a rearrangement to happen, more than once thgoughout the entire somatic evolutionary history of the tumor); this can be relaxed for extremity-exclusivity constraint, if in the high confident input novel adjacencies some genomic location is shared.
- no novel genomic locations (unless explicitly specified) can play a role of telomeres in the derived chromosomes
- (approximate) clone- and allele-specific fragment/segment copy numbers are inferred by 3rd-party tools and are part of the input (see more in the segments docs)
- (noisy) unlabeled (i.e., without haplotype labels) novel adjacencies (aka structural variants) are inferred by 3rd-party tools and are part of the input (see more in the adjacencies docs)
RCK uses a Diploid Interval Adjacency Graph to represent all possible segments and transitions between them (across all clones and the reference). RCK then solves an optimization problem of inferring clone- and haplotype-specific karyotypes (i.e., finding clone-specific edge multiplicity functions in the constructed DIAG) as an MILP program. Several constraints are taken into consideration (some of which are listed below) during the inference:
- infinite sites complience (across all clones in the tumor)
- adjacencies grouping (is part of the input, optional)
- false positive for novel adjacencies presence in reconstructed karyotypes
- maximum divergence from input (approximate) allele-specific segment/fragment copy number profile
- preservatino of allele-separation across clones in tumor
- telomere locations
We note, that in contrast to some other cancer karyotype inference methods, RCK model has several advantages, that all work in q unifying computation framework and some/all of which differentiate RCK from other methods:
- any level of sample heterogeneity (on the karyotype level): from homogeneous samples with a single derived clone, to tumor samples comprised of
- support for any type of novel adjacencies signature (SV types), including copy-number-neutral ones, as well as the complicated ones arising from chromoplexy/chromothripsis events
- model of diploid reference/non-haploid derived genomes
- explicit control over telomere location during the inference
- explicit fine-grain control over false positive in the novel adjacencies in the input and respectively their utilization in the inference
- haplotype-specific (aka phased) inference both for segments and adjacencies across all clones in the tumor sample
- support for (optional) 3rd-generation sequencing additional information
RCK shall work on latest macOS, and main Linux distribution. RCK is implemented in Python and designed to work with Python 3.7+. We highly recommend creating an independent python virtual environment for RCK usage.
RCK itself can be installed in three different ways:
RCK requires an ILP solver installed on the system, as well as python bindings for it. Currently only Gurobi ILP solver is supported.
For more details about installation please refer to the installation documentation.
The minimum input for RCK is comprised of two parts:
- Unlabeled novel adjacencies (aka structural variations in the tumor sample)
- Clone- and allele-specific segment copy numbers
Additional input can contain:
- Additional telomere locations
- Segment-, clone-, and allele-specific boundaries (both lower and upper) on inferred copy numbers
- Grouping information about clone-specific novel adjacencies (usually informed by 3rd-generation sequencing data), with individual False Positive rates per each group
- False Positive rates for any subgroup of input novel adjacencies.
RCK expects the input data to be in a (C/T)SV (Coma/Tab Separated Values) format. We provide a set of utility tools to convert input data obtained from a lot of state-of-the-atr methods outputs into the RCK suitable format.
Obtaining unlabeled (i.e., without allele-information) novel adjacencies (aka Structural Variants) is not a part of the RCK workflow, as there exist a lot of tools for obtaining those.
We provide a
rck-adj-x2rck utility to convert output from output format of SV detection tools to the RCK suitable format.
We currently support converting the output of the following 3rd-party SV detection tools:
- linked/barcode reads
- long reads
For more information about adjacencies, formats, converting, reciprocality, etc, please refer to adjacencies documentation
Segment copy numbers
Obtaining clone- and allele-specific segment copy numbers is not a part of the RCK workflow, as there exist a lof of tools for obtaining those.
We provide a
rck-scnt-x2rck utility to convert output from output format of other tools that infer clone- and allele-specific segment copy numbers to the RCK suitable format.
We currently support converting the output of the following 3rd-party tools:
- HATCHet [paper | code] (recommended as it has fewest limitation w.r.t. tumor heterogeneity)
- TitanCNA [paper | code]
- Battenberg [paper | code]
- ReMixT [paper | code]
- Ginkgo [paper | code] (Attention! haploid mode only)
RCK data processing recipe
For the most cases the cancer sample of interest is initially represented via a set
cancer.sr.fastq of reads obtained via a sequencer.
Additionally, a sequenced reads
normal.sr.fastq from a matching normal sample need to be available.
Most often case of analysis consists of having a standard Illumina paired-end sequenced reads for both the tumor and the matching normal. Increasingly 3rd-generation sequencing technologies are being utilized in cancer analysis. Let us assume that there may optionally be a set
cancer.lr.fastq of reads for the cancer sample in question obtained via 3rd-generation sequencing technology.
- Align sequenced reads (with you aligner of choice)
normal.sr.fastqfor cancer and a matching normal samples to obtain
- Optionally align sequenced long reads
- Optionally align sequenced long reads
- Run a tool of you choosing on
cancer.sr.fastqto obtain a novel adjacencies VCF file
- Convert novel adjacencies from VCF file
RCKinput format via
rck-adj-x2rck x cancer.vcf -o input.rck.adj.tsv, where
xstands for the novel adjacency inference tool. Please, see adjacencies docs for list of supported tools and more detailed instructions on comparison.
- Run any of the supported tools (HATCHet, TitanCNA, Battenberg, ReMixT) of choice to infer large-scale clone- and allele-specific fragment copy numbers
CN.data(generic name of the tool-specific result)
- Convert tool-specific copy-number data
rck-scnt-x2rck x CN-data -o input.rck.scnt.tsv, where
xstands for copy number inference tool. Please, see segments docs for link to specific methods, as well as details on how to run conversion.
We provide the the
rck tool to run the main RCK algorithm for clone- and haplotype specific cancer karyotypes reconstruction.
With the minimum input for RCK the following is the example of running RCK:
rck --scnt input.rck.scnt.tsv --adjacecnies input.rck.adj.tsv
--scntcorresponds to the clone- and allele-specific segments copy number input
--adjacenciescorresponds to the unlabeled novel adjacencies input
Additionally one can specify the
--workdir working directory, where the input, preprocessing, and the output will be stored.
For more on the
rck command usage please refer to usage documentation.
Here is the description of the results produced by
rck main method for cancer karyotype reconstruction.
For results on segment/adjacency conversion/processing, please refer to respective segment/adjacency documentations.
RCK's cancer karyotype reconstruction is stored in the
output subdirectory in the working directory (the
The following two files depict the inferred clone- and haplotype-specific karyotypes:
rck.scnt.tsv- clone- and haplotype-specific segments copy numbers;
rck.acnt.tsv- clone- and haplotype-specific adjacencies copy numbers;
When using RCK's cancer karyotype reconstruction algorithm or any of RCK's utilities, please cite the following paper:
If you experience any issues with RCK installation, usage, or results or want to see RCK enhanced in any way, shape or form, please create an issue on RCK issue tracker. Please, make sure to specify the RCK's, Python's, and Gurobi's versions in question, and, if possible, provide (minimized) data, on which the issue(s) occur(s).
If you want to discuss any avenues of collaboration, custom RCK applications, etc, please contact Sergey Aganezov at aganezov(at)jhu.edu or sergeyaganezovjr(at)gmail.com