No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
LICENSE
Makefile
README.md
buildGraph.cpp
buildGraph_README.txt
getHTMatrixInversion.cpp
getHTMatrixInversion_README.txt
hitndrive.cpp
hitndrive_README.txt

README.md

HIT'nDRIVE

HIT’nDRIVE (Shrestha et al. 2017), is a combinatorial algorithm to prioritize cancer driver genes. It integrates changes in genome (sequence altered genes) with changes in transcriptome (gene expression outliers) to identify patient-specific genomic alterations that can collectively influence the dysregulated transcriptome of the patient.

HIT’nDRIVE aims to solve the “random-walk facility location” (RWFL) problem on a gene/protein interaction network – thus differs from the standard facility location problem by its use of “hitting time”, the expected minimum number of hops in a random-walk originating from any sequence altered gene (i.e. a potential driver) to reach an expression altered gene, as the distance measure. HIT’nDRIVE reduces RWFL (with multi-hitting time as the distance) to a weighted multi-set cover problem, which it solves as an integer linear program (ILP).

Publications

  • Shrestha R, Hodzic E, Sauerwald T, Dao P, Yeung J, Wang K, Anderson S, Haffari G, Collins CC, and Sahinalp SC. 2017. HIT’nDRIVE: Patient-Specific Multi-Driver Gene Prioritization for Precision Oncology. Genome Research. doi:10.1101/gr.221218.117 (http://dx.doi.org/10.1101/gr.221218.117)
  • Shrestha R, Hodzic E, Yeung J, Wang K, Sauerwald T, Dao P, Anderson S, Beltran H, Rubin MA, Collins CC, Haffari G and Sahinalp SC. 2014. HIT’nDRIVE: Multi-driver gene prioritization based on hitting time. Research in Computational Molecular Biology: 18th Annual International Conference, RECOMB 2014, Pittsburgh, PA, USA, April 2-5, 2014, 293–306. (https://link.springer.com/chapter/10.1007/978-3-319-05269-4_23)

Setup

System Requirements

  • make (version 3.81 or higher)
  • g++ (GCC version 4.1.2 or higher)
  • IBM ILOG CPLEX Optimization Studio

Installation

To install HIT'nDRIVE, clone the repo using following command

git clone git@github.com:sfu-compbio/hitndrive.git

Compile HIT'nDRIVE

In the Makefile, set CPLEXDIR to the path of your root CPLEX folder. Set CPLEX_BUILD to the name of your build - the build can be identified from root folder in the following manner:

${CPLEXDIR}/cplex/bin/${CPLEX_BUILD}/

Simply run make command in the src folder. It will create executables in the src folder.

Run HIT'nDRIVE - a demo script

inputFolder = "insertPath"
workingFolder = "insertPath"
network = "ppi network file path"

./buildGraph -i ${inputFolder}/${network} -f ${workingFolder} -o ppi
./getHTMatrixInversion -i ${workingFolder}/ppi.graph -o ppi -f ${workingFolder}
./hitndrive -a ${inputFolder}/alterations.txt -o ${inputFolder}/outliers.txt -g ${workingFolder}/ppi.nodes -i ${workingFolder}/ppi.ht -f ${workingFolder} -n hitndriveOutput -l 0.9 -b 0.4 -m 0.8
  • The list of drivers will be in the ${workingFolder}/hitndriveOutput.drivers file.

Step-1: buildGraph

Usage:

./buildGraph -i [input edge collection] -f [output folder] -o [output graph file name]
Parameters Description
-i input edge collection
-o output graph file name
-f output folder (optional)

- i :    This parameter represents an edge collection file where each row represents an edge in form of two vertex names followed by a weight, separated by whitespace. All edges are treated as undirected. There is no header row. e.g.

	V1 V2 1
	V5 V7 1
	...
	Vn Vk 1

- o :    This parameter determines the name and path of two output files (with exptensions .graph and .nodes). The first is the path to output file in which network information is stored as graph structure (.graph). First row contains number of vertices and directed edges. The following lines contain, for each vertex, the size of its neighbourhood followed by a pairs of numbers representing index of each neighbor and weight of the edge. e.g.

	10971 428596
	1 1 1
	1 0 1
	1 3 1
	1 2 1
	113 5 1 
	...

The second is the path to file which contains node names listed in the order they are discovered in the input file (.nodes). e.g.:

    	node_1
	node_2
	...

- f :    Path to output folder the filename path is relative to. The default value is . (current folder).

Things To Note:

  • Input edge collection file is checked for duplicate edges, which are discarded. Self-loops are also discarded.
  • The output file does not contain any vertex labels, only their index in the .nodes file.
  • Both .graph and .nodes files are stored without headers to be used for hitting-time calculations.

Step-2: getHTMatrixInversion

  • Inverts matrix A of size n x n and stores the result in B. It is based on gaussian elimination (elementary row transformations).
  • Returns 1 if the matrix is singular and no inverse exists. Otherwise it returns 0.

Usage:

./getHTMatrixInversion -i [input graph file] -o [output hitting times matrix file] -f [output folder]
Parameters Description
-i input graph file
-o output hitting times matrix file
-f output folder (optional)

- i :    This file needs to be in same format as .graph output of buildGraph binary
- o :    Matrix-structured output file in which pair-wise hitting times are stored. The extension assigned is .ht. All rows and columns represent vertices of the graph. Vertex names are not included (no header row/column) as the nodes are represented in the same order as in the .nodes file.
- f :    Path to output folder the filename path is relative to. The default value is . (current folder).

Things To Note:

  • The graph should be such that hitting times are calculable. If that is not the case, matrix inversion will fail due to singularity.
  • It is okay if the graph has multiple connected components.

Step-3: hitndrive

Usage:

./hitndrive -a [alterations file] -o [outlier file] -g [gene names file] -i [influence matrix] -f [output folder] -n [output filename] -l [alpha] -b [beta] -m [gamma]
Parameters Description
-a alterations file
-o outlier file
-g gene names file
-i influence matrix
-f output folder (optional)
-n output filename
-l alpha; (optional) default value is 1
-b beta
-m gamma

- a :    File containing list of sample IDs and name of the corresponding aberrant genes. Format is following:

	SampleID GeneName
	ID_1 Gene_1
	...
	ID_i Gene_j
	...

- o :    File containing list of sample IDs and names of the corresponding expression-outlier gene. Weights to the corresponding outlier gene should be in the third column. Set weights to 1 to obtain unweighted version. Format is following:

	SampleID GeneName Weight
	ID_1 Gene_1 Weight_1
	...
	ID_i Gene_j Weight_ij

- g :    File containing names of genes in the influence matrix. It is output of buildGraph binary
- i :    Output of getHTMatrixInversion binary
- f :    Path to output folder the filename path is relative to. The default value is .
- n :    Output filename for the .lp file. This file represents a CPLEX-format linear program whose solution gives influential driver genes given the input network, hitting-times, alteration and expression-outlier data.
- l :    Real number from interval [0, 1] representing fraction of expression-outliers to be covered in global. Tweak this number down if you are getting too many drivers and want to keep "problematic" expression-outliers aside.
- b :    Real number from interval [0, 1] representing fraction of top-weighted outliers to be covered per patient. This parameters ensures that the most important expression-outliers are explained by the chosen alteration events. Set to 0 if you wish all expression-outliers to be considered equally.
- m :    Real number from interval [0, 1] representing percentage of total incoming influence into an expression-outlier to be satisfied. This number is inversely proportional to the multi-hitting time distance from the selected set of drivers towards individual expression-outliers. Decrease this number to allow greater distances.

Things To Note:

  • Alterations and expression-outliers files contain header rows, which are discarded during input. If your file does not contain a header, then first row of data will get ignored.
  • Running the program in unweighted mode and with beta=0 (in case of uncertainty about importance of expression-outliers) will significantly increase the running time.