HaploRecstr: Haplotype Reconstruction Using Hidden Markov Technique
HaploRecstr is a C++ haplotype reconstruction program. Method used is based on the algorithm introduced by Rastas et al.^[Rastas, P., Koivisto, M. et al. (2005). Algorithms in Bioinformatics: 5th International Workshop. Berlin, Heidelberg: Springer. 145-151] Genotype data from a sample population is used as input.
More specifically, the program uses a Hidden Markov Model (HMM) to construct the data. By default, the model is initialized by going through the data to select the major alleles and assigning parameter values, then EM algorithm is used to optimize the likelihood function, and then Viterbi is used to reconstruct the haplotype data.
The program takes genotype as input. Outputs of the program include: 1) a set of reconstructed haplotypes; 2) a summary of the frequencies of all possible haplotypes in the population (sorted in descending order).
The program takes an input dataset of genotypes. It should have
General requirements about the format:
- Dataset should not include headers
- Each row of the data represent one observed genotype
- Rows starting with
#
will be ignored when reading in data - Entries in the dataset should be integers from
$0$ to$K-1$ , where$K$ is the number of states/alleles for each SNP, and is set by the user - Entries can be separated by
,
,tab
orspace
- No missing data is allowed in the dataset
The example input dataset (genotype.txt
) looks like this:
Five observations are represented in the genotype.txt
dataset. Columns 1 and 2 code for the observed alleles at SNP1, columns 3 and 4 code for the observed alleles at SNP2, and so on.
The program can take in user defined parameter data as the initial value to be used in the program. A total of three kinds of parameters is needed:
-
$\pi$ : a$1 \times K$ vector where$\pi_{i}$ indicates the initial transmission probabilities from state$S_0$ to$S_{1i}$ ,$i=1,2,...,K$ . A sample file (pis.txt
) is included in theexample
folder
-
$\tau$ : a$(M-1) \times K^2$ matrix where$\tau_{t(i \times K +j)}$ indicates the transmission probability from state$S_{ti}$ at the$t^{th}$ SNP to$S_{(t+1)j}$ at the$(t+1)^{th}$ SNP,$t=1,2,...,M-1$ ,$i,j=1,2,...,K$ . A sample file (trans.txt
) is included in theexample
folder
-
$\epsilon$ : a$M \times K^2$ matrix where$\epsilon_{t(i \times K +j)}$ indicates the emission probability from state$S_{ti}$ of allele$A_{tj}$ at the$t^th$ SNP,$t=1,2,...,M$ ,$i,j=1,2,...,K$ . A sample file (emis.txt
) is included in theexample
folder
The first of the two output datasets generated by the program, the reconstructed haplotype data, is of size
A total of ten haplotypes are shown in the dataset in out1.txt
, corresponding to the five observed genotypes in the example input data. The first row corresponds to the identity number of observations. The following rows represent the alleles at subsequent SNPs of each haplotype. Each observation will have two rows representing the reconstructed haplotypes (i.e., reconstructed haplotypes for observation 1 is {1,4,4,3,0} and {1,2,1,4,3} shown at rows 1 and 2; reconstructed haplotypes for observation 2 is {0,1,2,4,3} and {1,3,3,3,1} shown at rows 3 and 4, and so on).
The second of the two output datasets generated by the program, a summary of unique reconstructed haplotypes in the population, is of size
The dataset in out2.txt
shows the five most common haplotypes in the example population. Each row represents one haplotype, and the last column is the frequency of the haplotype in the sample population in descending order.
A sample command to unpack program on a computer:
% tar -zxvf hapRecstr.tar.gz
Boost Library should be included when compiling. You can use a compile command like this:
% g++ -I path/to/boost hapRecstr.cpp -o hapRecstr
The program can take a total of 7 arguments while the first four are required.
Structure of the command look like this:
% ./hapRecstr arg1 arg2 arg3 arg4 arg5 arg6 arg7
The meanings of the parameters:
arg1
: the path and name of the input dataarg2
: the path and name of the file to store output data 1arg3
: the path and name of the file to store output data 2arg4
: number of states/alleles for each SNP. Note that this should match the entry values in the input data (in the example case, the entries in the input data should be integers from 0 to 4)arg5
(optional): the path and name of the file containing emission probability matrix. The emission probability is the probability of emitting alleles from states. Format requirement see herearg6
(optional): the path and name of the file containing transmission probability matrix, which is the probability of transitting from current state to other states on the subsequent SNP. Format requirement see herearg7
(optional): the path and name of the file containing initial transmission probability vector. Format requirement see here
The example dataset, genotype.txt
, in the example
folder contains 50 genotype observations. To reconstruct haplotypes from this data using default parameters, use the command below:
% ./hapRecstr example/genotype.txt example/out1.txt example/out2.txt 5
Note that the value of arg4
is 5, indicating that in this example case, the entries in the input data should be integers from 0 to 4.
The parameters of the HMM model can be set by either taking the user-specified data from arg5
, arg6
and arg7
, or by default initialization within the program (more details on the initialization method see here).
For the latter case (i.e., when arg5
, arg6
and arg7
are not specified), there will be a prompt for inputting values of
Set ro, nu, eta for parameter initialization (type 0 to use default value ro=0.1, nu=0.01, eta=0.8):
Numbers larger than 0 and less than 1 should be typed in to specify the values. Alternatively, if 0
is typed in, the program will use the default value.
After running the commandline in the previous step, there will be a prompt for convergence threshold like this:
Model initialized
Set convergence threshold (type 0 to use default value 1e-6):
A number larger than 0 and less than 1 should be typed in to specify the threshold. Alternatively, if 0
is typed in, the program will use the default threshold value of 1e-6.
The program will also prompt for the maximum number of EM loops, which is the maximum number of times that the program will go through the EM algorithm if the convergence threshold is not met:
Set maximum EM loops (type 0 to use default value 25):
A positive integer should be typed in to specify the threshold. Alternatively, if 0
is typed in, the program will use the default value of 25.
The program will store the outputs into the pre-specified files. A successful run of the program on the example data has the following lines shown in the end:
Output 1 saved to example/out1.txt
Output 2 saved to example/out2.txt
The theoretical time complexity of the program is
genotype_Daly
in the example
folder simulates the benchmark dataset from Daly et al.^[Daly, M.J., Rioux, J.D., Schaffner, S.F., et al.: High-resolution haplotype structure
in the human genome. Nature Genetics 29 (2001) 229–232] with 129 genotypes and 103 SNP markers.
The running time for the simulated data is 94 seconds, calculated using default values of convergence and max EM loops with the highest likelihood HMM.
-
$K$ : number of possible alleles per loci -
$N$ : number of genotypes -
$M$ : number of SNPs -
$S_{ti}$ : state$i$ at the$t^{th}$ SNP -
$\pi$ : a$1 \times K$ vector of initial transmission probabilities from state$S_0$ to$S_{1i}$ ,$i=1,2,...,K$ -
$\tau$ : a$(M-1) \times K^2$ matrix where$\tau_{t(i \times K +j)}$ indicates the transmission probability from state$S_{ti}$ at the$t^{th}$ SNP to$S_{(t+1)j}$ at the$(t+1)^{th}$ SNP,$t=1,2,...,M-1$ ,$i,j=1,2,...,K$ -
$\epsilon$ : a$M \times K^2$ matrix where$\epsilon_{t(i \times K +j)}$ indicates the emission probability from state$S_{ti}$ of allele$A_{tj}$ at the$t^th$ SNP,$t=1,2,...,M$ ,$i,j=1,2,...,K$ -
$\rho$ ,$\nu$ and$\eta$ : constant values for default initialization of HMM model
The initial values of
The probabilities are then perturbed by a random number and normalized to get their finalized initial values.