HaploRecstr: Haplotype Reconstruction Using Hidden Markov Technique

Chen Liang, Ziwei Zhu, Yongwen Zhuang

2016-12-16

HaploRecstr is a C++ haplotype reconstruction program. Method used is based on the algorithm introduced by Rastas et al.^[Rastas, P., Koivisto, M. et al. (2005). Algorithms in Bioinformatics: 5th International Workshop. Berlin, Heidelberg: Springer. 145-151] Genotype data from a sample population is used as input.
More specifically, the program uses a Hidden Markov Model (HMM) to construct the data. By default, the model is initialized by going through the data to select the major alleles and assigning parameter values, then EM algorithm is used to optimize the likelihood function, and then Viterbi is used to reconstruct the haplotype data.
The program takes genotype as input. Outputs of the program include: 1) a set of reconstructed haplotypes; 2) a summary of the frequencies of all possible haplotypes in the population (sorted in descending order).

Data Format

Input Data: Genotypes

The program takes an input dataset of genotypes. It should have $N$ rows and $2M$ columns, where $N$ and $M$ are the numbers of individuals and SNPs contained in the dataset respectively.

General requirements about the format:

Dataset should not include headers
Each row of the data represent one observed genotype
Rows starting with # will be ignored when reading in data
Entries in the dataset should be integers from $0$ to $K-1$, where $K$ is the number of states/alleles for each SNP, and is set by the user
Entries can be separated by ,, tab or space
No missing data is allowed in the dataset

The example input dataset (genotype.txt) looks like this:

Five observations are represented in the genotype.txt dataset. Columns 1 and 2 code for the observed alleles at SNP1, columns 3 and 4 code for the observed alleles at SNP2, and so on.

Parameter Data

The program can take in user defined parameter data as the initial value to be used in the program. A total of three kinds of parameters is needed:

$\pi$: a $1 \times K$ vector where $\pi_{i}$ indicates the initial transmission probabilities from state $S_0$ to $S_{1i}$, $i=1,2,...,K$. A sample file (pis.txt) is included in the example folder

$\tau$: a $(M-1) \times K^2$ matrix where $\tau_{t(i \times K +j)}$ indicates the transmission probability from state $S_{ti}$ at the $t^{th}$ SNP to $S_{(t+1)j}$ at the $(t+1)^{th}$ SNP, $t=1,2,...,M-1$, $i,j=1,2,...,K$. A sample file (trans.txt) is included in the example folder

$\epsilon$: a $M \times K^2$ matrix where $\epsilon_{t(i \times K +j)}$ indicates the emission probability from state $S_{ti}$ of allele $A_{tj}$ at the $t^th$ SNP, $t=1,2,...,M$, $i,j=1,2,...,K$. A sample file (emis.txt) is included in the example folder

Output Data 1: Haplotypes

The first of the two output datasets generated by the program, the reconstructed haplotype data, is of size $2N \times M$.

A total of ten haplotypes are shown in the dataset in out1.txt, corresponding to the five observed genotypes in the example input data. The first row corresponds to the identity number of observations. The following rows represent the alleles at subsequent SNPs of each haplotype. Each observation will have two rows representing the reconstructed haplotypes (i.e., reconstructed haplotypes for observation 1 is {1,4,4,3,0} and {1,2,1,4,3} shown at rows 1 and 2; reconstructed haplotypes for observation 2 is {0,1,2,4,3} and {1,3,3,3,1} shown at rows 3 and 4, and so on).

Output Data 2: Summary

The second of the two output datasets generated by the program, a summary of unique reconstructed haplotypes in the population, is of size $L \times (M+1)$, where $L$ is the number of unique haplotypes in the dataset.

The dataset in out2.txt shows the five most common haplotypes in the example population. Each row represents one haplotype, and the last column is the frequency of the haplotype in the sample population in descending order.

back to top

How to Use HapRecstr

Download and Unpack

A sample command to unpack program on a computer:

 % tar -zxvf hapRecstr.tar.gz

Compile

Boost Library should be included when compiling. You can use a compile command like this:

 % g++ -I path/to/boost hapRecstr.cpp -o hapRecstr

Run the Program

The program can take a total of 7 arguments while the first four are required.
Structure of the command look like this:

% ./hapRecstr arg1 arg2 arg3 arg4 arg5 arg6 arg7

The meanings of the parameters:

arg1: the path and name of the input data
arg2: the path and name of the file to store output data 1
arg3: the path and name of the file to store output data 2
arg4: number of states/alleles for each SNP. Note that this should match the entry values in the input data (in the example case, the entries in the input data should be integers from 0 to 4)
arg5 (optional): the path and name of the file containing emission probability matrix. The emission probability is the probability of emitting alleles from states. Format requirement see here
arg6 (optional): the path and name of the file containing transmission probability matrix, which is the probability of transitting from current state to other states on the subsequent SNP. Format requirement see here
arg7 (optional): the path and name of the file containing initial transmission probability vector. Format requirement see here

The example dataset, genotype.txt, in the example folder contains 50 genotype observations. To reconstruct haplotypes from this data using default parameters, use the command below:

 % ./hapRecstr example/genotype.txt example/out1.txt example/out2.txt 5

Note that the value of arg4 is 5, indicating that in this example case, the entries in the input data should be integers from 0 to 4.

Set Parameters of HMM Model

The parameters of the HMM model can be set by either taking the user-specified data from arg5, arg6 and arg7, or by default initialization within the program (more details on the initialization method see here).
For the latter case (i.e., when arg5, arg6 and arg7 are not specified), there will be a prompt for inputting values of $\rho$,$\nu$ and $\eta$:

 Set ro, nu, eta for parameter initialization (type 0 to use default value ro=0.1, nu=0.01, eta=0.8):

Numbers larger than 0 and less than 1 should be typed in to specify the values. Alternatively, if 0 is typed in, the program will use the default value.

Set Convergence Threshold

After running the commandline in the previous step, there will be a prompt for convergence threshold like this:

 Model initialized
  Set convergence threshold (type 0 to use default value 1e-6):

A number larger than 0 and less than 1 should be typed in to specify the threshold. Alternatively, if 0 is typed in, the program will use the default threshold value of 1e-6.

Set Maximum Number of EM Loops

The program will also prompt for the maximum number of EM loops, which is the maximum number of times that the program will go through the EM algorithm if the convergence threshold is not met:

  Set maximum EM loops (type 0 to use default value 25):

A positive integer should be typed in to specify the threshold. Alternatively, if 0 is typed in, the program will use the default value of 25.

Get the Results

The program will store the outputs into the pre-specified files. A successful run of the program on the example data has the following lines shown in the end:

 Output 1 saved to example/out1.txt
 Output 2 saved to example/out2.txt

back to top

Time Complexity and Space Requirement

The theoretical time complexity of the program is $O(NMK^3)$ while the space requirement is $max(O(MN), O(MK^2))$.
genotype_Daly in the example folder simulates the benchmark dataset from Daly et al.^[Daly, M.J., Rioux, J.D., Schaffner, S.F., et al.: High-resolution haplotype structure in the human genome. Nature Genetics 29 (2001) 229–232] with 129 genotypes and 103 SNP markers.
The running time for the simulated data is 94 seconds, calculated using default values of convergence and max EM loops with the highest likelihood HMM.

back to top

Appendix

Notations

$K$: number of possible alleles per loci
$N$: number of genotypes
$M$: number of SNPs
$S_{ti}$: state $i$ at the $t^{th}$ SNP
$\pi$: a $1 \times K$ vector of initial transmission probabilities from state $S_0$ to $S_{1i}$, $i=1,2,...,K$
$\tau$: a $(M-1) \times K^2$ matrix where $\tau_{t(i \times K +j)}$ indicates the transmission probability from state $S_{ti}$ at the $t^{th}$ SNP to $S_{(t+1)j}$ at the $(t+1)^{th}$ SNP, $t=1,2,...,M-1$, $i,j=1,2,...,K$
$\epsilon$: a $M \times K^2$ matrix where $\epsilon_{t(i \times K +j)}$ indicates the emission probability from state $S_{ti}$ of allele $A_{tj}$ at the $t^th$ SNP, $t=1,2,...,M$, $i,j=1,2,...,K$
$\rho$,$\nu$ and $\eta$: constant values for default initialization of HMM model

Model Initialization

The initial values of $\pi$ are set to $1/K$. For each $t=1,2,...,m$, $\tau(s_{ti}, s_{(t+1)j}) = 1- \rho$ if $i=j$, $\tau(s_{ti}, s_{(t+1)j}) = \rho /(K-1)$ otherwise. The values of $\epsilon$ is set to $1 - \nu$ for a selected major allele for each $Sti$, and to $\nu / (K-1)$ for the other alleles, where the major allele is selected in such a manner that it locally maximize the likelihood of getting the genotype data.
The probabilities are then perturbed by a random number and normalized to get their finalized initial values.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
example		example
manual		manual
.DS_Store		.DS_Store
Matrix3d.h		Matrix3d.h
Matrix615.h		Matrix615.h
hapRecstr.cpp		hapRecstr.cpp
hapRecstr.h		hapRecstr.h
readme.md		readme.md

styvon/HapRecstr

Folders and files

Latest commit

History

Repository files navigation

HaploRecstr: Haplotype Reconstruction Using Hidden Markov Technique

Chen Liang, Ziwei Zhu, Yongwen Zhuang

2016-12-16

Contents

Data Format

Input Data: Genotypes

Parameter Data

Output Data 1: Haplotypes

Output Data 2: Summary

How to Use HapRecstr

Download and Unpack

Compile

Run the Program

Set Parameters of HMM Model

Set Convergence Threshold

Set Maximum Number of EM Loops

Get the Results

Time Complexity and Space Requirement

Appendix

Notations

Model Initialization

About

Topics

Resources

Stars

Watchers

Forks

Languages