Skip to content

a C++ program for haplotype reconstruction using HMM

Notifications You must be signed in to change notification settings

styvon/HapRecstr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HaploRecstr: Haplotype Reconstruction Using Hidden Markov Technique

Chen Liang, Ziwei Zhu, Yongwen Zhuang

2016-12-16

HaploRecstr is a C++ haplotype reconstruction program. Method used is based on the algorithm introduced by Rastas et al.^[Rastas, P., Koivisto, M. et al. (2005). Algorithms in Bioinformatics: 5th International Workshop. Berlin, Heidelberg: Springer. 145-151] Genotype data from a sample population is used as input.
More specifically, the program uses a Hidden Markov Model (HMM) to construct the data. By default, the model is initialized by going through the data to select the major alleles and assigning parameter values, then EM algorithm is used to optimize the likelihood function, and then Viterbi is used to reconstruct the haplotype data.
The program takes genotype as input. Outputs of the program include: 1) a set of reconstructed haplotypes; 2) a summary of the frequencies of all possible haplotypes in the population (sorted in descending order).

Contents

back to top

Data Format

Input Data: Genotypes

The program takes an input dataset of genotypes. It should have $N$ rows and $2M$ columns, where $N$ and $M$ are the numbers of individuals and SNPs contained in the dataset respectively.

General requirements about the format:

  • Dataset should not include headers
  • Each row of the data represent one observed genotype
  • Rows starting with # will be ignored when reading in data
  • Entries in the dataset should be integers from $0$ to $K-1$, where $K$ is the number of states/alleles for each SNP, and is set by the user
  • Entries can be separated by ,, tab or space
  • No missing data is allowed in the dataset

The example input dataset (genotype.txt) looks like this:

Five observations are represented in the genotype.txt dataset. Columns 1 and 2 code for the observed alleles at SNP1, columns 3 and 4 code for the observed alleles at SNP2, and so on.

Parameter Data

The program can take in user defined parameter data as the initial value to be used in the program. A total of three kinds of parameters is needed:

  • $\pi$: a $1 \times K$ vector where $\pi_{i}$ indicates the initial transmission probabilities from state $S_0$ to $S_{1i}$, $i=1,2,...,K$. A sample file (pis.txt) is included in the example folder

  • $\tau$: a $(M-1) \times K^2$ matrix where $\tau_{t(i \times K +j)}$ indicates the transmission probability from state $S_{ti}$ at the $t^{th}$ SNP to $S_{(t+1)j}$ at the $(t+1)^{th}$ SNP, $t=1,2,...,M-1$, $i,j=1,2,...,K$. A sample file (trans.txt) is included in the example folder

  • $\epsilon$: a $M \times K^2$ matrix where $\epsilon_{t(i \times K +j)}$ indicates the emission probability from state $S_{ti}$ of allele $A_{tj}$ at the $t^th$ SNP, $t=1,2,...,M$, $i,j=1,2,...,K$. A sample file (emis.txt) is included in the example folder

Output Data 1: Haplotypes

The first of the two output datasets generated by the program, the reconstructed haplotype data, is of size $2N \times M$.

A total of ten haplotypes are shown in the dataset in out1.txt, corresponding to the five observed genotypes in the example input data. The first row corresponds to the identity number of observations. The following rows represent the alleles at subsequent SNPs of each haplotype. Each observation will have two rows representing the reconstructed haplotypes (i.e., reconstructed haplotypes for observation 1 is {1,4,4,3,0} and {1,2,1,4,3} shown at rows 1 and 2; reconstructed haplotypes for observation 2 is {0,1,2,4,3} and {1,3,3,3,1} shown at rows 3 and 4, and so on).

Output Data 2: Summary

The second of the two output datasets generated by the program, a summary of unique reconstructed haplotypes in the population, is of size $L \times (M+1)$, where $L$ is the number of unique haplotypes in the dataset.

The dataset in out2.txt shows the five most common haplotypes in the example population. Each row represents one haplotype, and the last column is the frequency of the haplotype in the sample population in descending order.


back to top

How to Use HapRecstr

Download and Unpack

A sample command to unpack program on a computer:

 % tar -zxvf hapRecstr.tar.gz

Compile

Boost Library should be included when compiling. You can use a compile command like this:

 % g++ -I path/to/boost hapRecstr.cpp -o hapRecstr

Run the Program

The program can take a total of 7 arguments while the first four are required.
Structure of the command look like this:

% ./hapRecstr arg1 arg2 arg3 arg4 arg5 arg6 arg7  

The meanings of the parameters:

  • arg1: the path and name of the input data
  • arg2: the path and name of the file to store output data 1
  • arg3: the path and name of the file to store output data 2
  • arg4: number of states/alleles for each SNP. Note that this should match the entry values in the input data (in the example case, the entries in the input data should be integers from 0 to 4)
  • arg5 (optional): the path and name of the file containing emission probability matrix. The emission probability is the probability of emitting alleles from states. Format requirement see here
  • arg6 (optional): the path and name of the file containing transmission probability matrix, which is the probability of transitting from current state to other states on the subsequent SNP. Format requirement see here
  • arg7 (optional): the path and name of the file containing initial transmission probability vector. Format requirement see here

The example dataset, genotype.txt, in the example folder contains 50 genotype observations. To reconstruct haplotypes from this data using default parameters, use the command below:

 % ./hapRecstr example/genotype.txt example/out1.txt example/out2.txt 5

Note that the value of arg4 is 5, indicating that in this example case, the entries in the input data should be integers from 0 to 4.

Set Parameters of HMM Model

The parameters of the HMM model can be set by either taking the user-specified data from arg5, arg6 and arg7, or by default initialization within the program (more details on the initialization method see here).
For the latter case (i.e., when arg5, arg6 and arg7 are not specified), there will be a prompt for inputting values of $\rho$,$\nu$ and $\eta$:

 Set ro, nu, eta for parameter initialization (type 0 to use default value ro=0.1, nu=0.01, eta=0.8):  

Numbers larger than 0 and less than 1 should be typed in to specify the values. Alternatively, if 0 is typed in, the program will use the default value.

Set Convergence Threshold

After running the commandline in the previous step, there will be a prompt for convergence threshold like this:

 Model initialized
  Set convergence threshold (type 0 to use default value 1e-6):

A number larger than 0 and less than 1 should be typed in to specify the threshold. Alternatively, if 0 is typed in, the program will use the default threshold value of 1e-6.

Set Maximum Number of EM Loops

The program will also prompt for the maximum number of EM loops, which is the maximum number of times that the program will go through the EM algorithm if the convergence threshold is not met:

  Set maximum EM loops (type 0 to use default value 25):

A positive integer should be typed in to specify the threshold. Alternatively, if 0 is typed in, the program will use the default value of 25.

Get the Results

The program will store the outputs into the pre-specified files. A successful run of the program on the example data has the following lines shown in the end:

 Output 1 saved to example/out1.txt
 Output 2 saved to example/out2.txt


back to top

Time Complexity and Space Requirement

The theoretical time complexity of the program is $O(NMK^3)$ while the space requirement is $max(O(MN), O(MK^2))$.
genotype_Daly in the example folder simulates the benchmark dataset from Daly et al.^[Daly, M.J., Rioux, J.D., Schaffner, S.F., et al.: High-resolution haplotype structure in the human genome. Nature Genetics 29 (2001) 229–232] with 129 genotypes and 103 SNP markers.
The running time for the simulated data is 94 seconds, calculated using default values of convergence and max EM loops with the highest likelihood HMM.


back to top

Appendix

Notations

  • $K$: number of possible alleles per loci
  • $N$: number of genotypes
  • $M$: number of SNPs
  • $S_{ti}$: state $i$ at the $t^{th}$ SNP
  • $\pi$: a $1 \times K$ vector of initial transmission probabilities from state $S_0$ to $S_{1i}$, $i=1,2,...,K$
  • $\tau$: a $(M-1) \times K^2$ matrix where $\tau_{t(i \times K +j)}$ indicates the transmission probability from state $S_{ti}$ at the $t^{th}$ SNP to $S_{(t+1)j}$ at the $(t+1)^{th}$ SNP, $t=1,2,...,M-1$, $i,j=1,2,...,K$
  • $\epsilon$: a $M \times K^2$ matrix where $\epsilon_{t(i \times K +j)}$ indicates the emission probability from state $S_{ti}$ of allele $A_{tj}$ at the $t^th$ SNP, $t=1,2,...,M$, $i,j=1,2,...,K$
  • $\rho$,$\nu$ and $\eta$: constant values for default initialization of HMM model

Model Initialization

The initial values of $\pi$ are set to $1/K$. For each $t=1,2,...,m$, $\tau(s_{ti}, s_{(t+1)j}) = 1- \rho$ if $i=j$, $\tau(s_{ti}, s_{(t+1)j}) = \rho /(K-1)$ otherwise. The values of $\epsilon$ is set to $1 - \nu$ for a selected major allele for each $Sti$, and to $\nu / (K-1)$ for the other alleles, where the major allele is selected in such a manner that it locally maximize the likelihood of getting the genotype data.
The probabilities are then perturbed by a random number and normalized to get their finalized initial values.

About

a C++ program for haplotype reconstruction using HMM

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published