Heiko Strathmann edited this page Feb 20, 2015 · 9 revisions
Clone this wiki locally

Hidden Markov Models for geneotype imputation

Cleaning up Shogun's HMMS and implementing a gene imputation pipeline


Difficulty & Requirements

Medium/Advanced. You need to be able to:

  • Cleanup and optimize C/C++ code
  • Have a good understanding of HMMs
  • Knowledge of MCMC is helpful
  • Have a basic understanding of genotyping and GWAS analysis
  • Visualize imputation results in an iPython notebook


Hidden Markov Models (HMMs) have been used extensively in Bioinformatics to solve problems of biological sequence analysis. Among others, HMMs can be used to predict transcription factor binding sites [1], align sequences [2], find conserved sequence motives [3] and statistically infer ("impute") unobserved genotypes for Genome Wide Association Studies (GWAS) [4]. One of the most popular tools at the moment that is able to impute genotypes, Impute 2 [5, 6], uses HMMs and a Markov Chain Monte Carlo type algorithm; however, the software is not available under an open source license and only free for academic use. In this project you will cleanup and optimize the HMM implementation in the Shogun Toolbox and subsequently use the HMMs to build a Bioinformatics pipeline to statistically impute genotypes in a sample dataset using the densely typed 1000 genome data as a background set.

The project is a great opportunity to combine machine learning with some real-world large scale genomic data and would enable the student to deepen her/his understanding of biological sequence analysis and large scale genomics.

Waypoints and initial work

There are three major parts to this projects

  • Cleaning up Shogun's HMM implementation (mostly to be done in pre-GSoC phase, see below)
  • Extending HMM with a simple MCMC sampler
  • Build a pipeline to impute genotypes using sample data and data from the 1000 genome project


None yet.

Why this is cool

This project is an amazing opportunity to learn about both efficient C++ implementations of HMMs and their usage in practical bioinformatics. Furthermore, the combination of cleaning up Shogun's internals, and then applying them to real-world problems of other scientific realms, is aligned very closely with our goals for this GSoC. Given a successful completion, expect your work to be used by scientists all over the world.

The sharp decline in the cost of DNA and RNA sequencing enables genomic research on an unprecedented scale in a multitude of organisms. However, modern biology is challenged by the wast amounts of data generated and depends more and more on machine learning algorithms to analyze the produced data in order to further our understanding of the underlying biology. As such, the student will work in the intersection between machine learning, bioinformatics and genome biology, developing open source tools for the community. These tools could be used in Genome Wide Association Studies (GWAS) that aim to enhance our understanding of major diseases like cancer or epidemics such as obesity.

Useful ressources

Entrance tasks:

[1] Kyoung-Jae W, Bing R and Wei W (2010) Genome-wide prediction of transcription factor binding sites using an integrated model Genome Biology

[2] Eddy SR (2008) A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation PLoS Computational Biology

[3] Eddy SR (2011) Accelerated Profile HMM Searches PLoS Computational Biology

[4] Howie B, Marchini j , and Stephens M (2011) Genotype imputation with thousands of genomes G3: Genes, Genomics, Genetics

[5] Howie B, Donnelly P, Marchini J (2009) A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies PLoS Genetics

[6] Howie B, Fuchsberger C, Stephens M, Marchini J, and Abecasis GR (2012) Fast and accurate genotype imputation in genome-wide association studies through pre-phasing Nature Genetics