GitHub - sorhus/hmmongo: viterbi and more on hadoop

Given a Hidden Markov Model and a bunch of observation sequences, compute the most likely path for each observation sequence.

The project consists of a few different parts. A generic library and some applications. The applications are:

A standalone scala app
A hadoop job
A http server
A thrift server
A prestodb plugin

Javadocs for the library can be found here

Typical usage of the library looks like this.

// args = {pi, A, B, T, input, output};
Function<String, InputStream> r = DNAViterbiApp.class::getResourceAsStream;
HMM hmm = new HMM.Builder()
    .fromInputStreams(r.apply(args[0]), r.apply(args[1]), r.apply(args[2]))
    .adjacency()
    .build();
Viterbi<String,FullResult> viterbi =
    new Viterbi.Builder<String,String,FullResult>()
    .withHMM(hmm)
    .withMaxObservationLength(Integer.parseInt(args[3]))
    .withObservationEncoder(new DNAEncoder())
    .withObservationDecoder(new DNADecoder())
    .withPathDecoder(new StringDecoder())
    .withResultFactoryClass("com.github.sorhus.hmmongo.viterbi.result.FullResultFactory")
    .build();

The library is written in Java 8. Note that the DNAEncoder, DNADecoder and StringDecoder requires scala 2.11 to run.

Also, in libhmm/src/main/resources there is an HMM that models the T-cell receptor.

Build the project

mvn clean package

Run the scala app

scala -cp libhmm/target/libhmm-0.3.0.jar com.github.sorhus.hmmongo.DNAViterbiApp /example_pi.gz /example_A.gz /example_B.gz 41 libhmm/src/main/resources/example_input output

Run the http server

java -jar scalatra/target/scalatra-0.3.0.jar /example_pi.gz /example_A.gz /example_B.gz 101
curl localhost:8080/acgttgcatcgatcgatcgatcgatcgtacgatcgatcgaacgatgcgactaca

Run the thrift server

java -jar thrift/target/thrift-0.3.0.jar /example_pi.gz /example_A.gz /example_B.gz 101
There is an example client that can be tested as follows
java -cp thrift/target/thrift-0.3.0.jar com.github.sorhus.hmmongo.thrift.client.DNAViterbiClient

Run the hadoop job

local mode

hadoop jar hadoop/target/hadoop-0.3.0.jar com.twitter.scalding.Tool com.github.sorhus.hmmongo.DNAViterbiJob --local --pi hadoop/src/main/resources/example_pi.gz --A hadoop/src/main/resources/example_A.gz --B hadoop/src/main/resources/example_B.gz --T 101 --input hadoop/src/main/resources/example_input --output output

hdfs mode

If the HMM is big: export HADOOP_CLIENT_OPT=-Xmx2g
Put the input on hdfs
hadoop jar hadoop/target/hadoop-0.3.0.jar com.twitter.scalding.Tool com.github.sorhus.hmmongo.DNAViterbiJob --hdfs --pi src/main/resources/tcrb_pi.gz --A src/main/resources/tcrb_A.gz --B src/main/resources/tcrb_B.gz --T 101 --input /user/anton/SRR060692_1.sample --output /user/anton/output

Deploy the presto plugin

Put the jar in the presto plugin direcory, i.e. presto/plugin/hmmongo/presto-0.4.0.jar
Put a config in the presto etc directory

cat >> etc/plugin/viterbi.properties <<EOF
pi.location=/path/to/pi
A.location=/path/to/A
B.location=/path/to/B
T=100
EOF

Test the function like so

presto> select viterbi('name','acgt');
  _col0  
---------
 1,1,2,2 
(1 row)

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
hadoop		hadoop
libhmm		libhmm
presto		presto
scalatra		scalatra
thrift		thrift
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Build the project

Run the scala app

Run the http server

Run the thrift server

Run the hadoop job

local mode

hdfs mode

Deploy the presto plugin

About

Releases

Packages

Languages

License

sorhus/hmmongo

Folders and files

Latest commit

History

Repository files navigation

Build the project

Run the scala app

Run the http server

Run the thrift server

Run the hadoop job

local mode

hdfs mode

Deploy the presto plugin

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages