No description, website, or topics provided.
Java Ruby Python
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

Causes Puzzle


Before anything else, the answer is 78482 and the resulting code is

Summary of my Though Process

First Attempt: The Ruby Way

I first went to the Wikipedia page and learned how the levenshtein distance is calculated.

I went straight to my editor to and started implementing the standard (Robert-Fisher) algorithm in Ruby. I did not do any research, I wanted to see how a straight forward, head-on approach would turn out. I saw the puzzle's data set is over 260000 words long but I went ahead.

This resulted in causes_challenge.rb. The code is pretty basic and crude. As expected this first approach did not work out. While correctly written, the performance using either ruby 1.9 or jruby is sub-optimal for this problem. I didn't wait for it to complete.

Second Attempt: The Java Way

I had to find a better algorithm and switch to a faster language.

My language of choice for compute intensive tasks is Java (I never got to working with C/C++) so I switched to Java.

I had to go online to find a new approach/algorithm to this problem (I didn't study CS, I have to look this stuff up).

On Stackoverflow, I found someone with a similar need and along with his solution, using Trie structures. I found it elegant and I was happy to understand the logic and the trie structure quickly. I implemented it in Java.

This resulted in In my first test I solved the problem in 12 minutes. Too slow (I have all weekend). So I went back online and looked for alternative approaches.

I should mention that I'm running all my tests on my older 1.8ghz Macbook Air, I do not have very performant hardware.

I had read about upcoming Lucene 4.0's greatly improved fuzzy search performance. Lucene will soon be using Levenshtein Automata. I didn't and still don't know how they work, but on paper they can find a leveinstein distance very quickly.

I couldn't find a standalone implementation of Levenshtein Automata so I first tried using Lucene.

This resulted in This didn't work well enough. I got the correct answer but it was taking over 15 minutes because of Lucene's overhead.

I did find a standalone implementation of Levenshtein Automata.

This resulted in It worked but again it was too slow. I tweeked the automaton's source trying to make it perform better, I got some results but I was still over 10 minutes. I was not motivated to tweak the code further.

The Holy Grail

Well maybe not quite the holy grail, but good enough for me. I went back to the more elegant I read up on writing performant Java code and how to tweak the JVM for my problem at hand.

After a few iterations, making subtle changes, I brought the execution time down from 12 minutes to 1.5 minutes. That is an improvement of 800%. I stopped there. That is good enough for me.

Running This

The Java Trie version can be compiled:

javac -O

and run:

time java -Xms720m -Xmx720m -server -XX:MaxNewSize=60m -XX:NewSize=60m -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled -XX:MaxPermSize=48M -XX:PermSize=48M -XX:+AggressiveOpts -XX:+OptimizeStringConcat -XX:MaxTenuringThreshold=0 -XX:NewRatio=2048 -XX:SoftRefLRUPolicyMSPerMB=10000 CausesChallengeTrie challenge_word_list.tmp.txt