Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
sefira - state of the art in tree comparison Finding the largest common subtree of two ordered labeled trees is a problem with increasing applications (i.e. in XML processing and computational biology), on which substantial progress has been made in recent years. However, papers on tree comparison may not be available without subscription to the appropriate journals and even when they are, they generally don't contain executable code, so using the new algorithms is non-trivial. Also, there isn't one "best" algorithm yet - distance definitions differ, and even for the same edit distance, time complexity can be measured on different tree features. The goal of the sefira library is to provide well tested and reasonably optimized implementations of some tree comparison algorithms, so that they can be tried out in practical applications. Scripting languages being impractical for fast implementations of memory-intensive algorithms, sefira is a C++ (C would probably be feasible, but the implementations depend heavily on C++ containers) library with common classes plus executables - one per implemented algorithm - wrapping the classes into command-line programs. The common model of all algorithm implementations is the transformation of two XML trees (sefira depends on libxml2) to their largest common subtree (also XML). When only the distance is required, it can be trivially computed from tree sizes, although keeping the intermediate subtrees wastes quite a lot of memory in that case - using a simplified algorithm may be worth a rewrite. Implementations generally use unsigned short indices, limiting the size of one input tree somewhere around 65536 DOM nodes (on 32bit systems; sefira has never been tried on anything else, although obviously you're welcome to try it out on as big an iron as you want) - beyond that, the algorithms would probably be hopelessly slow anyway... Error handling is implemented using C++ exceptions, throwing instances of the standard C++ exception classes. Checking of parameters is far from comprehensive, but there is some - performance-hyperconscious users may want to remove it, but really, the current implementation presents many more interesting optimization opportunities... sefira also has unified compile-time configurable logging (to standard error via std::cerr, so it can be redirected in one place) and many objects implement their own serialization formats by overloading operator <<. The implemented algorithms are as follows: sefira-optimistic This algorithm is from S. Mozes, D. Tsur, O. Weimann, and M. Ziv-Ukelson Fast algorithms for computing tree LCS probably a preliminary version which appeared in Proc. 19th Symposium on Combinatorial Pattern Matching (CPM), LNCS 5029, 230-243, 2008, but I got it at http://www.cs.bgu.ac.il/~dekelts/publications/treelcs.pdf . This algorithm is specialized for the LCS problem and measures its complexity in the compared trees size, height and the number of pairs from the first and second tree having the same label - in other words, it's maximally efficient for totally different trees and looses its edge as node names start to repeat. sefira-straight This algorithm is from Erik D. Demaine, Shay Mozes, Benjamin Rossman, and Oren Weimann, An Optimal Decomposition Algorithm for Tree Edit Distance in Proceedings of the 34th International Colloquium on Automata, Languages and Programming (ICALP 2007), Wroclaw, Poland, July 9, 2007. This algorithm is formulated for a generic distance metric, but sefira implements only the size comparison - a generalization is left as an exercise for the reader. sefira-straight is the memoizing version, using n^3 in both space and time (where n is the size of the larger input tree). sefira-systematic This algorithm is the same as sefira-straight, i.e. also from Erik D. Demaine, Shay Mozes, Benjamin Rossman, and Oren Weimann, An Optimal Decomposition Algorithm for Tree Edit Distance in Proceedings of the 34th International Colloquium on Automata, Languages and Programming (ICALP 2007), Wroclaw, Poland, July 9, 2007, except it's the memory-efficient variant, as described in http://erikdemaine.org/papers/TreeEdit_ICALP2007/paper.pdf . It uses n^2 space and n^3 time.