Subotai brings routines for extracting information from HTML documents to clojure
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


> Subotai died at age 73, by which time he had conquered 32 nations and won 65 pitched battles, > as the Muslim historians tell us. For 60 of those years, Subotai lived as Mongol soldier, > first as a lowly private who kept the tent door of Genghis himself, > rising to be the most brilliant and trusted of Genghis Khan's generals. > When Genghis died, Subotai continued to be the moving force of the Mongol > army under his successors. It was Subotai who planned and participated in > the Mongol victories against Korea, China, Persia, and Russia. It was Subotai's > conquest of Hungary that destroyed every major army between the > Mongols and the threshold of Europe. > - from Subotai the Valiant: Genghis Khan's Greatest General by Richard A. Gabriel

Subotai contains a swiss-army-knife of data-mining tools on HTML documents. It contains routines for:

  1. Comparing the similarity (in structure) of HTML documents.
  2. Testing if two documents are near-duplicates in a way that scales to large web corpora.



A list of the algorithms implemented:

  • Structural similarity using tree-edit-distance from Reis, Davi de Castro, et al, and a version I hacked together from a previous project that uses a vector-space representation and cosine similarity.
  • Near-Duplicate Detection (a naive algorithm from the IRBook by Manning et al and a scaleable version from Manku et al).


Clojars Project

Structural Similarity

Structural similarity routines are available in the subotai.structural-similarity namespace.

To check if two documents have the same underlying structure (for example, different pages of the same blog):

user=> (use 'clj-http.client)
user=> (use 'subotai.structural-similarity :reload)
user=> (def bod1 (:body (get ""))) ; this is page 1
user=> (def bod2 (:body (get ""))) ; this is page 2
user=> (similar? bod1 bod2)
true ; both pages have the same structure

The two pages in the above example look like:

Near Duplicate Detection

Near Duplicate Detection routines are available in the subotai.near-duplicate namespace. The simplest algorithm implemented is the shingles algorithm (build a list of 4-grams, compute the jaccard similarity, and perform a threshold test). I also have a more scaleable algorithm from Manku et al (from WWW '07). To compare two HTML documents (we default to the scaleable version but you can specify a different algorithm in the function call).

For example, here are 2 documents (two documents that contain the same content but are arrived at via different links like it often happens when you are crawling web-pages).

user> (use 'subotai.near-duplicate.utils)
user> (near-duplicate-html? bod-1 bod-2)
user> (use 'subotai.near-duplicate.core :reload)
user> (def bod-1 (:body (get ",5304.0.html"))) ; this is the first page
user> (def bod-2 (:body (get ",5304.msg30671.html"))) ; this is the second page
user> (near-duplicate-html? bod-1 bod-2)
true ; and they are near duplicate

Also, you can specify the algorithm you want to use (shingles or fingerprint) depending on your choice.

user> (near-duplicate-html? bod-1 bod-2 :shingles)
user> (near-duplicate-html? bod-1 bod-2 :fingerprint)

Reading WARC files

Warc files are the standard file format used for archiving large HTML corpora. Several of the largest web corpora (ClueWeb09, ClueWeb12, and The Common Crawl) are shipped as a collection of warc files.

An example routine would be:

(use 'subotai.warc.warc)

(defn usage-example
  (with-open [instream (warc-input-stream "/Users/shriphani/Documents/warc-clojure/0000wb-00.warc.gz")]
      (fn [r]
        (-> r :warc-target-uri))
      (stream-warc-records-seq instream)))))

(take 3 (usage-example))

Which returns:

(nil "" "")

A single record in a Warc file contains of some metadata stored in the header and a payload. An example record is:

{:payload ....
 :warc-type "response",
 :warc-date "2011-02-18T23:32:56Z",
 :content-length "4928",
 :warc-record-id "<urn:uuid:00127f49-b6d8-413e-857b-5a7620368f88>",
 :warc-ip-address "",
 :warc-payload-digest "sha1:M4VJCCJQJKPACSSSBHURM572HSDQHO2P",
 :content-type "application/http; msgtype=response"}


Copyright © 2014 Shriphani Palakodety

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.