Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

first attempt at this post

  • Loading branch information...
commit cf7bbeade2bb615de4d98f2af1869ab1fdce8b2e 1 parent bca48f2
Nate Young authored
Showing with 60 additions and 16 deletions.
  1. +60 −16 html-selector/html-selector.md
76 html-selector/html-selector.md
View
@@ -1,22 +1,66 @@
-# HTML Selectors for Scraping Webpages with Clojure
+# CSS Selectors, Scraping and Clojure
-1. starting with a webpage we've got on disk
-2. DOM is a nice format to parse and HTML on the web these days is tough to parse
-3. validator.nu (already on maven - easy to add to project.clj)
-4. docs at http://about.validator.nu/htmlparser/apidocs/
-5. feed HtmlDocumentBuilder an InputSource (feed the InputSource a Reader, which Clojure has a great fn for)
-6. decide what information we want off the page and what we'd write in jQuery to get it (sizzle selectors)
+## Building a DOM
-($ "#statTable1")
-($ "tbody" "tr")
+Parsing HTML can be tricky, most of my naive attempts to parse real-world pages produced a lot of stack traces. The [Validator.nu HTML parser](http://about.validator.nu/htmlparser/) has so far cleared those low hurdles. It's implemented in Java and it has a maven artifact, which makes it easy to include in a leiningen project, so it's my current weapon of choice.
-($ ".pos")
-($ ".player" ".name")
-($ ".stat")
+ :dependencies [[org.clojure/clojure "1.2.0"]
+ [org.clojure/clojure-contrib "1.2.0"]
+ [nu.validator.htmlparser/htmlparser "1.2.1"]]
-($ "#matchup-summary-table" "tr")
-($ "td")
+It's easy to get a DOM from a webpage using Validator.nu ([api docs here](http://about.validator.nu/htmlparser/apidocs/)), feed `HtmlDocumentBuilder` an `InputSource` which you feed a `java.io.Reader`, which is easily created via the `reader` fn from `clojure.java.io`:
-We'd like to be able to create these snippets, which then merely need to be fed the contexts (a list of nodes) from which they will select their work and return a flat list. That way the results can serve as contexts for other selectors.
+ (defn build-document [file-name]
+ (.parse (HtmlDocumentBuilder.) (InputSource. (reader file-name))))
-nodelist-seq aside (this is why you need `letfn`, won't work with a let and an anonymous fn)
+## Selectors
+
+To start, I'd like to be able to select a node by:
+
+* id: `#statTable1`
+* tag name: `table`
+* class attribute: `.class`
+
+Selection by id and tagname is easy, there are already methods on `getElementById` method on `Document`.
+
+ (defn id-sel [document id]
+ (let [id (.substring id 1)]
+ (.getElementById document id)))
+
+A DOM is already a tree, but not in a Clojure data structure that we can walk with the lbs the language already gives us, like Stuart Sierra's `clojure.walk`. The Java interface for walking the DOM returns a node's children as `NodeList` (which does not implement `Iterable`), so converting that into a seq is a useful first step. After that, `filter` can be used to
+
+ (defn selector [node pred]
+ (let [children (nodelist-seq (.getChildNodes node))]
+ (lazy-cat
+ (filter pred children)
+ (when-not (empty? children)
+ (mapcat #(selector % pred) children)))))
+
+ (defn element-tagname [elt]
+ (when (= Node/ELEMENT_NODE (.getNodeType elt))
+ (.getNodeName elt)))
+
+ (defn get-attribute [elt attr]
+ (.?. elt getAttributes (getNamedItem attr) getValue))
+
+ (defn hasclass? [elt class]
+ (when-let [class-attr (get-attribute elt "class")]
+ (some #(= class %) (split class-attr #" "))))
+
+
+ (defn class-sel [node class]
+ (selector node #(hasclass? % (.substring class 1))))
+
+Is there a better way to write `selector` here? I'd love to hear in the comments. Zippers from `clojure.zip` and prewalk/postwalk from Stuart Sierra's `clojure.walk` might be faster/cleaner/more elegant.
+
+The `.?.` method in `get-attribute` is remarkably useful. It's analogous to the `..` operator in `clojure.core` for chaining method invocations on objects. As not all `Node` objects have attributes on them, and not all attributes have the one we're looking for, in both cases, a null value is returned by the method invoked. Trying to invoke any other method returns an NPE. `.?.` does the grunt-work of handling that and short-circuiting to return nil, which is a perfectly reasonable and usable return value for those cases.
+
+## Composing Selectors
+
+I wanted selectors to be composable. I wanted to be able to feed the results of one selector to another selector to produce further refined results. This would allow me to declare a selector and use it under multiple contexts.
+
+Selectors take a single node and produce a list of "selected" nodes. To run a second selector over a list of selected nodes, the `mapcat` operator executes it for each selection and combines the individual result lists back into a flat list of "selected" nodes.
+
+## The 'M' Word
+
+By now, you may have realized that this approach is the same as that ubiquitous and hip mathematical notion, the List monad.
Please sign in to comment.
Something went wrong with that request. Please try again.