-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy pathtrotter.html
65 lines (64 loc) · 5.52 KB
/
trotter.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
layout: default
---
<header>
<h1 class="title">Trotter Walkthrough</h1>
</header>
<p>The Trotter is an example web spider; it “trots” through a web page, heuristically determining which links to follow and recursively printing them.</p>
<p>The Trotter is designed to be a simple program to demonstrate what a basic Common Lisp program looks like.</p>
<p>The raw file can be found at <a href="src/web-trotter.lisp">web-trotter.lisp</a>.</p>
<p>The first form in the code is loading the external libraries:</p>
<pre class="sourceCode commonlisp"><code class="sourceCode commonlisp">(ql:quickload '(:drakma
:split-sequence
:cl-ppcre
:babel))</code></pre>
<p><code>DRAKMA</code> is the standard HTTP(S) client library in Lisp, <code>SPLIT-SEQUENCE</code> does what you might think, <code>CL-PPCRE</code> is a reimplementation of the PPCRE library in native Lisp (It’s considered one of the best modern CL libraries, go check the source out sometime). <code>BABEL</code> is a encoding/decoding tool for strings. Note that they are referred to here as “keywords”.</p>
<p>The second form in the code defines a global that describes what an URL looks like. Of course, it’s not very precise, but it suffices for our purpose. The key thing here is that <code>\</code> are doubled for the special forms. This is part of the CL-PPCRE regex definition.</p>
<pre class="sourceCode commonlisp"><code class="sourceCode commonlisp">(defparameter *url-regex* "(\\w+://[-\\w\\./_?=%&]+)"
"Hacked up regex suitable for demo purposes.")</code></pre>
<p>Next, a predicate for HTTP. We really don’t care about FTP, HTTPS, GOPHER, GIT, etc protocols. It returns nil if it can’t find http:// in the URL.</p>
<pre class="sourceCode commonlisp"><code class="sourceCode commonlisp">(defun http-p (url)
(cl-ppcre:all-matches "\\bhttp://" url))</code></pre>
<p>Next, a predicate to check to see if our data is ASCII. While Common Lisp can handle Unicode data, this is a sample program, and the complexities of Unicode can be deferred.</p>
<pre class="sourceCode commonlisp"><code class="sourceCode commonlisp">(defun ascii-p (data)
"A not terribly great way to determine if the data is ASCII"
(unless (stringp data)
(loop for elem across data do
(when (> elem 127)
(return-from ascii-p nil))))
t)</code></pre>
<p>Next, we have a quick (and dirty) way to see if a URL is going to point to a binary. It just looks for a binary file extension in the string.</p>
<pre class="sourceCode commonlisp"><code class="sourceCode commonlisp">(defun known-binary-p (url)
"Is this url a binary file?"
(let ((binaries
'(".png" ".bmp" ".jpg" ".exe" ".dmg" ".package" ".css"
".ico" ".gif" ".dtd" ".pdf" ".xml" ".tgz")))
(dolist (b binaries NIL)
(when (search b url)
(return T)))))</code></pre>
<p>The next form is the most complex: <code>find-links</code>.</p>
<p><code>find-links</code> attempts to find all the links in a given url.</p>
<pre class="sourceCode commonlisp"><code class="sourceCode commonlisp">
(defun find-links (url)
"Scrapes links from a URL. Prints to STDOUT if an error is caught"
(when (and (http-p url)
(not (known-binary-p url)))
(handler-case
(let ((page (drakma:http-request url)))
(when page
(when (ascii-p page)
(cl-ppcre:all-matches-as-strings
*url-regex*
(if (stringp page)
page
(babel:octets-to-string page))))))
#+sbcl(sb-int:simple-stream-error (se) (format t "Whoops, ~a didn't work. ~a~%" url se))
(DRAKMA::DRAKMA-SIMPLE-ERROR (se) (format t "Error? ~a threw ~a~%" url se))
(USOCKET:TIMEOUT-ERROR (se) (format t "timeout error ~a threw ~a~%" url se))
(USOCKET:NS-HOST-NOT-FOUND-ERROR (se) (format t "host-not-found error ~a threw ~a~%" url se))
(FLEXI-STREAMS:EXTERNAL-FORMAT-ENCODING-ERROR (se) (format t "~a threw ~a~%" url se)))))</code></pre>
<p>The initial form is <code>WHEN</code> - a one-branch conditional. When we have a http link and it’s not a binary URL, then…</p>
<p><code>HANDLER-CASE</code> wraps some code. <code>HANDLER-CASE</code> is part of the Common Lisp condition system; it serves as the “try” block in this situation. The list of errors below form the “catch” blocks. The interested reader is referred to <a href="../quicklinks.html">the quicklinks section</a> for more resources. Note that the errors are typical network errors- timeouts, stream errors, encoding errors.</p>
<p>The first form in <code>HANDLER-CASE</code> requests in the url and assigns it to page. Supposing we got something, we make sure it’s an ascii page; if so, we then find all the url-ish looking things, using the previously defined global. Supposing that the page is, indeed, a string, we return it, otherwise we convert the octets to a string and return that. N.b.: Common Lisp makes a difference between strings and vectors of bytes. Of course, if an error occurred, the <code>HANDLER-CASE</code> will route to the known conditions.</p>
<p>Note that in one case, <code>#+sbcl</code> is present; this is a Common Lisp syntax to indicate that the following form is for SBCL only.</p>
<hr/>