Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 189 lines (136 sloc) 4.346 kb
38546a2 @postmodern Switched the YARD doc formatting to markdown.
postmodern authored
1 # Spidr
4bebaac @postmodern * Initial import.
postmodern authored
2
29c6e6b @postmodern Migrated off of Hoe and onto Jeweler, for building/releasing Spidr.
postmodern authored
3 * [spidr.rubyforge.org](http://spidr.rubyforge.org/)
4 * [github.com/postmodern/spidr](http://github.com/postmodern/spidr)
5 * [github.com/postmodern/spidr/issues](http://github.com/postmodern/spidr/issues)
6 * [groups.google.com/group/spidr](http://groups.google.com/group/spidr)
ba4524e @postmodern Added the #spidr IRC channel to the README file.
postmodern authored
7 * irc.freenode.net #spidr
4bebaac @postmodern * Initial import.
postmodern authored
8
29c6e6b @postmodern Migrated off of Hoe and onto Jeweler, for building/releasing Spidr.
postmodern authored
9 ## Description
4bebaac @postmodern * Initial import.
postmodern authored
10
11 Spidr is a versatile Ruby web spidering library that can spider a site,
12 multiple domains, certain links or infinitely. Spidr is designed to be fast
13 and easy to use.
14
29c6e6b @postmodern Migrated off of Hoe and onto Jeweler, for building/releasing Spidr.
postmodern authored
15 ## Features
4bebaac @postmodern * Initial import.
postmodern authored
16
88f837b @postmodern Updated FEATURES list.
postmodern authored
17 * Follows:
18 * a tags.
19 * iframe tags.
20 * frame tags.
79352a8 @postmodern Updated the Features list for Spidr.
postmodern authored
21 * Cookie protected links.
88f837b @postmodern Updated FEATURES list.
postmodern authored
22 * HTTP 300, 301, 302, 303 and 307 Redirects.
79352a8 @postmodern Updated the Features list for Spidr.
postmodern authored
23 * HTTP Basic Auth protected links.
4bebaac @postmodern * Initial import.
postmodern authored
24 * Black-list or white-list URLs based upon:
08406ba @postmodern Updated the FEATURES section.
postmodern authored
25 * URL scheme.
71dd5c5 @postmodern * Cleaned up rdoc syntax.
postmodern authored
26 * Host name
27 * Port number
28 * Full link
29 * URL extension
41bca0e @postmodern * Releasing Spidr version 0.1.0.
postmodern authored
30 * Provides call-backs for:
71dd5c5 @postmodern * Cleaned up rdoc syntax.
postmodern authored
31 * Every visited Page.
32 * Every visited URL.
33 * Every visited URL that matches a specified pattern.
6c2296b @postmodern Added examples for using every_link.
postmodern authored
34 * Every origin and destination URI of a link.
734f5a3 @postmodern Updated the FEATURES list.
postmodern authored
35 * Every URL that failed to be visited.
9882d9e @postmodern Updated the Features section.
postmodern authored
36 * Provides action methods to:
37 * Pause spidering.
38 * Skip processing of pages.
b04318b @postmodern Adjust ordering of Features.
postmodern authored
39 * Skip processing of links.
d147171 @postmodern Added two new features to the Features list.
postmodern authored
40 * Restore the spidering queue and history from a previous session.
71dd5c5 @postmodern * Cleaned up rdoc syntax.
postmodern authored
41 * Custom User-Agent strings.
42 * Custom proxy settings.
26d495a @postmodern Mention HTTPS support in the Features section.
postmodern authored
43 * HTTPS support.
71dd5c5 @postmodern * Cleaned up rdoc syntax.
postmodern authored
44
29c6e6b @postmodern Migrated off of Hoe and onto Jeweler, for building/releasing Spidr.
postmodern authored
45 ## Examples
71dd5c5 @postmodern * Cleaned up rdoc syntax.
postmodern authored
46
38546a2 @postmodern Switched the YARD doc formatting to markdown.
postmodern authored
47 Start spidering from a URL:
71dd5c5 @postmodern * Cleaned up rdoc syntax.
postmodern authored
48
49 Spidr.start_at('http://tenderlovemaking.com/')
50
38546a2 @postmodern Switched the YARD doc formatting to markdown.
postmodern authored
51 Spider a host:
71dd5c5 @postmodern * Cleaned up rdoc syntax.
postmodern authored
52
d629d6e @postmodern Updated the URLs used in the Examples section.
postmodern authored
53 Spidr.host('coderrr.wordpress.com')
71dd5c5 @postmodern * Cleaned up rdoc syntax.
postmodern authored
54
38546a2 @postmodern Switched the YARD doc formatting to markdown.
postmodern authored
55 Spider a site:
71dd5c5 @postmodern * Cleaned up rdoc syntax.
postmodern authored
56
d629d6e @postmodern Updated the URLs used in the Examples section.
postmodern authored
57 Spidr.site('http://rubyflow.com/')
71dd5c5 @postmodern * Cleaned up rdoc syntax.
postmodern authored
58
38546a2 @postmodern Switched the YARD doc formatting to markdown.
postmodern authored
59 Spider multiple hosts:
76815c6 @postmodern Added an example for using the :hosts option.
postmodern authored
60
61 Spidr.start_at(
62 'http://company.com/',
63 :hosts => [
64 'company.com',
8e1ad78 @postmodern Fixed indentation in the README file.
postmodern authored
65 /host\d\.company\.com/
76815c6 @postmodern Added an example for using the :hosts option.
postmodern authored
66 ]
67 )
68
38546a2 @postmodern Switched the YARD doc formatting to markdown.
postmodern authored
69 Do not spider certain links:
adc096f @postmodern Added an example for using the :ignore_links option.
postmodern authored
70
71 Spidr.site('http://matasano.com/', :ignore_links => [/log/])
72
38546a2 @postmodern Switched the YARD doc formatting to markdown.
postmodern authored
73 Do not spider links on certain ports:
c4edbf0 @postmodern Added an example for using the :ignore_ports option.
postmodern authored
74
75 Spidr.site(
362a6ff @postmodern Tweaked an example.
postmodern authored
76 'http://sketchy.content.com/',
c4edbf0 @postmodern Added an example for using the :ignore_ports option.
postmodern authored
77 :ignore_ports => [8000, 8010, 8080]
78 )
79
38546a2 @postmodern Switched the YARD doc formatting to markdown.
postmodern authored
80 Print out visited URLs:
71dd5c5 @postmodern * Cleaned up rdoc syntax.
postmodern authored
81
82 Spidr.site('http://rubyinside.org/') do |spider|
83 spider.every_url { |url| puts url }
84 end
4bebaac @postmodern * Initial import.
postmodern authored
85
6c2296b @postmodern Added examples for using every_link.
postmodern authored
86 Build a URL map of a site:
87
88 url_map = Hash.new { |hash,key| hash[key] = [] }
89
90 Spidr.site('http://intranet.com/') do |spider|
91 spider.every_link do |origin,dest|
92 url_map[dest] << origin
93 end
94 end
95
38546a2 @postmodern Switched the YARD doc formatting to markdown.
postmodern authored
96 Print out the URLs that could not be requested:
16662b8 @postmodern Added an example for using every_failed_url.
postmodern authored
97
98 Spidr.site('http://sketchy.content.com/') do |spider|
99 spider.every_failed_url { |url| puts url }
100 end
101
6c2296b @postmodern Added examples for using every_link.
postmodern authored
102 Finds all pages which have broken links:
103
104 url_map = Hash.new { |hash,key| hash[key] = [] }
105
106 spider = Spidr.site('http://intranet.com/') do |spider|
107 spider.every_link do |origin,dest|
108 url_map[dest] << origin
109 end
110 end
111
112 spider.failures.each do |url|
113 puts "Broken link #{url} found in:"
114
115 url_map[url].each { |page| puts " #{page}" }
116 end
117
38546a2 @postmodern Switched the YARD doc formatting to markdown.
postmodern authored
118 Search HTML and XML pages:
f55d6e4 @postmodern Added an example for searching HTML/XML pages.
postmodern authored
119
120 Spidr.site('http://company.withablog.com/') do |spider|
121 spider.every_page do |page|
122 puts "[-] #{page.url}"
123
124 page.search('//meta').each do |meta|
8e1ad78 @postmodern Fixed indentation in the README file.
postmodern authored
125 name = (meta.attributes['name'] || meta.attributes['http-equiv'])
126 value = meta.attributes['content']
f55d6e4 @postmodern Added an example for searching HTML/XML pages.
postmodern authored
127
8e1ad78 @postmodern Fixed indentation in the README file.
postmodern authored
128 puts " #{name} = #{value}"
38546a2 @postmodern Switched the YARD doc formatting to markdown.
postmodern authored
129 end
f55d6e4 @postmodern Added an example for searching HTML/XML pages.
postmodern authored
130 end
131 end
132
38546a2 @postmodern Switched the YARD doc formatting to markdown.
postmodern authored
133 Print out the titles from every page:
4b12b27 @postmodern Added an example for using the every_page event.
postmodern authored
134
135 Spidr.site('http://www.rubypulse.com/') do |spider|
86f63da @postmodern Updated examples to use every_html_page, every_forbidden_page and eve…
postmodern authored
136 spider.every_html_page do |page|
137 puts page.title
4b12b27 @postmodern Added an example for using the every_page event.
postmodern authored
138 end
139 end
140
38546a2 @postmodern Switched the YARD doc formatting to markdown.
postmodern authored
141 Find what kinds of web servers a host is using, by accessing the headers:
8db2bc6 @postmodern Added an example for using the all_headers event.
postmodern authored
142
143 servers = Set[]
144
145 Spidr.host('generic.company.com') do |spider|
146 spider.all_headers do |headers|
147 servers << headers['server']
148 end
149 end
150
38546a2 @postmodern Switched the YARD doc formatting to markdown.
postmodern authored
151 Pause the spider on a forbidden page:
bc90576 @postmodern Added an example for using pause!.
postmodern authored
152
153 spider = Spidr.host('overnight.startup.com') do |spider|
86f63da @postmodern Updated examples to use every_html_page, every_forbidden_page and eve…
postmodern authored
154 spider.every_forbidden_page do |page|
155 spider.pause!
bc90576 @postmodern Added an example for using pause!.
postmodern authored
156 end
157 end
158
38546a2 @postmodern Switched the YARD doc formatting to markdown.
postmodern authored
159 Skip the processing of a page:
8fc1a0b @postmodern Added an example for skipping pages.
postmodern authored
160
161 Spidr.host('sketchy.content.com') do |spider|
86f63da @postmodern Updated examples to use every_html_page, every_forbidden_page and eve…
postmodern authored
162 spider.every_missing_page do |page|
163 spider.skip_page!
8fc1a0b @postmodern Added an example for skipping pages.
postmodern authored
164 end
165 end
166
38546a2 @postmodern Switched the YARD doc formatting to markdown.
postmodern authored
167 Skip the processing of links:
e838b1b @postmodern Added an example for using skip_link!.
postmodern authored
168
169 Spidr.host('sketchy.content.com') do |spider|
170 spider.every_url do |url|
171 if url.path.split('/').find { |dir| dir.to_i > 1000 }
8e1ad78 @postmodern Fixed indentation in the README file.
postmodern authored
172 spider.skip_link!
173 end
e838b1b @postmodern Added an example for using skip_link!.
postmodern authored
174 end
175 end
176
29c6e6b @postmodern Migrated off of Hoe and onto Jeweler, for building/releasing Spidr.
postmodern authored
177 ## Requirements
2eee843 @postmodern Moved the Examples section above the Install section.
postmodern authored
178
38546a2 @postmodern Switched the YARD doc formatting to markdown.
postmodern authored
179 * [nokogiri](http://nokogiri.rubyforge.org/) >= 1.2.0
2eee843 @postmodern Moved the Examples section above the Install section.
postmodern authored
180
29c6e6b @postmodern Migrated off of Hoe and onto Jeweler, for building/releasing Spidr.
postmodern authored
181 ## Install
2eee843 @postmodern Moved the Examples section above the Install section.
postmodern authored
182
38546a2 @postmodern Switched the YARD doc formatting to markdown.
postmodern authored
183 $ sudo gem install spidr
2eee843 @postmodern Moved the Examples section above the Install section.
postmodern authored
184
29c6e6b @postmodern Migrated off of Hoe and onto Jeweler, for building/releasing Spidr.
postmodern authored
185 ## License
4bebaac @postmodern * Initial import.
postmodern authored
186
29c6e6b @postmodern Migrated off of Hoe and onto Jeweler, for building/releasing Spidr.
postmodern authored
187 See {file:LICENSE.txt} for license information.
4bebaac @postmodern * Initial import.
postmodern authored
188
Something went wrong with that request. Please try again.