Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 169 lines (110 sloc) 5.802 kb
d554726 Aaron Patterson removing references to the WWW constant
tenderlove authored
1 = Getting Started With Mechanize
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
2
1c07f21 adding the guide
aaronp authored
3 This guide is meant to get you started using Mechanize. By the end of this
4 guide, you should be able to fetch pages, click links, fill out and submit
5 forms, scrape data, and many other hopefully useful things. This guide
6 really just scratches the surface of what is available, but should be enough
7 information to get you really going!
8
9 == Let's Fetch a Page!
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
10
1c07f21 adding the guide
aaronp authored
11 First thing is first. Make sure that you've required mechanize and that you
12 instantiate a new mechanize object:
13
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
14 require 'rubygems'
15 require 'mechanize'
16
17 agent = Mechanize.new
18
1c07f21 adding the guide
aaronp authored
19 Now we'll use the agent we've created to fetch a page. Let's fetch google
20 with our mechanize agent:
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
21
22 page = agent.get('http://google.com/')
23
1c07f21 adding the guide
aaronp authored
24 What just happened? We told mechanize to go pick up google's main page.
25 Mechanize stored any cookies that were set, and followed any redirects that
26 google may have sent. The agent gave us back a page that we can use to
27 scrape data, find links to click, or find forms to fill out.
28
29 Next, lets try finding some links to click.
30
31 == Finding Links
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
32
1c07f21 adding the guide
aaronp authored
33 Mechanize returns a page object whenever you get a page, post, or submit a
34 form. When a page is fetched, the agent will parse the page and put a list
35 of links on the page object.
36
37 Now that we've fetched google's homepage, lets try listing all of the links:
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
38
39 page.links.each do |link|
40 puts link.text
41 end
42
1c07f21 adding the guide
aaronp authored
43 We can list the links, but Mechanize gives a few shortcuts to help us find a
44 link to click on. Lets say we wanted to click the link whose text is 'News'.
45 Normally, we would have to do this:
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
46
47 page = agent.page.links.find { |l| l.text == 'News' }.click
48
1c07f21 adding the guide
aaronp authored
49 But Mechanize gives us a shortcut. Instead we can say this:
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
50
51 page = agent.page.link_with(:text => 'News').click
52
1c07f21 adding the guide
aaronp authored
53 That shortcut says "find all links with the name 'News'". You're probably
54 thinking "there could be multiple links with that text!", and you would be
973b5a5 Aaron Patterson updating documentation to remove any examples that use WWW::Mechanize::L...
tenderlove authored
55 correct! If you use the plural form, you can access the list.
56 If you wanted to click on the second news link, you could do this:
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
57
58 agent.page.links_with(:text => 'News')[1].click
59
1c07f21 adding the guide
aaronp authored
60 We can even find a link with a certain href like so:
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
61
62 page.link_with(:href => '/something')
63
1c07f21 adding the guide
aaronp authored
64 Or chain them together to find a link with certain text and certain href:
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
65
66 page.link_with(:text => 'News', :href => '/something')
1c07f21 adding the guide
aaronp authored
67
68 These shortcuts that mechanize provides are available on any list that you
69 can fetch like frames, iframes, or forms. Now that we know how to find and
70 click links, lets try something more complicated like filling out a form.
71
72 == Filling Out Forms
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
73
1c07f21 adding the guide
aaronp authored
74 Lets continue with our google example. Here's the code we have so far:
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
75 require 'rubygems'
76 require 'mechanize'
77
78 agent = Mechanize.new
79 page = agent.get('http://google.com/')
1c07f21 adding the guide
aaronp authored
80
81 If we pretty print the page, we can see that there is one form named 'f',
82 that has a couple buttons and a few fields:
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
83
84 pp page
85
1c07f21 adding the guide
aaronp authored
86 Now that we know the name of the form, lets fetch it off the page:
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
87
1c07f21 adding the guide
aaronp authored
88 google_form = page.form('f')
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
89
1c07f21 adding the guide
aaronp authored
90 Mechanize lets you access form input fields in a few different ways, but the
91 most convenient is that you can access input fields as accessors on the
92 object. So lets set the form field named 'q' on the form to 'ruby mechanize':
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
93
94 google_form.q = 'ruby mechanize'
95
1c07f21 adding the guide
aaronp authored
96 To make sure that we set the value, lets pretty print the form, and you should
97 see a line similar to this:
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
98
99 #<Mechanize::Field:0x1403488 @name="q", @value="ruby mechanize">
100
1c07f21 adding the guide
aaronp authored
101 If you saw that the value of 'q' changed, you're on the right track! Now we
102 can submit the form and 'press' the submit button and print the results:
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
103
104 page = agent.submit(google_form, google_form.buttons.first)
105 pp page
106
1c07f21 adding the guide
aaronp authored
107 What we just did was equivalent to putting text in the search field and
108 clicking the 'Google Search' button. If we had submitted the form without
109 a button, it would be like typing in the text field and hitting the return
110 button.
111
112 Lets take a look at the code all together:
113
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
114 require 'rubygems'
115 require 'mechanize'
116
117 agent = Mechanize.new
118 page = agent.get('http://google.com/')
119 google_form = page.form('f')
120 google_form.q = 'ruby mechanize'
121 page = agent.submit(google_form)
122 pp page
1c07f21 adding the guide
aaronp authored
123
124 Before we go on to screen scraping, lets take a look at forms a little more
125 in depth. Unless you want to skip ahead!
126
127 == Advanced Form Techniques
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
128
1c07f21 adding the guide
aaronp authored
129 In this section, I want to touch on using the different types in input fields
130 possible with a form. Password and textarea fields can be treated just like
131 text input fields. Select fields are very similar to text fields, but they
132 have many options associated with them. If you select one option, mechanize
30eb161 Eric Hodel Document that Mechanize::Page#search accepts an XPath or CSS expression....
drbrain authored
133 will de-select the other options (unless it is a multi select!).
1c07f21 adding the guide
aaronp authored
134
135 For example, lets select an option on a list:
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
136
137 form.field_with(:name => 'list').options[0].select
1c07f21 adding the guide
aaronp authored
138
139 Now lets take a look at checkboxes and radio buttons. To select a checkbox,
140 just check it like this:
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
141
142 form.checkbox_with(:name => 'box').check
143
1c07f21 adding the guide
aaronp authored
144 Radio buttons are very similar to checkboxes, but they know how to uncheck
145 other radio buttons of the same name. Just check a radio button like you
146 would a checkbox:
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
147
973b5a5 Aaron Patterson updating documentation to remove any examples that use WWW::Mechanize::L...
tenderlove authored
148 form.radiobuttons_with(:name => 'box')[1].check
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
149
1c07f21 adding the guide
aaronp authored
150 Mechanize also makes file uploads easy! Just find the file upload field, and
151 tell it what file name you want to upload:
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
152
973b5a5 Aaron Patterson updating documentation to remove any examples that use WWW::Mechanize::L...
tenderlove authored
153 form.file_uploads.first.file_name = "somefile.jpg"
1c07f21 adding the guide
aaronp authored
154
155 == Scraping Data
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
156
30eb161 Eric Hodel Document that Mechanize::Page#search accepts an XPath or CSS expression....
drbrain authored
157 Mechanize uses nokogiri[http://nokogiri.org/] to parse HTML. What does this
158 mean for you? You can treat a mechanize page like an nokogiri object. After
159 you have used Mechanize to navigate to the page that you need to scrape, then
160 scrape it using nokogiri methods:
161
162 agent.get('http://someurl.com/').search("p.posted")
163
164 The expression given to Mechanize::Page#search may be a CSS expression or an
165 XPath expression:
3f88eb7 Eric Hodel Clean up RDoc files
drbrain authored
166
c166682 Aaron Patterson nokogiri may be dropped in as an html replacement
tenderlove authored
167 agent.get('http://someurl.com/').search(".//p[@class='posted']")
0107bd1 G. Ferguson Fixed .click examples.
gpherguson authored
168
Something went wrong with that request. Please try again.