Skip to content
This repository
Newer
Older
100644 167 lines (110 sloc) 5.802 kb
d554726e » tenderlove
2010-02-06 removing references to the WWW constant
1 = Getting Started With Mechanize
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
2
1c07f21f » aaronp
2006-09-06 adding the guide
3 This guide is meant to get you started using Mechanize. By the end of this
4 guide, you should be able to fetch pages, click links, fill out and submit
5 forms, scrape data, and many other hopefully useful things. This guide
6 really just scratches the surface of what is available, but should be enough
7 information to get you really going!
8
9 == Let's Fetch a Page!
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
10
1c07f21f » aaronp
2006-09-06 adding the guide
11 First thing is first. Make sure that you've required mechanize and that you
12 instantiate a new mechanize object:
13
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
14 require 'rubygems'
15 require 'mechanize'
16
17 agent = Mechanize.new
18
1c07f21f » aaronp
2006-09-06 adding the guide
19 Now we'll use the agent we've created to fetch a page. Let's fetch google
20 with our mechanize agent:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
21
22 page = agent.get('http://google.com/')
23
1c07f21f » aaronp
2006-09-06 adding the guide
24 What just happened? We told mechanize to go pick up google's main page.
25 Mechanize stored any cookies that were set, and followed any redirects that
26 google may have sent. The agent gave us back a page that we can use to
27 scrape data, find links to click, or find forms to fill out.
28
29 Next, lets try finding some links to click.
30
31 == Finding Links
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
32
1c07f21f » aaronp
2006-09-06 adding the guide
33 Mechanize returns a page object whenever you get a page, post, or submit a
34 form. When a page is fetched, the agent will parse the page and put a list
35 of links on the page object.
36
37 Now that we've fetched google's homepage, lets try listing all of the links:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
38
39 page.links.each do |link|
40 puts link.text
41 end
42
1c07f21f » aaronp
2006-09-06 adding the guide
43 We can list the links, but Mechanize gives a few shortcuts to help us find a
44 link to click on. Lets say we wanted to click the link whose text is 'News'.
45 Normally, we would have to do this:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
46
47 page = agent.page.links.find { |l| l.text == 'News' }.click
48
1c07f21f » aaronp
2006-09-06 adding the guide
49 But Mechanize gives us a shortcut. Instead we can say this:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
50
51 page = agent.page.link_with(:text => 'News').click
52
1c07f21f » aaronp
2006-09-06 adding the guide
53 That shortcut says "find all links with the name 'News'". You're probably
54 thinking "there could be multiple links with that text!", and you would be
973b5a55 » tenderlove
2008-10-19 updating documentation to remove any examples that use WWW::Mechanize…
55 correct! If you use the plural form, you can access the list.
56 If you wanted to click on the second news link, you could do this:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
57
58 agent.page.links_with(:text => 'News')[1].click
59
1c07f21f » aaronp
2006-09-06 adding the guide
60 We can even find a link with a certain href like so:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
61
62 page.link_with(:href => '/something')
63
1c07f21f » aaronp
2006-09-06 adding the guide
64 Or chain them together to find a link with certain text and certain href:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
65
66 page.link_with(:text => 'News', :href => '/something')
1c07f21f » aaronp
2006-09-06 adding the guide
67
68 These shortcuts that mechanize provides are available on any list that you
69 can fetch like frames, iframes, or forms. Now that we know how to find and
70 click links, lets try something more complicated like filling out a form.
71
72 == Filling Out Forms
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
73
1c07f21f » aaronp
2006-09-06 adding the guide
74 Lets continue with our google example. Here's the code we have so far:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
75 require 'rubygems'
76 require 'mechanize'
77
78 agent = Mechanize.new
79 page = agent.get('http://google.com/')
1c07f21f » aaronp
2006-09-06 adding the guide
80
81 If we pretty print the page, we can see that there is one form named 'f',
82 that has a couple buttons and a few fields:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
83
84 pp page
85
1c07f21f » aaronp
2006-09-06 adding the guide
86 Now that we know the name of the form, lets fetch it off the page:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
87
1c07f21f » aaronp
2006-09-06 adding the guide
88 google_form = page.form('f')
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
89
1c07f21f » aaronp
2006-09-06 adding the guide
90 Mechanize lets you access form input fields in a few different ways, but the
91 most convenient is that you can access input fields as accessors on the
92 object. So lets set the form field named 'q' on the form to 'ruby mechanize':
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
93
94 google_form.q = 'ruby mechanize'
95
1c07f21f » aaronp
2006-09-06 adding the guide
96 To make sure that we set the value, lets pretty print the form, and you should
97 see a line similar to this:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
98
99 #<Mechanize::Field:0x1403488 @name="q", @value="ruby mechanize">
100
1c07f21f » aaronp
2006-09-06 adding the guide
101 If you saw that the value of 'q' changed, you're on the right track! Now we
102 can submit the form and 'press' the submit button and print the results:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
103
104 page = agent.submit(google_form, google_form.buttons.first)
105 pp page
106
1c07f21f » aaronp
2006-09-06 adding the guide
107 What we just did was equivalent to putting text in the search field and
108 clicking the 'Google Search' button. If we had submitted the form without
109 a button, it would be like typing in the text field and hitting the return
110 button.
111
112 Lets take a look at the code all together:
113
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
114 require 'rubygems'
115 require 'mechanize'
116
117 agent = Mechanize.new
118 page = agent.get('http://google.com/')
119 google_form = page.form('f')
120 google_form.q = 'ruby mechanize'
121 page = agent.submit(google_form)
122 pp page
1c07f21f » aaronp
2006-09-06 adding the guide
123
124 Before we go on to screen scraping, lets take a look at forms a little more
125 in depth. Unless you want to skip ahead!
126
127 == Advanced Form Techniques
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
128
1c07f21f » aaronp
2006-09-06 adding the guide
129 In this section, I want to touch on using the different types in input fields
130 possible with a form. Password and textarea fields can be treated just like
131 text input fields. Select fields are very similar to text fields, but they
132 have many options associated with them. If you select one option, mechanize
30eb1618 » drbrain
2012-02-13 Document that Mechanize::Page#search accepts an XPath or CSS expressi…
133 will de-select the other options (unless it is a multi select!).
1c07f21f » aaronp
2006-09-06 adding the guide
134
135 For example, lets select an option on a list:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
136
137 form.field_with(:name => 'list').options[0].select
1c07f21f » aaronp
2006-09-06 adding the guide
138
139 Now lets take a look at checkboxes and radio buttons. To select a checkbox,
140 just check it like this:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
141
142 form.checkbox_with(:name => 'box').check
143
1c07f21f » aaronp
2006-09-06 adding the guide
144 Radio buttons are very similar to checkboxes, but they know how to uncheck
145 other radio buttons of the same name. Just check a radio button like you
146 would a checkbox:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
147
973b5a55 » tenderlove
2008-10-19 updating documentation to remove any examples that use WWW::Mechanize…
148 form.radiobuttons_with(:name => 'box')[1].check
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
149
1c07f21f » aaronp
2006-09-06 adding the guide
150 Mechanize also makes file uploads easy! Just find the file upload field, and
151 tell it what file name you want to upload:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
152
973b5a55 » tenderlove
2008-10-19 updating documentation to remove any examples that use WWW::Mechanize…
153 form.file_uploads.first.file_name = "somefile.jpg"
1c07f21f » aaronp
2006-09-06 adding the guide
154
155 == Scraping Data
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
156
30eb1618 » drbrain
2012-02-13 Document that Mechanize::Page#search accepts an XPath or CSS expressi…
157 Mechanize uses nokogiri[http://nokogiri.org/] to parse HTML. What does this
158 mean for you? You can treat a mechanize page like an nokogiri object. After
159 you have used Mechanize to navigate to the page that you need to scrape, then
160 scrape it using nokogiri methods:
161
162 agent.get('http://someurl.com/').search("p.posted")
163
164 The expression given to Mechanize::Page#search may be a CSS expression or an
165 XPath expression:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
166
c1666826 » tenderlove
2008-11-19 nokogiri may be dropped in as an html replacement
167 agent.get('http://someurl.com/').search(".//p[@class='posted']")
0107bd11 » gpherguson
2010-02-02 Fixed .click examples.
168
Something went wrong with that request. Please try again.