Skip to content
This repository
Newer
Older
100644 167 lines (110 sloc) 5.814 kb
d554726e » tenderlove
2010-02-06 removing references to the WWW constant
1 = Getting Started With Mechanize
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
2
1c07f21f » aaronp
2006-09-06 adding the guide
3 This guide is meant to get you started using Mechanize. By the end of this
4 guide, you should be able to fetch pages, click links, fill out and submit
5 forms, scrape data, and many other hopefully useful things. This guide
6 really just scratches the surface of what is available, but should be enough
7 information to get you really going!
8
9 == Let's Fetch a Page!
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
10
1c07f21f » aaronp
2006-09-06 adding the guide
11 First thing is first. Make sure that you've required mechanize and that you
12 instantiate a new mechanize object:
13
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
14 require 'rubygems'
15 require 'mechanize'
16
17 agent = Mechanize.new
18
1c07f21f » aaronp
2006-09-06 adding the guide
19 Now we'll use the agent we've created to fetch a page. Let's fetch google
20 with our mechanize agent:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
21
22 page = agent.get('http://google.com/')
23
1c07f21f » aaronp
2006-09-06 adding the guide
24 What just happened? We told mechanize to go pick up google's main page.
25 Mechanize stored any cookies that were set, and followed any redirects that
26 google may have sent. The agent gave us back a page that we can use to
27 scrape data, find links to click, or find forms to fill out.
28
2c719bc1 » Erkan-Yilmaz
2012-03-09 typos
29 Next, let's try finding some links to click.
1c07f21f » aaronp
2006-09-06 adding the guide
30
31 == Finding Links
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
32
1c07f21f » aaronp
2006-09-06 adding the guide
33 Mechanize returns a page object whenever you get a page, post, or submit a
34 form. When a page is fetched, the agent will parse the page and put a list
35 of links on the page object.
36
2c719bc1 » Erkan-Yilmaz
2012-03-09 typos
37 Now that we've fetched google's homepage, let's try listing all of the links:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
38
39 page.links.each do |link|
40 puts link.text
41 end
42
1c07f21f » aaronp
2006-09-06 adding the guide
43 We can list the links, but Mechanize gives a few shortcuts to help us find a
2c719bc1 » Erkan-Yilmaz
2012-03-09 typos
44 link to click on. Let's say we wanted to click the link whose text is 'News'.
1c07f21f » aaronp
2006-09-06 adding the guide
45 Normally, we would have to do this:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
46
47 page = agent.page.links.find { |l| l.text == 'News' }.click
48
1c07f21f » aaronp
2006-09-06 adding the guide
49 But Mechanize gives us a shortcut. Instead we can say this:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
50
51 page = agent.page.link_with(:text => 'News').click
52
1c07f21f » aaronp
2006-09-06 adding the guide
53 That shortcut says "find all links with the name 'News'". You're probably
54 thinking "there could be multiple links with that text!", and you would be
973b5a55 » tenderlove
2008-10-19 updating documentation to remove any examples that use WWW::Mechanize…
55 correct! If you use the plural form, you can access the list.
56 If you wanted to click on the second news link, you could do this:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
57
58 agent.page.links_with(:text => 'News')[1].click
59
1c07f21f » aaronp
2006-09-06 adding the guide
60 We can even find a link with a certain href like so:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
61
62 page.link_with(:href => '/something')
63
1c07f21f » aaronp
2006-09-06 adding the guide
64 Or chain them together to find a link with certain text and certain href:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
65
66 page.link_with(:text => 'News', :href => '/something')
1c07f21f » aaronp
2006-09-06 adding the guide
67
2c719bc1 » Erkan-Yilmaz
2012-03-09 typos
68 These shortcuts that Mechanize provides are available on any list that you
1c07f21f » aaronp
2006-09-06 adding the guide
69 can fetch like frames, iframes, or forms. Now that we know how to find and
2c719bc1 » Erkan-Yilmaz
2012-03-09 typos
70 click links, let's try something more complicated like filling out a form.
1c07f21f » aaronp
2006-09-06 adding the guide
71
72 == Filling Out Forms
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
73
2c719bc1 » Erkan-Yilmaz
2012-03-09 typos
74 Let's continue with our google example. Here's the code we have so far:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
75 require 'rubygems'
76 require 'mechanize'
77
78 agent = Mechanize.new
79 page = agent.get('http://google.com/')
1c07f21f » aaronp
2006-09-06 adding the guide
80
81 If we pretty print the page, we can see that there is one form named 'f',
82 that has a couple buttons and a few fields:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
83
84 pp page
85
2c719bc1 » Erkan-Yilmaz
2012-03-09 typos
86 Now that we know the name of the form, let's fetch it off the page:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
87
1c07f21f » aaronp
2006-09-06 adding the guide
88 google_form = page.form('f')
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
89
1c07f21f » aaronp
2006-09-06 adding the guide
90 Mechanize lets you access form input fields in a few different ways, but the
91 most convenient is that you can access input fields as accessors on the
2c719bc1 » Erkan-Yilmaz
2012-03-09 typos
92 object. So let's set the form field named 'q' on the form to 'ruby mechanize':
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
93
94 google_form.q = 'ruby mechanize'
95
2c719bc1 » Erkan-Yilmaz
2012-03-09 typos
96 To make sure that we set the value, let's pretty print the form, and you should
1c07f21f » aaronp
2006-09-06 adding the guide
97 see a line similar to this:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
98
99 #<Mechanize::Field:0x1403488 @name="q", @value="ruby mechanize">
100
1c07f21f » aaronp
2006-09-06 adding the guide
101 If you saw that the value of 'q' changed, you're on the right track! Now we
102 can submit the form and 'press' the submit button and print the results:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
103
104 page = agent.submit(google_form, google_form.buttons.first)
105 pp page
106
1c07f21f » aaronp
2006-09-06 adding the guide
107 What we just did was equivalent to putting text in the search field and
108 clicking the 'Google Search' button. If we had submitted the form without
109 a button, it would be like typing in the text field and hitting the return
110 button.
111
2c719bc1 » Erkan-Yilmaz
2012-03-09 typos
112 Let's take a look at the code all together:
1c07f21f » aaronp
2006-09-06 adding the guide
113
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
114 require 'rubygems'
115 require 'mechanize'
116
117 agent = Mechanize.new
118 page = agent.get('http://google.com/')
119 google_form = page.form('f')
120 google_form.q = 'ruby mechanize'
121 page = agent.submit(google_form)
122 pp page
1c07f21f » aaronp
2006-09-06 adding the guide
123
2c719bc1 » Erkan-Yilmaz
2012-03-09 typos
124 Before we go on to screen scraping, let's take a look at forms a little more
1c07f21f » aaronp
2006-09-06 adding the guide
125 in depth. Unless you want to skip ahead!
126
127 == Advanced Form Techniques
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
128
1c07f21f » aaronp
2006-09-06 adding the guide
129 In this section, I want to touch on using the different types in input fields
130 possible with a form. Password and textarea fields can be treated just like
131 text input fields. Select fields are very similar to text fields, but they
132 have many options associated with them. If you select one option, mechanize
30eb1618 » drbrain
2012-02-13 Document that Mechanize::Page#search accepts an XPath or CSS expressi…
133 will de-select the other options (unless it is a multi select!).
1c07f21f » aaronp
2006-09-06 adding the guide
134
2c719bc1 » Erkan-Yilmaz
2012-03-09 typos
135 For example, let's select an option on a list:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
136
137 form.field_with(:name => 'list').options[0].select
1c07f21f » aaronp
2006-09-06 adding the guide
138
2c719bc1 » Erkan-Yilmaz
2012-03-09 typos
139 Now let's take a look at checkboxes and radio buttons. To select a checkbox,
1c07f21f » aaronp
2006-09-06 adding the guide
140 just check it like this:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
141
142 form.checkbox_with(:name => 'box').check
143
1c07f21f » aaronp
2006-09-06 adding the guide
144 Radio buttons are very similar to checkboxes, but they know how to uncheck
145 other radio buttons of the same name. Just check a radio button like you
146 would a checkbox:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
147
973b5a55 » tenderlove
2008-10-19 updating documentation to remove any examples that use WWW::Mechanize…
148 form.radiobuttons_with(:name => 'box')[1].check
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
149
1c07f21f » aaronp
2006-09-06 adding the guide
150 Mechanize also makes file uploads easy! Just find the file upload field, and
151 tell it what file name you want to upload:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
152
973b5a55 » tenderlove
2008-10-19 updating documentation to remove any examples that use WWW::Mechanize…
153 form.file_uploads.first.file_name = "somefile.jpg"
1c07f21f » aaronp
2006-09-06 adding the guide
154
155 == Scraping Data
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
156
30eb1618 » drbrain
2012-02-13 Document that Mechanize::Page#search accepts an XPath or CSS expressi…
157 Mechanize uses nokogiri[http://nokogiri.org/] to parse HTML. What does this
158 mean for you? You can treat a mechanize page like an nokogiri object. After
159 you have used Mechanize to navigate to the page that you need to scrape, then
160 scrape it using nokogiri methods:
161
162 agent.get('http://someurl.com/').search("p.posted")
163
164 The expression given to Mechanize::Page#search may be a CSS expression or an
165 XPath expression:
3f88eb71 » drbrain
2011-04-01 Clean up RDoc files
166
c1666826 » tenderlove
2008-11-19 nokogiri may be dropped in as an html replacement
167 agent.get('http://someurl.com/').search(".//p[@class='posted']")
0107bd11 » gpherguson
2010-02-02 Fixed .click examples.
168
Something went wrong with that request. Please try again.