Little tiny class to easily save and retrieve web pages
In web related client applications, such as spiders, it is frequently necessary to save pages into files with adecuate naming convention. WebDump comes to the rescue. It manages the details of assigning unique readable names and save files after URIs that have been visited. Additionally, saving data could also be conveniently compressed with gzip for deep web spidering. It only depends on telling the correct file extension when saving.
Conversely, file read operation is available through convenient methods indicating either a pathname or a URI.
$ sudo gem install web_dump
The main source repository is github.com/syborg/web_dump.
First of all …
require 'rubygems' require 'web_dump'
Instantiating an object, you may add some options that can be passed through a Hash
wd = WebDump.new :base_dir => '~/mydir', :file_ext => '.gz'
‘wd`, when asked to, will save all files inside expanded directory ’~/mydir’ with an appended file extension ‘.gz’ at the end (if not overwritten later)
Some options that could be passed when instantiating an object. Most of them are directly passed along to an UriPathname object that is created.
-
‘:file_ext => extension` (String that will be appended at the end to every filename if not changed from save method)
-
‘:base_dir => dir_name` (directory where everything will be stored. Defaults to ’~/web_dumps’)
-
‘:pth_sep => psep` (String that will be used to substitute ’/‘ inside URI’s path and queries (defaults to UriPathname::PTH_SEP=‘_|_’))
-
‘:host_sep => hsep` (String that will be used separate the URI¡s hostname and path when constructing the pathname. if ’/‘ is used, hostname will actually become a subdirectory -defaults to UriPathname::HOST_SEP=’__|‘-)
-
‘:no_path => nopath` (String that will be used as a path placeholder when no URI’s path exists, -default UriPathname::NO_PTH = ‘NOPATH’-)
You should use WebDump#save, for example:
wd.save "http://hello.world.com/hithere", data
You can retrieve data using two flavoured read methods, using URIs or using pathnames as main argument
data = wd.read_uri(uri)
or
data = wd.read_pathname(f)
Here is a complete example
require 'rubygems' require 'open-uri' require 'web_dump' MY_URIS = [ 'http://en.wikipedia.org/wiki/Ruby_Bridges', 'http://donaldfagen.com/disc_nightfly.php', 'http://www.rubi.cat/ajrubi/portada/index.php', 'http://www.google.com/cse?q=array&cx=013598269713424429640%3Ag5orptiw95w&ie=UTF-8&sa=Search' ] # all files will be saved in expanded '~/mydir' with file extension '.gz' wd = WebDump.new :base_dir => '~/mydir', :file_ext => '.gz' # Don't care about filenames while saving pages into files puts "Saving data using URIs" MY_URIS.each do |uri| open uri do |u| data = u.read puts wd.save uri, data end end # Possibly mocking? ... don't care about filenames while retrieving pages from files. puts "\nRetrieving data using URIs" MY_URIS.each do |uri| data = wd.read_uri(uri) puts data[0...100].gsub(/\s+/, ' ').strip end # ... or, conversely, use filenames if you need so puts "\nRetrieving data using pathnames" files = Dir[File.expand_path('*.gz', '~/mydir')] files.each do |f| data = wd.read_pathname(f) puts data[0...100].gsub(/\s+/, ' ').strip end
-
Fork the project.
-
Make your feature addition or bug fix.
-
Add tests for it. This is important so I don’t break it in a future version unintentionally.
-
Commit, do not mess with rakefile, version, or history. (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
-
Send me a pull request. Bonus points for topic branches.
Copyright © 2011 Marcel Massana. See LICENSE for details.