web_dump¶ ↑

Little tiny class to easily save and retrieve web pages

In web related client applications, such as spiders, it is frequently necessary to save pages into files with adecuate naming convention. WebDump comes to the rescue. It manages the details of assigning unique readable names and save files after URIs that have been visited. Additionally, saving data could also be conveniently compressed with gzip for deep web spidering. It only depends on telling the correct file extension when saving.

Conversely, file read operation is available through convenient methods indicating either a pathname or a URI.

Installation¶ ↑

$ sudo gem install web_dump

The main source repository is github.com/syborg/web_dump.

Usage¶ ↑

Intantiating¶ ↑

First of all …

require 'rubygems'
require 'web_dump'

Instantiating an object, you may add some options that can be passed through a Hash

wd = WebDump.new :base_dir => '~/mydir', :file_ext => '.gz'

‘wd`, when asked to, will save all files inside expanded directory ’~/mydir’ with an appended file extension ‘.gz’ at the end (if not overwritten later)

Some options that could be passed when instantiating an object. Most of them are directly passed along to an UriPathname object that is created.

‘:file_ext => extension` (String that will be appended at the end to every filename if not changed from save method)
‘:base_dir => dir_name` (directory where everything will be stored. Defaults to ’~/web_dumps’)
‘:pth_sep => psep` (String that will be used to substitute ’/‘ inside URI’s path and queries (defaults to UriPathname::PTH_SEP=‘_|_’))
‘:host_sep => hsep` (String that will be used separate the URI¡s hostname and path when constructing the pathname. if ’/‘ is used, hostname will actually become a subdirectory -defaults to UriPathname::HOST_SEP=’__|‘-)
‘:no_path => nopath` (String that will be used as a path placeholder when no URI’s path exists, -default UriPathname::NO_PTH = ‘NOPATH’-)

Saving Web Contents¶ ↑

You should use WebDump#save, for example:

wd.save "http://hello.world.com/hithere", data

Retrieving Web Contents¶ ↑

You can retrieve data using two flavoured read methods, using URIs or using pathnames as main argument

data = wd.read_uri(uri)

or

data = wd.read_pathname(f)

Example¶ ↑

Here is a complete example

require 'rubygems'
require 'open-uri'
require 'web_dump'

MY_URIS = [
  'http://en.wikipedia.org/wiki/Ruby_Bridges',
  'http://donaldfagen.com/disc_nightfly.php',
  'http://www.rubi.cat/ajrubi/portada/index.php',
  'http://www.google.com/cse?q=array&cx=013598269713424429640%3Ag5orptiw95w&ie=UTF-8&sa=Search'
]

# all files will be saved in expanded '~/mydir' with file extension '.gz'
wd = WebDump.new :base_dir => '~/mydir', :file_ext => '.gz'

# Don't care about filenames while saving pages into files
puts "Saving data using URIs"
MY_URIS.each do |uri|
  open uri do |u|
    data = u.read
    puts wd.save uri, data
  end
end

# Possibly mocking? ... don't care about filenames while retrieving pages from files.
puts "\nRetrieving data using URIs"
MY_URIS.each do |uri|
  data = wd.read_uri(uri)
  puts data[0...100].gsub(/\s+/, ' ').strip
end

# ... or, conversely, use filenames if you need so
puts "\nRetrieving data using pathnames"
files = Dir[File.expand_path('*.gz', '~/mydir')]
files.each do |f|
  data = wd.read_pathname(f)
  puts data[0...100].gsub(/\s+/, ' ').strip
end

Note on Patches/Pull Requests¶ ↑

Fork the project.
Make your feature addition or bug fix.
Add tests for it. This is important so I don’t break it in a future version unintentionally.
Commit, do not mess with rakefile, version, or history. (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
Send me a pull request. Bonus points for topic branches.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
examples		examples
lib		lib
test		test
.document		.document
.gitignore		.gitignore
LICENSE		LICENSE
README.rdoc		README.rdoc
Rakefile		Rakefile
web_dump.gemspec		web_dump.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

web_dump¶ ↑

Installation¶ ↑

Usage¶ ↑

Intantiating¶ ↑

Saving Web Contents¶ ↑

Retrieving Web Contents¶ ↑

Example¶ ↑

Note on Patches/Pull Requests¶ ↑

Copyright¶ ↑

About

Releases

Packages

Languages

License

syborg/web_dump

Folders and files

Latest commit

History

Repository files navigation

web_dump¶ ↑

Installation¶ ↑

Usage¶ ↑

Intantiating¶ ↑

Saving Web Contents¶ ↑

Retrieving Web Contents¶ ↑

Example¶ ↑

Note on Patches/Pull Requests¶ ↑

Copyright¶ ↑

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages