initial commit

scrubber · Oct 19, 2008 · a94877c · a94877c
commit a94877c
Show file tree

Hide file tree

Showing 512 changed files with 29,532 additions and 0 deletions.
diff --git a/CHANGELOG b/CHANGELOG
@@ -0,0 +1,338 @@
+= scRUBYt! Changelog
+
+== 0.4.05
+== 20th October
+
+=<tt>changes:</tt>
+- [NEW] possibility to use FireWatir as the agent for scraping
+- [NEW] navigation actions: click_by_xpath, click_link_and_wait
+- [MOD] dropped dependencies: RubyInline, ParseTree, Ruby2Ruby (hooray for win32 users)
+- [MOD] exporting temporarily doesn't work - for now, generated XPaths are printed to the screen
+- [FIX] lot of bugfixes and stability fixes
+
+== 0.4.0 (unofficial)
+=== 31st October, 2007
+
+=<tt>changes:</tt>
+- [NEW] possibility to define a default value for patterns
+- [MOD] rewrite of to_flat_xml to a more robust algorithm
+- [NEW] find_string method in text pattern; return the string if it's present         in the input
+
+== 0.3.4
+=== 26th September, 2007
+
+=<tt>changes:</tt>
+
+
+
+== 0.3.1
+=== 29th May, 2007
+
+=<tt>changes:</tt>
+
+[NEW] complete rewrite of the output system, creating
+      a solid foundation for more robust output functions
+      (credit: Neelance)
+[NEW] logging - no annoying puts messages anymore! (credit: Tim Fletcher)
+[NEW] can index an example - e.g.
+      link 'more[5]'
+      semantics: give me the 6th element with the text 'link'
+[NEW] can use XPath checking an attribute value, like "//div[@id='content']"
+[NEW] default values for missing elements (first version was done in 0.2.8
+      but it did not work for all cases)
+[NEW] possibility to click button with it's text (instead of it's index)
+      (credit: Nick Merwin)
+[NEW] clicking radio buttons
+[NEW] can click on image buttons (by specifying the name of the button)
+[NEW] possibility to extract an URL with one step, like so:
+      link 'The Difference/@href'
+      i.e. give me the href attribute of the element matched by the example 'The      Difference'
+[NEW] new way to match an element of the page:
+      div 'div[The Difference]'
+      means 'return the div which contains the string "The Difference"'. This is
+      useful if the XPath of the element is non-constant across the same site 
+      (e.g.sometimes a banner or add is added, sometimes not etc.)
+[NEW] Clicking image maps; At the moment this is achieved by specifying an 
+      index, like
+      click_image_map 3
+      which means click the 4th link in the image map
+[FIX] Replacing \240 (&nbsp;) with space in the preprocessing phase 
+      automatically
+[FIX] Fixed: correctly downloading image if the src
+      attribute had a leading space, as in
+      <img src=' /files/downloads/images/image.jpg'/>
+[FIX] Other misc fixes - a ton of them!
+
+== 0.2.7
+=== 12th April, 2007
+
+=<tt>changes:</tt>
+
+[NEW] download pattern: download the file pointed to by the
+      parent pattern
+[NEW] checking checkboxes
+[NEW] basic authentication support
+[NEW] possibility to resolve relative paths against a custom url
+[NEW] first simple version of to_csv and to_hash
+[NEW] complete rewrite of the exporting system (Credit: Neelance)
+[NEW] first version of smart regular expressions: they are constructed
+      from examples, just as regular expressions (Credit: Neelance)
+[NEW] Possibility to click the n-th link
+[FIX] Clicking on links using scRUBYt's aadvanced example lookup
+[NEW] Forcing writing text of non-leaf nodes with :write_text => true
+[NEW] Possibility to set custom user-agent; Specified default user agent
+      as Microsoft IE6
+[FIX] Fixed crawling to detail pages in case of leaving the
+      original site (Credit: Michael Mazour)
+[FIX] fixing the '//' problem - if the relative url contained two
+      slashes, the fetching failed
+[FIX] scrubyt assumed that documents have a list of nested elements
+      (Credit: Rick Bradley)
+[FIX] crawling to detail pages works also if the parent pattern is
+      a string pattern
+[FIX] shorcut url fixed again
+[FIX] regexp pattern fixed in case it's parent was a string
+[FIX] refactoring the core classes, lots of bugfixes and stabilization
+
+== 0.2.6
+=== 22th March, 2007
+
+The mission of this release was to add even more powerful features,
+like crawling to detail pages or compound example specification,
+as well as fixing the most frequently popping-up bugs. Scraping
+of concrete sites is more and more frequently the cause for new
+features and bugfixes, which in my opinion means that the
+framework is beginning to make sense: from a shiny toy which
+looks cool and everybody wants to play with, it is moving
+towards a tool which you reach after if you seriously want
+to scrape a site.
+
+The new stuff in this release is 99% scraping related - if
+you are looking for new features in the navigation part,
+probably the next version will be for you, where I will
+concentrate more on adding new widgets and possibilities
+to the navigation process. Firewatir integration is very
+close, too - perhaps already the next release will
+support FireWatir navigation!
+
+=<tt>changes:</tt>
+* [NEW] Automatically crawling to and extracting from detail pages
+* [NEW] Compound example specification: So far the example of a pattern had to be a string.
+        Now it can be a hash as well, like {:contains => /\d\d-\d/, :begins_with => 'Telephone'}
+* [NEW] More sophisticated example specification: Possible to use regexp as well, and need not
+        (but still possible of course) to specify the whole content of the node - nodes that
+        contain the string/match the regexp will be returned, too
+* [NEW] Possibility to force writing text in case of non-leaf nodes
+* [NEW] Crawling to the next page now possible via image links as well
+* [NEW] Possibility to define examples for any pattern (before it did not make sense for ancestors)
+* [NEW] Implementation of crawling to the next page with different methods
+* [NEW] Heuristics: if something ends with _url, it is a shortcut for:
+        some_url 'href', :type => :attribute
+* [FIX] Crawling to the next page (the broken google example): if the next
+        link text is not an <a>, traverse down until the <a> is found; if it is
+        still not found, traverse up until it is found
+* [FIX] Crawling to next pages does not break if the next link is greyed out
+        (or otherwise present but has no href attribute (Credit: Robert Au)
+* [FIX] DRY-ed next link lookup - it should be much more robust now as it is uses the 'standard' example lookup
+* [NEW] Correct exporting of detail page extractors
+* [NEW] Added more powerful XPath regexp (Credit: Karol Hosiawa)
+* [NEW] New examples for the new featutres
+* [FIX] Tons of bugfixes, new blackbox and unit tests, refactoring and stabilization
+
+== 0.2.3
+=== 20th February, 2007
+
+Thanks to the feedback from all of you, I managed to find a lot of bugs as well as write up a nice feature request list. The bugs are mostly fixed and also some shiny new features have been added. Stability was also improved by adding new tests and totally refacroring the whole code.
+The new features make this release much more powerful than the previous one. Sites requiring login, submitting forms with button click, filling text areas, dealing with variable-size results, smart handling of attribute lookup, https, custom proxy setting and tons of bugfixes make this release capable of doing much-much more than it was possible in 0.2.0.
+I have added also some shiny new examples - scraping reddit, del.icio.us, rubyforge login, wordpress automatic comment
+ing for example.
+
+=<tt>changes:</tt>
+* [FIX] Cookies (and other stuff) are now taken into consideration
+* [NEW] select_indices feature. Example:
+
+  table do
+    (row '1').select_indices(:last)
+  end
+
+  this will select only the last row;
+  possibility to specify a Range, or an array of indices, or other
+  constants like :first, :every_odd etc. More to come in the future!
+* [FIX] digg.com next page problem fixed
+* [FIX] Fetching of https sites
+* [FIX] Next page works incorrectly when given an absolute path
+* [FIX] Fixing exporting if the pattern parameters are parenthesized
+* [NEW] Possibility to submit forms by clicking a button
+* [NEW] Added new unit test suite: pattern_test
+* [NEW] Possibility to set a proxy for fetching the input document
+* [NEW] Added possibility to choose an option from a selection list (Credit: Zaheed Haque)
+* [FIX] Image pattern example lookup fix
+* [NEW] Possibility to prefilter the document before passing it to Hpricot (Credit: Demitrious Kelly)
+* [FIX] corrected gem dependencies (Credit: Tim Fletcher)
+* [FIX] remove duplicates only if there are more examples present
+* [NEW] new examples: wordpress comment (Credit: Zaheed Haque), rubyforge login, del.icio.us, reddit and more
+* [FIX] if there is no scraper defined, exit with a message rather than raise an exception
+* [NEW] smart handling of attribute lookup: try to look up the attribute in the parent, but if it is not there, traverse up until it is found (this is useful e.g. if an image is inside a span and the span is inside an <a>)
+
+== 0.2.0
+=== 30th January, 2007
+
+The first ever public release, 0.2.0 is out! I would say the feature set is impressive, though the the relyability still needs to be improved, and the whole thing needs to be tested, tested and tested thoroughly. This is not yet the release which you just pull out of the box anf works under any circumstances - however, the major bugs are fixed and the whole stuff is in a good-enough(TM) state, I guess.
+
+=<tt>changes:</tt>
+
+* better form detection heuristics
+* report message if there are absolutely no results
+* lots of bugfixes
+  * fixed amazon_data.books[0].item[0].title[0] style output access
+    and implemented it correctly in case of crawling as well
+  * /body/div/h3 not detected as XPath
+  * crawling problem (improved heuristics of url joining)
+  * fixed blackbox test runner - no more platform dependent code
+  * fixed exporting bug: swapped exported XPaths in the case of no example     present
+  * fixed exporting bug: capturing \W (non-word character) after the\          pattern name; this way we can distinguish pattern names where one
+    name is substring of the other
+  * Evaluation stops if the example was not found - but not in the case
+    of next page link lookup
+  * google_data[0].link[0].url[0] style result lookup now works in the
+    case of more documents, too
+  * tons of others bugfixes
+  * overall stability fixes
+* more blackbox tests
+* more examples
+* overall stability fixes
+
+
+= 0.1.9
+=== 28th January, 2007
+
+This is a preview release before the first real public release, 0.2.0. Basically everything planned for 0.2.0 is in, now a testing phase (with light bugfixing :-) will follow, then 0.2.0 will be released.
+
+=<tt>Changes</tt>:
+
+* Possibility to specify multiple examples (hence a pattern can have more filters)
+* Enhanced heuristics for example text detection
+* First version of algorithm to remove dupes resulting from multiple examples
+* empty XML leaf nodes are not written
+* new examples
+* TONS of bugfixes
+
+= 0.1
+=== 15th January, 2007
+
+First pre-alpha (non-public) release
+This release was made more for myself (to try and test rubyforge, gems, etc) rather than for the community at this time.
+
+Fairly nice set of features, but still need a lot of testing and stabilizing before it will be really usable.
+
+* Navigation:
+  * fetching pages
+  * clicking links
+  * filling input fields
+  * submitting forms
+  * automatically passing the document to the scraping
+  * both files and http:// support
+  * automatic crawling
+
+* Scraping:
+  * Fairly powerful DSL to describe the full scraping process
+  * Automatic navigation with WWW::Mechanize
+  * Automatic scraping through examples with Hpricot
+  * automatic recursive scraping through the next button
+
+
+
+
+=<tt>changes:</tt>
+* [FIX] cookies (and other stuff) are now taken into consideration
+* [FIX] digg.com next page problem fixed
+* [FIX] fetching of https sites
+* [FIX] Next page works incorrectly when given an absolute path
+* [FIX] Fixing exporting if the pattern parameters are parenthesized
+* [NEW] Possibility to submit forms by clicking a button
+* [NEW] Added new unit test suite: pattern_test
+* [NEW] Possibility to set a proxy for fetching the input document
+* [NEW] Added possibility to choose an option from a selection list
+* [NEW] select_indices feature. Example:
+
+  table do
+    (row '1').select_indices(:last)
+  end
+
+  this will select only the last row;
+  possibility to specify a Range, or an array of indices, or other
+  constants like :first, :every_odd etc. More to come in the future!
+* [FIX] Image pattern example lookup fix
+* [FIX] corrected gem dependencies (thanks to Tim Fletcher)
+* [FIX] remove duplicates only if there are more examples present
+* [NEW] new examples: gmail login, wordpress comment, del.icio.us, grab_rows (showcasing select_indices)
+* [FIX] if there is no scraper defined, exit with a message rather than
+  raise an exception
+* [NEW] smart handling of attribute lookup: try to look up the attribute in the parent, but if it is not there, traverse up until it is found (this is useful e.g. if an image is inside a span and the span is inside an <a>)
+
+== 0.2.0
+=== 30th January, 2007
+
+The first ever public release, 0.2.0 is out! I would say the feature set is impressive, though the the relyability still needs to be improved, and the whole thing needs to be tested, tested and tested thoroughly. This is not yet the release which you just pull out of the box anf works under any circumstances - however, the major bugs are fixed and the whole stuff is in a good-enough(TM) state, I guess.
+
+=<tt>changes:</tt>
+
+* better form detection heuristics
+* report message if there are absolutely no results
+* lots of bugfixes
+  * fixed amazon_data.books[0].item[0].title[0] style output access
+    and implemented it correctly in case of crawling as well
+  * /body/div/h3 not detected as XPath
+  * crawling problem (improved heuristics of url joining)
+  * fixed blackbox test runner - no more platform dependent code
+  * fixed exporting bug: swapped exported XPaths in the case of no example     present
+  * fixed exporting bug: capturing \W (non-word character) after the\          pattern name; this way we can distinguish pattern names where one
+    name is substring of the other
+  * Evaluation stops if the example was not found - but not in the case
+    of next page link lookup
+  * google_data[0].link[0].url[0] style result lookup now works in the
+    case of more documents, too
+  * tons of others bugfixes
+  * overall stability fixes
+* more blackbox tests
+* more examples
+* overall stability fixes
+
+
+= 0.1.9
+=== 28th January, 2007
+
+This is a preview release before the first real public release, 0.2.0. Basically everything planned for 0.2.0 is in, now a testing phase (with light bugfixing :-) will follow, then 0.2.0 will be released.
+
+=<tt>Changes</tt>:
+
+* Possibility to specify multiple examples (hence a pattern can have more filters)
+* Enhanced heuristics for example text detection
+* First version of algorithm to remove dupes resulting from multiple examples
+* empty XML leaf nodes are not written
+* new examples
+* TONS of bugfixes
+
+= 0.1
+=== 15th January, 2007
+
+First pre-alpha (non-public) release
+This release was made more for myself (to try and test rubyforge, gems, etc) rather than for the community at this time.
+
+Fairly nice set of features, but still need a lot of testing and stabilizing before it will be really usable.
+
+* Navigation:
+  * fetching pages
+  * clicking links
+  * filling input fields
+  * submitting forms
+  * automatically passing the document to the scraping
+  * both files and http:// support
+  * automatic crawling
+
+* Scraping:
+  * Fairly powerful DSL to describe the full scraping process
+  * Automatic navigation with WWW::Mechanize
+  * Automatic scraping through examples with Hpricot
+  * automatic recursive scraping through the next button
+