Skip to content

Commit

Permalink
initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
scrubber committed Oct 19, 2008
0 parents commit a94877c
Show file tree
Hide file tree
Showing 512 changed files with 29,532 additions and 0 deletions.
338 changes: 338 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -0,0 +1,338 @@
= scRUBYt! Changelog

== 0.4.05
== 20th October

=<tt>changes:</tt>
- [NEW] possibility to use FireWatir as the agent for scraping
- [NEW] navigation actions: click_by_xpath, click_link_and_wait
- [MOD] dropped dependencies: RubyInline, ParseTree, Ruby2Ruby (hooray for win32 users)
- [MOD] exporting temporarily doesn't work - for now, generated XPaths are printed to the screen
- [FIX] lot of bugfixes and stability fixes

== 0.4.0 (unofficial)
=== 31st October, 2007

=<tt>changes:</tt>
- [NEW] possibility to define a default value for patterns
- [MOD] rewrite of to_flat_xml to a more robust algorithm
- [NEW] find_string method in text pattern; return the string if it's present in the input

== 0.3.4
=== 26th September, 2007

=<tt>changes:</tt>



== 0.3.1
=== 29th May, 2007

=<tt>changes:</tt>

[NEW] complete rewrite of the output system, creating
a solid foundation for more robust output functions
(credit: Neelance)
[NEW] logging - no annoying puts messages anymore! (credit: Tim Fletcher)
[NEW] can index an example - e.g.
link 'more[5]'
semantics: give me the 6th element with the text 'link'
[NEW] can use XPath checking an attribute value, like "//div[@id='content']"
[NEW] default values for missing elements (first version was done in 0.2.8
but it did not work for all cases)
[NEW] possibility to click button with it's text (instead of it's index)
(credit: Nick Merwin)
[NEW] clicking radio buttons
[NEW] can click on image buttons (by specifying the name of the button)
[NEW] possibility to extract an URL with one step, like so:
link 'The Difference/@href'
i.e. give me the href attribute of the element matched by the example 'The Difference'
[NEW] new way to match an element of the page:
div 'div[The Difference]'
means 'return the div which contains the string "The Difference"'. This is
useful if the XPath of the element is non-constant across the same site
(e.g.sometimes a banner or add is added, sometimes not etc.)
[NEW] Clicking image maps; At the moment this is achieved by specifying an
index, like
click_image_map 3
which means click the 4th link in the image map
[FIX] Replacing \240 (&nbsp;) with space in the preprocessing phase
automatically
[FIX] Fixed: correctly downloading image if the src
attribute had a leading space, as in
<img src=' /files/downloads/images/image.jpg'/>
[FIX] Other misc fixes - a ton of them!

== 0.2.7
=== 12th April, 2007

=<tt>changes:</tt>

[NEW] download pattern: download the file pointed to by the
parent pattern
[NEW] checking checkboxes
[NEW] basic authentication support
[NEW] possibility to resolve relative paths against a custom url
[NEW] first simple version of to_csv and to_hash
[NEW] complete rewrite of the exporting system (Credit: Neelance)
[NEW] first version of smart regular expressions: they are constructed
from examples, just as regular expressions (Credit: Neelance)
[NEW] Possibility to click the n-th link
[FIX] Clicking on links using scRUBYt's aadvanced example lookup
[NEW] Forcing writing text of non-leaf nodes with :write_text => true
[NEW] Possibility to set custom user-agent; Specified default user agent
as Microsoft IE6
[FIX] Fixed crawling to detail pages in case of leaving the
original site (Credit: Michael Mazour)
[FIX] fixing the '//' problem - if the relative url contained two
slashes, the fetching failed
[FIX] scrubyt assumed that documents have a list of nested elements
(Credit: Rick Bradley)
[FIX] crawling to detail pages works also if the parent pattern is
a string pattern
[FIX] shorcut url fixed again
[FIX] regexp pattern fixed in case it's parent was a string
[FIX] refactoring the core classes, lots of bugfixes and stabilization

== 0.2.6
=== 22th March, 2007

The mission of this release was to add even more powerful features,
like crawling to detail pages or compound example specification,
as well as fixing the most frequently popping-up bugs. Scraping
of concrete sites is more and more frequently the cause for new
features and bugfixes, which in my opinion means that the
framework is beginning to make sense: from a shiny toy which
looks cool and everybody wants to play with, it is moving
towards a tool which you reach after if you seriously want
to scrape a site.

The new stuff in this release is 99% scraping related - if
you are looking for new features in the navigation part,
probably the next version will be for you, where I will
concentrate more on adding new widgets and possibilities
to the navigation process. Firewatir integration is very
close, too - perhaps already the next release will
support FireWatir navigation!

=<tt>changes:</tt>
* [NEW] Automatically crawling to and extracting from detail pages
* [NEW] Compound example specification: So far the example of a pattern had to be a string.
Now it can be a hash as well, like {:contains => /\d\d-\d/, :begins_with => 'Telephone'}
* [NEW] More sophisticated example specification: Possible to use regexp as well, and need not
(but still possible of course) to specify the whole content of the node - nodes that
contain the string/match the regexp will be returned, too
* [NEW] Possibility to force writing text in case of non-leaf nodes
* [NEW] Crawling to the next page now possible via image links as well
* [NEW] Possibility to define examples for any pattern (before it did not make sense for ancestors)
* [NEW] Implementation of crawling to the next page with different methods
* [NEW] Heuristics: if something ends with _url, it is a shortcut for:
some_url 'href', :type => :attribute
* [FIX] Crawling to the next page (the broken google example): if the next
link text is not an <a>, traverse down until the <a> is found; if it is
still not found, traverse up until it is found
* [FIX] Crawling to next pages does not break if the next link is greyed out
(or otherwise present but has no href attribute (Credit: Robert Au)
* [FIX] DRY-ed next link lookup - it should be much more robust now as it is uses the 'standard' example lookup
* [NEW] Correct exporting of detail page extractors
* [NEW] Added more powerful XPath regexp (Credit: Karol Hosiawa)
* [NEW] New examples for the new featutres
* [FIX] Tons of bugfixes, new blackbox and unit tests, refactoring and stabilization

== 0.2.3
=== 20th February, 2007

Thanks to the feedback from all of you, I managed to find a lot of bugs as well as write up a nice feature request list. The bugs are mostly fixed and also some shiny new features have been added. Stability was also improved by adding new tests and totally refacroring the whole code.
The new features make this release much more powerful than the previous one. Sites requiring login, submitting forms with button click, filling text areas, dealing with variable-size results, smart handling of attribute lookup, https, custom proxy setting and tons of bugfixes make this release capable of doing much-much more than it was possible in 0.2.0.
I have added also some shiny new examples - scraping reddit, del.icio.us, rubyforge login, wordpress automatic comment
ing for example.

=<tt>changes:</tt>
* [FIX] Cookies (and other stuff) are now taken into consideration
* [NEW] select_indices feature. Example:

table do
(row '1').select_indices(:last)
end

this will select only the last row;
possibility to specify a Range, or an array of indices, or other
constants like :first, :every_odd etc. More to come in the future!
* [FIX] digg.com next page problem fixed
* [FIX] Fetching of https sites
* [FIX] Next page works incorrectly when given an absolute path
* [FIX] Fixing exporting if the pattern parameters are parenthesized
* [NEW] Possibility to submit forms by clicking a button
* [NEW] Added new unit test suite: pattern_test
* [NEW] Possibility to set a proxy for fetching the input document
* [NEW] Added possibility to choose an option from a selection list (Credit: Zaheed Haque)
* [FIX] Image pattern example lookup fix
* [NEW] Possibility to prefilter the document before passing it to Hpricot (Credit: Demitrious Kelly)
* [FIX] corrected gem dependencies (Credit: Tim Fletcher)
* [FIX] remove duplicates only if there are more examples present
* [NEW] new examples: wordpress comment (Credit: Zaheed Haque), rubyforge login, del.icio.us, reddit and more
* [FIX] if there is no scraper defined, exit with a message rather than raise an exception
* [NEW] smart handling of attribute lookup: try to look up the attribute in the parent, but if it is not there, traverse up until it is found (this is useful e.g. if an image is inside a span and the span is inside an <a>)

== 0.2.0
=== 30th January, 2007

The first ever public release, 0.2.0 is out! I would say the feature set is impressive, though the the relyability still needs to be improved, and the whole thing needs to be tested, tested and tested thoroughly. This is not yet the release which you just pull out of the box anf works under any circumstances - however, the major bugs are fixed and the whole stuff is in a good-enough(TM) state, I guess.

=<tt>changes:</tt>

* better form detection heuristics
* report message if there are absolutely no results
* lots of bugfixes
* fixed amazon_data.books[0].item[0].title[0] style output access
and implemented it correctly in case of crawling as well
* /body/div/h3 not detected as XPath
* crawling problem (improved heuristics of url joining)
* fixed blackbox test runner - no more platform dependent code
* fixed exporting bug: swapped exported XPaths in the case of no example present
* fixed exporting bug: capturing \W (non-word character) after the\ pattern name; this way we can distinguish pattern names where one
name is substring of the other
* Evaluation stops if the example was not found - but not in the case
of next page link lookup
* google_data[0].link[0].url[0] style result lookup now works in the
case of more documents, too
* tons of others bugfixes
* overall stability fixes
* more blackbox tests
* more examples
* overall stability fixes


= 0.1.9
=== 28th January, 2007

This is a preview release before the first real public release, 0.2.0. Basically everything planned for 0.2.0 is in, now a testing phase (with light bugfixing :-) will follow, then 0.2.0 will be released.

=<tt>Changes</tt>:

* Possibility to specify multiple examples (hence a pattern can have more filters)
* Enhanced heuristics for example text detection
* First version of algorithm to remove dupes resulting from multiple examples
* empty XML leaf nodes are not written
* new examples
* TONS of bugfixes

= 0.1
=== 15th January, 2007

First pre-alpha (non-public) release
This release was made more for myself (to try and test rubyforge, gems, etc) rather than for the community at this time.

Fairly nice set of features, but still need a lot of testing and stabilizing before it will be really usable.

* Navigation:
* fetching pages
* clicking links
* filling input fields
* submitting forms
* automatically passing the document to the scraping
* both files and http:// support
* automatic crawling

* Scraping:
* Fairly powerful DSL to describe the full scraping process
* Automatic navigation with WWW::Mechanize
* Automatic scraping through examples with Hpricot
* automatic recursive scraping through the next button




=<tt>changes:</tt>
* [FIX] cookies (and other stuff) are now taken into consideration
* [FIX] digg.com next page problem fixed
* [FIX] fetching of https sites
* [FIX] Next page works incorrectly when given an absolute path
* [FIX] Fixing exporting if the pattern parameters are parenthesized
* [NEW] Possibility to submit forms by clicking a button
* [NEW] Added new unit test suite: pattern_test
* [NEW] Possibility to set a proxy for fetching the input document
* [NEW] Added possibility to choose an option from a selection list
* [NEW] select_indices feature. Example:

table do
(row '1').select_indices(:last)
end

this will select only the last row;
possibility to specify a Range, or an array of indices, or other
constants like :first, :every_odd etc. More to come in the future!
* [FIX] Image pattern example lookup fix
* [FIX] corrected gem dependencies (thanks to Tim Fletcher)
* [FIX] remove duplicates only if there are more examples present
* [NEW] new examples: gmail login, wordpress comment, del.icio.us, grab_rows (showcasing select_indices)
* [FIX] if there is no scraper defined, exit with a message rather than
raise an exception
* [NEW] smart handling of attribute lookup: try to look up the attribute in the parent, but if it is not there, traverse up until it is found (this is useful e.g. if an image is inside a span and the span is inside an <a>)

== 0.2.0
=== 30th January, 2007

The first ever public release, 0.2.0 is out! I would say the feature set is impressive, though the the relyability still needs to be improved, and the whole thing needs to be tested, tested and tested thoroughly. This is not yet the release which you just pull out of the box anf works under any circumstances - however, the major bugs are fixed and the whole stuff is in a good-enough(TM) state, I guess.

=<tt>changes:</tt>

* better form detection heuristics
* report message if there are absolutely no results
* lots of bugfixes
* fixed amazon_data.books[0].item[0].title[0] style output access
and implemented it correctly in case of crawling as well
* /body/div/h3 not detected as XPath
* crawling problem (improved heuristics of url joining)
* fixed blackbox test runner - no more platform dependent code
* fixed exporting bug: swapped exported XPaths in the case of no example present
* fixed exporting bug: capturing \W (non-word character) after the\ pattern name; this way we can distinguish pattern names where one
name is substring of the other
* Evaluation stops if the example was not found - but not in the case
of next page link lookup
* google_data[0].link[0].url[0] style result lookup now works in the
case of more documents, too
* tons of others bugfixes
* overall stability fixes
* more blackbox tests
* more examples
* overall stability fixes


= 0.1.9
=== 28th January, 2007

This is a preview release before the first real public release, 0.2.0. Basically everything planned for 0.2.0 is in, now a testing phase (with light bugfixing :-) will follow, then 0.2.0 will be released.

=<tt>Changes</tt>:

* Possibility to specify multiple examples (hence a pattern can have more filters)
* Enhanced heuristics for example text detection
* First version of algorithm to remove dupes resulting from multiple examples
* empty XML leaf nodes are not written
* new examples
* TONS of bugfixes

= 0.1
=== 15th January, 2007

First pre-alpha (non-public) release
This release was made more for myself (to try and test rubyforge, gems, etc) rather than for the community at this time.

Fairly nice set of features, but still need a lot of testing and stabilizing before it will be really usable.

* Navigation:
* fetching pages
* clicking links
* filling input fields
* submitting forms
* automatically passing the document to the scraping
* both files and http:// support
* automatic crawling

* Scraping:
* Fairly powerful DSL to describe the full scraping process
* Automatic navigation with WWW::Mechanize
* Automatic scraping through examples with Hpricot
* automatic recursive scraping through the next button

Loading

0 comments on commit a94877c

Please sign in to comment.