Scraper refactor #19

shaneaevans · 2012-02-16T14:35:30Z

The Scraper class can be trained with an HtmlPage instead of requiring a URL. It's more correct now (handling encoding, headers, etc.) when creating the HtmlPage for training.

The InstanceBasedLearningExtractor is no longer re-initialized on each request, improving performance.

A failing test has been fixed and now does not require to make an HTTP request to perform the test.

The test for the Scraper class was fixed in the process and it now uses saved data instead of making a request to a website, which changed and broke the tests. The encoding is guessed using w3lib.encoding when not set, instead of defaulting to utf-8, which was likely to fail often.

…ctor

Scraper refactor

pablohoffman · 2012-02-16T16:30:25Z

Have you checked the scrapely command line tool (python -m scrapely.tool) keeps working after this change?

shaneaevans · 2012-02-16T16:43:00Z

The change is API compatible, unless it relies on private functions, it should be fine. I checked some basic usage and it was OK. (although, really, this should be automated..)

shaneaevans · 2012-02-16T16:46:30Z

I guess we should also 'fix' the tool. It requires users to tell it the encoding or it assumes utf8, where it should work out the encoding instead for the default case. I'll work up a patch..

I also note that the example on the README is broken - a 0 Scrapy project -n 1 -f author doesn't work for me

shaneaevans added 2 commits February 14, 2012 20:43

Merge branch 'master' of github.com:scrapy/scrapely into scraper_refa…

2f8e9d4

…ctor

shaneaevans added a commit that referenced this pull request Feb 16, 2012

Merge pull request #19 from scrapy/scraper_refactor

63284b7

Scraper refactor

shaneaevans merged commit 63284b7 into master Feb 16, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scraper refactor #19

Scraper refactor #19

Uh oh!

shaneaevans commented Feb 16, 2012

Uh oh!

pablohoffman commented Feb 16, 2012

Uh oh!

shaneaevans commented Feb 16, 2012

Uh oh!

shaneaevans commented Feb 16, 2012

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Scraper refactor #19

Scraper refactor #19

Uh oh!

Conversation

shaneaevans commented Feb 16, 2012

Uh oh!

pablohoffman commented Feb 16, 2012

Uh oh!

shaneaevans commented Feb 16, 2012

Uh oh!

shaneaevans commented Feb 16, 2012

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants