Skip to content

Conversation

@shaneaevans
Copy link
Member

The Scraper class can be trained with an HtmlPage instead of requiring a URL. It's more correct now (handling encoding, headers, etc.) when creating the HtmlPage for training.

The InstanceBasedLearningExtractor is no longer re-initialized on each request, improving performance.

A failing test has been fixed and now does not require to make an HTTP request to perform the test.

The test for the Scraper class was fixed in the process and it now uses
saved data instead of making a request to a website, which changed and
broke the tests.

The encoding is guessed using w3lib.encoding when not set, instead of
defaulting to utf-8, which was likely to fail often.
shaneaevans added a commit that referenced this pull request Feb 16, 2012
@shaneaevans shaneaevans merged commit 63284b7 into master Feb 16, 2012
@pablohoffman
Copy link
Member

Have you checked the scrapely command line tool (python -m scrapely.tool) keeps working after this change?

@shaneaevans
Copy link
Member Author

The change is API compatible, unless it relies on private functions, it should be fine. I checked some basic usage and it was OK. (although, really, this should be automated..)

@shaneaevans
Copy link
Member Author

I guess we should also 'fix' the tool. It requires users to tell it the encoding or it assumes utf8, where it should work out the encoding instead for the default case. I'll work up a patch..

I also note that the example on the README is broken - a 0 Scrapy project -n 1 -f author doesn't work for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants