Skip to content

Latest commit

 

History

History
2101 lines (1743 loc) · 103 KB

news.rst

File metadata and controls

2101 lines (1743 loc) · 103 KB

Release notes

Scrapy 1.4.0 (2017-05-18)

Scrapy 1.4 does not bring that many breathtaking new features but quite a few handy improvements nonetheless.

Scrapy now supports anonymous FTP sessions with customizable user and password via the new :setting:`FTP_USER` and :setting:`FTP_PASSWORD` settings. And if you're using Twisted version 17.1.0 or above, FTP is now available with Python 3.

There's a new :meth:`response.follow <scrapy.http.TextResponse.follow>` method for creating requests; it is now a recommended way to create Requests in Scrapy spiders. This method makes it easier to write correct spiders; response.follow has several advantages over creating scrapy.Request objects directly:

  • it handles relative URLs;
  • it works properly with non-ascii URLs on non-UTF8 pages;
  • in addition to absolute and relative URLs it supports Selectors; for <a> elements it can also extract their href values.

For example, instead of this:

for href in response.css('li.page a::attr(href)').extract():
    url = response.urljoin(href)
    yield scrapy.Request(url, self.parse, encoding=response.encoding)

One can now write this:

for a in response.css('li.page a'):
    yield response.follow(a, self.parse)

Link extractors are also improved. They work similarly to what a regular modern browser would do: leading and trailing whitespace are removed from attributes (think href=" http://example.com") when building Link objects. This whitespace-stripping also happens for action attributes with FormRequest.

Please also note that link extractors do not canonicalize URLs by default anymore. This was puzzling users every now and then, and it's not what browsers do in fact, so we removed that extra transformation on extractred links.

For those of you wanting more control on the Referer: header that Scrapy sends when following links, you can set your own Referrer Policy. Prior to Scrapy 1.4, the default RefererMiddleware would simply and blindly set it to the URL of the response that generated the HTTP request (which could leak information on your URL seeds). By default, Scrapy now behaves much like your regular browser does. And this policy is fully customizable with W3C standard values (or with something really custom of your own if you wish). See :setting:`REFERRER_POLICY` for details.

To make Scrapy spiders easier to debug, Scrapy logs more stats by default in 1.4: memory usage stats, detailed retry stats, detailed HTTP error code stats. A similar change is that HTTP cache path is also visible in logs now.

Last but not least, Scrapy now has the option to make JSON and XML items more human-readable, with newlines between items and even custom indenting offset, using the new :setting:`FEED_EXPORT_INDENT` setting.

Enjoy! (Or read on for the rest of changes in this release.)

Deprecations and Backwards Incompatible Changes

New Features

Bug fixes

Cleanups & Refactoring

Documentation

Scrapy 1.3.3 (2017-03-10)

Bug fixes

  • Make SpiderLoader raise ImportError again by default for missing dependencies and wrong :setting:`SPIDER_MODULES`. These exceptions were silenced as warnings since 1.3.0. A new setting is introduced to toggle between warning or exception if needed ; see :setting:`SPIDER_LOADER_WARN_ONLY` for details.

Scrapy 1.3.2 (2017-02-13)

Bug fixes

  • Preserve request class when converting to/from dicts (utils.reqser) (:issue:`2510`).
  • Use consistent selectors for author field in tutorial (:issue:`2551`).
  • Fix TLS compatibility in Twisted 17+ (:issue:`2558`)

Scrapy 1.3.1 (2017-02-08)

New features

Bug fixes

Documentation

Cleanups

  • Remove reduntant check in MetaRefreshMiddleware (:issue:`2542`).
  • Faster checks in LinkExtractor for allow/deny patterns (:issue:`2538`).
  • Remove dead code supporting old Twisted versions (:issue:`2544`).

Scrapy 1.3.0 (2016-12-21)

This release comes rather soon after 1.2.2 for one main reason: it was found out that releases since 0.18 up to 1.2.2 (included) use some backported code from Twisted (scrapy.xlib.tx.*), even if newer Twisted modules are available. Scrapy now uses twisted.web.client and twisted.internet.endpoints directly. (See also cleanups below.)

As it is a major change, we wanted to get the bug fix out quickly while not breaking any projects using the 1.2 series.

New Features

  • MailSender now accepts single strings as values for to and cc arguments (:issue:`2272`)
  • scrapy fetch url, scrapy shell url and fetch(url) inside scrapy shell now follow HTTP redirections by default (:issue:`2290`); See :command:`fetch` and :command:`shell` for details.
  • HttpErrorMiddleware now logs errors with INFO level instead of DEBUG; this is technically backwards incompatible so please check your log parsers.
  • By default, logger names now use a long-form path, e.g. [scrapy.extensions.logstats], instead of the shorter "top-level" variant of prior releases (e.g. [scrapy]); this is backwards incompatible if you have log parsers expecting the short logger name part. You can switch back to short logger names using :setting:`LOG_SHORT_NAMES` set to True.

Dependencies & Cleanups

  • Scrapy now requires Twisted >= 13.1 which is the case for many Linux distributions already.
  • As a consequence, we got rid of scrapy.xlib.tx.* modules, which copied some of Twisted code for users stuck with an "old" Twisted version
  • ChunkedTransferMiddleware is deprecated and removed from the default downloader middlewares.

Scrapy 1.2.3 (2017-03-03)

  • Packaging fix: disallow unsupported Twisted versions in setup.py

Scrapy 1.2.2 (2016-12-06)

Bug fixes

  • Fix a cryptic traceback when a pipeline fails on open_spider() (:issue:`2011`)
  • Fix embedded IPython shell variables (fixing :issue:`396` that re-appeared in 1.2.0, fixed in :issue:`2418`)
  • A couple of patches when dealing with robots.txt:

Documentation

Other changes

  • Advertize conda-forge as Scrapy's official conda channel (:issue:`2387`)
  • More helpful error messages when trying to use .css() or .xpath() on non-Text Responses (:issue:`2264`)
  • startproject command now generates a sample middlewares.py file (:issue:`2335`)
  • Add more dependencies' version info in scrapy version verbose output (:issue:`2404`)
  • Remove all *.pyc files from source distribution (:issue:`2386`)

Scrapy 1.2.1 (2016-10-21)

Bug fixes

  • Include OpenSSL's more permissive default ciphers when establishing TLS/SSL connections (:issue:`2314`).
  • Fix "Location" HTTP header decoding on non-ASCII URL redirects (:issue:`2321`).

Documentation

Other changes

  • Removed www. from start_urls in built-in spider templates (:issue:`2299`).

Scrapy 1.2.0 (2016-10-03)

New Features

Bug fixes

  • DefaultRequestHeaders middleware now runs before UserAgent middleware (:issue:`2088`). Warning: this is technically backwards incompatible, though we consider this a bug fix.
  • HTTP cache extension and plugins that use the .scrapy data directory now work outside projects (:issue:`1581`). Warning: this is technically backwards incompatible, though we consider this a bug fix.
  • Selector does not allow passing both response and text anymore (:issue:`2153`).
  • Fixed logging of wrong callback name with scrapy parse (:issue:`2169`).
  • Fix for an odd gzip decompression bug (:issue:`1606`).
  • Fix for selected callbacks when using CrawlSpider with :command:`scrapy parse <parse>` (:issue:`2225`).
  • Fix for invalid JSON and XML files when spider yields no items (:issue:`872`).
  • Implement flush() fpr StreamLogger avoiding a warning in logs (:issue:`2125`).

Refactoring

Tests & Requirements

Scrapy's new requirements baseline is Debian 8 "Jessie". It was previously Ubuntu 12.04 Precise. What this means in practice is that we run continuous integration tests with these (main) packages versions at a minimum: Twisted 14.0, pyOpenSSL 0.14, lxml 3.4.

Scrapy may very well work with older versions of these packages (the code base still has switches for older Twisted versions for example) but it is not guaranteed (because it's not tested anymore).

Documentation

Scrapy 1.1.4 (2017-03-03)

  • Packaging fix: disallow unsupported Twisted versions in setup.py

Scrapy 1.1.3 (2016-09-22)

Bug fixes

  • Class attributes for subclasses of ImagesPipeline and FilesPipeline work as they did before 1.1.1 (:issue:`2243`, fixes :issue:`2198`)

Documentation

Scrapy 1.1.2 (2016-08-18)

Bug fixes

  • Introduce a missing :setting:`IMAGES_STORE_S3_ACL` setting to override the default ACL policy in ImagesPipeline when uploading images to S3 (note that default ACL policy is "private" -- instead of "public-read" -- since Scrapy 1.1.0)
  • :setting:`IMAGES_EXPIRES` default value set back to 90 (the regression was introduced in 1.1.1)

Scrapy 1.1.1 (2016-07-13)

Bug fixes

New features

Documentation

Tests

  • Upgrade py.test requirement on Travis CI and Pin pytest-cov to 2.2.1 (:issue:`2095`)

Scrapy 1.1.0 (2016-05-11)

This 1.1 release brings a lot of interesting features and bug fixes:

  • Scrapy 1.1 has beta Python 3 support (requires Twisted >= 15.5). See :ref:`news_betapy3` for more details and some limitations.
  • Hot new features:
  • These bug fixes may require your attention:
    • Don't retry bad requests (HTTP 400) by default (:issue:`1289`). If you need the old behavior, add 400 to :setting:`RETRY_HTTP_CODES`.
    • Fix shell files argument handling (:issue:`1710`, :issue:`1550`). If you try scrapy shell index.html it will try to load the URL http://index.html, use scrapy shell ./index.html to load a local file.
    • Robots.txt compliance is now enabled by default for newly-created projects (:issue:`1724`). Scrapy will also wait for robots.txt to be downloaded before proceeding with the crawl (:issue:`1735`). If you want to disable this behavior, update :setting:`ROBOTSTXT_OBEY` in settings.py file after creating a new project.
    • Exporters now work on unicode, instead of bytes by default (:issue:`1080`). If you use PythonItemExporter, you may want to update your code to disable binary mode which is now deprecated.
    • Accept XML node names containing dots as valid (:issue:`1533`).
    • When uploading files or images to S3 (with FilesPipeline or ImagesPipeline), the default ACL policy is now "private" instead of "public" Warning: backwards incompatible!. You can use :setting:`FILES_STORE_S3_ACL` to change it.
    • We've reimplemented canonicalize_url() for more correct output, especially for URLs with non-ASCII characters (:issue:`1947`). This could change link extractors output compared to previous scrapy versions. This may also invalidate some cache entries you could still have from pre-1.1 runs. Warning: backwards incompatible!.

Keep reading for more details on other improvements and bug fixes.

Beta Python 3 Support

We have been hard at work to make Scrapy run on Python 3. As a result, now you can run spiders on Python 3.3, 3.4 and 3.5 (Twisted >= 15.5 required). Some features are still missing (and some may never be ported).

Almost all builtin extensions/middlewares are expected to work. However, we are aware of some limitations in Python 3:

  • Scrapy does not work on Windows with Python 3
  • Sending emails is not supported
  • FTP download handler is not supported
  • Telnet console is not supported

Additional New Features and Enhancements

Deprecations and Removals

  • Added to_bytes and to_unicode, deprecated str_to_unicode and unicode_to_str functions (:issue:`778`).
  • binary_is_text is introduced, to replace use of isbinarytext (but with inverse return value) (:issue:`1851`)
  • The optional_features set has been removed (:issue:`1359`).
  • The --lsprof command line option has been removed (:issue:`1689`). Warning: backward incompatible, but doesn't break user code.
  • The following datatypes were deprecated (:issue:`1720`):
    • scrapy.utils.datatypes.MultiValueDictKeyError
    • scrapy.utils.datatypes.MultiValueDict
    • scrapy.utils.datatypes.SiteNode
  • The previously bundled scrapy.xlib.pydispatch library was deprecated and replaced by pydispatcher.

Relocations

Bugfixes

Scrapy 1.0.7 (2017-03-03)

  • Packaging fix: disallow unsupported Twisted versions in setup.py

Scrapy 1.0.6 (2016-05-04)

  • FIX: RetryMiddleware is now robust to non-standard HTTP status codes (:issue:`1857`)
  • FIX: Filestorage HTTP cache was checking wrong modified time (:issue:`1875`)
  • DOC: Support for Sphinx 1.4+ (:issue:`1893`)
  • DOC: Consistency in selectors examples (:issue:`1869`)

Scrapy 1.0.5 (2016-02-04)

Scrapy 1.0.4 (2015-12-30)

Scrapy 1.0.3 (2015-08-11)

Scrapy 1.0.2 (2015-08-06)

Scrapy 1.0.1 (2015-07-01)

Scrapy 1.0.0 (2015-06-19)

You will find a lot of new features and bugfixes in this major release. Make sure to check our updated :ref:`overview <intro-overview>` to get a glance of some of the changes, along with our brushed :ref:`tutorial <intro-tutorial>`.

Support for returning dictionaries in spiders

Declaring and returning Scrapy Items is no longer necessary to collect the scraped data from your spider, you can now return explicit dictionaries instead.

Classic version

class MyItem(scrapy.Item):
    url = scrapy.Field()

class MySpider(scrapy.Spider):
    def parse(self, response):
        return MyItem(url=response.url)

New version

class MySpider(scrapy.Spider):
    def parse(self, response):
        return {'url': response.url}

Per-spider settings (GSoC 2014)

Last Google Summer of Code project accomplished an important redesign of the mechanism used for populating settings, introducing explicit priorities to override any given setting. As an extension of that goal, we included a new level of priority for settings that act exclusively for a single spider, allowing them to redefine project settings.

Start using it by defining a :attr:`~scrapy.spiders.Spider.custom_settings` class variable in your spider:

class MySpider(scrapy.Spider):
    custom_settings = {
        "DOWNLOAD_DELAY": 5.0,
        "RETRY_ENABLED": False,
    }

Read more about settings population: :ref:`topics-settings`

Python Logging

Scrapy 1.0 has moved away from Twisted logging to support Python built in’s as default logging system. We’re maintaining backward compatibility for most of the old custom interface to call logging functions, but you’ll get warnings to switch to the Python logging API entirely.

Old version

from scrapy import log
log.msg('MESSAGE', log.INFO)

New version

import logging
logging.info('MESSAGE')

Logging with spiders remains the same, but on top of the :meth:`~scrapy.spiders.Spider.log` method you’ll have access to a custom :attr:`~scrapy.spiders.Spider.logger` created for the spider to issue log events:

class MySpider(scrapy.Spider):
    def parse(self, response):
        self.logger.info('Response received')

Read more in the logging documentation: :ref:`topics-logging`

Crawler API refactoring (GSoC 2014)

Another milestone for last Google Summer of Code was a refactoring of the internal API, seeking a simpler and easier usage. Check new core interface in: :ref:`topics-api`

A common situation where you will face these changes is while running Scrapy from scripts. Here’s a quick example of how to run a Spider manually with the new API:

from scrapy.crawler import CrawlerProcess

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start()

Bear in mind this feature is still under development and its API may change until it reaches a stable status.

See more examples for scripts running Scrapy: :ref:`topics-practices`

Module Relocations

There’s been a large rearrangement of modules trying to improve the general structure of Scrapy. Main changes were separating various subpackages into new projects and dissolving both scrapy.contrib and scrapy.contrib_exp into top level packages. Backward compatibility was kept among internal relocations, while importing deprecated modules expect warnings indicating their new place.

Full list of relocations

Outsourced packages

Note

These extensions went through some minor changes, e.g. some setting names were changed. Please check the documentation in each new repository to get familiar with the new usage.

Old location New location
scrapy.commands.deploy scrapyd-client (See other alternatives here: :ref:`topics-deploy`)
scrapy.contrib.djangoitem scrapy-djangoitem
scrapy.webservice scrapy-jsonrpc

scrapy.contrib_exp and scrapy.contrib dissolutions

Old location New location
scrapy.contrib_exp.downloadermiddleware.decompression scrapy.downloadermiddlewares.decompression
scrapy.contrib_exp.iterators scrapy.utils.iterators
scrapy.contrib.downloadermiddleware scrapy.downloadermiddlewares
scrapy.contrib.exporter scrapy.exporters
scrapy.contrib.linkextractors scrapy.linkextractors
scrapy.contrib.loader scrapy.loader
scrapy.contrib.loader.processor scrapy.loader.processors
scrapy.contrib.pipeline scrapy.pipelines
scrapy.contrib.spidermiddleware scrapy.spidermiddlewares
scrapy.contrib.spiders scrapy.spiders
  • scrapy.contrib.closespider
  • scrapy.contrib.corestats
  • scrapy.contrib.debug
  • scrapy.contrib.feedexport
  • scrapy.contrib.httpcache
  • scrapy.contrib.logstats
  • scrapy.contrib.memdebug
  • scrapy.contrib.memusage
  • scrapy.contrib.spiderstate
  • scrapy.contrib.statsmailer
  • scrapy.contrib.throttle
scrapy.extensions.*

Plural renames and Modules unification

Old location New location
scrapy.command scrapy.commands
scrapy.dupefilter scrapy.dupefilters
scrapy.linkextractor scrapy.linkextractors
scrapy.spider scrapy.spiders
scrapy.squeue scrapy.squeues
scrapy.statscol scrapy.statscollectors
scrapy.utils.decorator scrapy.utils.decorators

Class renames

Old location New location
scrapy.spidermanager.SpiderManager scrapy.spiderloader.SpiderLoader

Settings renames

Old location New location
SPIDER_MANAGER_CLASS SPIDER_LOADER_CLASS

Changelog

New Features and Enhancements

Deprecations and Removals

Relocations

Documentation

Bugfixes

Python 3 In Progress Support

Tests

Code refactoring

  • CSVFeedSpider cleanup: use iterate_spider_output (:issue:`1079`)
  • remove unnecessary check from scrapy.utils.spider.iter_spider_output (:issue:`1078`)
  • Pydispatch pep8 (:issue:`992`)
  • Removed unused 'load=False' parameter from walk_modules() (:issue:`871`)
  • For consistency, use job_dir helper in SpiderState extension. (:issue:`805`)
  • rename "sflo" local variables to less cryptic "log_observer" (:issue:`775`)

Scrapy 0.24.6 (2015-04-20)

Scrapy 0.24.5 (2015-02-25)

Scrapy 0.24.4 (2014-08-09)

Scrapy 0.24.3 (2014-08-09)

Scrapy 0.24.2 (2014-07-08)

Scrapy 0.24.1 (2014-06-27)

  • Fix deprecated CrawlerSettings and increase backwards compatibility with .defaults attribute (:commit:`8e3f20a`)

Scrapy 0.24.0 (2014-06-26)

Enhancements

Bugfixes

Scrapy 0.22.2 (released 2014-02-14)

Scrapy 0.22.1 (released 2014-02-08)

Scrapy 0.22.0 (released 2014-01-17)

Enhancements

Fixes

Scrapy 0.20.2 (released 2013-12-09)

Scrapy 0.20.1 (released 2013-11-28)

  • include_package_data is required to build wheels from published sources (:commit:`5ba1ad5`)
  • process_parallel was leaking the failures on its internal deferreds. closes #458 (:commit:`419a780`)

Scrapy 0.20.0 (released 2013-11-08)

Enhancements

Bugfixes

Other

  • Dropped Python 2.6 support (:issue:`448`)
  • Add cssselect python package as install dependency
  • Drop libxml2 and multi selector's backend support, lxml is required from now on.
  • Minimum Twisted version increased to 10.0.0, dropped Twisted 8.0 support.
  • Running test suite now requires mock python library (:issue:`390`)

Thanks

Thanks to everyone who contribute to this release!

List of contributors sorted by number of commits:

69 Daniel Graña <dangra@...>
37 Pablo Hoffman <pablo@...>
13 Mikhail Korobov <kmike84@...>
 9 Alex Cepoi <alex.cepoi@...>
 9 alexanderlukanin13 <alexander.lukanin.13@...>
 8 Rolando Espinoza La fuente <darkrho@...>
 8 Lukasz Biedrycki <lukasz.biedrycki@...>
 6 Nicolas Ramirez <nramirez.uy@...>
 3 Paul Tremberth <paul.tremberth@...>
 2 Martin Olveyra <molveyra@...>
 2 Stefan <misc@...>
 2 Rolando Espinoza <darkrho@...>
 2 Loren Davie <loren@...>
 2 irgmedeiros <irgmedeiros@...>
 1 Stefan Koch <taikano@...>
 1 Stefan <cct@...>
 1 scraperdragon <dragon@...>
 1 Kumara Tharmalingam <ktharmal@...>
 1 Francesco Piccinno <stack.box@...>
 1 Marcos Campal <duendex@...>
 1 Dragon Dave <dragon@...>
 1 Capi Etheriel <barraponto@...>
 1 cacovsky <amarquesferraz@...>
 1 Berend Iwema <berend@...>

Scrapy 0.18.4 (released 2013-10-10)

Scrapy 0.18.3 (released 2013-10-03)

Scrapy 0.18.2 (released 2013-09-03)

  • Backport scrapy check command fixes and backward compatible multi crawler process(:issue:`339`)

Scrapy 0.18.1 (released 2013-08-27)

Scrapy 0.18.0 (released 2013-08-09)

  • Lot of improvements to testsuite run using Tox, including a way to test on pypi
  • Handle GET parameters for AJAX crawleable urls (:commit:`3fe2a32`)
  • Use lxml recover option to parse sitemaps (:issue:`347`)
  • Bugfix cookie merging by hostname and not by netloc (:issue:`352`)
  • Support disabling HttpCompressionMiddleware using a flag setting (:issue:`359`)
  • Support xml namespaces using iternodes parser in XMLFeedSpider (:issue:`12`)
  • Support dont_cache request meta flag (:issue:`19`)
  • Bugfix scrapy.utils.gz.gunzip broken by changes in python 2.7.4 (:commit:`4dc76e`)
  • Bugfix url encoding on SgmlLinkExtractor (:issue:`24`)
  • Bugfix TakeFirst processor shouldn't discard zero (0) value (:issue:`59`)
  • Support nested items in xml exporter (:issue:`66`)
  • Improve cookies handling performance (:issue:`77`)
  • Log dupe filtered requests once (:issue:`105`)
  • Split redirection middleware into status and meta based middlewares (:issue:`78`)
  • Use HTTP1.1 as default downloader handler (:issue:`109` and :issue:`318`)
  • Support xpath form selection on FormRequest.from_response (:issue:`185`)
  • Bugfix unicode decoding error on SgmlLinkExtractor (:issue:`199`)
  • Bugfix signal dispatching on pypi interpreter (:issue:`205`)
  • Improve request delay and concurrency handling (:issue:`206`)
  • Add RFC2616 cache policy to HttpCacheMiddleware (:issue:`212`)
  • Allow customization of messages logged by engine (:issue:`214`)
  • Multiples improvements to DjangoItem (:issue:`217`, :issue:`218`, :issue:`221`)
  • Extend Scrapy commands using setuptools entry points (:issue:`260`)
  • Allow spider allowed_domains value to be set/tuple (:issue:`261`)
  • Support settings.getdict (:issue:`269`)
  • Simplify internal scrapy.core.scraper slot handling (:issue:`271`)
  • Added Item.copy (:issue:`290`)
  • Collect idle downloader slots (:issue:`297`)
  • Add ftp:// scheme downloader handler (:issue:`329`)
  • Added downloader benchmark webserver and spider tools :ref:`benchmarking`
  • Moved persistent (on disk) queues to a separate project (queuelib) which scrapy now depends on
  • Add scrapy commands using external libraries (:issue:`260`)
  • Added --pdb option to scrapy command line tool
  • Added :meth:`XPathSelector.remove_namespaces` which allows to remove all namespaces from XML documents for convenience (to work with namespace-less XPaths). Documented in :ref:`topics-selectors`.
  • Several improvements to spider contracts
  • New default middleware named MetaRefreshMiddldeware that handles meta-refresh html tag redirections,
  • MetaRefreshMiddldeware and RedirectMiddleware have different priorities to address #62
  • added from_crawler method to spiders
  • added system tests with mock server
  • more improvements to Mac OS compatibility (thanks Alex Cepoi)
  • several more cleanups to singletons and multi-spider support (thanks Nicolas Ramirez)
  • support custom download slots
  • added --spider option to "shell" command.
  • log overridden settings when scrapy starts

Thanks to everyone who contribute to this release. Here is a list of contributors sorted by number of commits:

130 Pablo Hoffman <pablo@...>
 97 Daniel Graña <dangra@...>
 20 Nicolás Ramírez <nramirez.uy@...>
 13 Mikhail Korobov <kmike84@...>
 12 Pedro Faustino <pedrobandim@...>
 11 Steven Almeroth <sroth77@...>
  5 Rolando Espinoza La fuente <darkrho@...>
  4 Michal Danilak <mimino.coder@...>
  4 Alex Cepoi <alex.cepoi@...>
  4 Alexandr N Zamaraev (aka tonal) <tonal@...>
  3 paul <paul.tremberth@...>
  3 Martin Olveyra <molveyra@...>
  3 Jordi Llonch <llonchj@...>
  3 arijitchakraborty <myself.arijit@...>
  2 Shane Evans <shane.evans@...>
  2 joehillen <joehillen@...>
  2 Hart <HartSimha@...>
  2 Dan <ellisd23@...>
  1 Zuhao Wan <wanzuhao@...>
  1 whodatninja <blake@...>
  1 vkrest <v.krestiannykov@...>
  1 tpeng <pengtaoo@...>
  1 Tom Mortimer-Jones <tom@...>
  1 Rocio Aramberri <roschegel@...>
  1 Pedro <pedro@...>
  1 notsobad <wangxiaohugg@...>
  1 Natan L <kuyanatan.nlao@...>
  1 Mark Grey <mark.grey@...>
  1 Luan <luanpab@...>
  1 Libor Nenadál <libor.nenadal@...>
  1 Juan M Uys <opyate@...>
  1 Jonas Brunsgaard <jonas.brunsgaard@...>
  1 Ilya Baryshev <baryshev@...>
  1 Hasnain Lakhani <m.hasnain.lakhani@...>
  1 Emanuel Schorsch <emschorsch@...>
  1 Chris Tilden <chris.tilden@...>
  1 Capi Etheriel <barraponto@...>
  1 cacovsky <amarquesferraz@...>
  1 Berend Iwema <berend@...>

Scrapy 0.16.5 (released 2013-05-30)

Scrapy 0.16.4 (released 2013-01-23)

Scrapy 0.16.3 (released 2012-12-07)

Scrapy 0.16.2 (released 2012-11-09)

Scrapy 0.16.1 (released 2012-10-26)

Scrapy 0.16.0 (released 2012-10-18)

Scrapy changes:

  • added :ref:`topics-contracts`, a mechanism for testing spiders in a formal/reproducible way
  • added options -o and -t to the :command:`runspider` command
  • documented :doc:`topics/autothrottle` and added to extensions installed by default. You still need to enable it with :setting:`AUTOTHROTTLE_ENABLED`
  • major Stats Collection refactoring: removed separation of global/per-spider stats, removed stats-related signals (stats_spider_opened, etc). Stats are much simpler now, backwards compatibility is kept on the Stats Collector API and signals.
  • added :meth:`~scrapy.contrib.spidermiddleware.SpiderMiddleware.process_start_requests` method to spider middlewares
  • dropped Signals singleton. Signals should now be accesed through the Crawler.signals attribute. See the signals documentation for more info.
  • dropped Signals singleton. Signals should now be accesed through the Crawler.signals attribute. See the signals documentation for more info.
  • dropped Stats Collector singleton. Stats can now be accessed through the Crawler.stats attribute. See the stats collection documentation for more info.
  • documented :ref:`topics-api`
  • lxml is now the default selectors backend instead of libxml2
  • ported FormRequest.from_response() to use lxml instead of ClientForm
  • removed modules: scrapy.xlib.BeautifulSoup and scrapy.xlib.ClientForm
  • SitemapSpider: added support for sitemap urls ending in .xml and .xml.gz, even if they advertise a wrong content type (:commit:`10ed28b`)
  • StackTraceDump extension: also dump trackref live references (:commit:`fe2ce93`)
  • nested items now fully supported in JSON and JSONLines exporters
  • added :reqmeta:`cookiejar` Request meta key to support multiple cookie sessions per spider
  • decoupled encoding detection code to w3lib.encoding, and ported Scrapy code to use that module
  • dropped support for Python 2.5. See https://blog.scrapinghub.com/2012/02/27/scrapy-0-15-dropping-support-for-python-2-5/
  • dropped support for Twisted 2.5
  • added :setting:`REFERER_ENABLED` setting, to control referer middleware
  • changed default user agent to: Scrapy/VERSION (+http://scrapy.org)
  • removed (undocumented) HTMLImageLinkExtractor class from scrapy.contrib.linkextractors.image
  • removed per-spider settings (to be replaced by instantiating multiple crawler objects)
  • USER_AGENT spider attribute will no longer work, use user_agent attribute instead
  • DOWNLOAD_TIMEOUT spider attribute will no longer work, use download_timeout attribute instead
  • removed ENCODING_ALIASES setting, as encoding auto-detection has been moved to the w3lib library
  • promoted :ref:`topics-djangoitem` to main contrib
  • LogFormatter method now return dicts(instead of strings) to support lazy formatting (:issue:`164`, :commit:`dcef7b0`)
  • downloader handlers (:setting:`DOWNLOAD_HANDLERS` setting) now receive settings as the first argument of the constructor
  • replaced memory usage acounting with (more portable) resource module, removed scrapy.utils.memory module
  • removed signal: scrapy.mail.mail_sent
  • removed TRACK_REFS setting, now :ref:`trackrefs <topics-leaks-trackrefs>` is always enabled
  • DBM is now the default storage backend for HTTP cache middleware
  • number of log messages (per level) are now tracked through Scrapy stats (stat name: log_count/LEVEL)
  • number received responses are now tracked through Scrapy stats (stat name: response_received_count)
  • removed scrapy.log.started attribute

Scrapy 0.14.4

Scrapy 0.14.3

  • forgot to include pydispatch license. #118 (:commit:`fd85f9c`)
  • include egg files used by testsuite in source distribution. #118 (:commit:`c897793`)
  • update docstring in project template to avoid confusion with genspider command, which may be considered as an advanced feature. refs #107 (:commit:`2548dcc`)
  • added note to docs/topics/firebug.rst about google directory being shut down (:commit:`668e352`)
  • dont discard slot when empty, just save in another dict in order to recycle if needed again. (:commit:`8e9f607`)
  • do not fail handling unicode xpaths in libxml2 backed selectors (:commit:`b830e95`)
  • fixed minor mistake in Request objects documentation (:commit:`bf3c9ee`)
  • fixed minor defect in link extractors documentation (:commit:`ba14f38`)
  • removed some obsolete remaining code related to sqlite support in scrapy (:commit:`0665175`)

Scrapy 0.14.2

Scrapy 0.14.1

Scrapy 0.14

New features and settings

  • Support for AJAX crawleable urls
  • New persistent scheduler that stores requests on disk, allowing to suspend and resume crawls (:rev:`2737`)
  • added -o option to scrapy crawl, a shortcut for dumping scraped items into a file (or standard output using -)
  • Added support for passing custom settings to Scrapyd schedule.json api (:rev:`2779`, :rev:`2783`)
  • New ChunkedTransferMiddleware (enabled by default) to support chunked transfer encoding (:rev:`2769`)
  • Add boto 2.0 support for S3 downloader handler (:rev:`2763`)
  • Added marshal to formats supported by feed exports (:rev:`2744`)
  • In request errbacks, offending requests are now received in failure.request attribute (:rev:`2738`)
  • Big downloader refactoring to support per domain/ip concurrency limits (:rev:`2732`)
  • Added builtin caching DNS resolver (:rev:`2728`)
  • Moved Amazon AWS-related components/extensions (SQS spider queue, SimpleDB stats collector) to a separate project: [scaws](https://github.com/scrapinghub/scaws) (:rev:`2706`, :rev:`2714`)
  • Moved spider queues to scrapyd: scrapy.spiderqueue -> scrapyd.spiderqueue (:rev:`2708`)
  • Moved sqlite utils to scrapyd: scrapy.utils.sqlite -> scrapyd.sqlite (:rev:`2781`)
  • Real support for returning iterators on start_requests() method. The iterator is now consumed during the crawl when the spider is getting idle (:rev:`2704`)
  • Added :setting:`REDIRECT_ENABLED` setting to quickly enable/disable the redirect middleware (:rev:`2697`)
  • Added :setting:`RETRY_ENABLED` setting to quickly enable/disable the retry middleware (:rev:`2694`)
  • Added CloseSpider exception to manually close spiders (:rev:`2691`)
  • Improved encoding detection by adding support for HTML5 meta charset declaration (:rev:`2690`)
  • Refactored close spider behavior to wait for all downloads to finish and be processed by spiders, before closing the spider (:rev:`2688`)
  • Added SitemapSpider (see documentation in Spiders page) (:rev:`2658`)
  • Added LogStats extension for periodically logging basic stats (like crawled pages and scraped items) (:rev:`2657`)
  • Make handling of gzipped responses more robust (#319, :rev:`2643`). Now Scrapy will try and decompress as much as possible from a gzipped response, instead of failing with an IOError.
  • Simplified !MemoryDebugger extension to use stats for dumping memory debugging info (:rev:`2639`)
  • Added new command to edit spiders: scrapy edit (:rev:`2636`) and -e flag to genspider command that uses it (:rev:`2653`)
  • Changed default representation of items to pretty-printed dicts. (:rev:`2631`). This improves default logging by making log more readable in the default case, for both Scraped and Dropped lines.
  • Added :signal:`spider_error` signal (:rev:`2628`)
  • Added :setting:`COOKIES_ENABLED` setting (:rev:`2625`)
  • Stats are now dumped to Scrapy log (default value of :setting:`STATS_DUMP` setting has been changed to True). This is to make Scrapy users more aware of Scrapy stats and the data that is collected there.
  • Added support for dynamically adjusting download delay and maximum concurrent requests (:rev:`2599`)
  • Added new DBM HTTP cache storage backend (:rev:`2576`)
  • Added listjobs.json API to Scrapyd (:rev:`2571`)
  • CsvItemExporter: added join_multivalued parameter (:rev:`2578`)
  • Added namespace support to xmliter_lxml (:rev:`2552`)
  • Improved cookies middleware by making COOKIES_DEBUG nicer and documenting it (:rev:`2579`)
  • Several improvements to Scrapyd and Link extractors

Code rearranged and removed

  • Merged item passed and item scraped concepts, as they have often proved confusing in the past. This means: (:rev:`2630`)
    • original item_scraped signal was removed
    • original item_passed signal was renamed to item_scraped
    • old log lines Scraped Item... were removed
    • old log lines Passed Item... were renamed to Scraped Item... lines and downgraded to DEBUG level
  • Reduced Scrapy codebase by striping part of Scrapy code into two new libraries:
  • Removed unused function: scrapy.utils.request.request_info() (:rev:`2577`)
  • Removed googledir project from examples/googledir. There's now a new example project called dirbot available on github: https://github.com/scrapy/dirbot
  • Removed support for default field values in Scrapy items (:rev:`2616`)
  • Removed experimental crawlspider v2 (:rev:`2632`)
  • Removed scheduler middleware to simplify architecture. Duplicates filter is now done in the scheduler itself, using the same dupe fltering class as before (DUPEFILTER_CLASS setting) (:rev:`2640`)
  • Removed support for passing urls to scrapy crawl command (use scrapy parse instead) (:rev:`2704`)
  • Removed deprecated Execution Queue (:rev:`2704`)
  • Removed (undocumented) spider context extension (from scrapy.contrib.spidercontext) (:rev:`2780`)
  • removed CONCURRENT_SPIDERS setting (use scrapyd maxproc instead) (:rev:`2789`)
  • Renamed attributes of core components: downloader.sites -> downloader.slots, scraper.sites -> scraper.slots (:rev:`2717`, :rev:`2718`)
  • Renamed setting CLOSESPIDER_ITEMPASSED to :setting:`CLOSESPIDER_ITEMCOUNT` (:rev:`2655`). Backwards compatibility kept.

Scrapy 0.12

The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.

New features and improvements

  • Passed item is now sent in the item argument of the :signal:`item_passed` (#273)
  • Added verbose option to scrapy version command, useful for bug reports (#298)
  • HTTP cache now stored by default in the project data dir (#279)
  • Added project data storage directory (#276, #277)
  • Documented file structure of Scrapy projects (see command-line tool doc)
  • New lxml backend for XPath selectors (#147)
  • Per-spider settings (#245)
  • Support exit codes to signal errors in Scrapy commands (#248)
  • Added -c argument to scrapy shell command
  • Made libxml2 optional (#260)
  • New deploy command (#261)
  • Added :setting:`CLOSESPIDER_PAGECOUNT` setting (#253)
  • Added :setting:`CLOSESPIDER_ERRORCOUNT` setting (#254)

Scrapyd changes

  • Scrapyd now uses one process per spider
  • It stores one log file per spider run, and rotate them keeping the lastest 5 logs per spider (by default)
  • A minimal web ui was added, available at http://localhost:6800 by default
  • There is now a scrapy server command to start a Scrapyd server of the current project

Changes to settings

  • added HTTPCACHE_ENABLED setting (False by default) to enable HTTP cache middleware
  • changed HTTPCACHE_EXPIRATION_SECS semantics: now zero means "never expire".

Deprecated/obsoleted functionality

  • Deprecated runserver command in favor of server command which starts a Scrapyd server. See also: Scrapyd changes
  • Deprecated queue command in favor of using Scrapyd schedule.json API. See also: Scrapyd changes
  • Removed the !LxmlItemLoader (experimental contrib which never graduated to main contrib)

Scrapy 0.10

The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.

New features and improvements

  • New Scrapy service called scrapyd for deploying Scrapy crawlers in production (#218) (documentation available)
  • Simplified Images pipeline usage which doesn't require subclassing your own images pipeline now (#217)
  • Scrapy shell now shows the Scrapy log by default (#206)
  • Refactored execution queue in a common base code and pluggable backends called "spider queues" (#220)
  • New persistent spider queue (based on SQLite) (#198), available by default, which allows to start Scrapy in server mode and then schedule spiders to run.
  • Added documentation for Scrapy command-line tool and all its available sub-commands. (documentation available)
  • Feed exporters with pluggable backends (#197) (documentation available)
  • Deferred signals (#193)
  • Added two new methods to item pipeline open_spider(), close_spider() with deferred support (#195)
  • Support for overriding default request headers per spider (#181)
  • Replaced default Spider Manager with one with similar functionality but not depending on Twisted Plugins (#186)
  • Splitted Debian package into two packages - the library and the service (#187)
  • Scrapy log refactoring (#188)
  • New extension for keeping persistent spider contexts among different runs (#203)
  • Added dont_redirect request.meta key for avoiding redirects (#233)
  • Added dont_retry request.meta key for avoiding retries (#234)

Command-line tool changes

  • New scrapy command which replaces the old scrapy-ctl.py (#199) - there is only one global scrapy command now, instead of one scrapy-ctl.py per project - Added scrapy.bat script for running more conveniently from Windows
  • Added bash completion to command-line tool (#210)
  • Renamed command start to runserver (#209)

API changes

  • url and body attributes of Request objects are now read-only (#230)
  • Request.copy() and Request.replace() now also copies their callback and errback attributes (#231)
  • Removed UrlFilterMiddleware from scrapy.contrib (already disabled by default)
  • Offsite middelware doesn't filter out any request coming from a spider that doesn't have a allowed_domains attribute (#225)
  • Removed Spider Manager load() method. Now spiders are loaded in the constructor itself.
  • Changes to Scrapy Manager (now called "Crawler"):
    • scrapy.core.manager.ScrapyManager class renamed to scrapy.crawler.Crawler
    • scrapy.core.manager.scrapymanager singleton moved to scrapy.project.crawler
  • Moved module: scrapy.contrib.spidermanager to scrapy.spidermanager
  • Spider Manager singleton moved from scrapy.spider.spiders to the spiders` attribute of ``scrapy.project.crawler singleton.
  • moved Stats Collector classes: (#204)
    • scrapy.stats.collector.StatsCollector to scrapy.statscol.StatsCollector
    • scrapy.stats.collector.SimpledbStatsCollector to scrapy.contrib.statscol.SimpledbStatsCollector
  • default per-command settings are now specified in the default_settings attribute of command object class (#201)
  • changed arguments of Item pipeline process_item() method from (spider, item) to (item, spider)
    • backwards compatibility kept (with deprecation warning)
  • moved scrapy.core.signals module to scrapy.signals
    • backwards compatibility kept (with deprecation warning)
  • moved scrapy.core.exceptions module to scrapy.exceptions
    • backwards compatibility kept (with deprecation warning)
  • added handles_request() class method to BaseSpider
  • dropped scrapy.log.exc() function (use scrapy.log.err() instead)
  • dropped component argument of scrapy.log.msg() function
  • dropped scrapy.log.log_level attribute
  • Added from_settings() class methods to Spider Manager, and Item Pipeline Manager

Changes to settings

  • Added HTTPCACHE_IGNORE_SCHEMES setting to ignore certain schemes on !HttpCacheMiddleware (#225)
  • Added SPIDER_QUEUE_CLASS setting which defines the spider queue to use (#220)
  • Added KEEP_ALIVE setting (#220)
  • Removed SERVICE_QUEUE setting (#220)
  • Removed COMMANDS_SETTINGS_MODULE setting (#201)
  • Renamed REQUEST_HANDLERS to DOWNLOAD_HANDLERS and make download handlers classes (instead of functions)

Scrapy 0.9

The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.

New features and improvements

API changes

Changes to default settings

Scrapy 0.8

The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.

New features

Backwards-incompatible changes

  • Changed scrapy.utils.response.get_meta_refresh() signature (:rev:`1804`)
  • Removed deprecated scrapy.item.ScrapedItem class - use scrapy.item.Item instead (:rev:`1838`)
  • Removed deprecated scrapy.xpath module - use scrapy.selector instead. (:rev:`1836`)
  • Removed deprecated core.signals.domain_open signal - use core.signals.domain_opened instead (:rev:`1822`)
  • log.msg() now receives a spider argument (:rev:`1822`)
    • Old domain argument has been deprecated and will be removed in 0.9. For spiders, you should always use the spider argument and pass spider references. If you really want to pass a string, use the component argument instead.
  • Changed core signals domain_opened, domain_closed, domain_idle
  • Changed Item pipeline to use spiders instead of domains
    • The domain argument of process_item() item pipeline method was changed to spider, the new signature is: process_item(spider, item) (:rev:`1827` | #105)
    • To quickly port your code (to work with Scrapy 0.8) just use spider.domain_name where you previously used domain.
  • Changed Stats API to use spiders instead of domains (:rev:`1849` | #113)
    • StatsCollector was changed to receive spider references (instead of domains) in its methods (set_value, inc_value, etc).
    • added StatsCollector.iter_spider_stats() method
    • removed StatsCollector.list_domains() method
    • Also, Stats signals were renamed and now pass around spider references (instead of domains). Here's a summary of the changes:
    • To quickly port your code (to work with Scrapy 0.8) just use spider.domain_name where you previously used domain. spider_stats contains exactly the same data as domain_stats.
  • CloseDomain extension moved to scrapy.contrib.closespider.CloseSpider (:rev:`1833`)
    • Its settings were also renamed:
      • CLOSEDOMAIN_TIMEOUT to CLOSESPIDER_TIMEOUT
      • CLOSEDOMAIN_ITEMCOUNT to CLOSESPIDER_ITEMCOUNT
  • Removed deprecated SCRAPYSETTINGS_MODULE environment variable - use SCRAPY_SETTINGS_MODULE instead (:rev:`1840`)
  • Renamed setting: REQUESTS_PER_DOMAIN to CONCURRENT_REQUESTS_PER_SPIDER (:rev:`1830`, :rev:`1844`)
  • Renamed setting: CONCURRENT_DOMAINS to CONCURRENT_SPIDERS (:rev:`1830`)
  • Refactored HTTP Cache middleware
  • HTTP Cache middleware has been heavilty refactored, retaining the same functionality except for the domain sectorization which was removed. (:rev:`1843` )
  • Renamed exception: DontCloseDomain to DontCloseSpider (:rev:`1859` | #120)
  • Renamed extension: DelayedCloseDomain to SpiderCloseDelay (:rev:`1861` | #121)
  • Removed obsolete scrapy.utils.markup.remove_escape_chars function - use scrapy.utils.markup.replace_escape_chars instead (:rev:`1865`)

Scrapy 0.7

First release of Scrapy.