@sibiryakov sibiryakov released this Aug 18, 2016 · 366 commits to master since this release

Assets 2
  • Full Python 3 support 👏 👍 🍻 (#106), all the thanks goes to @Preetwinder.
  • canonicalize_url method removed in favor of w3lib implementation.
  • The whole Request (incl. meta) is propagated to DB Worker, by means of scoring log (fixes #131)
  • Generating Crc32 from hostname the same way for both platforms: Python 2 and 3.
  • HBaseQueue supports delayed requests now. ‘crawl_at’ field in meta with timestamp makes request available to spiders only after moment expressed with timestamp passed. Important feature for revisiting.
  • Request object is now persisted in HBaseQueue, allowing to schedule requests with specific meta, headers, body, cookies parameters.
  • MESSAGE_BUS_CODEC option allowing to choose other than default message bus codec.
  • Strategy worker refactoring to simplify it’s customization from subclasses.
  • Fixed a bug with extracted links distribution over spider log partitions (#129).