Be notified of new releases
Create your free GitHub account today to subscribe to this repository for new releases and build software alongside 28 million developers.Sign up
- Full Python 3 support
👏 👍 🍻(#106), all the thanks goes to @Preetwinder.
- canonicalize_url method removed in favor of w3lib implementation.
- The whole
Request(incl. meta) is propagated to DB Worker, by means of scoring log (fixes #131)
- Generating Crc32 from hostname the same way for both platforms: Python 2 and 3.
HBaseQueuesupports delayed requests now. ‘crawl_at’ field in meta with timestamp makes request available to spiders only after moment expressed with timestamp passed. Important feature for revisiting.
Requestobject is now persisted in
HBaseQueue, allowing to schedule requests with specific meta, headers, body, cookies parameters.
MESSAGE_BUS_CODECoption allowing to choose other than default message bus codec.
- Strategy worker refactoring to simplify it’s customization from subclasses.
- Fixed a bug with extracted links distribution over spider log partitions (#129).