Scrapy support for working with streamcorpus Stream Items.
Python
Latest commit 78ecb9d Jan 8, 2015 @gatufo gatufo Update README.rst

README.rst

scrapy-streamitem

https://badge.fury.io/py/scrapy-streamitem.png https://api.travis-ci.org/scrapinghub/scrapy-streamitem.png?branch=master

Overview

Scrapy support for working with streamcorpus StreamItems.

Includes the following:

  • StreamItem: Scrapy Stream Item definition. streamitem.items.StreamItem
  • StreamItemLoader: Scrapy Itemloader for StreamItem. streamitem.loaders.StreamItemLoader
  • StreamItemExporter: Scrapy ItemExporter to .sc file. streamitem.exporters.StreamItemExporter
  • StreamItemFileFeedStorage: Scrapy FileFeedStorage to handle .sc files. streamitem.storages.StreamItemFileFeedStorage

Stream Items

Scrapy Stream Item will be populated from response with the following fields:

  • url: A string containing the URL of the response.
  • body: A string containing the body of this Response.
  • source_url: If response has been redirected, a string containing the URL of the original page. Defaults to None.
  • redirect_urls: If response has been redirected, a list containing the URLs of all the redirected pages, including the current one. Defaults to None.
  • http_status: An integer representing the HTTP status of the response. Example: 200, 404.
  • content_type: A string containing the Content-Type HTTP header of the response.
  • response_size: An integer representing the response body size in bytes.
  • metadata: A dict containing arbitrary metadata for this page.

If items are exported they will generate streamcorpus StreamItem items filling the following fields:

  • abs_url: item.url
  • source_url: item.source_url
  • body.raw: item.body
  • body.media_type: item.content_type
  • body.language.code: item.metadata.language_code
  • body.language.name: item.metadata.language_name
  • source_metadata['redirect_urls']: item.redirect_urls
  • source_metadata['response_size']: item.response_size
  • source_metadata: will be filled with all fields in item.metadata

How to use it

An example of use from a spider:

def parse_page(self, response):
    loader = StreamItemLoader(item=StreamItem(), response=response)
    return loader.load_item()

Settings for exporting:

FEED_URI = ".exports/streamitems.sc"
FEED_FORMAT = "streamcorpus"
FEED_EXPORTERS = {
    'streamcorpus': 'scrapylib.streamitem.exporters.StreamItemExporter',
}
FEED_STORAGES = {
    '': 'scrapylib.streamitem.storages.StreamItemFileFeedStorage',
}

You can also add additional info to your item using the metadata field. For example from a Item pipeline:

def process_item(self, item, spider):
     item['metadata']['my_custom_field'] = 'whatever'
     return item

Requirements

Install

using pypi:

pip install scrapy-streamitem