Scrapy support for working with streamcorpus Stream Items.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
streamitem
tests
.gitignore
.travis.yml
README.rst
requirements.txt
setup.py
tox.ini

README.rst

scrapy-streamitem

https://badge.fury.io/py/scrapy-streamitem.png https://api.travis-ci.org/scrapinghub/scrapy-streamitem.png?branch=master

Overview

Scrapy support for working with streamcorpus StreamItems.

Includes the following:

  • StreamItem: Scrapy Stream Item definition. streamitem.items.StreamItem
  • StreamItemLoader: Scrapy Itemloader for StreamItem. streamitem.loaders.StreamItemLoader
  • StreamItemExporter: Scrapy ItemExporter to .sc file. streamitem.exporters.StreamItemExporter
  • StreamItemFileFeedStorage: Scrapy FileFeedStorage to handle .sc files. streamitem.storages.StreamItemFileFeedStorage

Stream Items

Scrapy Stream Item will be populated from response with the following fields:

  • url: A string containing the URL of the response.
  • body: A string containing the body of this Response.
  • source_url: If response has been redirected, a string containing the URL of the original page. Defaults to None.
  • redirect_urls: If response has been redirected, a list containing the URLs of all the redirected pages, including the current one. Defaults to None.
  • http_status: An integer representing the HTTP status of the response. Example: 200, 404.
  • content_type: A string containing the Content-Type HTTP header of the response.
  • response_size: An integer representing the response body size in bytes.
  • metadata: A dict containing arbitrary metadata for this page.

If items are exported they will generate streamcorpus StreamItem items filling the following fields:

  • abs_url: item.url
  • source_url: item.source_url
  • body.raw: item.body
  • body.media_type: item.content_type
  • body.language.code: item.metadata.language_code
  • body.language.name: item.metadata.language_name
  • source_metadata['redirect_urls']: item.redirect_urls
  • source_metadata['response_size']: item.response_size
  • source_metadata: will be filled with all fields in item.metadata

How to use it

An example of use from a spider:

def parse_page(self, response):
    loader = StreamItemLoader(item=StreamItem(), response=response)
    return loader.load_item()

Settings for exporting:

FEED_URI = ".exports/streamitems.sc"
FEED_FORMAT = "streamcorpus"
FEED_EXPORTERS = {
    'streamcorpus': 'scrapylib.streamitem.exporters.StreamItemExporter',
}
FEED_STORAGES = {
    '': 'scrapylib.streamitem.storages.StreamItemFileFeedStorage',
}

You can also add additional info to your item using the metadata field. For example from a Item pipeline:

def process_item(self, item, spider):
     item['metadata']['my_custom_field'] = 'whatever'
     return item

Requirements

Install

using pypi:

pip install scrapy-streamitem