Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Change Files/ImagesPipelines class attributes to instance attributes #1891

Merged
merged 5 commits into from Apr 8, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
83 changes: 56 additions & 27 deletions docs/topics/media-pipeline.rst
Expand Up @@ -77,30 +77,6 @@ PIL.
.. _Python Imaging Library: http://www.pythonware.com/products/pil/


Usage example
=============

In order to use a media pipeline first, :ref:`enable it
<topics-media-pipeline-enabling>`.

Then, if a spider returns a dict with the URLs key ('file_urls' or
'image_urls', for the Files or Images Pipeline respectively), the pipeline will
put the results under respective key ('files' or images').

If you prefer to use :class:`~.Item`, then define a custom item with the
necessary fields, like in this example for Images Pipeline::

import scrapy

class MyItem(scrapy.Item):

# ... other item fields ...
image_urls = scrapy.Field()
images = scrapy.Field()

If you need something more complex and want to override the custom pipeline
behaviour, see :ref:`topics-media-pipeline-override`.

.. _topics-media-pipeline-enabling:

Enabling your Media Pipeline
Expand Down Expand Up @@ -171,6 +147,51 @@ Where:
used). For more info see :ref:`topics-images-thumbnails`.


Usage example
=============

.. setting:: FILES_URLS_FIELD
.. setting:: FILES_RESULT_FIELD
.. setting:: IMAGES_URLS_FIELD
.. setting:: IMAGES_RESULT_FIELD

In order to use a media pipeline first, :ref:`enable it
<topics-media-pipeline-enabling>`.

Then, if a spider returns a dict with the URLs key (``file_urls`` or
``image_urls``, for the Files or Images Pipeline respectively), the pipeline will
put the results under respective key (``files`` or ``images``).

If you prefer to use :class:`~.Item`, then define a custom item with the
necessary fields, like in this example for Images Pipeline::

import scrapy

class MyItem(scrapy.Item):

# ... other item fields ...
image_urls = scrapy.Field()
images = scrapy.Field()

If you want to use another field name for the URLs key or for the results key,
it is also possible to override it.

For the Files Pipeline, set :setting:`FILES_URLS_FIELD` and/or
:setting:`FILES_RESULT_FIELD` settings::

FILES_URLS_FIELD = 'field_name_for_your_files_urls'
FILES_RESULT_FIELD = 'field_name_for_your_processed_files'

For the Images Pipeline, set :setting:`IMAGES_URLS_FIELD` and/or
:setting:`IMAGES_RESULT_FIELD` settings::

IMAGES_URLS_FIELD = 'field_name_for_your_images_urls'
IMAGES_RESULT_FIELD = 'field_name_for_your_processed_images'

If you need something more complex and want to override the custom pipeline
behaviour, see :ref:`topics-media-pipeline-override`.


Additional features
===================

Expand All @@ -185,12 +206,14 @@ adjust this retention delay use the :setting:`FILES_EXPIRES` setting (or
:setting:`IMAGES_EXPIRES`, in case of Images Pipeline), which
specifies the delay in number of days::

# 90 days of delay for files expiration
FILES_EXPIRES = 90
# 120 days of delay for files expiration
FILES_EXPIRES = 120

# 30 days of delay for images expiration
IMAGES_EXPIRES = 30

The default value for both settings is 90 days.

.. _topics-images-thumbnails:

Thumbnail generation for images
Expand Down Expand Up @@ -249,7 +272,13 @@ For example::
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110

Note: these size constraints don't affect thumbnail generation at all.
.. note::
The size constraints don't affect thumbnail generation at all.

It is possible to set just one size constraint or both. When setting both of
them, only images that satisfy both minimum sizes will be saved. For the
above example, images of sizes (105 x 105) or (105 x 200) or (200 x 105) will
all be dropped because at least one dimension is shorter than the constraint.

By default, there are no size constraints, so all images are processed.

Expand Down
27 changes: 15 additions & 12 deletions scrapy/pipelines/files.py
Expand Up @@ -22,6 +22,7 @@
from twisted.internet import defer, threads

from scrapy.pipelines.media import MediaPipeline
from scrapy.settings import Settings
from scrapy.exceptions import NotConfigured, IgnoreRequest
from scrapy.http import Request
from scrapy.utils.misc import md5sum
Expand Down Expand Up @@ -213,19 +214,24 @@ class FilesPipeline(MediaPipeline):
"""

MEDIA_NAME = "file"
EXPIRES = 90
STORE_SCHEMES = {
'': FSFilesStore,
'file': FSFilesStore,
's3': S3FilesStore,
}
DEFAULT_FILES_URLS_FIELD = 'file_urls'
DEFAULT_FILES_RESULT_FIELD = 'files'

def __init__(self, store_uri, download_func=None):
def __init__(self, store_uri, download_func=None, settings=None):
if not store_uri:
raise NotConfigured

if isinstance(settings, dict) or settings is None:
settings = Settings(settings)

self.store = self._get_store(store_uri)
self.expires = settings.getint('FILES_EXPIRES')
self.files_urls_field = settings.get('FILES_URLS_FIELD')
self.files_result_field = settings.get('FILES_RESULT_FIELD')

super(FilesPipeline, self).__init__(download_func=download_func)

@classmethod
Expand All @@ -235,11 +241,8 @@ def from_settings(cls, settings):
s3store.AWS_SECRET_ACCESS_KEY = settings['AWS_SECRET_ACCESS_KEY']
s3store.POLICY = settings['FILES_STORE_S3_ACL']

cls.FILES_URLS_FIELD = settings.get('FILES_URLS_FIELD', cls.DEFAULT_FILES_URLS_FIELD)
cls.FILES_RESULT_FIELD = settings.get('FILES_RESULT_FIELD', cls.DEFAULT_FILES_RESULT_FIELD)
cls.EXPIRES = settings.getint('FILES_EXPIRES', 90)
store_uri = settings['FILES_STORE']
return cls(store_uri)
return cls(store_uri, settings=settings)

def _get_store(self, uri):
if os.path.isabs(uri): # to support win32 paths like: C:\\some\dir
Expand All @@ -260,7 +263,7 @@ def _onsuccess(result):

age_seconds = time.time() - last_modified
age_days = age_seconds / 60 / 60 / 24
if age_days > self.EXPIRES:
if age_days > self.expires:
return # returning None force download

referer = referer_str(request)
Expand Down Expand Up @@ -359,7 +362,7 @@ def inc_stats(self, spider, status):

### Overridable Interface
def get_media_requests(self, item, info):
return [Request(x) for x in item.get(self.FILES_URLS_FIELD, [])]
return [Request(x) for x in item.get(self.files_urls_field, [])]

def file_downloaded(self, response, request, info):
path = self.file_path(request, response=response, info=info)
Expand All @@ -370,8 +373,8 @@ def file_downloaded(self, response, request, info):
return checksum

def item_completed(self, results, item, info):
if isinstance(item, dict) or self.FILES_RESULT_FIELD in item.fields:
item[self.FILES_RESULT_FIELD] = [x for ok, x in results if ok]
if isinstance(item, dict) or self.files_result_field in item.fields:
item[self.files_result_field] = [x for ok, x in results if ok]
return item

def file_path(self, request, response=None, info=None):
Expand Down
39 changes: 21 additions & 18 deletions scrapy/pipelines/images.py
Expand Up @@ -17,6 +17,7 @@
from scrapy.utils.misc import md5sum
from scrapy.utils.python import to_bytes
from scrapy.http import Request
from scrapy.settings import Settings
from scrapy.exceptions import DropItem
#TODO: from scrapy.pipelines.media import MediaPipeline
from scrapy.pipelines.files import FileException, FilesPipeline
Expand All @@ -36,26 +37,28 @@ class ImagesPipeline(FilesPipeline):
"""

MEDIA_NAME = 'image'
MIN_WIDTH = 0
MIN_HEIGHT = 0
THUMBS = {}
DEFAULT_IMAGES_URLS_FIELD = 'image_urls'
DEFAULT_IMAGES_RESULT_FIELD = 'images'

def __init__(self, store_uri, download_func=None, settings=None):
super(ImagesPipeline, self).__init__(store_uri, settings=settings, download_func=download_func)

if isinstance(settings, dict) or settings is None:
settings = Settings(settings)

self.expires = settings.getint('IMAGES_EXPIRES')
self.images_urls_field = settings.get('IMAGES_URLS_FIELD')
self.images_result_field = settings.get('IMAGES_RESULT_FIELD')
self.min_width = settings.getint('IMAGES_MIN_WIDTH')
self.min_height = settings.getint('IMAGES_MIN_HEIGHT')
self.thumbs = settings.get('IMAGES_THUMBS')

@classmethod
def from_settings(cls, settings):
cls.MIN_WIDTH = settings.getint('IMAGES_MIN_WIDTH', 0)
cls.MIN_HEIGHT = settings.getint('IMAGES_MIN_HEIGHT', 0)
cls.EXPIRES = settings.getint('IMAGES_EXPIRES', 90)
cls.THUMBS = settings.get('IMAGES_THUMBS', {})
s3store = cls.STORE_SCHEMES['s3']
s3store.AWS_ACCESS_KEY_ID = settings['AWS_ACCESS_KEY_ID']
s3store.AWS_SECRET_ACCESS_KEY = settings['AWS_SECRET_ACCESS_KEY']

cls.IMAGES_URLS_FIELD = settings.get('IMAGES_URLS_FIELD', cls.DEFAULT_IMAGES_URLS_FIELD)
cls.IMAGES_RESULT_FIELD = settings.get('IMAGES_RESULT_FIELD', cls.DEFAULT_IMAGES_RESULT_FIELD)
store_uri = settings['IMAGES_STORE']
return cls(store_uri)
return cls(store_uri, settings=settings)

def file_downloaded(self, response, request, info):
return self.image_downloaded(response, request, info)
Expand All @@ -78,14 +81,14 @@ def get_images(self, response, request, info):
orig_image = Image.open(BytesIO(response.body))

width, height = orig_image.size
if width < self.MIN_WIDTH or height < self.MIN_HEIGHT:
if width < self.min_width or height < self.min_height:
raise ImageException("Image too small (%dx%d < %dx%d)" %
(width, height, self.MIN_WIDTH, self.MIN_HEIGHT))
(width, height, self.min_width, self.min_height))

image, buf = self.convert_image(orig_image)
yield path, image, buf

for thumb_id, size in six.iteritems(self.THUMBS):
for thumb_id, size in six.iteritems(self.thumbs):
thumb_path = self.thumb_path(request, thumb_id, response=response, info=info)
thumb_image, thumb_buf = self.convert_image(image, size)
yield thumb_path, thumb_image, thumb_buf
Expand All @@ -107,11 +110,11 @@ def convert_image(self, image, size=None):
return image, buf

def get_media_requests(self, item, info):
return [Request(x) for x in item.get(self.IMAGES_URLS_FIELD, [])]
return [Request(x) for x in item.get(self.images_urls_field, [])]

def item_completed(self, results, item, info):
if isinstance(item, dict) or self.IMAGES_RESULT_FIELD in item.fields:
item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
if isinstance(item, dict) or self.images_result_field in item.fields:
item[self.images_result_field] = [x for ok, x in results if ok]
return item

def file_path(self, request, response=None, info=None):
Expand Down
10 changes: 10 additions & 0 deletions scrapy/settings/default_settings.py
Expand Up @@ -159,6 +159,9 @@
}

FILES_STORE_S3_ACL = 'private'
FILES_EXPIRES = 90
FILES_URLS_FIELD = 'file_urls'
FILES_RESULT_FIELD = 'files'

HTTPCACHE_ENABLED = False
HTTPCACHE_DIR = 'httpcache'
Expand All @@ -175,6 +178,13 @@

HTTPPROXY_AUTH_ENCODING = 'latin-1'

IMAGES_MIN_WIDTH = 0
IMAGES_MIN_HEIGHT = 0
IMAGES_EXPIRES = 90
IMAGES_THUMBS = {}
IMAGES_URLS_FIELD = 'image_urls'
IMAGES_RESULT_FIELD = 'images'

ITEM_PROCESSOR = 'scrapy.pipelines.ItemPipelineManager'

ITEM_PIPELINES = {}
Expand Down
31 changes: 30 additions & 1 deletion tests/test_pipeline_files.py
Expand Up @@ -91,7 +91,7 @@ def test_file_expired(self):
patchers = [
mock.patch.object(FSFilesStore, 'stat_file', return_value={
'checksum': 'abc',
'last_modified': time.time() - (FilesPipeline.EXPIRES * 60 * 60 * 24 * 2)}),
'last_modified': time.time() - (self.pipeline.expires * 60 * 60 * 24 * 2)}),
mock.patch.object(FilesPipeline, 'get_media_requests',
return_value=[_prepare_request_object(item_url)]),
mock.patch.object(FilesPipeline, 'inc_stats', return_value=True)
Expand Down Expand Up @@ -183,6 +183,35 @@ class TestItem(Item):
self.assertEqual(item['stored_file'], [results[0][1]])


class FilesPipelineTestCaseCustomSettings(unittest.TestCase):

def setUp(self):
self.tempdir = mkdtemp()
self.pipeline = FilesPipeline(self.tempdir)
self.default_settings = Settings()

def tearDown(self):
rmtree(self.tempdir)

def test_expires(self):
another_pipeline = FilesPipeline.from_settings(Settings({'FILES_STORE': self.tempdir,
'FILES_EXPIRES': 42}))
self.assertEqual(self.pipeline.expires, self.default_settings.getint('FILES_EXPIRES'))
self.assertEqual(another_pipeline.expires, 42)

def test_files_urls_field(self):
another_pipeline = FilesPipeline.from_settings(Settings({'FILES_STORE': self.tempdir,
'FILES_URLS_FIELD': 'funny_field'}))
self.assertEqual(self.pipeline.files_urls_field, self.default_settings.get('FILES_URLS_FIELD'))
self.assertEqual(another_pipeline.files_urls_field, 'funny_field')

def test_files_result_field(self):
another_pipeline = FilesPipeline.from_settings(Settings({'FILES_STORE': self.tempdir,
'FILES_RESULT_FIELD': 'funny_field'}))
self.assertEqual(self.pipeline.files_result_field, self.default_settings.get('FILES_RESULT_FIELD'))
self.assertEqual(another_pipeline.files_result_field, 'funny_field')


class TestS3FilesStore(unittest.TestCase):
@defer.inlineCallbacks
def test_persist(self):
Expand Down
48 changes: 48 additions & 0 deletions tests/test_pipeline_images.py
Expand Up @@ -205,6 +205,54 @@ class TestItem(Item):
self.assertEqual(item['stored_image'], [results[0][1]])


class ImagesPipelineTestCaseCustomSettings(unittest.TestCase):

def setUp(self):
self.tempdir = mkdtemp()
self.pipeline = ImagesPipeline(self.tempdir)
self.default_settings = Settings()

def tearDown(self):
rmtree(self.tempdir)

def test_expires(self):
another_pipeline = ImagesPipeline.from_settings(Settings({'IMAGES_STORE': self.tempdir,
'IMAGES_EXPIRES': 42}))
self.assertEqual(self.pipeline.expires, self.default_settings.getint('IMAGES_EXPIRES'))
self.assertEqual(another_pipeline.expires, 42)

def test_images_urls_field(self):
another_pipeline = ImagesPipeline.from_settings(Settings({'IMAGES_STORE': self.tempdir,
'IMAGES_URLS_FIELD': 'funny_field'}))
self.assertEqual(self.pipeline.images_urls_field, self.default_settings.get('IMAGES_URLS_FIELD'))
self.assertEqual(another_pipeline.images_urls_field, 'funny_field')

def test_images_result_field(self):
another_pipeline = ImagesPipeline.from_settings(Settings({'IMAGES_STORE': self.tempdir,
'IMAGES_RESULT_FIELD': 'funny_field'}))
self.assertEqual(self.pipeline.images_result_field, self.default_settings.get('IMAGES_RESULT_FIELD'))
self.assertEqual(another_pipeline.images_result_field, 'funny_field')

def test_min_width(self):
another_pipeline = ImagesPipeline.from_settings(Settings({'IMAGES_STORE': self.tempdir,
'IMAGES_MIN_WIDTH': 42}))
self.assertEqual(self.pipeline.min_width, self.default_settings.getint('IMAGES_MIN_WIDTH'))
self.assertEqual(another_pipeline.min_width, 42)

def test_min_height(self):
another_pipeline = ImagesPipeline.from_settings(Settings({'IMAGES_STORE': self.tempdir,
'IMAGES_MIN_HEIGHT': 42}))
self.assertEqual(self.pipeline.min_height, self.default_settings.getint('IMAGES_MIN_HEIGHT'))
self.assertEqual(another_pipeline.min_height, 42)

def test_thumbs(self):
custom_thumbs = {'small': (50, 50), 'big': (270, 270)}
another_pipeline = ImagesPipeline.from_settings(Settings({'IMAGES_STORE': self.tempdir,
'IMAGES_THUMBS': custom_thumbs}))
self.assertEqual(self.pipeline.thumbs, self.default_settings.get('IMAGES_THUMBS'))
self.assertEqual(another_pipeline.thumbs, custom_thumbs)


def _create_image(format, *a, **kw):
buf = TemporaryFile()
Image.new(*a, **kw).save(buf, format)
Expand Down