Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documenting Files Pipeline together with Images Pipeline #1150

Merged
merged 3 commits into from Apr 21, 2015

Conversation

@eliasdorneles
Copy link
Member

@eliasdorneles eliasdorneles commented Apr 11, 2015

Hey folks,

So, here is a proposal addressing #599 , documenting both media pipelines together.
Does this look good?

Thank you!

Both pipelines implement these features:

* Avoid re-downloading media that was downloaded recently
* Specifying where to store the files (filesystem directory, Amazon S3 bucket)

This comment has been minimized.

@dangra

dangra Apr 15, 2015
Member

I think we should refer to media and not files or images when talking in generics.

This comment has been minimized.

@kmike

kmike Apr 15, 2015
Member

I like "files" more - never understood what "media" means :)

This comment has been minimized.

@eliasdorneles

eliasdorneles Apr 15, 2015
Author Member

I think of it like this: media is more about the content and file is more about the "container/artifact".

So, we avoid re-downloading media because we're assuming the same URL would give us the same document/image/video.
We specify where to store the files because for a given media (image), we may store two files (original + thumbnail).

I'm not sure if this makes sense to someone other than me, though. :D

This pipeline also keeps an internal queue of those images which are currently
being scheduled for download, and connects those items that arrive containing
the same image, to that queue. This avoids downloading the same image more than
The pipelines also keep an internal queue of those images which are currently

This comment has been minimized.

@dangra

dangra Apr 15, 2015
Member

s/images/media urls/

being scheduled for download, and connects those items that arrive containing
the same image, to that queue. This avoids downloading the same image more than
The pipelines also keep an internal queue of those images which are currently
being scheduled for download, and connect those items that arrive containing

This comment has been minimized.

@dangra

dangra Apr 15, 2015
Member

s/items/responses/

the same image, to that queue. This avoids downloading the same image more than
The pipelines also keep an internal queue of those images which are currently
being scheduled for download, and connect those items that arrive containing
the same image to that queue. This avoids downloading the same media more than

This comment has been minimized.

@dangra

dangra Apr 15, 2015
Member

s/image/media/

`Python Imaging Library`_ (PIL) should also work in most cases, but it
is known to cause troubles in some setups, so we recommend to use `Pillow`_
instead of `PIL <Python Imaging Library>`_.
The Images Pipeline uses `Pillow`_ for thumbnailing and normalizing images to

This comment has been minimized.

@dangra

dangra Apr 15, 2015
Member

I think this paragraph should be in Using the Images pipeline section.

@dangra
Copy link
Member

@dangra dangra commented Apr 15, 2015

The rest looks good to me.

@curita
Copy link
Member

@curita curita commented Apr 15, 2015

Maybe we should rename docs/topics/images.rst to docs/topics/media.rst or something similar too.

@eliasdorneles
Copy link
Member Author

@eliasdorneles eliasdorneles commented Apr 15, 2015

Thank you for the review, @dangra -- I will update this soon.

@curita +1, I thought about it too but decided to do it as the last thing to make it easier to see the diff in the review. :)

pablohoffman added a commit that referenced this pull request Apr 21, 2015
Documenting Files Pipeline together with Images Pipeline
@pablohoffman pablohoffman merged commit e4122cd into scrapy:master Apr 21, 2015
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.