Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document FilesPipeline.file_path and ImagesPipeline.file_path #3609

Merged
merged 1 commit into from
Jul 4, 2019

Conversation

Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Feb 1, 2019

Fixes #2253

@codecov
Copy link

codecov bot commented Feb 1, 2019

Codecov Report

Merging #3609 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3609      +/-   ##
==========================================
- Coverage   84.72%   84.71%   -0.02%     
==========================================
  Files         168      168              
  Lines        9460     9460              
  Branches     1407     1407              
==========================================
- Hits         8015     8014       -1     
  Misses       1188     1188              
- Partials      257      258       +1
Impacted Files Coverage Δ
scrapy/utils/trackref.py 83.78% <0%> (-2.71%) ⬇️

You can override this method to customize the download path of each file.

For example, to download all files into the ``files`` folder with their
original filenames::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be a bit more clear on that it is just an example, which works for a certain kind of URLs - http://example.com/foo.png would be stored properly, but e.g. all files like http://example.com/file?name=foo.png would be stored as "file", overwriting each other. Possible solutions:

  • say that it is an example, which works on websites where file URLs follow certain rules
  • make it more robust (but less readable) by adding full URL hash to the path, in addition to guessed file name.

A full implementation should probably also check Content-Disposition header, to get a file name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve gone with option 1, and also tried to make it obvious that the example implementation does not take subdirectories into account either.

import os
from urllib.parse import urlparse

def file_path(self, request, response, info):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. I found this definition wouldn't work, without assigning default values to the keyword arguments response and info:

Suggested change
def file_path(self, request, response, info):
def file_path(self, request, response=None, info=None):

Just like in:

def file_path(self, request, response=None, info=None):

Copy link
Member Author

@Gallaecio Gallaecio Feb 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll look into it in detail later, but if I recall right those two fields are only needed when also using some deprecated methods in subclasses. Are you using the latest Scrapy version? (1.6.0).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified, that is only needed to support subclasses of these pipelines that still use the deprecated file_key method.

from urllib.parse import urlparse

def file_path(self, request, response, info):
return 'files/' + os.path.basename(urlparse(request.url).path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make it a bit more clear, could you please add some boilerplate, e.g.

# ...
class MyFilesPipeline(FilesPipeline):
    # ...
    def file_path(self, request, response, info):

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve gone with the whole implementation, as it only required a few more lines.

@kmike
Copy link
Member

kmike commented Jul 4, 2019

Looks good, thanks @Gallaecio!

@kmike kmike merged commit 4d4bd0e into scrapy:master Jul 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Document how to customize path for written files with media pipelines
3 participants