Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document FilesPipeline.file_path and ImagesPipeline.file_path #3609

Merged
merged 1 commit into from Jul 4, 2019

Conversation

@Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Feb 1, 2019

Fixes #2253

@codecov
Copy link

@codecov codecov bot commented Feb 1, 2019

Codecov Report

Merging #3609 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3609      +/-   ##
==========================================
- Coverage   84.72%   84.71%   -0.02%     
==========================================
  Files         168      168              
  Lines        9460     9460              
  Branches     1407     1407              
==========================================
- Hits         8015     8014       -1     
  Misses       1188     1188              
- Partials      257      258       +1
Impacted Files Coverage Δ
scrapy/utils/trackref.py 83.78% <0%> (-2.71%) ⬇️

You can override this method to customize the download path of each file.

For example, to download all files into the ``files`` folder with their
original filenames::
Copy link
Member

@kmike kmike Feb 6, 2019

I think we should be a bit more clear on that it is just an example, which works for a certain kind of URLs - http://example.com/foo.png would be stored properly, but e.g. all files like http://example.com/file?name=foo.png would be stored as "file", overwriting each other. Possible solutions:

  • say that it is an example, which works on websites where file URLs follow certain rules
  • make it more robust (but less readable) by adding full URL hash to the path, in addition to guessed file name.

A full implementation should probably also check Content-Disposition header, to get a file name.

Copy link
Member Author

@Gallaecio Gallaecio Feb 11, 2019

I’ve gone with option 1, and also tried to make it obvious that the example implementation does not take subdirectories into account either.

import os
from urllib.parse import urlparse

def file_path(self, request, response, info):
Copy link

@schjoq schjoq Feb 8, 2019

Hi. I found this definition wouldn't work, without assigning default values to the keyword arguments response and info:

Suggested change
def file_path(self, request, response, info):
def file_path(self, request, response=None, info=None):

Just like in:

def file_path(self, request, response=None, info=None):

Copy link
Member Author

@Gallaecio Gallaecio Feb 8, 2019

I’ll look into it in detail later, but if I recall right those two fields are only needed when also using some deprecated methods in subclasses. Are you using the latest Scrapy version? (1.6.0).

Copy link
Member Author

@Gallaecio Gallaecio Feb 11, 2019

Verified, that is only needed to support subclasses of these pipelines that still use the deprecated file_key method.

@Gallaecio Gallaecio force-pushed the 2253 branch 3 times, most recently from f2f3234 to 0802375 Feb 11, 2019
@Gallaecio Gallaecio closed this Feb 12, 2019
@Gallaecio Gallaecio reopened this Feb 12, 2019
from urllib.parse import urlparse

def file_path(self, request, response, info):
return 'files/' + os.path.basename(urlparse(request.url).path)
Copy link
Member

@kmike kmike Mar 22, 2019

to make it a bit more clear, could you please add some boilerplate, e.g.

# ...
class MyFilesPipeline(FilesPipeline):
    # ...
    def file_path(self, request, response, info):

Copy link
Member Author

@Gallaecio Gallaecio Mar 26, 2019

I’ve gone with the whole implementation, as it only required a few more lines.

@kmike kmike added the docs label Jul 4, 2019
@kmike kmike added this to the v1.7 milestone Jul 4, 2019
@kmike
Copy link
Member

@kmike kmike commented Jul 4, 2019

Looks good, thanks @Gallaecio!

@kmike kmike merged commit 4d4bd0e into scrapy:master Jul 4, 2019
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

3 participants