Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support defining file path based on item in media pipelines #4686

Merged
merged 20 commits into from Aug 11, 2020

Conversation

ajaymittur28
Copy link
Contributor

@ajaymittur28 ajaymittur28 commented Jul 18, 2020

Closes #4628, closes #4677

The existing media pipelines (FilesPipeline and ImagesPipeline) do not allow the user to dynamically define downloaded file location based on the item returned by the spider, thus, forcing the users to use other workarounds like using the Request.meta to pass item data or creating their own Request subclass with the item passed to it.

I've updated the media pipelines to pass item as a parameter to the file_path function and have updated the same in the tests as well. This will now allow the user to define file paths based on the item.

@codecov
Copy link

codecov bot commented Jul 18, 2020

Codecov Report

Merging #4686 into master will increase coverage by 0.54%.
The diff coverage is 96.07%.

@@            Coverage Diff             @@
##           master    #4686      +/-   ##
==========================================
+ Coverage   86.18%   86.72%   +0.54%     
==========================================
  Files         160      160              
  Lines        9632     9703      +71     
  Branches     1414     1424      +10     
==========================================
+ Hits         8301     8415     +114     
+ Misses       1071     1025      -46     
- Partials      260      263       +3     
Impacted Files Coverage Δ
scrapy/pipelines/images.py 91.81% <83.33%> (ø)
scrapy/pipelines/media.py 97.16% <97.29%> (-0.14%) ⬇️
scrapy/pipelines/files.py 60.64% <100.00%> (ø)
scrapy/utils/misc.py 95.53% <0.00%> (-1.79%) ⬇️
scrapy/exporters.py 100.00% <0.00%> (ø)
scrapy/utils/curl.py 100.00% <0.00%> (ø)
scrapy/spiders/__init__.py 100.00% <0.00%> (ø)
scrapy/settings/default_settings.py 98.73% <0.00%> (+<0.01%) ⬆️
scrapy/settings/__init__.py 92.98% <0.00%> (+0.04%) ⬆️
scrapy/utils/conf.py 93.13% <0.00%> (+0.06%) ⬆️
... and 7 more

@ajaymittur28 ajaymittur28 changed the title [WIP] Support defining file path based on item in media pipelines Support defining file path based on item in media pipelines Jul 19, 2020
@ajaymittur28 ajaymittur28 marked this pull request as ready for review Jul 19, 2020
@wRAR
Copy link
Contributor

wRAR commented Jul 21, 2020

I would leave these tests as is and added a new test, or several if needed, to check that the item is indeed now available in the method.

@ajaymittur28
Copy link
Contributor Author

ajaymittur28 commented Jul 22, 2020

I've added a test to ensure the file path can now be generated from the item passed to file_path, and have also added the item parameter to media_to_download and media_downloaded methods of test_pipeline_media just to make sure the parameters are consistent with the updated interfaces in pipelines.media. Their implementation hasn't been changed.

wRAR
wRAR approved these changes Jul 27, 2020
@Gallaecio
Copy link
Member

Gallaecio commented Jul 30, 2020

@ajaymittur28 You have done a great work so far.

However, before we merge these changes, we should address backward incompatibility. If a subclass is not using the new argument in its file_path method signature, things should still work as before. And in that case a deprecation warning should be logged.

If you look for inspect usage in the code, you should be able to find some examples on how to do this. I believe even in related code we are already doing something like this.

@ajaymittur28
Copy link
Contributor Author

ajaymittur28 commented Jul 31, 2020

I can check if the user is using the deprecated version of file_path method or not using inspect and log a deprecation warning. I however couldn't think of a way to make the file_path method the user may be using compatible with the new item parameter, other than using try...except blocks wherever the call to file_path is made and decide whether to pass item or not. Do you think there is a better way to do this?

@ajaymittur28
Copy link
Contributor Author

ajaymittur28 commented Aug 1, 2020

I've added a wrapper method _file_path in MediaPipeline to pass parameters consistent with the user defined file_path method in the subclass and address the backwards incompatibility issue. I also added a deprecation warning for the method if the user defined method is using the previous signature.

I am however unsure of the deprecation message and if I should place them in FilesPipeline and ImagesPipeline like I have currently or in their parent class MediaPipeline. Could you have a look? @Gallaecio

try:
file_path_sig.parameters['item']
except KeyError:
warn('file_path(self, request, response=None, info=None) is deprecated, '
'please use file_path(self, request, response=None, info=None, item=None)',
ScrapyDeprecationWarning, stacklevel=2)
Copy link
Member

@Gallaecio Gallaecio Aug 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💄 if 'item' not in file_path_sig.parameters: would be simpler.

tests/test_pipeline_files.py Outdated Show resolved Hide resolved
warn('file_path(self, request, response=None, info=None) is deprecated, '
'please use file_path(self, request, response=None, info=None, item=None)',
ScrapyDeprecationWarning, stacklevel=2)

Copy link
Member

@Gallaecio Gallaecio Aug 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move it to the parent class.

I’m thinking:

  • It would avoid this duplication.

  • It would allow to set a private class variable (e.g. _file_path_expects_item = True/False) which could be used by _file_path, to avoid 1 call when item is not expected, and to only perform a call without item when needed (currently an unrelated TypeError in a file_path implementation could trigger that call as well).

Copy link
Contributor Author

@ajaymittur28 ajaymittur28 Aug 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unsure about doing this in the parent class since it doesn't have any declaration of file_path or any usage other than _file_path, should I add an empty declaration of file_path under "Overridable Interfaces" or is it not needed?

UPD: It looks like I have to add an empty file_path interface to the parent class MediaPipeline since test_pipeline_media will fail the tests otherwise.

Copy link
Member

@Gallaecio Gallaecio left a comment

We probably need to follow and approach similar to that of file_path for other overridable methods: media_to_download, media_downloaded, file_downloaded, image_downloaded, get_images.

scrapy/pipelines/media.py Show resolved Hide resolved
@ajaymittur28
Copy link
Contributor Author

ajaymittur28 commented Aug 5, 2020

According to the docs it looks like only file_path out of the overridable interfaces is exposed to the user to modify, so I don't think the backward compatibility issue can occur in the other methods?
I've updated the signatures of the methods you have mentioned to pass item to file_path, they work with both the current (master branch) and new signature of file_path

@Gallaecio
Copy link
Member

Gallaecio commented Aug 5, 2020

Those methods are indeed undocumented, but they are not private, and there’s a chance subclasses of the file or image classes are overriding them.

Also mind that we documented file_path in 1.7.0 because an issue about it was opened: #2253. But it has existed for a while; those other methods should probably be documented as well eventually.

@ajaymittur28
Copy link
Contributor Author

ajaymittur28 commented Aug 5, 2020

Right, makes sense, thanks for clarifying!

I'm thinking we can then generalize the file_path wrapper _file_path for the other overridable interfaces whose signatures have been modified which can be done easily using try....except block like how I had done it initially but this has the drawback of making 1 extra call each time, however I feel using a flag for each of the methods and comparing the flag with corresponding method may be a bit cumbersome.

How do you think we should go about it?

@Gallaecio
Copy link
Member

Gallaecio commented Aug 5, 2020

Maybe a flag dictionary, where keys are the names of the methods which may or may not expect item?

Copy link
Member

@Gallaecio Gallaecio left a comment

First off: Amazing work! The wrapper approach looks great.

However, method = self._compatible(method) will not do what you think it does. You can tell from the coverage data, which shows that the implementation of wrapper is never executed. setattr will probably do the job, though.

This also means that we need to make sure that the tests verify the method execution. Currently they pass because they just check that the warnings are issued.

@ajaymittur28
Copy link
Contributor Author

ajaymittur28 commented Aug 6, 2020

setattr did the job, thanks!

I replaced the individual tests I made with a test case using a mocked ImagesPipeline class since it allows us to test all the methods you mentioned without having to create individual cases for each by simulating the pipeline and lets us ensure the methods were indeed called. I had to override get_media_requests and inc_stats to prevent the pipeline from failing since crawler is None while running the test.

I also realized that item may be passed as a positional argument and hence kwargs.pop('item', None) may not be enough.
To handle this I've added a check on the last parameter of args and determine if it should be removed or not. But, this may cause errors in the future if more parameters are added to the methods and that's why I've made (and passed) item as a keyworded argument. (item can be handled irrespective of position)

What do you think?

scrapy/pipelines/media.py Outdated Show resolved Hide resolved
scrapy/pipelines/media.py Show resolved Hide resolved
Copy link
Member

@Gallaecio Gallaecio left a comment

Looks great to me. Thank you for your hard work and perseverance!

scrapy/pipelines/media.py Outdated Show resolved Hide resolved
scrapy/pipelines/media.py Outdated Show resolved Hide resolved
wRAR
wRAR approved these changes Aug 11, 2020
@Gallaecio Gallaecio merged commit 1c4b4cc into scrapy:master Aug 11, 2020
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pass item in FilesPipeline method Support choosing a file path based on item data in media pipelines
3 participants