Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add new pipeline methods to get file/image/thumbnail paths #490

Merged
merged 12 commits into from Dec 23, 2013

Conversation

@max-arnold
Copy link
Contributor

@max-arnold max-arnold commented Dec 17, 2013

This change allows to pass request, response and spider context to filename construction methods. As discussed per #370

I would like to add deprecation warnings but I'm not sure it is possible to do this cleanly. In the next commit I'll try to implement that.

max-arnold added 2 commits Dec 17, 2013
This change allows to pass request, response and spider context to filename
construction methods.
@dangra
Copy link
Member

@dangra dangra commented Dec 17, 2013

Nice deprecation tricks, it's looking good.

Only rest updating tests under scrapy.tests.test_pipeline_files and scrapy.tests.test_pipeline_images,
testing the deprecated methods work as expected is also important.

category=ScrapyDeprecationWarning, stacklevel=1)

# check if called from file_key with url as first argument
if isinstance(request, unicode) or isinstance(request, str):

This comment has been minimized.

@dangra

dangra Dec 17, 2013
Member

hey @kmike, sorry for summoning you. I wanted to ask if you can review for Python 3 forward compatibility before merging.

This comment has been minimized.

@kmike

kmike Dec 17, 2013
Member

Hey @dangra!

We'll have to review all occurrences of str/unicode to add python3 support; this one would be likely replaced with isinstance(six.string_types). So this is not a problem. Checking for Request instead of checking for str/unicode would make porting a bit easier, but that's fine not to worry about that now.

@max-arnold
Copy link
Contributor Author

@max-arnold max-arnold commented Dec 17, 2013

In theory (there is no tests yet), deprecation should work both ways - if someone calls old file_key method, it will call file_path with string as first argument and that will be detected with appropriate warning. If someone reimplements file_key method, new code will check for this and also will issue warning.

All the extra warning blocks and old methods can be removed after deprecation period.

@max-arnold
Copy link
Contributor Author

@max-arnold max-arnold commented Dec 17, 2013

Existing tests passed successfully on my machine using bin/runtests.sh (except skipped Amazon S3). Tox tests have failed for some unrelated reason.

I'll try to add more tests to cover new code paths, but you have to wait for some time (I'm slow at writing tests). It would be great to get some help.

@dangra
Copy link
Member

@dangra dangra commented Dec 17, 2013

In theory (there is no tests yet), deprecation should work both ways - if someone calls old file_key method, it will call file_path with string as first argument and that will be detected with appropriate warning. If someone reimplements file_key method, new code will check for this and also will issue warning.
All the extra warning blocks and old methods can be removed after deprecation period.

This deprecation will last for a few releases before we can remove it.
I'm willing to see some tests that covers the new methods and ensure old ones are called if defined. Testing the warning message is a (nice to have) plus.

I think the test case must extend FilesPipeline and mock FSFilesStore similar to how FilesPipelineTestCase.test_file_expired() does. But do not mock FilesMediapipeline methods, on the FSFilesStore methods if possible.

@max-arnold
Copy link
Contributor Author

@max-arnold max-arnold commented Dec 18, 2013

In my opinion that should be enough.

@dangra
Copy link
Member

@dangra dangra commented Dec 18, 2013

@max-arnold: you rock. thanks!

It looks great to me, let's merge after travis-ci reports the success build of last commit.

@max-arnold
Copy link
Contributor Author

@max-arnold max-arnold commented Dec 18, 2013

I tried to use this new functionality in my code and created custom pipeline subclass with the following method (note the absence of response and info keyword arguments):

    def file_path(self, request):
         pass

Method invocation blows up with TypeError: "file_path() got an unexpected keyword argument 'info'" but this exception is completely silenced somewhere up the stack (probably in utils/defer.py mustbe_deferred()). Is that intended behavior?

@dangra
Copy link
Member

@dangra dangra commented Dec 18, 2013

Is that intended behavior?

No, it isn't.

@dangra
Copy link
Member

@dangra dangra commented Dec 20, 2013

LGTM

anyone else wants to take a look @pablohoffman @kmike @nramirezuy ?

@max-arnold
Copy link
Contributor Author

@max-arnold max-arnold commented Dec 20, 2013

I see one possible downside of this patch. In FilesPipeline.media_to_download() there is file_path() invocation with missing response argument:

        path = self.file_path(request, info=info)
        dfd = defer.maybeDeferred(self.store.stat_file, path, info)
        dfd.addCallbacks(_onsuccess, lambda _: None)
        dfd.addErrback(log.err, self.__class__.__name__ + '.store.stat_file')
        return dfd

So if someone wants to modify file_path behavior depending on response, he is not able to do this reliably. Sometimes file_path() will be called with response argument and sometimes without. This makes it not very useful.

@max-arnold
Copy link
Contributor Author

@max-arnold max-arnold commented Dec 20, 2013

Always having response instance could be useful to inspect Content-Disposition (and other server headers) to determine file name.

@dangra
Copy link
Member

@dangra dangra commented Dec 20, 2013

I don't see it as a downside of this patch, it is like that because media_to_download hook is there to prevent redownloads. If the key is based on any response details then you have no chance other than downloading it everytime.

for this reason response argument is optional and defaults to None in file_path signature.

We can improve it a bit (in another PR) by accepting a Deferred as file_path() result. In this case, file_path() can HEAD the url looking for Content-Disposition when response is None. (just an idea)

@max-arnold
Copy link
Contributor Author

@max-arnold max-arnold commented Dec 20, 2013

Ok, this is reasonable. Waiting for the patch being merged :)

dangra added a commit that referenced this pull request Dec 23, 2013
add new pipeline methods to get file/image/thumbnail paths
@dangra dangra merged commit 2c2ce20 into scrapy:master Dec 23, 2013
1 check passed
1 check passed
default The Travis CI build passed
Details
@max-arnold
Copy link
Contributor Author

@max-arnold max-arnold commented Dec 24, 2013

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants