Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disallow media extensions unregistered with IANA #3954

Merged
merged 6 commits into from
Sep 16, 2019

Conversation

OmarFarrag
Copy link
Contributor

@OmarFarrag OmarFarrag commented Aug 13, 2019

If a media extension is empty string or is unregistered with IANA, then try to guess the MIME type from the url then the extension from MIME type using built-in mimetypes lib..

Fixes #3953, fixes #1287

@OmarFarrag OmarFarrag changed the title Disallow media extensions unregistered with IANA .. Fixes #3953 Disallow media extensions unregistered with IANA Aug 13, 2019
scrapy/pipelines/files.py Outdated Show resolved Hide resolved
scrapy/pipelines/files.py Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Aug 14, 2019

Codecov Report

Merging #3954 into master will increase coverage by 0.08%.
The diff coverage is 85.71%.

@@            Coverage Diff             @@
##           master    #3954      +/-   ##
==========================================
+ Coverage   85.35%   85.44%   +0.08%     
==========================================
  Files         167      167              
  Lines        9699     9738      +39     
  Branches     1453     1458       +5     
==========================================
+ Hits         8279     8321      +42     
+ Misses       1162     1159       -3     
  Partials      258      258
Impacted Files Coverage Δ
scrapy/pipelines/files.py 66.16% <85.71%> (+0.78%) ⬆️
scrapy/exporters.py 100% <0%> (ø) ⬆️
scrapy/downloadermiddlewares/robotstxt.py 100% <0%> (ø) ⬆️
scrapy/settings/default_settings.py 98.7% <0%> (+0.01%) ⬆️
scrapy/http/request/json_request.py 94.11% <0%> (+0.36%) ⬆️
scrapy/core/downloader/contextfactory.py 96.66% <0%> (+0.51%) ⬆️
scrapy/robotstxt.py 97.36% <0%> (+0.64%) ⬆️
scrapy/utils/ssl.py 53.65% <0%> (+1.15%) ⬆️
scrapy/contracts/default.py 88% <0%> (+1.63%) ⬆️
... and 1 more

scrapy/pipelines/files.py Outdated Show resolved Hide resolved
Co-Authored-By: s-sanjay <sanjay537@gmail.com>
@Gallaecio Gallaecio merged commit 13735bc into scrapy:master Sep 16, 2019
@OmarFarrag OmarFarrag deleted the #3953 branch September 16, 2019 13:22
@s-sanjay
Copy link
Contributor

s-sanjay commented Sep 22, 2019

@elacuesta @Gallaecio @OmarFarrag i've found more instances of .keys() in the codebase and made a small PR to address part of it
#4031

@dankeil
Copy link

dankeil commented Dec 6, 2019

This works well for a lot of URL patterns, but it doesn't seem to fully address #1287.

For example, the URL "http://foo.bar/baz.pdf?v=22" should ideally return the path "full/0cb1b37dce389b0fcb5092d271c982269be26c8a.pdf", but it returns "full/0cb1b37dce389b0fcb5092d271c982269be26c8a" with the current v1.8.0 code.

One solution would be to apply the fix in #1548 on top of this, as I did in the patch below. That worked well for me.

diff --git a/scrapy/pipelines/files.py b/scrapy/pipelines/files.py
index 6d55c898..75556b98 100644
--- a/scrapy/pipelines/files.py
+++ b/scrapy/pipelines/files.py
@@ -20,6 +20,7 @@ from scrapy.pipelines.media import MediaPipeline
 from scrapy.settings import Settings
 from scrapy.exceptions import NotConfigured, IgnoreRequest
 from scrapy.http import Request
+from scrapy.utils.httpobj import urlparse_cached
 from scrapy.utils.misc import md5sum
 from scrapy.utils.log import failure_to_exc_info
 from scrapy.utils.python import to_bytes
@@ -469,6 +470,10 @@ class FilesPipeline(MediaPipeline):
     def file_path(self, request, response=None, info=None):
         media_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
         media_ext = os.path.splitext(request.url)[1]
+        # if result is a querystring fragment, ignore url params and use path only
+        if not media_ext[1:].isalnum():
+            media_base_url = urlparse_cached(request).path
+            media_ext = os.path.splitext(media_base_url)[1]
         # Handles empty and wild extensions by trying to guess the
         # mime type then extension or default to empty string otherwise
         if media_ext not in mimetypes.types_map:
diff --git a/tests/test_pipeline_files.py b/tests/test_pipeline_files.py
index 52f2b554..7c1cc63c 100644
--- a/tests/test_pipeline_files.py
+++ b/tests/test_pipeline_files.py
@@ -58,6 +58,18 @@ class FilesPipelineTestCase(unittest.TestCase):
         self.assertEqual(file_path(Request("data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAR0AAACxCAMAAADOHZloAAACClBMVEX/\
                                     //+F0tzCwMK76ZKQ21AMqr7oAAC96JvD5aWM2kvZ78J0N7fmAAC46Y4Ap7y")),
                          'full/178059cbeba2e34120a67f2dc1afc3ecc09b61cb.png')
+        self.assertEqual(file_path(Request("http://foo.bar/baz.pdf?v=22")),
+                          'full/0cb1b37dce389b0fcb5092d271c982269be26c8a.pdf')
+        self.assertEqual(file_path(Request("http://localhost:8050/render.png?url=http://www.test.ca&timeout=30wait=3")),
+                          'full/1691f03855fb23bc1e3be2618889a8d0d7ce15f8.png')
+        self.assertEqual(file_path(Request("http://foo.bar/baz.txt?fizz")),
+                          'full/a2b4913a62f65445aeae2bac08cd8c3b41d7195e.txt')
+        self.assertEqual(file_path(Request("http://foo.bar/baz.mp3?fizz")),
+                          'full/e395e5dd4ea00d7440bd7c4e42576ff71c1c7bca.mp3')
+        self.assertEqual(file_path(Request("http://foo.bar/baz?img=fizz.mp3")),
+                          'full/768d1719ad5fa6b6919bbbd65388fae36a3820d4.mp3')
+        self.assertEqual(file_path(Request("http://foo.bar/baz.php?img=fizz.mp3")),
+                          'full/04435a0f665ba33cc2257737d7f2f6c27ea5f2d4.mp3')


     def test_fs_store(self):

@Gallaecio
Copy link
Member

#1548 is not the only pull request that addresses this. There’s also at least #2809 and, more recently, #3817 (which also checks response headers when guessing the right extension).

@dankeil
Copy link

dankeil commented Dec 9, 2019

Thank you for pointing those pull requests out. I will be sure to reference them when developing my own solution to this problem.

Before that, though, I would like to know what your preferred process is for handling the fact that this pull request does not fully address issue #1287. Do you ever reopen tickets once they are closed, or would you prefer for me to open a new ticket for the remaining unhandled URLs?

@Gallaecio
Copy link
Member

I guess you can create a new ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OSError when downloading a very long url Files Pipeline fails on URLs with "?"/get attributes
5 participants