Disallow media extensions unregistered with IANA #3954

OmarFarrag · 2019-08-13T13:57:10Z

If a media extension is empty string or is unregistered with IANA, then try to guess the MIME type from the url then the extension from MIME type using built-in mimetypes lib..

Fixes #3953, fixes #1287

scrapy/pipelines/files.py

codecov · 2019-08-14T12:35:25Z

Codecov Report

Merging #3954 into master will increase coverage by 0.08%.
The diff coverage is 85.71%.

@@            Coverage Diff             @@
##           master    #3954      +/-   ##
==========================================
+ Coverage   85.35%   85.44%   +0.08%     
==========================================
  Files         167      167              
  Lines        9699     9738      +39     
  Branches     1453     1458       +5     
==========================================
+ Hits         8279     8321      +42     
+ Misses       1162     1159       -3     
  Partials      258      258

Impacted Files	Coverage Δ
scrapy/pipelines/files.py	`66.16% <85.71%> (+0.78%)`	⬆️
scrapy/exporters.py	`100% <0%> (ø)`	⬆️
scrapy/downloadermiddlewares/robotstxt.py	`100% <0%> (ø)`	⬆️
scrapy/settings/default_settings.py	`98.7% <0%> (+0.01%)`	⬆️
scrapy/http/request/json_request.py	`94.11% <0%> (+0.36%)`	⬆️
scrapy/core/downloader/contextfactory.py	`96.66% <0%> (+0.51%)`	⬆️
scrapy/robotstxt.py	`97.36% <0%> (+0.64%)`	⬆️
scrapy/utils/ssl.py	`53.65% <0%> (+1.15%)`	⬆️
scrapy/contracts/default.py	`88% <0%> (+1.63%)`	⬆️
... and 1 more

scrapy/pipelines/files.py

Co-Authored-By: s-sanjay <sanjay537@gmail.com>

s-sanjay · 2019-09-22T11:35:17Z

@elacuesta @Gallaecio @OmarFarrag i've found more instances of .keys() in the codebase and made a small PR to address part of it
#4031

dankeil · 2019-12-06T20:49:13Z

This works well for a lot of URL patterns, but it doesn't seem to fully address #1287.

For example, the URL "http://foo.bar/baz.pdf?v=22" should ideally return the path "full/0cb1b37dce389b0fcb5092d271c982269be26c8a.pdf", but it returns "full/0cb1b37dce389b0fcb5092d271c982269be26c8a" with the current v1.8.0 code.

One solution would be to apply the fix in #1548 on top of this, as I did in the patch below. That worked well for me.

diff --git a/scrapy/pipelines/files.py b/scrapy/pipelines/files.py
index 6d55c898..75556b98 100644
--- a/scrapy/pipelines/files.py
+++ b/scrapy/pipelines/files.py
@@ -20,6 +20,7 @@ from scrapy.pipelines.media import MediaPipeline
 from scrapy.settings import Settings
 from scrapy.exceptions import NotConfigured, IgnoreRequest
 from scrapy.http import Request
+from scrapy.utils.httpobj import urlparse_cached
 from scrapy.utils.misc import md5sum
 from scrapy.utils.log import failure_to_exc_info
 from scrapy.utils.python import to_bytes
@@ -469,6 +470,10 @@ class FilesPipeline(MediaPipeline):
     def file_path(self, request, response=None, info=None):
         media_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
         media_ext = os.path.splitext(request.url)[1]
+        # if result is a querystring fragment, ignore url params and use path only
+        if not media_ext[1:].isalnum():
+            media_base_url = urlparse_cached(request).path
+            media_ext = os.path.splitext(media_base_url)[1]
         # Handles empty and wild extensions by trying to guess the
         # mime type then extension or default to empty string otherwise
         if media_ext not in mimetypes.types_map:
diff --git a/tests/test_pipeline_files.py b/tests/test_pipeline_files.py
index 52f2b554..7c1cc63c 100644
--- a/tests/test_pipeline_files.py
+++ b/tests/test_pipeline_files.py
@@ -58,6 +58,18 @@ class FilesPipelineTestCase(unittest.TestCase):
         self.assertEqual(file_path(Request("data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAR0AAACxCAMAAADOHZloAAACClBMVEX/\
                                     //+F0tzCwMK76ZKQ21AMqr7oAAC96JvD5aWM2kvZ78J0N7fmAAC46Y4Ap7y")),
                          'full/178059cbeba2e34120a67f2dc1afc3ecc09b61cb.png')
+        self.assertEqual(file_path(Request("http://foo.bar/baz.pdf?v=22")),
+                          'full/0cb1b37dce389b0fcb5092d271c982269be26c8a.pdf')
+        self.assertEqual(file_path(Request("http://localhost:8050/render.png?url=http://www.test.ca&timeout=30wait=3")),
+                          'full/1691f03855fb23bc1e3be2618889a8d0d7ce15f8.png')
+        self.assertEqual(file_path(Request("http://foo.bar/baz.txt?fizz")),
+                          'full/a2b4913a62f65445aeae2bac08cd8c3b41d7195e.txt')
+        self.assertEqual(file_path(Request("http://foo.bar/baz.mp3?fizz")),
+                          'full/e395e5dd4ea00d7440bd7c4e42576ff71c1c7bca.mp3')
+        self.assertEqual(file_path(Request("http://foo.bar/baz?img=fizz.mp3")),
+                          'full/768d1719ad5fa6b6919bbbd65388fae36a3820d4.mp3')
+        self.assertEqual(file_path(Request("http://foo.bar/baz.php?img=fizz.mp3")),
+                          'full/04435a0f665ba33cc2257737d7f2f6c27ea5f2d4.mp3')


     def test_fs_store(self):

Gallaecio · 2019-12-07T06:16:42Z

#1548 is not the only pull request that addresses this. There’s also at least #2809 and, more recently, #3817 (which also checks response headers when guessing the right extension).

dankeil · 2019-12-09T16:30:21Z

Thank you for pointing those pull requests out. I will be sure to reference them when developing my own solution to this problem.

Before that, though, I would like to know what your preferred process is for handling the fact that this pull request does not fully address issue #1287. Do you ever reopen tickets once they are closed, or would you prefer for me to open a new ticket for the remaining unhandled URLs?

Gallaecio · 2019-12-09T16:46:47Z

I guess you can create a new ticket.

Fix wild media extensions

65d4273

OmarFarrag changed the title ~~Disallow media extensions unregistered with IANA .. Fixes #3953~~ Disallow media extensions unregistered with IANA Aug 13, 2019

Import mimetypes

833196c

Gallaecio reviewed Aug 14, 2019

View reviewed changes

scrapy/pipelines/files.py Outdated Show resolved Hide resolved

scrapy/pipelines/files.py Show resolved Hide resolved

Add more tests for coverage

4bbf893

OmarFarrag added 2 commits August 14, 2019 14:36

Make the import alphabeticaly orderd

cbdc7af

Re-add wrongly deleted import

1dc6e56

Gallaecio approved these changes Aug 14, 2019

View reviewed changes

Gallaecio mentioned this pull request Aug 24, 2019

OSError when downloading a very long url #3953

Closed

s-sanjay reviewed Sep 12, 2019

View reviewed changes

scrapy/pipelines/files.py Outdated Show resolved Hide resolved

use in dict instead of in dict.keys()

c555012

Co-Authored-By: s-sanjay <sanjay537@gmail.com>

elacuesta approved these changes Sep 13, 2019

View reviewed changes

Gallaecio merged commit 13735bc into scrapy:master Sep 16, 2019

OmarFarrag deleted the #3953 branch September 16, 2019 13:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disallow media extensions unregistered with IANA #3954

Disallow media extensions unregistered with IANA #3954

OmarFarrag commented Aug 13, 2019 •

edited by Gallaecio

Loading

codecov bot commented Aug 14, 2019 •

edited

Loading

s-sanjay commented Sep 22, 2019 •

edited

Loading

dankeil commented Dec 6, 2019

Gallaecio commented Dec 7, 2019

dankeil commented Dec 9, 2019

Gallaecio commented Dec 9, 2019

Disallow media extensions unregistered with IANA #3954

Disallow media extensions unregistered with IANA #3954

Conversation

OmarFarrag commented Aug 13, 2019 • edited by Gallaecio Loading

codecov bot commented Aug 14, 2019 • edited Loading

Codecov Report

s-sanjay commented Sep 22, 2019 • edited Loading

dankeil commented Dec 6, 2019

Gallaecio commented Dec 7, 2019

dankeil commented Dec 9, 2019

Gallaecio commented Dec 9, 2019

OmarFarrag commented Aug 13, 2019 •

edited by Gallaecio

Loading

codecov bot commented Aug 14, 2019 •

edited

Loading

s-sanjay commented Sep 22, 2019 •

edited

Loading