-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disallow media extensions unregistered with IANA #3954
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3954 +/- ##
==========================================
+ Coverage 85.35% 85.44% +0.08%
==========================================
Files 167 167
Lines 9699 9738 +39
Branches 1453 1458 +5
==========================================
+ Hits 8279 8321 +42
+ Misses 1162 1159 -3
Partials 258 258
|
Co-Authored-By: s-sanjay <sanjay537@gmail.com>
@elacuesta @Gallaecio @OmarFarrag i've found more instances of .keys() in the codebase and made a small PR to address part of it |
This works well for a lot of URL patterns, but it doesn't seem to fully address #1287. For example, the URL "http://foo.bar/baz.pdf?v=22" should ideally return the path "full/0cb1b37dce389b0fcb5092d271c982269be26c8a.pdf", but it returns "full/0cb1b37dce389b0fcb5092d271c982269be26c8a" with the current v1.8.0 code. One solution would be to apply the fix in #1548 on top of this, as I did in the patch below. That worked well for me. diff --git a/scrapy/pipelines/files.py b/scrapy/pipelines/files.py
index 6d55c898..75556b98 100644
--- a/scrapy/pipelines/files.py
+++ b/scrapy/pipelines/files.py
@@ -20,6 +20,7 @@ from scrapy.pipelines.media import MediaPipeline
from scrapy.settings import Settings
from scrapy.exceptions import NotConfigured, IgnoreRequest
from scrapy.http import Request
+from scrapy.utils.httpobj import urlparse_cached
from scrapy.utils.misc import md5sum
from scrapy.utils.log import failure_to_exc_info
from scrapy.utils.python import to_bytes
@@ -469,6 +470,10 @@ class FilesPipeline(MediaPipeline):
def file_path(self, request, response=None, info=None):
media_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
media_ext = os.path.splitext(request.url)[1]
+ # if result is a querystring fragment, ignore url params and use path only
+ if not media_ext[1:].isalnum():
+ media_base_url = urlparse_cached(request).path
+ media_ext = os.path.splitext(media_base_url)[1]
# Handles empty and wild extensions by trying to guess the
# mime type then extension or default to empty string otherwise
if media_ext not in mimetypes.types_map:
diff --git a/tests/test_pipeline_files.py b/tests/test_pipeline_files.py
index 52f2b554..7c1cc63c 100644
--- a/tests/test_pipeline_files.py
+++ b/tests/test_pipeline_files.py
@@ -58,6 +58,18 @@ class FilesPipelineTestCase(unittest.TestCase):
self.assertEqual(file_path(Request("data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAR0AAACxCAMAAADOHZloAAACClBMVEX/\
//+F0tzCwMK76ZKQ21AMqr7oAAC96JvD5aWM2kvZ78J0N7fmAAC46Y4Ap7y")),
'full/178059cbeba2e34120a67f2dc1afc3ecc09b61cb.png')
+ self.assertEqual(file_path(Request("http://foo.bar/baz.pdf?v=22")),
+ 'full/0cb1b37dce389b0fcb5092d271c982269be26c8a.pdf')
+ self.assertEqual(file_path(Request("http://localhost:8050/render.png?url=http://www.test.ca&timeout=30wait=3")),
+ 'full/1691f03855fb23bc1e3be2618889a8d0d7ce15f8.png')
+ self.assertEqual(file_path(Request("http://foo.bar/baz.txt?fizz")),
+ 'full/a2b4913a62f65445aeae2bac08cd8c3b41d7195e.txt')
+ self.assertEqual(file_path(Request("http://foo.bar/baz.mp3?fizz")),
+ 'full/e395e5dd4ea00d7440bd7c4e42576ff71c1c7bca.mp3')
+ self.assertEqual(file_path(Request("http://foo.bar/baz?img=fizz.mp3")),
+ 'full/768d1719ad5fa6b6919bbbd65388fae36a3820d4.mp3')
+ self.assertEqual(file_path(Request("http://foo.bar/baz.php?img=fizz.mp3")),
+ 'full/04435a0f665ba33cc2257737d7f2f6c27ea5f2d4.mp3')
def test_fs_store(self): |
Thank you for pointing those pull requests out. I will be sure to reference them when developing my own solution to this problem. Before that, though, I would like to know what your preferred process is for handling the fact that this pull request does not fully address issue #1287. Do you ever reopen tickets once they are closed, or would you prefer for me to open a new ticket for the remaining unhandled URLs? |
I guess you can create a new ticket. |
If a media extension is empty string or is unregistered with IANA, then try to guess the MIME type from the url then the extension from MIME type using built-in
mimetypes
lib..Fixes #3953, fixes #1287