Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSError when downloading a very long url #3953

Closed
zaxtyson opened this issue Aug 12, 2019 · 9 comments · Fixed by #3954
Closed

OSError when downloading a very long url #3953

zaxtyson opened this issue Aug 12, 2019 · 9 comments · Fixed by #3954

Comments

@zaxtyson
Copy link

@zaxtyson zaxtyson commented Aug 12, 2019

When you run into some horrible image url, like this:

https://o.aolcdn.com/images/dims?resize=2000%2C2000%2Cshrink&image_uri=https%3A%2F%2Fo.aolcdn.com%2Fimages%2Fdimse%2F5845cadfecd996e0372f%2Fccc34660c41122e3170c0d586c151a29397c0fcf%2FY3JvcD0xOTIwJTJDMTA5NyUyQzAlMkMwJnF1YWxpdHk9ODUmZm9ybWF0PWpwZyZyZXNpemU9MTYwMCUyQzkxNCZpbWFnZV91cmk9aHR0cHMlM0ElMkYlMkZzLnlpbWcuY29tJTJGb3MlMkZjcmVhdHItdXBsb2FkZWQtaW1hZ2VzJTJGMjAxOS0wOCUyRjg2YjNlYjkwLWI5YjgtMTFlOS05ZWFlLTQ5YWU2NTcxMjM0MyZjbGllbnQ9YTFhY2FjM2UxYjMyOTA5MTdkOTImc2lnbmF0dXJlPTZmZWJkYjQwN2E0NzU0YzM0YTJjY2ViMDczNDc1YTE1ZjBiODA3OGQ%3D&client=a1acac3e1b3290917d92&signature=bf3461468aef0cb3ecaea00d2ed611e04a88bc70

Then...

Traceback (most recent call last):
  File "c:\program files\python37\lib\site-packages\scrapy\pipelines\files.py", line 419, in media_downloaded
    checksum = self.file_downloaded(response, request, info)
  File "c:\program files\python37\lib\site-packages\scrapy\pipelines\files.py", line 452, in file_downloaded
    self.store.persist_file(path, buf, info)
  File "c:\program files\python37\lib\site-packages\scrapy\pipelines\files.py", line 53, in persist_file
    with open(absolute_path, 'wb') as f:
OSError: [Errno 22] Invalid argument: 'E:\\2019-08-12\\resources\\885443110bae0e1149e017dbea5ca3935efa38c0.com%2Fimages%2Fdimse%2F5845cadfecd996e0372f%2F108a4af73772ae197fa2c4ec4e9fe7a47390433c%2FY3JvcD0xMTc0JTJDNTgwJTJDMCUyQzAmcXVhbGl0eT04NSZmb3JtYXQ9anBnJnJlc2l6ZT0xNjAwJTJDNzkxJmltYWdlX3VyaT1odHRwcyUzQSUyRiUyRnMueWltZy5jb20lMkZvcyUyRmNyZWF0ci11cGxvYWRlZC1pbWFnZXMlMkYyMDE5LTA4JTJGMWJmZGQxNDAtYjliYy0xMWU5LWJmZjMtMjMyNzcwMTg1MzE5JmNsaWVudD1hMWFjYWMzZTFiMzI5MDkxN2Q5MiZzaWduYXR1cmU9OTFiNzQ3Y2MyZTY5ODY3OGIxNWI0OTkyMjdjM2NmZWRlYTE1NGIxOA%3D%3D&client=a1acac3e1b3290917d92&signature=6517aece82e79d536edeaccc275ad88090df0252'

So,I think that when downloading a file, you should use a random name instead of intercepting it from the url.For some particularly weird urls, this will cause an OSErro when writing to the file

@Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Aug 12, 2019

Funny, the code actually uses a hash as file name. However, it picks the whole .com%2Fimages%2Fdimse%2F5845cadfecd996e0372f%2F108a4af73772ae197fa2c4ec4e9fe7a47390433c%2FY3JvcD0xMTc0JTJDNTgwJTJDMCUyQzAmcXVhbGl0eT04NSZmb3JtYXQ9anBnJnJlc2l6ZT0xNjAwJTJDNzkxJmltYWdlX3VyaT1odHRwcyUzQSUyRiUyRnMueWltZy5jb20lMkZvcyUyRmNyZWF0ci11cGxvYWRlZC1pbWFnZXMlMkYyMDE5LTA4JTJGMWJmZGQxNDAtYjliYy0xMWU5LWJmZjMtMjMyNzcwMTg1MzE5JmNsaWVudD1hMWFjYWMzZTFiMzI5MDkxN2Q5MiZzaWduYXR1cmU9OTFiNzQ3Y2MyZTY5ODY3OGIxNWI0OTkyMjdjM2NmZWRlYTE1NGIxOA%3D%3D&client=a1acac3e1b3290917d92&signature=6517aece82e79d536edeaccc275ad88090df0252 as file extension 🙂

@Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Aug 12, 2019

If you don’t want to wait for a fix, you can override the file_path method in a custom subclass. See https://docs.scrapy.org/en/latest/topics/media-pipeline.html#scrapy.pipelines.files.FilesPipeline

@OmarFarrag
Copy link
Contributor

@OmarFarrag OmarFarrag commented Aug 13, 2019

@Gallaecio Working on that

@zaxtyson
Copy link
Author

@zaxtyson zaxtyson commented Aug 13, 2019

Yes, I override file_path, but with this magic url, it always raise an error, I can't even find the reason. Moreover, on Linux, the length of the file name is limited, and intercepting the url as a file name is an inappropriate practice.

@Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Aug 19, 2019

This seems related to #1287

@Naman-Garg-06
Copy link

@Naman-Garg-06 Naman-Garg-06 commented Aug 19, 2019

Hello. I am new here. Can anyone please tell me where can I get the source code?

@zaxtyson
Copy link
Author

@zaxtyson zaxtyson commented Aug 22, 2019

Hello. I am new here. Can anyone please tell me where can I get the source code?

oh,look at this~ https://github.com/scrapy/scrapy/tree/master/scrapy

@ritik-malik
Copy link

@ritik-malik ritik-malik commented Aug 24, 2019

Hey man, can I fix this? This would make my 1st PR

@Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Aug 24, 2019

@ritik-malik There’s no need to ask for permission to fix an issue, you can usually just start a pull request including a reference to the ticket and that’s it.

However, in this case there is already a pull request open, #3954, which seems promising. Maybe you could find a different issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants