Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigation of attack vectors from external provided links for generic extractor #25141

Closed
legraphista opened this issue May 5, 2020 · 5 comments
Closed
Labels

Comments

@legraphista
Copy link

@legraphista legraphista commented May 5, 2020

Checklist

  • I'm asking a question
  • I've looked through the README and FAQ for similar questions
  • I've searched the bugtracker for similar questions including closed ones

Question

  1. Are there any mitigation possibilities in place for attacks such as Slowloris or a 3 party url that will indefinitely send data without a content-length?

  2. Does the downloader respect the content-length header and error out if the downloaded size is a mismatch?

  3. I have noticed that there is an option for --max-filesize but does that use content-length or the actual downloaded bytes so far?

  4. Is there anything I could do to kill the process if the download speed is lower than a threshold? Or is there any option that would help me achieve this?

@legraphista legraphista added the question label May 5, 2020
@dstftw
Copy link
Collaborator

@dstftw dstftw commented May 5, 2020

  1. What scenario concretely you are talking about? Missing Content-Length is the standard way for transmitting live streams via plain HTTP. If you "mitigate" this you wan't be able to download them.
  2. Mismatch with what?
  3. Content-Length.
  4. No. You should never kill the process in the first place but send interrupt signal.
@dstftw dstftw closed this May 5, 2020
@legraphista
Copy link
Author

@legraphista legraphista commented May 5, 2020

  1. I am concerned with potentially maliciously crafted endpoints sent by users where the endpoint could provide continuous junk data without actually ending the connection, thus exhausting our capacity to spin up more instances for legitimate users.

  2. I was referring to a mismatch between content-length and the size of the downloaded file.

  3. Ok, understood

  4. My apologies, I worded it badly; by kill i actually meant sending SIGTERM followed by a cooldown for a couple of times, then SIGKILL if program is unresponsive

Since we don't plan to allow live-streams, we could use -J to get the source file URL extracted location and do a HEAD request to check for content-length. If present, we could go ahead and fire up youtube-dl with --max-filesize to download it. My only concern remains if the server "lied" about the content-length size and will send a bigger than expected payload.

@dstftw
Copy link
Collaborator

@dstftw dstftw commented May 5, 2020

  1. As already pointed out mitigation will possibly break HTTP live streams thus no such mitigations are reasonable for youtube-dl.
  2. Yes. Actually, no. Looks like here we have data already truncated at the border of Content-Length if it exceeds it. This probably happens somewhere in python HTTP stack or even deeper. Browsers seems to have the same behavior.

You should bear in mind that remote party may serve different Content-Length for HEAD and GET requests.
If server lied no more than one extra payload of current block size will be downloaded and spotted by the aforementioned check.

@legraphista
Copy link
Author

@legraphista legraphista commented May 5, 2020

Thank you for your time & answers.

I have come to the conclusion that my requirements are out of the scope of youtube-dl.

We can write a custom downloader that aligns with our needs and use that when the extractor is generic, else let youtube-dl handle the heavy work.

I'll most likely leverage --load-info-json to cache the metadata and not send out duplicate requests after the extractor checks.

@dstftw
Copy link
Collaborator

@dstftw dstftw commented May 5, 2020

I've re-checked question 2 and looks like it actually silently truncates data at Content-Length border if there are more data. This happens somewhere in python internals or deeper. Browsers seem to have similar behavior. So that the actual answer to 2 is "no".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.