Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to allow all characters in filenames on non-Windows OS #22691

Closed
Wuestengecko opened this issue Oct 12, 2019 · 3 comments
Closed

Option to allow all characters in filenames on non-Windows OS #22691

Wuestengecko opened this issue Oct 12, 2019 · 3 comments

Comments

@Wuestengecko
Copy link

@Wuestengecko Wuestengecko commented Oct 12, 2019

Checklist

  • I'm reporting a feature request
  • I've verified that I'm running youtube-dl version 2019.09.28
  • I've searched the bugtracker for similar feature requests including closed ones

Description

On Unix-like operating systems like Linux, the only restrictions on file name characters are "no ASCII NUL" and "no /". Currently, youtube-dl will always strip out a few characters (e.g. |) that are not allowed on Windows, but that is not necessary on other platforms. When I download a video that has such a character in its name, I have to manually go through the title and replace them back. Please make it an option to not replace these characters in the first place, if the operating system allows them.

As Windows is widely used and interoperability between Windows and Linux therefore is a common requirement, I see the current behavior as a sane default. Considering this, this feature could be implemented as counterpart options to --restrict-filenames. They could be called e.g. --sane-filenames to explicitly request the current/default behavior and --no-sane-filenames to allow all characters.

@dstftw dstftw closed this Oct 12, 2019
@dstftw dstftw added the duplicate label Oct 12, 2019
@Wuestengecko
Copy link
Author

@Wuestengecko Wuestengecko commented Oct 13, 2019

I apologize for apparently not doing thorough enough research on the matter. I did search through the tickets on Github to see whether there are any duplicates, but I'm still unsure which one you are referring to. All I could find were errors related to insufficient UTF-8 support (or other character set mismatches), ones with invalid template format strings, and ones complaining that Windows can't handle some character or other, but - seeing the sheer number of tickets in total - it's quite possible I have overlooked more relevant ones.

What I'm asking for however is basically the exact opposite: I want a way to just pass the original video title through to the filename, without any "sanitization" whatsoever.


I also tried looking through the code and found sanitize_open (and sanitize_path a few lines below), which suggest that the Windows sanitization should only be taking place on Windows; however I am in fact not using Windows, and running python -c 'open("a | b", "w").close()' does not fail (in the same directory I was downloading the videos to).

By simply editing in a hook that prints the parameters and results of sanitize_filename and sanitize_path whenever they're called, I see that sanitize_path actually gets called after the individual parts have been sent through sanitize_filename; in other words the if sys.platform != 'win32' at the start of sanitize_path has no effect on the %(title)s template, since affected characters have already been removed prior. This observation is reinforced by the fact that a template like %(ext)s | %(ext)s correctly preserves the | instead of "sanitizing" it to _.

I'm now a little confused about what the originally intended behavior was. Can you please clarify this, or point me to the approriate documentation?

@dstftw
Copy link
Collaborator

@dstftw dstftw commented Oct 13, 2019

Not using Windows does not gives any guarantee one won't use Windows shares/NTFS partitions.
Sanitization is performed in two steps dues to the different data sources: output template sequences that are external metadata not controlled by a user and must be always sanitized and output template literal itself that may contain forbidden characters written by user that makes him responsible though it's anyway sanitized on Windows.

@Wuestengecko
Copy link
Author

@Wuestengecko Wuestengecko commented Oct 15, 2019

Not using Windows does not gives any guarantee one won't use Windows shares/NTFS partitions.

I absolutely agree on this point, which is why I was asking for an option, instead of changing the default behavior.

Sanitization is performed in two steps dues to the different data sources

In my opinion, this isn't really thought through. The way I see it is, we either want full Windows compatibility, or we want to allow the full character set supported by the current OS.

  • If we want compatibility, we shouldn't rely on the user honoring this - it's all too easy to miss an illegal character if you're used to it not being illegal.
  • On the other hand, if the user knows that a specific file will never land on Windows, it would be nice if ytdl wouldn't unnecessarily strip out characters that are totally legal.

I do think retaining Windows compatibility by default is the right call to make. I'm only asking for an option to not force Windows' restrictions on my filenames. If you still think such an option is inappropriate for one reason or another, then I will accept that decision and stop bothering you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.