Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Filenames should be filtered on --literal too #441

Closed
Blacker47 opened this Issue · 8 comments

4 participants

@Blacker47

Using --title converts too much (spaces to underscores) for a modern filesystem.

Using --literal don't filter slashes out that produces ugly sub-directories. Maybe a directory-traversal with overwriting of important files is possible too.
IMHO, using --literal should filter slashes "/" anyway on a unix-system. On windows there would be some more chars.

In case the idea don't fits in the meaning of "literal" there could be a "almostliteral" option for unix-systems.

@ssokolow

Actually, I'd suggest an option to force FAT32 filename restrictions everywhere since the same restrictions apply to FAT32 filesystems no matter where you mount them. (I know NTFS has a POSIX mode where only NUL and / are forbidden, but I'm not sure whether the Linux ntfs-3g driver uses it or whether it enforces the Win32 mode for compatibility)

Either way, here are what I'd recommend filtering at bare minimum to ensure people don't get an unpleasant surprise when they try to copy a file to a thumbdrive:

  • \x00 (NUL), ?, \, *, /, ", :, |, <, and > (Disallowed in FAT and Win32-mode NTFS)
  • \x00-\x1f and \x7f (Even when allowing long filenames, FAT32 disallows ASCII control characters)

As is, since --title makes it impossible to tell what was and wasn't once an underscore, I'm using a wrapper for youtube-dl which trusts me to only use it for videos without slashes in their names and to run it on a POSIX-compliant filesystem. It downloads using --literal and then uses the rename command to escape all filenames in the current folder for safety on FAT filesystems.

@Blacker47

I belive there are a lot of people using youtube-dl unter linux and don't bother about be able to copy the files to FAT/NTFS filesystem. On the other hand a lot of video titles uses some characters you can't use on FAT and any filtering of them makes the filename unnacessary ugly. Filtering only the forward-slash would be not that ugly.

There could be something like --filter-filenames=[FAT|UNIX] as a parameter.

@ssokolow

@Blacker47 I wouldn't really mind that but what examples do you have in mind?

Also, it may not be as bad as you're thinking in FAT mode:

  • Control characters are even less desirable than \n in filenames and should never appear in a properly-formatted title anyway, so filtering them isn't an issue.
  • : and | can be replaced with a space-padded dash to get things like "My Story - Chapter 2". (Possibly with a regular expression match like r"\s?[:|]\s?" so you always get that pretty balance of spaces before and after.)
  • " can be replaced by '. Heck, that's basically how they do it normally in Britain anyway. (Read a UK printing of something like Harry Potter or Discworld if you don't believe me)
  • < and > can be comfortably replaced by [ and ]. That's basically the kind of thinking that led to BBcode.
  • "?" isn't horrendously ugly if replaced by an underscore, but could be replaced by ¿ if you want something both thumbdrive-compatible and unambiguous. (That's how quite a few Windows users do it when they consider an underscore unacceptable)
  • I'm not sure why you'd want * in your filenames, given how it's a risk for anything that accidentally processes metacharacters and the most common use I've seen is juvenile naming of things like "****OMG****", but if you really want it, there are various unicode characters that look similar or even ASCII characters that wouldn't look bad as a replacement.

That just leaves \ and /. If you don't mind unicode, you could always use look-alike characters like U+2215 (Division Slash. ) and U+2216 (Set Minus. ).

Anyway, my main concern is trying to get the default behaviour to be as comfortable as possible for the largest variety of people possible.

@phihag
Collaborator

+1 on using fancy Unicode characters instead of /\* - great idea, @ssokolow . IMHO, --literal should be deprecated and made an alias of --title.

@FiloSottile
Collaborator

With the new sanitize_title that resembles the proposal by @ssokolow (and that we can tweak even more based on his suggestions, although I don't like Unicode in filenames) I don't get the point of --literal at all. And yes, it IS a security bug. Let's alias it!

@Blacker47

Using non-ASCII with FATxx is not allways a good idea.
Using similar characters from Unicode changes the title too much.

The ONLY security related problem is not to escape the only ONE char, '/' (and on windows '\').
It is easy change and can be seen by user after the change (e.g. the user needs the exact title to search for).

I don't see any clue to aliasing --literal to --title because the escaping of one char is too much work.

(I haven't checked the todays build for this issue - maybe it is fixed allready)

@FiloSottile
Collaborator

We consolidated the behavior now, see https://github.com/rg3/youtube-dl#output-template

--title now replaces only security or filesystem problematic characters
--literal opened to a number of issues for a little gained value, we aliased to --title
--restrict-filename is the new option to leave only safe ASCII characters

Does this address the issue?

@Blacker47

--title seems to make on linux the expected job now.

Thanks.

@Blacker47 Blacker47 closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.