Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode "|" character is being replaced in filenames on filesystems that support it #4547

Open
8 tasks done
Gestas opened this issue Aug 3, 2022 · 20 comments · May be fixed by #8464 or #9591
Open
8 tasks done

Unicode "|" character is being replaced in filenames on filesystems that support it #4547

Gestas opened this issue Aug 3, 2022 · 20 comments · May be fixed by #8464 or #9591
Labels
enhancement New feature or request

Comments

@Gestas
Copy link

Gestas commented Aug 3, 2022

Checklist

  • I'm reporting a bug unrelated to a specific site
  • I've verified that I'm running yt-dlp version 2022.07.18 (update instructions) or later (specify commit)
  • I've checked that all provided URLs are playable in a browser with the same IP and same login details
  • I've checked that all URLs and arguments with special characters are properly quoted or escaped
  • I've searched the bugtracker for similar issues including closed ones. DO NOT post duplicates
  • I've read the guidelines for opening an issue

Provide a description that is worded well enough to be understood

Per the documentation --no-restrict-filenames should support Unicode characters. However the | character, which is valid Unicode is being replaced with _ in file paths -

Actual behavior -

$ yt-dlp -vU \
--get-filename \
-o "/%(uploader)s/S01E%(video_autonumber)02d - %(title)s/S01E%(video_autonumber)02d - %(title)s [%(id)s].%(ext)s" \
l4gGWufoIYI

[debug] Command-line config: ['-vU', '--get-filename', '-o', '/%(uploader)s/S01E%(video_autonumber)02d - %(title)s/S01E%(video_autonumber)02d - %(title)s [%(id)s].%(ext)s', 'l4gGWufoIYI']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version 2022.07.18 [135f05e] (zip)
[debug] Python 3.10.4 (CPython 64bit) - Linux-5.17.15-76051715-generic-x86_64-with-glibc2.35 (glibc 2.35)
[debug] Checking exe version: ffmpeg -bsfs
[debug] Checking exe version: avconv -bsfs
[debug] Checking exe version: ffprobe -bsfs
[debug] Checking exe version: avprobe -bsfs
[debug] exe versions: none
[debug] Optional libraries: brotli-1.0.9, certifi-2020.06.20, secretstorage-3.3.1, sqlite3-2.6.0
[debug] Proxy map: {}
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
Latest version: 2022.07.18, Current version: 2022.07.18
yt-dlp is up to date (2022.07.18)
[debug] [youtube] Extracting URL: l4gGWufoIYI
[youtube] l4gGWufoIYI: Downloading webpage
[youtube] l4gGWufoIYI: Downloading android player API JSON
[debug] Sort order given by extractor: quality, res, fps, hdr:12, source, codec:vp9.2, lang, proto
[debug] Formats sorted by: hasvid, ie_pref, quality, res, fps, hdr:12(7), source, vcodec:vp9.2(10), acodec, lang, proto, filesize, fs_approx, tbr, vbr, abr, asr, vext, aext, hasaud, id
[debug] Default format spec: bestvideo*+bestaudio/best
[info] l4gGWufoIYI: Downloading 1 format(s): 248+251

/Action BOX/S01E01 - Homemade Plastic Injection Machine _ DIY/S01E01 - Homemade Plastic Injection Machine _ DIY [l4gGWufoIYI].webm

Expected behavior -

The video title on Youtube is "Homemade Plastic Injection Machine | DIY" so the expected file path is -

/Action BOX/S01E01 - Homemade Plastic Injection Machine | DIY/S01E01 - Homemade Plastic Injection Machine | DIY [l4gGWufoIYI].webm

Notes -

I'm running Ubuntu 21 with a ext4 filesystem, | is a supported character in filenames -

$ touch 'this is a pipe | in a filename'
$ ls
'this is a pipe | in a filename'

To the best of my knowledge NTFS (Windows) is the only filesystem that doesn't support | in a path. yt-dlp already handles filenames differently when running on Windows via the default --no-windows-filenames. Removing the | restriction for non-NTFS shouldn't cause an issue.

Extras -

While this ticket is specific to the '|' character NTFS is the only filesystem that doesn't support all Unicode characters -

$ touch 'this is a happy filename 😀'
$ ls
'this is a happy filename 😀'

It may be worth exploring supporting the entire Unicode character set in yt-dlp.

Provide verbose output that clearly demonstrates the problem

  • Run your yt-dlp command with -vU flag added (yt-dlp -vU <your command line>)
  • Copy the WHOLE output (starting with [debug] Command-line config) and insert it below

Complete Verbose Output

[debug] Command-line config: ['-vU', '--get-filename', '-o', '/%(uploader)s/S01E%(video_autonumber)02d - %(title)s/S01E%(video_autonumber)02d - %(title)s [%(id)s].%(ext)s', 'l4gGWufoIYI']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version 2022.07.18 [135f05e] (zip)
[debug] Python 3.10.4 (CPython 64bit) - Linux-5.17.15-76051715-generic-x86_64-with-glibc2.35 (glibc 2.35)
[debug] Checking exe version: ffmpeg -bsfs
[debug] Checking exe version: avconv -bsfs
[debug] Checking exe version: ffprobe -bsfs
[debug] Checking exe version: avprobe -bsfs
[debug] exe versions: none
[debug] Optional libraries: brotli-1.0.9, certifi-2020.06.20, secretstorage-3.3.1, sqlite3-2.6.0
[debug] Proxy map: {}
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
Latest version: 2022.07.18, Current version: 2022.07.18
yt-dlp is up to date (2022.07.18)
[debug] [youtube] Extracting URL: l4gGWufoIYI
[youtube] l4gGWufoIYI: Downloading webpage
[youtube] l4gGWufoIYI: Downloading android player API JSON
[debug] Sort order given by extractor: quality, res, fps, hdr:12, source, codec:vp9.2, lang, proto
[debug] Formats sorted by: hasvid, ie_pref, quality, res, fps, hdr:12(7), source, vcodec:vp9.2(10), acodec, lang, proto, filesize, fs_approx, tbr, vbr, abr, asr, vext, aext, hasaud, id
[debug] Default format spec: bestvideo*+bestaudio/best
[info] l4gGWufoIYI: Downloading 1 format(s): 248+251

/Action BOX/S01E01 - Homemade Plastic Injection Machine _ DIY/S01E01 - Homemade Plastic Injection Machine _ DIY [l4gGWufoIYI].webm
@Gestas Gestas added bug Bug that is not site-specific triage Untriaged issue labels Aug 3, 2022
@pukkandan

This comment was marked as resolved.

@pukkandan pukkandan marked this as a duplicate of #3385 Aug 3, 2022
@pukkandan pukkandan closed this as not planned Won't fix, can't repro, duplicate, stale Aug 3, 2022
@pukkandan pukkandan added duplicate This issue or pull request already exists enhancement New feature or request and removed bug Bug that is not site-specific triage Untriaged issue labels Aug 3, 2022
@pukkandan

This comment was marked as resolved.

@pukkandan pukkandan reopened this Aug 3, 2022
@pukkandan pukkandan removed the duplicate This issue or pull request already exists label Aug 3, 2022
@light-and-ray
Copy link

Maybe you can add simple flag --ext4-filename-sanitization, if you have problems with filename restrictions deduction. In this fs you need to replace only / symbol

@chungy
Copy link

chungy commented Dec 25, 2022

Documentation suggests to use --no-windows-filenames but it still tries to replace characters like *, ?, | -- all of them are perfectly valid file name characters in Unix.

(Maybe yt-dlp could default to --windows-filenames if the filesystem is one of msdos/vfat/exfat, otherwise it should be all fine.)

@pukkandan
Copy link
Member

Can someone make a summary of which popular FS/OS blocks which characters? We may be able to get away with using --windows-filenames to deal with this, but I need to be sure it won't cause issues

@Lesmiscore
Copy link
Contributor

https://stackoverflow.com/a/31976060

@light-and-ray
Copy link

light-and-ray commented Mar 12, 2023 via email

@mikkovedru
Copy link

In Linux out of all the characters, there are only / (forward slash) and 0 (Null) forbidden.

I suggest checking the OS environment and automatically applying --windows-filenames in Windows and --no-windows-filenames (perhaps should be divided into --linux-filenames and --macos-filenames) in Linux.

@mikkovedru

This comment was marked as duplicate.

@dirkf
Copy link
Contributor

dirkf commented May 21, 2023

The character set support in the destination file system is a factor, as well as the OS.

Then one could consider whether some weird character might be censored even if the FS allows it, so as to make the filename acceptable in other common FSs (to which the file might be copied).

@mikkovedru

This comment was marked as duplicate.

@mikkovedru
Copy link

mikkovedru commented Jun 13, 2023

In case someone wonders how to rename the already saved files, use the Linux shell command is rename 's/|/|/g' name-of-the-file.mkv (fixes 1 out of 5 cases named in my previous comment just above)

In order to rename all of the files in the current directory recursively, one can apply the following commands one after another:

find . -type f -name '*' -exec rename 's/|/|/g' {} +
find . -type f -name '*' -exec rename 's/:/:/g' {} +
find . -type f -name '*' -exec rename 's/?/?/g' {} +
find . -type f -name '*' -exec rename 's/*/*/g' {} +
find . -type f -name '*' -exec rename 's/"/"/g' {} +

@pukkandan
Copy link
Member

pukkandan commented Jul 3, 2023

An implementation where explicitly passing --no-windows-filenames prevents this sanitization is welcome. I was planning to do it, but I have my hands full with other issues. #9591

@catthou

This comment was marked as spam.

@Krinkle
Copy link

Krinkle commented Nov 18, 2023

As of writing, the --no-windows-filenames --no-restrict-filenames options still do not work. yt-dlp still replaces the double quotes in titles with the Unicode "Fullwidth Quotation Mark" (also known as char code 65282 or &#xff02;).

I did find a workaround that works today, by setting --compat-options filename-sanitization

This will replace double quotes with regular single quotes instead. Ideally they'd be left alone, but at least these are not problematic.

@NintendoManiac64

This comment was marked as off-topic.

@pukkandan
Copy link
Member

pukkandan commented Mar 1, 2024

I just recently realized that a standard ASCII :: is replaced by the substantially wider-spaced full-width colon :: rather than a colon that's a similar sized to the standard ASCII colon, such as the modifier letter colon ꞉꞉ (U+A789):

For reference, a standard ASCII // is replaced with a unicode slash that looks darned-near identical rather than the full-sized unicode slash // so why use the full-sized colon?

@NintendoManiac64 Off topic... It's just what seemed reasonable to me at the time. It is possible there are better alternatives, but any benefit gained by changing this is not worth breaking compatibility now. You can always do custom replacements with --replace-in-metadata

@NintendoManiac64

This comment was marked as off-topic.

@pukkandan

This comment was marked as off-topic.

@NintendoManiac64

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Filename
10 participants