Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to download a subtitle file only from youtube without the timecode? #7496

Closed
6 of 9 tasks
Xelbayria opened this issue Jul 3, 2023 · 33 comments
Closed
6 of 9 tasks
Labels
question Question

Comments

@Xelbayria
Copy link

Xelbayria commented Jul 3, 2023

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

  • I understand that I will be blocked if I intentionally remove or skip any mandatory* field

Checklist

Please make sure the question is worded well enough to be understood

You can pick any video that has subtitle which are either automatic caption or subtitle created by the creator/author.

I've tried varying commands included below that involved yt-dlp and ffmpeg. I have no success with the result. The timecode are not even removed at all.

yt-dlp --skip-download --write-auto-subs --write-subs --sub-lang en --convert-subs srt --sub-format txt --postprocessor-args "-ss 00:00:00 -to 99:59:59 -f srt - | sed '/^[0-9]*:[0-9]*:[0-9]*,[0-9]* --> [0-9]*:[0-9]*:[0-9]*,[0-9]*$/d' | tr -s '\n' ' ' > transcription.txt" https://youtu.be/jPrdCuYD-t0

The command above was able to successfully download the subtitle file from URL and was able to convert it to a plain text file, but failed to remove the timecode or timestamp.

00:00:00 --> 00:01:00
Hello World. 

00:01:00 --> 00:02:00
My name is John Doe. 

into

Hello World. My name is John Doe 

Above is a perfect example of what I am trying to do. I am not sure if the verbose output is necessary because this is about using the correct command to get correct result I wanted. Let me know if i need to include it or not.

Provide verbose output that clearly demonstrates the problem

  • Run your yt-dlp command with -vU flag added (yt-dlp -vU <your command line>)
  • If using API, add 'verbose': True to YoutubeDL params instead
  • Copy the WHOLE output (starting with [debug] Command-line config) and insert it below

Complete Verbose Output

No response

@Xelbayria Xelbayria added the question Question label Jul 3, 2023
@pukkandan
Copy link
Member

@Xelbayria
Copy link
Author

Xelbayria commented Jul 4, 2023

https://github.com/yt-dlp/yt-dlp/blob/master/CONTRIBUTING.md#is-your-question-about-yt-dlp

huh, I used yt-dlp and ffmpeg as I succeed downloading the subtitle file and wasn't successfully in removing the timecode/timestamp from the subtitle file? As you can see the title, it is not about bug and is about "how" related to yt-dlp. Unless I am mistaken about something. then please enlighten me.

@pukkandan
Copy link
Member

You can't remove it just using yt-dlp. As far as I know you can't do it with ffmpeg either. When I saw your command at first, I assumed you ae piping something into sed and asking help for that part of the command. But looking at it again and paying attention to the quotes, I see you seem to be passing sed commands as ffmpeg arguments? That obviously do not work.

Say the command to manually remove timestamps from the downloaded file manually is xxx path_to_subtitle yyy. Then, you can use --exec 'xxx %(requested_subtitles.:.filepath)#q yyy'.

@Xelbayria
Copy link
Author

Say the command to manually remove timestamps from the downloaded file manually is xxx path_to_subtitle yyy. Then, you can use --exec 'xxx %(requested_subtitles.:.filepath)#q yyy'.

ah, it's more tricky than I thought. The reason why i can't find documentation on sed because I don't know where does it originated from. Can I get information on --exec and where does it originated from?

@Xelbayria
Copy link
Author

Xelbayria commented Jul 5, 2023

Ok. I was awaiting for the response to my question in my previous post and I have no idea how to use that --exec because I don't have any information on what is it and what does it do. :(
Can anyone tell me what is it or what does it do?

EDIT

Alright, I'll be closing this question. Here's what I learned so far:

First:

I thought sed was part of either yt-dlp or ffmpeg but thanks to the first response. I realized that it wasn't. I investigated to discover that there is a program called "SED" from Here. Since I know the origin of sed and are able to look up the documentation to tweak the command using sed to get what I wanted.

Second:

I ended up had to modify the command using yt-dlp to get ttml instead of vtt format, this way I can get a clean subtitle file and use sed to remove the unnecessary part of the file just like how you saw the example in my first post.
remove:

  • timecode/timestamp
  • numbering
  • html tags

this is the modified of sed's part: sed -i -e '/^[0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9]$/d' -e 's/<[^>]*>//g' -e '/^[[:digit:]]\{1,3\}$/d' -e ':a;N;$!ba;s/\n//g' subtitle.srt

where:

  • subtitle.srt is the file of the subtitle converted to srt
  • first -e: remove the timecode/timestamp
  • second -e: remove html tags
  • third -e: remove numbering
  • fourth -e: remove newline

Last

I do not understand what is --exec and have no clue where it originated from. I gave up on it since I got the result I wanted.
That's it.

Have a good day!

@bashonly
Copy link
Member

bashonly commented Jul 6, 2023

--exec tells yt-dlp to run a shell command during postprocessing (or pre-processing; the exact stage when it runs can be specified)
e.g.

yt-dlp --skip-download --write-subtitles --exec "cat %(requested_subtitles.:.filepath)#q | sed '/^[0-9]*:[0-9]*:[0-9]*,[0-9]* --> [0-9]*:[0-9]*:[0-9]*,[0-9]*$/d' | tr -s '\n' ' ' > transcription.txt" "https://youtu.be/jPrdCuYD-t0"

from the README

--exec [WHEN:]CMD               Execute a command, optionally prefixed with
                                when to execute it, separated by a ":".
                                Supported values of "WHEN" are the same as
                                that of --use-postprocessor (default:
                                after_move). Same syntax as the output
                                template can be used to pass any field as
                                arguments to the command. If no fields are
                                passed, %(filepath,_filename|)q is appended
                                to the end of the command. This option can
                                be used multiple times

                                The "when" argument determines when the
                                postprocessor is invoked. It can be one of
                                "pre_process" (after video extraction),
                                "after_filter" (after video passes filter),
                                "video" (after --format; before
                                --print/--output), "before_dl" (before each
                                video download), "post_process" (after each
                                video download), "after_move"
                                (after moving video file to it's final
                                locations), "after_video" (after downloading
                                and processing all formats of a video), or
                                "playlist" (at end of playlist).

@Xelbayria
Copy link
Author

Xelbayria commented Jul 6, 2023

--exec tells yt-dlp to run a shell command during postprocessing (or pre-processing; the exact stage when it runs can be specified) e.g.

yt-dlp --skip-download --write-subtitles --exec "cat %(requested_subtitles.:.filepath)#q | sed '/^[0-9]*:[0-9]*:[0-9]*,[0-9]* --> [0-9]*:[0-9]*:[0-9]*,[0-9]*$/d' | tr -s '\n' ' ' > transcription.txt" "https://youtu.be/jPrdCuYD-t0"

from the README

ah thank you. that helped me fully understand what it does and --exec is actually part of yt-dlp, i'll keep that in mind.

@bheeshmpita
Copy link

that helped me fully understand what it does

hi @Xelbayria , can you help me how to achieve transcript file of a video as i unable to comprehend this command.

@Xelbayria
Copy link
Author

Xelbayria commented Aug 4, 2023

that helped me fully understand what it does

hi @Xelbayria , can you help me how to achieve transcript file of a video as i unable to comprehend this command.

What do you want to see in the transcript file? You'll need to share command of what you tried to do. Someone or I can get a good idea of what you are trying to do.

@bheeshmpita
Copy link

bheeshmpita commented Aug 4, 2023

What do you want to see in the transcript file?

the transcription of the content in the video
what i am trying to achieve is download the transcript of video (without timecode) that can be helpful in certain cases like when the video part of the content is not that necessary.
the commands that i have tried were all from this thread only as i have no understanding of them so the task is, if i have a video URL, what code can fetch its transcription/subtitles (without timecode) into a text format file.

@Xelbayria
Copy link
Author

Xelbayria commented Aug 4, 2023

What do you want to see in the transcript file?

the transcription of the content in the video what i am trying to achieve is download the transcript of video ( without timecode) that can be helpful in certain cases like when the video part of the content is not that necessary. the commands that i have tried were all from this thread only as i have no understanding of them so the task is if i have a video url what code can fetch its transcription/subtitles (without timecode) into a text format file.

Ok. There are 2 commands, yt-dlp and sed where:

  • yt-dlp is responsible for downloading video or subtitle (can be both, too)
  • sed is responsible for removing text like timecode from the transcript file (you can use any expression via Regex to remove specific words or timecode)

it's not possible to download a file without a timecode. I recommended that you look at yt-dlp's documentation where they will explain what options & Usages are for. There is also Documentation for SED, too

however, the fastest way to get Usage & Options is to run SED --help so you can see what are they for. (only if you have SED installed). It's the same for yt-dlp --help

@bheeshmpita
Copy link

can you please share the final command too?

@Xelbayria
Copy link
Author

can you please share the final command too?

alright, Here's the command and u will notice the format of subtitle is TTML, I have it converted it to SRT (blc it's the most clean one I can use to do with SED

yt-dlp --skip-download --write-subs --write-auto-subs --sub-lang en --sub-format ttml --convert-subs srt --output "transcript.%(ext)s" <URL>

sed -i -e '/^[0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9]$/d' -e '/^[[:digit:]]\{1,3\}$/d' -e 's/<[^>]*>//g' .\transcript.en.srt

first argument is Timecode (being removed),
second argument is digit (u see numbers from 0 to 800 above the sentence) (being removed)
third argument is HMTL tags (being removed)

there is also an optional -e ':a;N;$!ba;s/\n/ /g' which removed newline. If you are using Notepad or Notepadd++, don't use it blc it turned into "one line"

@bheeshmpita
Copy link

can the 2 commands be merged into single one such that output of 1st one acts as input for 2nd?

@Xelbayria
Copy link
Author

Xelbayria commented Aug 4, 2023

can the 2 commands be merged into single one such that output of 1st one acts as input for 2nd?

I haven't solved that yet.

@kharbandaraghu
Copy link

Here we go
yt-dlp --skip-download --write-subs --write-auto-subs --sub-lang en --sub-format ttml --convert-subs srt --output "transcript.%(ext)s" <URL_GOES_HERE_WITHOUT_QUOTES> && sed -i '' -e '/^[0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9]$/d' -e '/^[[:digit:]]\{1,3\}$/d' -e 's/<[^>]*>//g' ./transcript.en.srt && sed -e 's/<[^>]*>//g' -e '/^[[:space:]]*$/d' transcript.en.srt > output.txt && rm transcript.en.srt

@bheeshmpita
Copy link

@kharbandaraghu can it be further modified such that url input part can be moved to the last in the command?

@kharbandaraghu
Copy link

@kharbandaraghu can it be further modified such that url input part can be moved to the last in the command?

I'm not sure, but I would rather just use Mac shortcut and pass it as argument in the shortcut so I can run it directly from youtube page 🤷‍♂️

@Xelbayria
Copy link
Author

Xelbayria commented Aug 6, 2023

@kharbandaraghu can it be further modified such that url input part can be moved to the last in the command?

uh no. it's not possible. because there are 2 commands yt-dlp and sed. It's possible to create a script in a file where you can pass your clipboard (the URL you copied) to a file where it will take as a variable and put it in that command.

EDIT:
I was wrong. it's possible. I forgot about --exec where you can put the URL at the end of the command. See below 👇

@bashonly
Copy link
Member

bashonly commented Aug 6, 2023

yt-dlp --skip-download --write-subs --exec before_dl:"sed -i -e '/^[0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9]$/d' -e '/^[[:digit:]]\{1,3\}$/d' -e 's/<[^>]*>//g' %(requested_subtitles.:.filepath)#q" "URL"

@nhershy
Copy link

nhershy commented Oct 15, 2023

Here we go yt-dlp --skip-download --write-subs --write-auto-subs --sub-lang en --sub-format ttml --convert-subs srt --output "transcript.%(ext)s" <URL_GOES_HERE_WITHOUT_QUOTES> && sed -i '' -e '/^[0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9]$/d' -e '/^[[:digit:]]\{1,3\}$/d' -e 's/<[^>]*>//g' ./transcript.en.srt && sed -e 's/<[^>]*>//g' -e '/^[[:space:]]*$/d' transcript.en.srt > output.txt && rm transcript.en.srt

This works as long as the URL is referencing a single video. I'd like to use a YouTube channel as the URL, which contains a list of videos. But the "output.txt" is getting overwritten each time, I think. How can this be modified to be used with a channel? Thanks.

@bashonly
Copy link
Member

use the --exec suggestion I posted directly above your comment

@Xelbayria
Copy link
Author

Xelbayria commented Oct 15, 2023

@nhershy
Since the output.txt is overwritten. I'm not sure if it's possible to do that in Shell. The method is U can create a script and call it via shell with the URL of the video channel. The script will increment it like: output_1.txt for video#1, output_2.txt for video#2 so on and on. And Like bashonly ☝️ said to use --exec as another method.

@nhershy
Copy link

nhershy commented Oct 15, 2023

use the --exec suggestion I posted directly above your comment

@bashonly I tried that, and I got this error:

[Exec] Executing command: sed -i -e '/^[0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9]$/d' -e '/^[[:digit:]]{1,3}$/d' -e 's/<[^>]*>//g' 'Emotions are NEVER IRRATIONAL: feelings are not always justified [MSNBgLz3IJA].en.vtt'
sed: -e: No such file or directory
ERROR: Preprocessing: Command returned error code 1
ERROR: Preprocessing: Command returned error code 1

To be clear, I had to swap out "write-subs" with "write-auto-subs", since the videos only have the auto-generated subtitles.

This is the full command I used:

yt-dlp --skip-download --write-auto-subs --exec before_dl:"sed -i -e '/^[0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9]$/d' -e '/^[[:digit:]]{1,3}$/d' -e 's/<[^>]*>//g' %(requested_subtitles.:.filepath)#q" https://www.youtube.com/@psychacks

@bashonly
Copy link
Member

@nhershy show verbose output, add -v to your command and show full log

try changing %(requested_subtitles.:.filepath)#q to %(requested_subtitles.:.filepath)q

@nhershy
Copy link

nhershy commented Oct 15, 2023

@bashonly I tried without the #, and it gave same result.

Here is the -v output:

user@user % yt-dlp --skip-download --write-auto-subs --exec before_dl:"sed -i -e '/^[0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9]$/d' -e '/^[[:digit:]]\{1,3\}$/d' -e 's/<[^>]*>//g' %(requested_subtitles.:.filepath)#q" "https://www.youtube.com/@psychacks" -v
[debug] Command-line config: ['--skip-download', '--write-auto-subs', '--exec', "before_dl:sed -i -e '/^[0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9]$/d' -e '/^[[:digit:]]\\{1,3\\}$/d' -e 's/<[^>]*>//g' %(requested_subtitles.:.filepath)#q", 'https://www.youtube.com/@psychacks', '-v']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version stable@2023.10.07 [377e85a17] (pip)
[debug] Python 3.10.5 (CPython arm64 64bit) - macOS-14.0-arm64-arm-64bit (OpenSSL 1.1.1n  15 Mar 2022)
[debug] exe versions: none
[debug] Optional libraries: Cryptodome-3.19.0, brotli-1.1.0, certifi-2022.12.07, mutagen-1.47.0, sqlite3-3.37.2, websockets-11.0.3
[debug] Proxy map: {}
[debug] Loaded 1886 extractors
[youtube:tab] Extracting URL: https://www.youtube.com/@psychacks
[youtube:tab] @psychacks: Downloading webpage
[debug] [youtube:tab] Selected tab: 'videos' (videos), Requested tab: ''
[youtube:tab] Downloading all uploads of the channel. To download only the videos in a specific tab, pass the tab's URL
[youtube:tab] @psychacks/shorts: Downloading webpage
[debug] [youtube:tab] Selected tab: 'shorts' (shorts), Requested tab: 'shorts'
[youtube:tab] Downloading as multiple playlists, separated by tabs. To download as a single playlist instead, pass https://www.youtube.com/playlist?list=UUSduXBjCHkLoo_y9ss2xzXw
[download] Downloading playlist: PsycHacks
[youtube:tab] Playlist PsycHacks: Downloading 2 items of 2
[download] Downloading item 1 of 2
[download] Downloading playlist: PsycHacks - Videos
[youtube:tab] UCSduXBjCHkLoo_y9ss2xzXw page 1: Downloading API JSON
[youtube:tab] UCSduXBjCHkLoo_y9ss2xzXw page 2: Downloading API JSON
[youtube:tab] UCSduXBjCHkLoo_y9ss2xzXw page 3: Downloading API JSON
[youtube:tab] UCSduXBjCHkLoo_y9ss2xzXw page 4: Downloading API JSON
[youtube:tab] UCSduXBjCHkLoo_y9ss2xzXw page 5: Downloading API JSON
[youtube:tab] UCSduXBjCHkLoo_y9ss2xzXw page 6: Downloading API JSON
[youtube:tab] UCSduXBjCHkLoo_y9ss2xzXw page 7: Downloading API JSON
[youtube:tab] UCSduXBjCHkLoo_y9ss2xzXw page 8: Downloading API JSON
[youtube:tab] UCSduXBjCHkLoo_y9ss2xzXw page 9: Downloading API JSON
[youtube:tab] UCSduXBjCHkLoo_y9ss2xzXw page 10: Downloading API JSON
[youtube:tab] UCSduXBjCHkLoo_y9ss2xzXw page 11: Downloading API JSON
[youtube:tab] UCSduXBjCHkLoo_y9ss2xzXw page 12: Downloading API JSON
[youtube:tab] UCSduXBjCHkLoo_y9ss2xzXw page 13: Downloading API JSON
[youtube:tab] Playlist PsycHacks - Videos: Downloading 403 items of 403
[download] Downloading item 1 of 403
[youtube] Extracting URL: https://www.youtube.com/watch?v=MSNBgLz3IJA
[youtube] MSNBgLz3IJA: Downloading webpage
[youtube] MSNBgLz3IJA: Downloading ios player API JSON
[youtube] MSNBgLz3IJA: Downloading android player API JSON
[youtube] MSNBgLz3IJA: Downloading m3u8 information
[info] MSNBgLz3IJA: Downloading subtitles: en
[debug] Sort order given by extractor: quality, res, fps, hdr:12, source, vcodec:vp9.2, channels, acodec, lang, proto
[debug] Formats sorted by: hasvid, ie_pref, quality, res, fps, hdr:12(7), source, vcodec:vp9.2(10), channels, acodec, lang, proto, size, br, asr, vext, aext, hasaud, id
[debug] Default format spec: best/bestvideo+bestaudio
[info] MSNBgLz3IJA: Downloading 1 format(s): 22
Deleting existing file Emotions are NEVER IRRATIONAL: feelings are not always justified [MSNBgLz3IJA].en.vtt
[info] Writing video subtitles to: Emotions are NEVER IRRATIONAL: feelings are not always justified [MSNBgLz3IJA].en.vtt
[debug] Invoking http downloader on "https://www.youtube.com/api/timedtext?v=MSNBgLz3IJA&ei=UEYsZZLOH8S5kAPnybDADg&caps=asr&opi=112496729&xoaf=5&hl=en&ip=0.0.0.0&ipbits=0&expire=1697425600&sparams=ip%2Cipbits%2Cexpire%2Cv%2Cei%2Ccaps%2Copi%2Cxoaf&signature=44D21F2D03D8CA201AC593B349D66C66027BAB27.DE5CF177566BAEE6C573FC17D92A8DD98B65A4D2&key=yt8&kind=asr&lang=en&fmt=vtt"
[download] Destination: Emotions are NEVER IRRATIONAL: feelings are not always justified [MSNBgLz3IJA].en.vtt
[download] 100% of   80.08KiB in 00:00:00 at 1.03MiB/s
[Exec] Executing command: sed -i -e '/^[0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9]$/d' -e '/^[[:digit:]]\{1,3\}$/d' -e 's/<[^>]*>//g' 'Emotions are NEVER IRRATIONAL: feelings are not always justified [MSNBgLz3IJA].en.vtt'
sed: -e: No such file or directory
ERROR: Preprocessing: Command returned error code 1
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/yt_dlp/YoutubeDL.py", line 3633, in pre_process
    info = self.run_all_pps(key, info)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/yt_dlp/YoutubeDL.py", line 3626, in run_all_pps
    info = self.run_pp(pp, info)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/yt_dlp/YoutubeDL.py", line 3604, in run_pp
    files_to_delete, infodict = pp.run(infodict)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/yt_dlp/postprocessor/common.py", line 23, in run
    ret = func(self, info, *args, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/yt_dlp/postprocessor/exec.py", line 31, in run
    raise PostProcessingError(f'Command returned error code {return_code}')
yt_dlp.utils.PostProcessingError: Command returned error code 1

ERROR: Preprocessing: Command returned error code 1

@Xelbayria
Copy link
Author

Xelbayria commented Oct 15, 2023

@nhershy

To be clear, I had to swap out "write-subs" with "write-auto-subs", since the videos only have the auto-generated subtitles.

A good tip: if you use both --write-subs --write-auto-subs, it will check for subtitles (made by someone) before it checks for auto-generated subtitles (by youtube).

--write-auto-subs --write-subs check for generated subtitles, then check for subtitles created by someone. (pretty much opposite to above.)

With the command above, you can get a list of subtitles to select for displaying on your player.

@bashonly
Copy link
Member

bashonly commented Oct 15, 2023

@nhershy ah you are using BSD sed instead of GNU sed, which parses arguments differently

try this one

yt-dlp --skip-download --write-auto-subs --exec before_dl:"sed -e '/^[0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9]$/d' -e '/^[[:digit:]]\{1,3\}$/d' -e 's/<[^>]*>//g' -i '' %(requested_subtitles.:.filepath)#q" "URL"

or install gsed with homebrew and use gsed instead

@nhershy
Copy link

nhershy commented Oct 15, 2023

@nhershy ah you are using BSD sed instead of GNU sed, which parses arguments differently

try this one

yt-dlp --skip-download --write-auto-subs --exec before_dl:"sed -e '/^[0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9]$/d' -e '/^[[:digit:]]\{1,3\}$/d' -e 's/<[^>]*>//g' -i '' %(requested_subtitles.:.filepath)q" "URL"

or install gsed with homebrew and use gsed instead

I did try this new command and still get a similar error:

% yt-dlp --skip-download --write-auto-subs --exec before_dl:"sed -e '/^[0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9]$/d' -e '/^[[:digit:]]\{1,3\}$/d' -e 's/<[^>]*>//g' -i '' %(requested_subtitles.:.filepath)q" "https://www.youtube.com/@psychacks" - v
[youtube:tab] Extracting URL: https://www.youtube.com/@psychacks
[youtube:tab] @psychacks: Downloading webpage
[youtube:tab] Downloading all uploads of the channel. To download only the videos in a specific tab, pass the tab's URL
[youtube:tab] @psychacks/shorts: Downloading webpage
[youtube:tab] Downloading as multiple playlists, separated by tabs. To download as a single playlist instead, pass https://www.youtube.com/playlist?list=UUSduXBjCHkLoo_y9ss2xzXw
[download] Downloading playlist: PsycHacks
[youtube:tab] Playlist PsycHacks: Downloading 2 items of 2
[download] Downloading item 1 of 2
[download] Downloading playlist: PsycHacks - Videos
[youtube:tab] UCSduXBjCHkLoo_y9ss2xzXw page 1: Downloading API JSON
[youtube:tab] UCSduXBjCHkLoo_y9ss2xzXw page 2: Downloading API JSON
[youtube:tab] UCSduXBjCHkLoo_y9ss2xzXw page 3: Downloading API JSON
[youtube:tab] UCSduXBjCHkLoo_y9ss2xzXw page 4: Downloading API JSON
[youtube:tab] UCSduXBjCHkLoo_y9ss2xzXw page 5: Downloading API JSON
[youtube:tab] UCSduXBjCHkLoo_y9ss2xzXw page 6: Downloading API JSON
[youtube:tab] UCSduXBjCHkLoo_y9ss2xzXw page 7: Downloading API JSON
[youtube:tab] UCSduXBjCHkLoo_y9ss2xzXw page 8: Downloading API JSON
[youtube:tab] UCSduXBjCHkLoo_y9ss2xzXw page 9: Downloading API JSON
[youtube:tab] UCSduXBjCHkLoo_y9ss2xzXw page 10: Downloading API JSON
[youtube:tab] UCSduXBjCHkLoo_y9ss2xzXw page 11: Downloading API JSON
[youtube:tab] UCSduXBjCHkLoo_y9ss2xzXw page 12: Downloading API JSON
[youtube:tab] UCSduXBjCHkLoo_y9ss2xzXw page 13: Downloading API JSON
[youtube:tab] Playlist PsycHacks - Videos: Downloading 403 items of 403
[download] Downloading item 1 of 403
[youtube] Extracting URL: https://www.youtube.com/watch?v=MSNBgLz3IJA
[youtube] MSNBgLz3IJA: Downloading webpage
[youtube] MSNBgLz3IJA: Downloading ios player API JSON
[youtube] MSNBgLz3IJA: Downloading android player API JSON
[youtube] MSNBgLz3IJA: Downloading m3u8 information
[info] MSNBgLz3IJA: Downloading subtitles: en
[info] MSNBgLz3IJA: Downloading 1 format(s): 22
Deleting existing file Emotions are NEVER IRRATIONAL: feelings are not always justified [MSNBgLz3IJA].en.vtt
[info] Writing video subtitles to: Emotions are NEVER IRRATIONAL: feelings are not always justified [MSNBgLz3IJA].en.vtt
[download] Destination: Emotions are NEVER IRRATIONAL: feelings are not always justified [MSNBgLz3IJA].en.vtt
[download] 100% of   80.08KiB in 00:00:00 at 1.05MiB/s
[Exec] Executing command: sed -e '/^[0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9]$/d' -e '/^[[:digit:]]\{1,3\}$/d' -e 's/<[^>]*>//g' -i '' '['"'"'Emotions are NEVER IRRATIONAL: feelings are not always justified [MSNBgLz3IJA].en.vtt'"'"']'
sed: ['Emotions are NEVER IRRATIONAL: feelings are not always justified [MSNBgLz3IJA].en.vtt']: No such file or directory
ERROR: Preprocessing: Command returned error code 1
ERROR: Preprocessing: Command returned error code 1
[download] Downloading item 2 of 403

I guess I will try to download gsed as suggested and see how that works. I am on a M1 MacBook Pro if that makes a difference.

@bashonly
Copy link
Member

bashonly commented Oct 15, 2023

@nhershy change %(requested_subtitles.:.filepath)q back to %(requested_subtitles.:.filepath)#q and it should work then

@nhershy
Copy link

nhershy commented Oct 15, 2023

@bashonly

When I added the # back, it fixed the issue and no longer shows an error. However, it does not accomplish my goal. As there are still timestamps and duplicate sentences (from the auto-generated subs).

The command I used:

yt-dlp --skip-download --write-auto-subs --exec before_dl:"sed -e '/^[0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9]$/d' -e '/^[[:digit:]]\{1,3\}$/d' -e 's/<[^>]*>//g' -i '' %(requested_subtitles.:.filepath)#q" 'https://www.youtube.com/@psychacks'

As a reiteration, this command produces exactly what I want: no timestamps, and no duplicated sentences. But it only works on a single video:

yt-dlp --skip-download --write-subs --write-auto-subs  --sub-lang en --sub-format ttml --convert-subs srt --output "transcript.%(ext)s" 'https://www.youtube.com/watch?v=lDFBMv_CFl0' && sed -i '' -e '/^[0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9]$/d' -e '/^[[:digit:]]\{1,3\}$/d' -e 's/<[^>]*>//g' ./transcript.en.srt && sed -e 's/<[^>]*>//g' -e '/^[[:space:]]*$/d' transcript.en.srt > output.txt && rm transcript.en.srt

@bashonly
Copy link
Member

bashonly commented Oct 15, 2023

--exec is just giving you access to your shell during yt-dlp's pre-processing stage, and I've given multiple examples of how to use it

Your question is a sed/shell question, not a yt-dlp question. You'll need to adapt the working external sed command line into the --exec arg

@nhershy
Copy link

nhershy commented Oct 15, 2023

@bashonly

Thank you for your help. I got it to work by combining your command with pieces from the other command I mentioned:

yt-dlp --skip-download --write-subs --write-auto-subs --sub-lang en --sub-format ttml --convert-subs srt --exec before_dl:"sed -e '/^[0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9]$/d' -e '/^[[:digit:]]\{1,3\}$/d' -e 's/<[^>]*>//g' -e '/^[[:space:]]*$/d' -i '' %(requested_subtitles.:.filepath)#q" 'YOUTUBE_CHANNEL_URL'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question
Projects
None yet
Development

No branches or pull requests

6 participants