[youtube] Fix extracting YouTube search URLs and feeds #25734

xarantolus · 2020-06-19T19:56:09Z

Please follow the guide below

You will be asked some questions, please read them carefully and answer honestly
Put an x into all the boxes [ ] relevant to your pull request (like that [x])
Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

At least skimmed through adding new extractor tutorial and youtube-dl coding conventions sections
Searched the bugtracker for similar pull requests
Checked the code with flake8

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Bug fix (closes Request like this dont' work youtube.com/results?search_query=... #25696, Youtube subscriptions not working #25695)
Improvement
New extractor
New feature

Description

URLs like https://www.youtube.com/results?search_query=test have been broken for a few days since the data is now in a JSON object instead of being embedded into the HTML of the search page.

Now the window["ytInitialData"] variable is extracted and then searched recursively for any object containing the key videoId. These objects are all over the place in the JSON document.

The change also affects feed urls like https://www.youtube.com/feed/subscriptions, https://www.youtube.com/feed/history etc. For those the same logic is implemented.

Subscriptions additionally contain a nextContinuationData object that is then used (in combination with data from a ytcfg.set call from a script in the page) to make a request to fetch the next pages.

TODO / Requests for input

Some regexes feel like they are easy to break, so if they should be more general please let me know how.
Maybe implement continuations for other feeds?

Please feel free to point out anything that doesn't seem right

scout719 · 2020-06-20T13:05:21Z

I also have a similar problem when extracting the videos from the subscriptions page. After inspecting the page I think the window["ytInitialData"] variable is also used to list the videos. Can you try to also fix the subscriptions extractor? Thank you!

scout719 · 2020-06-20T17:50:19Z

I tried it, but the login using cookies doesn't seem to work right now (see #24508).

That's weird, I have it working here, with cookies. When I run with the subscriptions url and use --dump-page I get the correct page source

This moves feed extraction from using html content to json metadata. However, loading additional pages no longer works. The _extract_video_info function also returns a continuation object that contains some metadata that - together with an API key that is in the page source - might be used to request the next page.

xarantolus · 2020-06-21T07:39:07Z

When I tried extracting cookies using another browser, it worked and logged me in correctly, so there's no issue in that part (probably some issue when using firefox containers and cookie extraction extensions?).

I fixed the extraction part for these feeds (history, subscriptions, recommended), but it will only download the first page now.

There sometimes is an object called continuationItemRenderer that contains some stuff that is then POSTed somewhere to load additional videos, but I couldn't figure out how that one works. One would need to extract an api key that is somewhere in the window.ytplayer object in the page source. Edit: it takes these parameters from another place, not the continuationItemRenderer

If you want to try out the current version (and installed over pip):

pip3 install -U git+https://github.com/xarantolus/youtube-dl@fixYTSearch

scout719 · 2020-06-21T09:39:53Z

When I tried extracting cookies using another browser, it worked and logged me in correctly, so there's no issue in that part (probably some issue when using firefox containers and cookie extraction extensions?).

Could be, I extracted mine using Chrome

I fixed the extraction part for these feeds (history, subscriptions, recommended), but it will only download the first page now.

Awesome! It works now, thank you so much! 💪

At least for me the first page is enough. Don't forget to update the description on the PR to mention this limitation

If an object looks like a video (it has a `videoId` key), assume that it is.

AthenaAzuraea · 2020-06-23T09:05:52Z

Downloading youtube subscriptions worked with this yesterday, but today something seems to have broken:

[debug] System config: []
[debug] User config: ['-f', 'bestvideo[height<=1080]+bestaudio/bestvideo+bestaudio/best', '--merge-output-format', 'mkv', '--sub-lang', 'enUS', '--write-sub', '-i', '-o', '%(uploader)s - %(title)s.mkv', '--no-playlist', '--mark-watched', '--reject-title', 'English dub|Portuguese Dub|Spanish Dub|(Live)|Spawncast|(Charity stream)| Podcast | (Dub) |WAN Show|Broken Silicon']
[debug] Custom config: []
[debug] Command-line args: ['-v', 'https://www.youtube.com/feed/subscriptions', '--cookies', 'Cookies.txt']
[debug] Encodings: locale cp1252, fs utf-8, out utf-8, pref cp1252
[debug] youtube-dl version 2020.06.16.1
[debug] Python version 3.8.3 (CPython) - Windows-10-10.0.18362-SP0
[debug] exe versions: ffmpeg 4.2.3, ffprobe 4.2.3
[debug] Proxy map: {}
[youtube:subscriptions] Youtube Subscriptions: Downloading webpage
[download] Downloading playlist: Youtube Subscriptions
ERROR: Unable to extract ytInitialData; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Traceback (most recent call last):
File "C:\Users\Essen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\youtube_dl\YoutubeDL.py", line 808, in extract_info
return self.process_ie_result(ie_result, download, extra_info)
File "C:\Users\Essen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\youtube_dl\YoutubeDL.py", line 968, in process_ie_result
entries = list(itertools.islice(
File "C:\Users\Essen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\youtube_dl\extractor\youtube.py", line 3325, in _entries
search_response = self._parse_json(self._search_regex(self._FEED_DATA, page, 'ytInitialData'), None)
File "C:\Users\Essen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\youtube_dl\extractor\common.py", line 1005, in _search_regex
raise RegexNotFoundError('Unable to extract %s' % _name)
youtube_dl.utils.RegexNotFoundError: Unable to extract ytInitialData; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

…change was reverted The old code now works again, but it downloads without limit. This is why a limit of 1000 videos is added, it can be overwritten with the `--max-downloads` option - that way, only so many ids will be extracted as videos downloaded

xarantolus · 2020-06-23T09:35:13Z

Yeah, it seems like they changed the format to the old one again. With the old code the extraction will work, but it will extract an seemingly infinite number of video pages instead of downloading any videos.
I reverted my changes and added a default limit of 1000 videos that will be extracted before the download starts, but it can also be adjusted with the --max-downloads option. This limit is a temporary solution, it's not optimal.
It might be necessary to wait until youtube finalizes their changes as they seem to change stuff every few hours.

marabu88 · 2020-07-02T22:07:53Z

almost 2 weeks this problem continues. Is there any alternative method to get a video ID from an arbitrary search query with various parameters? I am beginning to fear that the search function will not be fixed.

I heard that it is possible to load a page into a browser object and receive an already processed version of the page with all the necessary links. can this option be used or will it be too cumbersome?

milahu · 2020-07-05T11:30:31Z

With the old code the extraction will work,

yes, commit 7a74fed works fine, please rollback

but it will extract an seemingly infinite number of video pages instead of downloading any videos.

not here - "Downloading 19 videos"

i can append &page=2 etc. for pagination

marabu88 · 2020-07-06T07:40:20Z

yes, commit 7a74fed works fine, please rollback

Yes, it definitely works for me. but why are there no changes in the main program? 20 days have passed :( I hope the work on this application is not abandoned.

xarantolus · 2020-07-10T09:49:43Z

So a quick update here, from my end the current code seems to work fine both for subscriptions and search queries. It looks like introducing the limit with feeds was unnecessary as they end at some point (for me it downloaded about 170 subscription pages) so I removed it again.

romanrm · 2020-07-28T12:53:59Z

Seems like this is still not fixed in the main release of YTDL? Today I had to update it because extracting of YouTube videos broke (in general). But now I lost the fix from that special branch with fixes for /results?search_query=... When will it get merged? Or how to combine both for now?

…loses #26136, closes #26137)

harroguk · 2020-09-03T21:34:14Z

youtube-dl -j --flat-playlist is working fine (dont judge me)

{"_type": "url", "url": "DhXkbUgoEBk", "ie_key": "Youtube", "title": "7 Games That Made Fun of the Haters"}
{"_type": "url", "url": "VxpOdUOjxaU", "ie_key": "Youtube", "title": "Generative Design in Minecraft Competition 2020 - Live Judging (Part 2)"}
{"_type": "url", "url": "mV16Hbh5hig", "ie_key": "Youtube", "title": "Pro Hairstylist Uses Kool-Aid To Color Hair (It didn't go well)"}
{"_type": "url", "url": "lMfMq5tyaOA", "ie_key": "Youtube", "title": "Why A Child Can (But Shouldn't) Eat Fire"}
{"_type": "url", "url": "DPCL9kj7_bU", "ie_key": "Youtube", "title": "How Oak Trees Manipulate Squirrels To Abandon Their Acorns"}
{"_type": "url", "url": "O9QuRZ-MBm0", "ie_key": "Youtube", "title": "Doom Eternal Gameplay: Let's Have a Chill Time Playing Doom Eternal - HELL ON EARTH"}

So I guess I sit tight and wait for the extraction to be fixed elsewhere as I guess it cant just be me with this issue.
Thanks very much for looking into this. Much appreciated.

If an object looks like a video (it has a `videoId` key), assume that it is.

In order to extract videos from further pages, we need to get various variables that are in an argument to the `ytcfg.set` call in a script on the feed page.

If the markup of the page changes in the future, it might be possible that _FEED_DATA still works, but the other regex does not. SInce it is not necessary for the first page of videos, we make sure the program doesn't exit before extracting them. TL;DR: Extract the first video page even if there are problems

Seems like this attribute is moved every few weeks, so we just extract both and use the one that is present.

This now supports declarations like `window["ytInitialData"] = ...` and `var ytInitialData = ...`

…into fixYTSearch

mungr · 2020-09-07T00:44:54Z

Just to confirm:

ERROR: Unable to extract ytInitialData

was caused by a stale cookie in my case. The code in this PR holds.
OT: I would love to find a way to automate the way to provide the cookie without any (fragile) browser extensions and manual copy pasting. Did anyone here solved this part?

xarantolus · 2020-09-07T01:22:24Z

I think you can use the -u and -p command line flags to log in with your username and password, didn't test if it works though.

xarantolus · 2021-02-27T13:05:54Z

Closing this pull request because this has already been resolved for some time.

jakeogh · 2021-03-06T20:59:15Z

Not working on git-latest b8b622f

$ youtube-dl --version
2021.03.03

$ /usr/bin/youtube-dl --verbose "https://www.youtube.com/results?search_query=antenna+design"
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['https://www.youtube.com/results?search_query=antenna+design', '--verbose']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2021.03.03
[debug] Python version 3.8.7 (CPython) - Linux-5.10.12-gentoo-x86_64-x86_64-Intel-R-_Core-TM-_i7-4910MQ_CPU_@_2.90GHz-with-glibc2.2.5
[debug] exe versions: ffmpeg N-100888-gd43a27ab6f, ffprobe N-100888-gd43a27ab6f
[debug] Proxy map: {}
[youtube:tab] results: Downloading webpage
ERROR: Unable to recognize tab page; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/youtube_dl/YoutubeDL.py", line 805, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/lib/python3.8/site-packages/youtube_dl/YoutubeDL.py", line 826, in __extract_info
    ie_result = ie.extract(url)
  File "/usr/lib/python3.8/site-packages/youtube_dl/extractor/common.py", line 532, in extract
    ie_result = self._real_extract(url)
  File "/usr/lib/python3.8/site-packages/youtube_dl/extractor/youtube.py", line 2706, in _real_extract
    raise ExtractorError('Unable to recognize tab page')
youtube_dl.utils.ExtractorError: Unable to recognize tab page; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

xarantolus · 2021-03-06T21:53:33Z

Ah, didn't really check the search URL extractor when closing this. The YoutubeSearchURLIE implementation from this pull request works with the current youtube site, e.g. your search query returns 31 items with correct id and title. The YoutubeFeedsInfoExtractor also works fine, or at least my tests with cookies + subscription, history feed seemed to work fine for the first page, but it no longer finds nextContinuationData.

I'm not entirely sure what to do because my fork got taken down and I can't seem to get it back, so I can't make any edits to this pull request. And in its current state it's not mergeable at all. Should I create a new fork & pull request and add both, or only the search url extractor?

jakeogh · 2021-03-10T17:05:05Z

Thanks! IMHO new pulls for each.

radiolondra · 2021-03-27T09:51:22Z

Same problem, search URL still not working:
youtube-dl version: 2021-3-25

Query: https://www.youtube.com/results?search_query=marooned
[youtube:tab] results: Downloading webpage
ERROR: Unable to recognize tab page; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
 Exception in thread SearchThread-1:
 Traceback (most recent call last):
   File "C:\kivy-projects\KYT\kivy_venv\lib\site-packages\youtube_dl\YoutubeDL.py", line 806, in wrapper
     return func(self, *args, **kwargs)
   File "C:\kivy-projects\KYT\kivy_venv\lib\site-packages\youtube_dl\YoutubeDL.py", line 827, in __extract_info
     ie_result = ie.extract(url)
   File "C:\kivy-projects\KYT\kivy_venv\lib\site-packages\youtube_dl\extractor\common.py", line 534, in extract
     ie_result = self._real_extract(url)
   File "C:\kivy-projects\KYT\kivy_venv\lib\site-packages\youtube_dl\extractor\youtube.py", line 2706, in _real_extract
     raise ExtractorError('Unable to recognize tab page')
 youtube_dl.utils.ExtractorError: Unable to recognize tab page; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
 
 During handling of the above exception, another exception occurred:
 
 Traceback (most recent call last):
   File "C:\Users\Robi\AppData\Local\Programs\Python\Python37\lib\threading.py", line 917, in _bootstrap_inner
     self.run()
   File "C:/kivy-projects/KYT/main.py", line 156, in run
     ydl.download([self.query])
   File "C:\kivy-projects\KYT\kivy_venv\lib\site-packages\youtube_dl\YoutubeDL.py", line 2060, in download
     url, force_generic_extractor=self.params.get('force_generic_extractor', False))
   File "C:\kivy-projects\KYT\kivy_venv\lib\site-packages\youtube_dl\YoutubeDL.py", line 799, in extract_info
     return self.__extract_info(url, ie, download, extra_info, process)
   File "C:\kivy-projects\KYT\kivy_venv\lib\site-packages\youtube_dl\YoutubeDL.py", line 815, in wrapper
     self.report_error(compat_str(e), e.format_traceback())
   File "C:\kivy-projects\KYT\kivy_venv\lib\site-packages\youtube_dl\YoutubeDL.py", line 628, in report_error
     self.trouble(error_message, tb)
   File "C:\kivy-projects\KYT\kivy_venv\lib\site-packages\youtube_dl\YoutubeDL.py", line 598, in trouble
     raise DownloadError(message, exc_info)
 youtube_dl.utils.DownloadError: ERROR: Unable to recognize tab page; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

xarantolus · 2021-03-27T10:30:51Z

Yes, it still doesn't work.

I tried integrating the changes from this pull request into the current code, but I failed at doing so. It seems like no matter what I try, the Youtube Tab extractor is always prefered before the SearchURL extractor is called. And the tab extractor fails.

I did uncomment this line and tried a lot of other stuff (such as returning False from YoutubeTabIEs suitable method if the search URL extractor matches like this:

@classmethod
def suitable(cls, url):
    if YoutubeIE.suitable(url) or YoutubeSearchURLIE.suitable(url):
        return False
    return super(YoutubeTabIE, cls).suitable(url)

But it didn't seem to work and I gave up since I currently don't have much time. If someone else finds out how to work around the tab extractor, it would be nice to know. The changes I made are here

xarantolus added 2 commits June 19, 2020 14:57

[youtube] Fix extraction of search urls (closes #25696)

7a74fed

[youtube] Move search URL extraction to appropriate extractor

6dad892

This comment has been minimized.

Sign in to view

run flake8

b3fd4b1

xarantolus changed the title ~~[youtube] Fix extracting YouTube search URLs~~ [youtube] Fix extracting YouTube search URLs and feeds Jun 21, 2020

[youtube] Make search extraction less dependent on json schema.

6a3cc89

If an object looks like a video (it has a `videoId` key), assume that it is.

[youtube] Remote download limit

c37ca47

Remove unused variable

7fa0a67

xarantolus changed the title ~~[youtube] Fix extracting YouTube search URLs and feeds~~ [youtube] Fix extracting YouTube search URLs Jul 10, 2020

DuckBoss mentioned this pull request Jul 13, 2020

Youtube searches with a single result cause a crash DuckBoss/JJMumbleBot#230

Closed

trungnq97 mentioned this pull request Jul 24, 2020

0 videos downloaded EgorLakomkin/KTSpeechCrawler#3

Open

remitamine and others added 6 commits July 28, 2020 15:34

[bellmedia] add support for cp24.com clip URLs(closes #25764)

2bd9412

[youtube:playlists] Extend _VALID_URL (closes #25810)

255f31b

[youtube] Prevent excess HTTP 301 (#25786)

bb2c950

[wistia] Restrict embed regex (closes #25969)

9fa728f

[youtube] Improve description extraction (closes #25937) (#25980)

54ffcbb

[youtube] Fix sigfunc name extraction (closes #26134, closes #26135, c…

49004a6

…loses #26136, closes #26137)

xarantolus added 10 commits September 6, 2020 09:21

[youtube] Fix extraction of search urls (closes #25696)

b948643

[youtube] Move search URL extraction to appropriate extractor

19f671f

[youtube] Make search extraction less dependent on json schema.

e03b4f3

If an object looks like a video (it has a `videoId` key), assume that it is.

[youtube] Fix feed extraction

5c430b6

In order to extract videos from further pages, we need to get various variables that are in an argument to the `ytcfg.set` call in a script on the feed page.

Run formatter

f536080

Fix python2 compatibility and title extraction

299056a

[youtube] More general title extraction

f442082

Seems like this attribute is moved every few weeks, so we just extract both and use the one that is present.

Fix regex for other variable declaration type

bea9b00

This now supports declarations like `window["ytInitialData"] = ...` and `var ytInitialData = ...`

Merge branch 'fixYTSearch' of https://github.com/xarantolus/youtube-dl …

c4a1d0e

…into fixYTSearch

dstftw force-pushed the master branch 2 times, most recently from 5e26784 to da2069f Compare September 13, 2020 13:50

xarantolus added 2 commits September 22, 2020 20:52

Use better regex for all fixed extraction types

c0a1a89

[youtube/search_url]: improve title extraction

955c4cb

laBecasse mentioned this pull request Oct 23, 2020

util.parse error bakapear/ytt#5

Closed

54696d21 mentioned this pull request Nov 18, 2020

error when building with sbt vaclavsvejcar/youtube-history-downloader#7

Open

xarantolus closed this Feb 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[youtube] Fix extracting YouTube search URLs and feeds #25734

[youtube] Fix extracting YouTube search URLs and feeds #25734

xarantolus commented Jun 19, 2020 •

edited

Loading

scout719 commented Jun 20, 2020

This comment has been minimized.

scout719 commented Jun 20, 2020 •

edited

Loading

xarantolus commented Jun 21, 2020 •

edited

Loading

scout719 commented Jun 21, 2020

AthenaAzuraea commented Jun 23, 2020

xarantolus commented Jun 23, 2020

marabu88 commented Jul 2, 2020

milahu commented Jul 5, 2020

marabu88 commented Jul 6, 2020

xarantolus commented Jul 10, 2020

romanrm commented Jul 28, 2020

harroguk commented Sep 3, 2020

mungr commented Sep 7, 2020

xarantolus commented Sep 7, 2020

xarantolus commented Feb 27, 2021

jakeogh commented Mar 6, 2021

xarantolus commented Mar 6, 2021

jakeogh commented Mar 10, 2021

radiolondra commented Mar 27, 2021

xarantolus commented Mar 27, 2021

[youtube] Fix extracting YouTube search URLs and feeds #25734

[youtube] Fix extracting YouTube search URLs and feeds #25734

Conversation

xarantolus commented Jun 19, 2020 • edited Loading

Please follow the guide below

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

Description

TODO / Requests for input

scout719 commented Jun 20, 2020

This comment has been minimized.

scout719 commented Jun 20, 2020 • edited Loading

xarantolus commented Jun 21, 2020 • edited Loading

scout719 commented Jun 21, 2020

AthenaAzuraea commented Jun 23, 2020

xarantolus commented Jun 23, 2020

marabu88 commented Jul 2, 2020

milahu commented Jul 5, 2020

marabu88 commented Jul 6, 2020

xarantolus commented Jul 10, 2020

romanrm commented Jul 28, 2020

harroguk commented Sep 3, 2020

mungr commented Sep 7, 2020

xarantolus commented Sep 7, 2020

xarantolus commented Feb 27, 2021

jakeogh commented Mar 6, 2021

xarantolus commented Mar 6, 2021

jakeogh commented Mar 10, 2021

radiolondra commented Mar 27, 2021

xarantolus commented Mar 27, 2021

xarantolus commented Jun 19, 2020 •

edited

Loading

scout719 commented Jun 20, 2020 •

edited

Loading

xarantolus commented Jun 21, 2020 •

edited

Loading