Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[youtube] Fix extracting YouTube search URLs and feeds #25734

Closed
wants to merge 36 commits into from
Closed

[youtube] Fix extracting YouTube search URLs and feeds #25734

wants to merge 36 commits into from

Conversation

xarantolus
Copy link

@xarantolus xarantolus commented Jun 19, 2020

Please follow the guide below

  • You will be asked some questions, please read them carefully and answer honestly
  • Put an x into all the boxes [ ] relevant to your pull request (like that [x])
  • Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?


Description

URLs like https://www.youtube.com/results?search_query=test have been broken for a few days since the data is now in a JSON object instead of being embedded into the HTML of the search page.

Now the window["ytInitialData"] variable is extracted and then searched recursively for any object containing the key videoId. These objects are all over the place in the JSON document.

The change also affects feed urls like https://www.youtube.com/feed/subscriptions, https://www.youtube.com/feed/history etc. For those the same logic is implemented.

Subscriptions additionally contain a nextContinuationData object that is then used (in combination with data from a ytcfg.set call from a script in the page) to make a request to fetch the next pages.

TODO / Requests for input

  • Some regexes feel like they are easy to break, so if they should be more general please let me know how.
  • Maybe implement continuations for other feeds?

Please feel free to point out anything that doesn't seem right

@scout719
Copy link

I also have a similar problem when extracting the videos from the subscriptions page. After inspecting the page I think the window["ytInitialData"] variable is also used to list the videos. Can you try to also fix the subscriptions extractor? Thank you!

@xarantolus

This comment has been minimized.

@scout719
Copy link

scout719 commented Jun 20, 2020

I tried it, but the login using cookies doesn't seem to work right now (see #24508).

That's weird, I have it working here, with cookies. When I run with the subscriptions url and use --dump-page I get the correct page source

This moves feed extraction from using html content to json metadata. However, loading additional pages no longer works.

The _extract_video_info function also returns a continuation object that contains some metadata that - together with an API key that is in the page source - might be used to request the next page.
@xarantolus
Copy link
Author

xarantolus commented Jun 21, 2020

When I tried extracting cookies using another browser, it worked and logged me in correctly, so there's no issue in that part (probably some issue when using firefox containers and cookie extraction extensions?).

I fixed the extraction part for these feeds (history, subscriptions, recommended), but it will only download the first page now.

There sometimes is an object called continuationItemRenderer that contains some stuff that is then POSTed somewhere to load additional videos, but I couldn't figure out how that one works. One would need to extract an api key that is somewhere in the window.ytplayer object in the page source. Edit: it takes these parameters from another place, not the continuationItemRenderer

If you want to try out the current version (and installed over pip):

pip3 install -U git+https://github.com/xarantolus/youtube-dl@fixYTSearch

@xarantolus xarantolus changed the title [youtube] Fix extracting YouTube search URLs [youtube] Fix extracting YouTube search URLs and feeds Jun 21, 2020
@scout719
Copy link

When I tried extracting cookies using another browser, it worked and logged me in correctly, so there's no issue in that part (probably some issue when using firefox containers and cookie extraction extensions?).

Could be, I extracted mine using Chrome

I fixed the extraction part for these feeds (history, subscriptions, recommended), but it will only download the first page now.

Awesome! It works now, thank you so much! 💪

At least for me the first page is enough. Don't forget to update the description on the PR to mention this limitation

If an object looks like a video (it has a `videoId` key), assume that it is.
@AthenaAzuraea
Copy link

Downloading youtube subscriptions worked with this yesterday, but today something seems to have broken:

[debug] System config: []
[debug] User config: ['-f', 'bestvideo[height<=1080]+bestaudio/bestvideo+bestaudio/best', '--merge-output-format', 'mkv', '--sub-lang', 'enUS', '--write-sub', '-i', '-o', '%(uploader)s - %(title)s.mkv', '--no-playlist', '--mark-watched', '--reject-title', 'English dub|Portuguese Dub|Spanish Dub|(Live)|Spawncast|(Charity stream)| Podcast | (Dub) |WAN Show|Broken Silicon']
[debug] Custom config: []
[debug] Command-line args: ['-v', 'https://www.youtube.com/feed/subscriptions', '--cookies', 'Cookies.txt']
[debug] Encodings: locale cp1252, fs utf-8, out utf-8, pref cp1252
[debug] youtube-dl version 2020.06.16.1
[debug] Python version 3.8.3 (CPython) - Windows-10-10.0.18362-SP0
[debug] exe versions: ffmpeg 4.2.3, ffprobe 4.2.3
[debug] Proxy map: {}
[youtube:subscriptions] Youtube Subscriptions: Downloading webpage
[download] Downloading playlist: Youtube Subscriptions
ERROR: Unable to extract ytInitialData; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Traceback (most recent call last):
File "C:\Users\Essen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\youtube_dl\YoutubeDL.py", line 808, in extract_info
return self.process_ie_result(ie_result, download, extra_info)
File "C:\Users\Essen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\youtube_dl\YoutubeDL.py", line 968, in process_ie_result
entries = list(itertools.islice(
File "C:\Users\Essen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\youtube_dl\extractor\youtube.py", line 3325, in _entries
search_response = self._parse_json(self._search_regex(self._FEED_DATA, page, 'ytInitialData'), None)
File "C:\Users\Essen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\youtube_dl\extractor\common.py", line 1005, in _search_regex
raise RegexNotFoundError('Unable to extract %s' % _name)
youtube_dl.utils.RegexNotFoundError: Unable to extract ytInitialData; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

…change was reverted

The old code now works again, but it downloads without limit. This is why a limit of 1000 videos is added, it can be overwritten with the `--max-downloads` option - that way, only so many ids will be extracted as videos downloaded
@xarantolus
Copy link
Author

Yeah, it seems like they changed the format to the old one again. With the old code the extraction will work, but it will extract an seemingly infinite number of video pages instead of downloading any videos.
I reverted my changes and added a default limit of 1000 videos that will be extracted before the download starts, but it can also be adjusted with the --max-downloads option. This limit is a temporary solution, it's not optimal.
It might be necessary to wait until youtube finalizes their changes as they seem to change stuff every few hours.

@marabu88
Copy link

marabu88 commented Jul 2, 2020

almost 2 weeks this problem continues. Is there any alternative method to get a video ID from an arbitrary search query with various parameters? I am beginning to fear that the search function will not be fixed.

I heard that it is possible to load a page into a browser object and receive an already processed version of the page with all the necessary links. can this option be used or will it be too cumbersome?

@milahu
Copy link

milahu commented Jul 5, 2020

With the old code the extraction will work,

yes, commit 7a74fed works fine, please rollback

but it will extract an seemingly infinite number of video pages instead of downloading any videos.

not here - "Downloading 19 videos"

i can append &page=2 etc. for pagination

@marabu88
Copy link

marabu88 commented Jul 6, 2020

yes, commit 7a74fed works fine, please rollback

Yes, it definitely works for me. but why are there no changes in the main program? 20 days have passed :( I hope the work on this application is not abandoned.

@xarantolus
Copy link
Author

So a quick update here, from my end the current code seems to work fine both for subscriptions and search queries. It looks like introducing the limit with feeds was unnecessary as they end at some point (for me it downloaded about 170 subscription pages) so I removed it again.

@xarantolus xarantolus changed the title [youtube] Fix extracting YouTube search URLs and feeds [youtube] Fix extracting YouTube search URLs Jul 10, 2020
@romanrm
Copy link

romanrm commented Jul 28, 2020

Seems like this is still not fixed in the main release of YTDL? Today I had to update it because extracting of YouTube videos broke (in general). But now I lost the fix from that special branch with fixes for /results?search_query=... When will it get merged? Or how to combine both for now?

@harroguk
Copy link

harroguk commented Sep 3, 2020

youtube-dl -j --flat-playlist is working fine (dont judge me)

{"_type": "url", "url": "DhXkbUgoEBk", "ie_key": "Youtube", "title": "7 Games That Made Fun of the Haters"}
{"_type": "url", "url": "VxpOdUOjxaU", "ie_key": "Youtube", "title": "Generative Design in Minecraft Competition 2020 - Live Judging (Part 2)"}
{"_type": "url", "url": "mV16Hbh5hig", "ie_key": "Youtube", "title": "Pro Hairstylist Uses Kool-Aid To Color Hair (It didn't go well)"}
{"_type": "url", "url": "lMfMq5tyaOA", "ie_key": "Youtube", "title": "Why A Child Can (But Shouldn't) Eat Fire"}
{"_type": "url", "url": "DPCL9kj7_bU", "ie_key": "Youtube", "title": "How Oak Trees Manipulate Squirrels To Abandon Their Acorns"}
{"_type": "url", "url": "O9QuRZ-MBm0", "ie_key": "Youtube", "title": "Doom Eternal Gameplay: Let's Have a Chill Time Playing Doom Eternal - HELL ON EARTH"}

So I guess I sit tight and wait for the extraction to be fixed elsewhere as I guess it cant just be me with this issue.
Thanks very much for looking into this. Much appreciated.

If an object looks like a video (it has a `videoId` key), assume that it is.
In order to extract videos from further pages, we need to get various variables that are in an argument to the `ytcfg.set` call in a script on the feed page.
If the markup of the page changes in the future, it might be possible that _FEED_DATA still works, but the other regex does not. SInce it is not necessary for the first page of videos, we make sure the program doesn't exit before extracting them.

TL;DR: Extract the first video page even if there are problems
Seems like this attribute is moved every few weeks, so we just extract both and use the one that is present.
This now supports declarations like `window["ytInitialData"] = ...` and `var ytInitialData = ...`
@mungr
Copy link

mungr commented Sep 7, 2020

Just to confirm:

ERROR: Unable to extract ytInitialData

was caused by a stale cookie in my case. The code in this PR holds.
OT: I would love to find a way to automate the way to provide the cookie without any (fragile) browser extensions and manual copy pasting. Did anyone here solved this part?

@xarantolus
Copy link
Author

I think you can use the -u and -p command line flags to log in with your username and password, didn't test if it works though.

@dstftw dstftw force-pushed the master branch 2 times, most recently from 5e26784 to da2069f Compare September 13, 2020 13:50
@xarantolus
Copy link
Author

Closing this pull request because this has already been resolved for some time.

@xarantolus xarantolus closed this Feb 27, 2021
@jakeogh
Copy link
Contributor

jakeogh commented Mar 6, 2021

Not working on git-latest b8b622f

$ youtube-dl --version
2021.03.03

$ /usr/bin/youtube-dl --verbose "https://www.youtube.com/results?search_query=antenna+design"
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['https://www.youtube.com/results?search_query=antenna+design', '--verbose']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2021.03.03
[debug] Python version 3.8.7 (CPython) - Linux-5.10.12-gentoo-x86_64-x86_64-Intel-R-_Core-TM-_i7-4910MQ_CPU_@_2.90GHz-with-glibc2.2.5
[debug] exe versions: ffmpeg N-100888-gd43a27ab6f, ffprobe N-100888-gd43a27ab6f
[debug] Proxy map: {}
[youtube:tab] results: Downloading webpage
ERROR: Unable to recognize tab page; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/youtube_dl/YoutubeDL.py", line 805, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/lib/python3.8/site-packages/youtube_dl/YoutubeDL.py", line 826, in __extract_info
    ie_result = ie.extract(url)
  File "/usr/lib/python3.8/site-packages/youtube_dl/extractor/common.py", line 532, in extract
    ie_result = self._real_extract(url)
  File "/usr/lib/python3.8/site-packages/youtube_dl/extractor/youtube.py", line 2706, in _real_extract
    raise ExtractorError('Unable to recognize tab page')
youtube_dl.utils.ExtractorError: Unable to recognize tab page; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

@xarantolus
Copy link
Author

Ah, didn't really check the search URL extractor when closing this. The YoutubeSearchURLIE implementation from this pull request works with the current youtube site, e.g. your search query returns 31 items with correct id and title. The YoutubeFeedsInfoExtractor also works fine, or at least my tests with cookies + subscription, history feed seemed to work fine for the first page, but it no longer finds nextContinuationData.

I'm not entirely sure what to do because my fork got taken down and I can't seem to get it back, so I can't make any edits to this pull request. And in its current state it's not mergeable at all. Should I create a new fork & pull request and add both, or only the search url extractor?

@jakeogh
Copy link
Contributor

jakeogh commented Mar 10, 2021

Thanks! IMHO new pulls for each.

@radiolondra
Copy link

Same problem, search URL still not working:
youtube-dl version: 2021-3-25

Query: https://www.youtube.com/results?search_query=marooned
[youtube:tab] results: Downloading webpage
ERROR: Unable to recognize tab page; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
 Exception in thread SearchThread-1:
 Traceback (most recent call last):
   File "C:\kivy-projects\KYT\kivy_venv\lib\site-packages\youtube_dl\YoutubeDL.py", line 806, in wrapper
     return func(self, *args, **kwargs)
   File "C:\kivy-projects\KYT\kivy_venv\lib\site-packages\youtube_dl\YoutubeDL.py", line 827, in __extract_info
     ie_result = ie.extract(url)
   File "C:\kivy-projects\KYT\kivy_venv\lib\site-packages\youtube_dl\extractor\common.py", line 534, in extract
     ie_result = self._real_extract(url)
   File "C:\kivy-projects\KYT\kivy_venv\lib\site-packages\youtube_dl\extractor\youtube.py", line 2706, in _real_extract
     raise ExtractorError('Unable to recognize tab page')
 youtube_dl.utils.ExtractorError: Unable to recognize tab page; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
 
 During handling of the above exception, another exception occurred:
 
 Traceback (most recent call last):
   File "C:\Users\Robi\AppData\Local\Programs\Python\Python37\lib\threading.py", line 917, in _bootstrap_inner
     self.run()
   File "C:/kivy-projects/KYT/main.py", line 156, in run
     ydl.download([self.query])
   File "C:\kivy-projects\KYT\kivy_venv\lib\site-packages\youtube_dl\YoutubeDL.py", line 2060, in download
     url, force_generic_extractor=self.params.get('force_generic_extractor', False))
   File "C:\kivy-projects\KYT\kivy_venv\lib\site-packages\youtube_dl\YoutubeDL.py", line 799, in extract_info
     return self.__extract_info(url, ie, download, extra_info, process)
   File "C:\kivy-projects\KYT\kivy_venv\lib\site-packages\youtube_dl\YoutubeDL.py", line 815, in wrapper
     self.report_error(compat_str(e), e.format_traceback())
   File "C:\kivy-projects\KYT\kivy_venv\lib\site-packages\youtube_dl\YoutubeDL.py", line 628, in report_error
     self.trouble(error_message, tb)
   File "C:\kivy-projects\KYT\kivy_venv\lib\site-packages\youtube_dl\YoutubeDL.py", line 598, in trouble
     raise DownloadError(message, exc_info)
 youtube_dl.utils.DownloadError: ERROR: Unable to recognize tab page; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

@xarantolus
Copy link
Author

Yes, it still doesn't work.

I tried integrating the changes from this pull request into the current code, but I failed at doing so. It seems like no matter what I try, the Youtube Tab extractor is always prefered before the SearchURL extractor is called. And the tab extractor fails.

I did uncomment this line and tried a lot of other stuff (such as returning False from YoutubeTabIEs suitable method if the search URL extractor matches like this:

@classmethod
def suitable(cls, url):
    if YoutubeIE.suitable(url) or YoutubeSearchURLIE.suitable(url):
        return False
    return super(YoutubeTabIE, cls).suitable(url)

But it didn't seem to work and I gave up since I currently don't have much time. If someone else finds out how to work around the tab extractor, it would be nice to know. The changes I made are here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Request like this dont' work youtube.com/results?search_query=...