Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difficult to debug extractors #6701

Closed
alphapapa opened this issue Aug 28, 2015 · 3 comments
Closed

Difficult to debug extractors #6701

alphapapa opened this issue Aug 28, 2015 · 3 comments

Comments

@alphapapa
Copy link
Contributor

@alphapapa alphapapa commented Aug 28, 2015

I know a little bit of Python, so I decided to try to debug the bug I just filed, issue #6699. I cloned the repo and did the following in a terminal (which, by the way, was very non-obvious, because doing what CONTRIBUTING.md suggested doing (python -m youtube_dl) loaded my distro's out-of-date module, not the one in the current directory):

$ python
>>> from youtube_dl.extractor.youtube import YoutubeUserIE
>>> e = YoutubeUserIE()
>>> e.extract("https://www.youtube.com/user/rhettandlink2")

The output I get is this:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "youtube_dl/extractor/common.py", line 287, in extract
    return self._real_extract(url)
  File "youtube_dl/extractor/youtube.py", line 1617, in _real_extract
    'Downloading channel page', fatal=False)
  File "youtube_dl/extractor/common.py", line 438, in _download_webpage
    res = self._download_webpage_handle(url_or_request, video_id, note, errnote, fatal, encoding=encoding)
  File "youtube_dl/extractor/common.py", line 345, in _download_webpage_handle
    urlh = self._request_webpage(url_or_request, video_id, note, errnote, fatal)
  File "youtube_dl/extractor/common.py", line 324, in _request_webpage
    self.to_screen('%s: %s' % (video_id, note))
  File "youtube_dl/extractor/common.py", line 495, in to_screen
    self._downloader.to_screen('[%s] %s' % (self.IE_NAME, msg))
AttributeError: 'NoneType' object has no attribute 'to_screen'

This doesn't seem to make sense, because the docstring says this:

>>> help(YoutubeUserIE)
...
 |  extract(self, url)
 |      Extracts URL information and returns it in list of dicts.

It would seem that the only thing this method should do is to return a list. But instead, it also generates output to the screen, and fails if not run from...the executable script, I guess?

So, since I had run youtube-dl with --dump-pages earlier, I loaded one of the pages (which was a JSON playlist segment) into Python and tried to extract directly from the file:

>>> with open('/tmp/yt/test') as f:
>>>     testpage = f.read()
>>> e.extract_videos_from_page(testpage)
[]

This makes no sense, because, having looked at YoutubeChannelIE.extract_videos_from_page, it looks like it should parse out the videos from testpage, which looks like this:

>>> testpage
'{"content_html": "      \\n\\n\\n\\u003ctr class=\\"pl-video yt-uix-tile \\" data-set-video-id=\\"\\" data-title=\\"Awkward Elevator Situation (Wheel of Mythicality - Ep. 30)\\" data-video-id=\\"2DpaTtjq1II\\"\\u003e\\u003ctd class=\\"pl-video-handle \\"\\u003e\\u003c\\/td\\u003e\\u003ctd class=...

Now I see that there are double-escaped quotes in there, which will mess with the regexp in YoutubeChannelIE.extract_videos_from_page. So I try to follow the chain of functions that download pages and parse them and decode them to find out how the corrected HTML gets to extract_videos_from_page()...but I am lost in a maze of functions calling functions calling functions, from one file to another, across directories...

I will try to summarize:

  1. The instructions in CONTRIBUTING.md are not helpful for trying to debug extractors from a current, cloned repo.
  2. The aforementioned extract() method should do only what it says (return a list), not also output to the screen, which fails if not correctly initialized (for which there is no documentation).
  3. Maybe I'm just a noob, but after about the 5th link in a method-that-calls-another-method-in-another-file-in-another-directory chain, I get lost. All I want to do is import the appropriate module that contains the appropriate extractor, pass it a) URL, or b) some raw HTML, and see what the result is so I can figure out why its regexp isn't working. This seems like it's harder than it should be.

If these issues could be addressed, I would imagine that more people would be able to contribute by fixing the inevitable broken extractors that happen when sites change.

Thanks for any help and for making youtube-dl. I don't mean this to be rude or harsh criticism; I'm just trying to document how I tried to debug it and got stuck so that perhaps the process can be improved.

@jaimeMF
Copy link
Collaborator

@jaimeMF jaimeMF commented Aug 29, 2015

python -m youtube_dl works perfectly for me, make sure you are running it from the correct directory.

It's not too clear, but extractors only work if they have the .downloader correctly set (via set_downloader or the initialization). So if you really want to use directly the extractor you have to do something like:

from youtube_dl import YoutubeDL
from youtube_dl.extractor import YoutubeUserIE

ydl = YoutubeDL()
ie = YoutubeUserIE(ydl)
info = ie.extract("https://www.youtube.com/user/rhettandlink2")

But in general you shouldn't use the extractors directly:

from youtube_dl import YoutubeDL

ydl = YoutubeDL()
# this resolves redirects and extracts info from playlist import entries
info = ydl.extract_info("https://www.youtube.com/user/rhettandlink2", download=False)

Instead of writing python, the method I use (and probably other developers do the same) is to call the program with the correct parameters (python -m youtube_dl URL OTHER_ARGS) and if necessary I put some print(...) call in the extractors to debug them.

About the problem with all the function calls, most functions do a relatively simple thing (download a webpage, extract some value with a regex ...) which simplifies the process and others are used to reduce the complexity of some extractors (in the case of extract_videos_from_page it's called in two of the possible branches of the extraction). I don't think there's a better alternative.

@jaimeMF jaimeMF closed this Aug 29, 2015
@alphapapa
Copy link
Contributor Author

@alphapapa alphapapa commented Aug 30, 2015

Thanks for your kind answer. That helps me understand it a lot better. I'll see what I can do.

If I may, I suggest that some of this info be added to the CONTRIBUTING.md file. :)

@alphapapa
Copy link
Contributor Author

@alphapapa alphapapa commented Sep 27, 2015

Thanks to your help here, I was able to fix the bug!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.