Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nebula] Add new extractor #24805

Closed
wants to merge 13 commits into from
Closed

[nebula] Add new extractor #24805

wants to merge 13 commits into from

Conversation

hheimbuerger
Copy link

@hheimbuerger hheimbuerger commented Apr 16, 2020

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense

What is the purpose of your pull request?

  • New extractor

Description of your pull request and other information

This is a new extractor for the video platform Nebula (https://watchnebula.com/), created by the streamer community Standard.

Nebula uses the Zype video infrastructure and this extractor is using the url_transparent mode to hand off video extraction to the Zype extractor (which has shown to be super reliable here).

Development discussion has occured on #21258. I didn't open the issue, but I hijacked it for my implementation.

All videos require a subscription to watch. There are no known freely available videos. The extractor contains three test cases, but they only pass when an auth cookie to a user account with a (paid) subscription is present.

I have aimed to provide comprehensive documentation of approach and behavior in class and method docstrings.

There's a few technical open issues -- all of which have been marked with FIXME -- that I'm hoping to receive feedback about in the review process.

@dstftw
Copy link
Collaborator

dstftw commented Apr 16, 2020

Read coding conventions.

@hheimbuerger
Copy link
Author

@dstftw

  • Mandatory and optional metafields

Many more direct field lookups have been turned into optional lookups.

  • Provide fallbacks

A secondary generator had already been implemented for the video URL.

Everything else is now optional, I don't see any obvious opportunity for fallbacks on meta data, but I'll keep my eyes open when implementing future improvements.

  • REs: Don't capture groups you don't use

There aren't really any capture groups used.

  • REs: Make regular expressions relaxed and flexible

The RE to extract the script tag has been made more relaxed, by searching for a script tag without a MIME type and by allowing other attributes on the element to come before the ID.

  • Long lines policy

Two more long code lines have been split or factored down.
There's some more lines longer than 80 characters, but those are all lines comments, usually TODO and FIXME tags. I would prefer not to break them, because then they are no longer detected as tasks by an IDE or other automation.

  • Inline values

I don't believe any such cases exist in the code.

  • Collapse fallbacks

I haven't seen an opportunity for those yet.

  • Trailing parentheses

Fixed one of these (even though I personally think it hurts structure readability). I wasn't quite sure whether this only is relevant for list parentheses or also to the curly brackets of dict literals.

  • Use convenience conversion and parsing functions

I have made some more use of try_get(). I was already using parse_iso8601 and parse_json. I'm not aware of other opportunities to use convenience functions, but I'm happy to be convinced otherwise.

@xSke
Copy link

xSke commented May 2, 2020

It doesn't look like Nebula even has a concept of a creator/uploader vs. a channel, they're one and the same things. Some creators are channels for themselves (eg. Mia Mulder, Super Bunnyhop), sometimes a channel is just a series from a creator (eg. Tom Scott presents: Money), but doesn't seem to be internally linked to the creator. Sometimes a creator has a channel of their own and a series/special-specific channel (eg. Wendover and Wendover - The World's Most Useful Airport, Lindsay Ellis and Lindsay Ellis - Tom Hooper's Les Miserables); in these cases the site itself seems to know about the connection (on the Les Miserables video, the uploader link points to Lindsay Ellis' main channel rather than the one-video series). Seems like the code you added to find the first named category solves this, though - although there may still be edge cases.

Digging into the pages a bit myself, it seems like the state has been removed from the video pages themselves somewhere between last time I looked (~April 9) and now. There's no JSON information on the video pages, and no XHR call to Nebula's own APIs for video info. I assume it's now parsing the URI for the video slug, and then fetching the video metadata (including the sparse category list) from Zype itself.

For "The World's Most Useful Airport" (one of the edge cases above), I see the following relevant requests:

// GET https://api.zype.com/videos?friendly_title=wendover-the-worlds-most-useful-airport&per_page=1&api_key=JlSv9XTImxelHi-eAHUVDy_NUM3uAtEogEpEdFoWHEOl9SKf5gl9pCHB1AYbY3QF
{
    "response": [
        {
            "_id": "5ddeb29de338f95ddff5e03e",
            "active": true,
            "categories": [
                {
                    "_id": "5ddeb31849594b5de68a2105",
                    "category_id": "5ced8c581c026e5ce6aaf047",
                    "title": "Animation",
                    "value": []
                },
                {
                    "_id": "5ddeb31849594b5de68a2106",
                    "category_id": "5c186aaa649f0f1342004248",
                    "title": "Explainers",
                    "value": [
                        "Wendover"
                    ]
                },
                // a couple category entries snipped
                {
                    "_id": "5ddeb31849594b5de68a210b",
                    "category_id": "5c624f363195970dec1057bb",
                    "title": "Originals",
                    "value": [
                        "Wendover — The World's Most Useful Airport"
                    ]
                },
                // there's a bunch more after this too
            ],
            "created_at": "2019-11-27T12:30:05.781-05:00",
            "description": "snip", // (this is the long-form description, and seems to be in Markdown format. does ytdl have a parser for that?)
            "duration": 2723,
            "friendly_title": "wendover-the-worlds-most-useful-airport",
            "keywords": [
                "st helena",
                // bunch more, snipped
            ],
            "marketplace_ids": {},
            "ott_description": "snip", // Seems to be equal to the "description" field above, may be differences? (what's "ott" short for?)
            "published_at": "2019-11-27T13:20:00.000-05:00",
            "site_id": "5c182d06649f0f134a001703",
            "status": "created",
            "subscription_ads_enabled": true, // (???)
            "title": "The World's Most Useful Airport",
            "updated_at": "2020-04-10T13:43:45.065-04:00",
            "thumbnails": [
                // snip
            ],
            "short_description": "After 500 years of isolation from the world, in 2017, St Helena got an airport. This changed everything, but was that enough to save the forgotten island?",
            "subscription_required": true
            // I've removed a bunch of seemingly-useless properties for post size reasons, a lot were null/false/empty
        }
    ]
}

I also see a call to https://api.zype.com/zobjects?zobject_type=channel&title=Wendover&api_key=JlSv9XTImxelHi-eAHUVDy_NUM3uAtEogEpEdFoWHEOl9SKf5gl9pCHB1AYbY3QF, which returns a bunch of channel information. Shouldn't be necessary since we probably don't need more than title for youtube-dl metadata, although might get us info about these links.

The API key in those requests is constant and hardcoded in source files, so I'm not leaking any personal tokens here. It's in main.[hash].chunk.js, a Ctrl-F away. Unsure of the policy of hardcoding those things in the youtube-dl source, though, might need to grab it dynamically from the script file somehow.

This also means we can't grab the video stream URL from the iframe itself, since that's no longer there. The source format is https://player.zype.com/embed/[video_id].html?autoplay=undefined&access_token=[token], where video_id is the _id field in the video response (see above) and token is the Zype video access token.

This is fetched via:

// GET https://api.watchnebula.com/api/v1/auth/user/
// Authorization: Token [nebula_token] (from the nebula_auth cookie's json data)
{
    "account_type": "curiositystream",
    "is_subscribed": true,
    "has_curiositystream_subscription": true,
    "zype_auth_info": {
        "access_token": "[snip]",
        "expires_at": 1588594215,
        "refresh_token": "[snip 2]",
        "zype_created_at": 1587989415
    }
    // more personal info fields (user ID, email) snipped
    // although info about subscription data and type may be useful for eg. error messages?
}

Hope some of this helps, that way you don't need to do too much digging on your own :) And hey, maybe they just accidentally turned off server-side rendering and the info will return to the main video page soon. But for now, this should be foolproof(-er)?

@hheimbuerger
Copy link
Author

hheimbuerger commented May 3, 2020

@xSke Wow, haha, this is so frustrating! I cannot believe they changed their entire frontend just days after finished the extractor. (I last updated the PR on Apr 18, so at that point they were definitely still using the old one.)
But I guess that's the life of a (wanna-be) video scraper, right? 😉

Thank you for your analysis! I did quickly walk through the new page and could confirm all your findings. (Including having an identical — apparently not-personalized — Zype API key and it only being found in the JS chunks.)

I'll see when I'll get around to updating (rewriting!?) my PR. Until then, I'll switch it to draft mode.

@hheimbuerger hheimbuerger marked this pull request as draft May 3, 2020 03:42
@xSke
Copy link

xSke commented May 9, 2020

I have some spare time - let me know if you'd like me to take over this? Don't want to step on anyone's toes :)

@hheimbuerger
Copy link
Author

@xSke I appreciate your concern and request. And you're right, I feel somewhat committed to this extractor now and would actually like to take a shot at it myself.

That said, I'd gladly take contributions (failing test cases, cleanups, especially anything that will make the PR more likely to be accepted) and code reviews, once I have pushed my initial prototype. Deal?

@hheimbuerger
Copy link
Author

hheimbuerger commented May 10, 2020

@xSke In the meantime, do you know anything about the best practices for youtube-dl around sites with no public videos? See the FIXMEs left in the existing PR. How do I correctly inject the authentication cookie into my test cases? And how do I deal with the test cases failing when no auth cookie is set — what should I set _WORKING to?

Can someone point me to another extractor, which deals (well) with videos behind authentication, so I can figure out the best practices? How do I actually accept authentication info from the user in 'production runs' (outside test cases)? Should I support --video-password? --password? --cookies? All of the above?

@hheimbuerger
Copy link
Author

@xSke Here's the new implementation. Code review and feedback is very welcome!

The only bit that's missing is the extraction of the Zype API key from the JS chunk. Once that is in, I'll un-draft this PR.

@hheimbuerger
Copy link
Author

hheimbuerger commented May 15, 2020

Here's the Zype API key extraction. For now, I have chosen not to include the known — and apparently static and global — API key as a fallback (or even a default) in the code. Mostly because I'm not sure what this project's policies are on including static keys like this.

That said, there is e.g. the TruNewsIE extractor on master which does contain a hardcoded Zype API key. Does that make it acceptable practice?

@hheimbuerger hheimbuerger marked this pull request as ready for review May 15, 2020 04:19
@hheimbuerger
Copy link
Author

Let me know if you want me to squash my commits.

@hheimbuerger
Copy link
Author

Merged in upstream master and verified as working on 2020-06-02.

@Zegnat
Copy link

Zegnat commented Jul 18, 2020

This seems to work 2020-07-18. But I wonder if the error text could be tweaked to be a little more explanatory.

ERROR: Nebula requires an account with an active subscription. You can supply a corresponding token by either a) finding your nebula-auth cookie and then specifying it via --video-password, or b) passing in a cookie jar containing a nebula-auth cookie via --cookies, or c) setting the environment variable NEBULA_TOKEN.

I needed to check the source code of the extractor to figure out that a) (and c)) did not want the actual value of the nebula-auth cookie. And that they instead wanted the JSON encoded apiToken value. Not sure if it was just my thick skull or what, but I feel like it could do a better job at pointing out what is considered the nebula token and what is not.

Although I have no immediate suggestions as to what a better copy would be 😅

Just my 2¢!

@hheimbuerger
Copy link
Author

I needed to check the source code of the extractor to figure out that a) (and c)) did not want the actual value of the nebula-auth cookie. And that they instead wanted the JSON encoded apiToken value. Not sure if it was just my thick skull or what, but I feel like it could do a better job at pointing out what is considered the nebula token and what is not.

No, you're absolutely right. I went back and forth a few times on what precisely should be passed in (passing in the cookie is an easier copy-paste job, but passing in the token in cleaner) and I think the error message lagged behind in one of those changes.

I'll look into improving it!

That said, is this generally the right approach to support subscription-only platforms on youtube-dl? Should I offer fewer methods of authentication? Or more, is there an essential one missing? I don't think I've seen other extractors support the environment variable, but I couldn't figure out another way to run the tests.

@Lamieur
Copy link

Lamieur commented Aug 2, 2020

I found your extractor and tried it out and was unpleasantly surprised with that error message and copying tokens from browser cookies, how user-unfriendly! :)

Can someone point me to another extractor, which deals (well) with videos behind authentication, so I can figure out the best practices?

Most of them, I think :)

grep _NETRC_MACHINE *.py

How do I actually accept authentication info from the user in 'production runs' (outside test cases)? Should I support --video-password? --password? --cookies? All of the above?

The way you do this is just what all those extractors do: username, password = self._get_login_info()

This magic function automatically pulls user+password from .netrc or -u/-p on the command line. Super easy, convenient for you and convenient for me as a user (use -u/-p for one time download, or just put email+password in .netrc and forget about it forever while I'm happily opening video links with mpv).

Of course then you have to implement logging in with email+password in your extractor, but I went through this in my Floatplane extractor - even logging in each time you watch a video (wasting a second) is better than copying cookies every week when the token changes, so in the end, I figured out I'm too lazy to NOT add that in ;)

@hheimbuerger
Copy link
Author

hheimbuerger commented Aug 5, 2020

The way you do this is just what all those extractors do: username, password = self._get_login_info()

This magic function automatically pulls user+password from .netrc or -u/-p on the command line. Super easy, convenient for you and convenient for me as a user (use -u/-p for one time download, or just put email+password in .netrc and forget about it forever while I'm happily opening video links with mpv).

@Lamieur Thanks, that's exactly the kind of best practice advice I was looking for! It's not mentioned in the Developer Instructions, nor in the extractor base class docstring, we should probably reference it in those.

I'll implement that, it will probably take me a couple of days to get around to it, though.

@dstftw dstftw force-pushed the master branch 2 times, most recently from 5e26784 to da2069f Compare September 13, 2020 13:50
@hheimbuerger
Copy link
Author

@Lamieur Thanks again for your tip! I've implemented the user/pass login as a proof of concept. Simple and works well. I just need to do some cleanup and error handling.

What's the correct way to supply these credentials during unit tests? test_download.py doesn't seem to accept the --username /--password arguments.

@hheimbuerger
Copy link
Author

@Lamieur

I found your extractor and tried it out and was unpleasantly surprised with that error message and copying tokens from browser cookies, how user-unfriendly! :)

The credentials-based authentication has now been fully implemented! I would appreciate if you could give the extractor another try and let me know if it now behaves the way you would have expected it to. Cheers!

@Lamieur
Copy link

Lamieur commented Oct 23, 2020

One small issue :)

It worked for one video, but after that, my mpv started saying:
[ytdl_hook] ERROR: Nebula requires an account (...)

youtube-dl -v said:
WARNING: Authentication failed or rejected: HTTP Error 403: Forbidden

I found out that the problem is my use of "--cookies /home/lam/.config/youtube-dl/cookies.txt" in my .config/youtube-dl/config

In that cookie jar, I now have two related cookies:
api.watchnebula.com FALSE / FALSE 1634917609 csrftoken <string> api.watchnebula.com FALSE / FALSE 1604677609 sessionid <string>
but there's nothing for watchnebula.com itself, like the "nebula-auth" token that your "option #2" cookie method checks for (I assume you use --cookies on a file exported from a browser). Therefore, my initial reflex of swapping test order in _retrieve_nebula_auth() won't help.

As a bad temporary work-around, I've added
self._set_cookie('api.watchnebula.com', 'sessionid', '')
to the top of _perform_login(). This way, it's able to log in fresh, every time. (I don't see a "remove_cookie" function and this works ;))

I mean, ideally it could probably remember being logged in, and re-run the logon process only when necessary, but a fresh login every run is at least functional :)

So many other extractors do this every time and don't bother with reusing old session cookies, so even if this feels super bad, it's not so out of ordinary in this context, right? :)

Oh, and can I be super nitpicky? :) Using "machine watchnebula" in the .netrc is maybe not so trivial - the service's name is Nebula, extractor is nebula.py, I would just call it "nebula" :)

Otherwise great improvement over command-line trickery with environment variables! I'll go watch something now!

@hheimbuerger
Copy link
Author

hheimbuerger commented Nov 26, 2020

I reforked the repo and rebased my changes: https://github.com/hheimbuerger/youtube-dl-post-dmca-refork/tree/add-nebula-support
I wasn't yet able to figure out how to switch this PR over to the new fork. I contacted GitHub support about that and am awaiting their reply.

@hheimbuerger
Copy link
Author

It worked for one video, but after that, my mpv started saying:
[ytdl_hook] ERROR: Nebula requires an account (...)

youtube-dl -v said:
WARNING: Authentication failed or rejected: HTTP Error 403: Forbidden

I found out that the problem is my use of "--cookies /home/lam/.config/youtube-dl/cookies.txt" in my .config/youtube-dl/config

In that cookie jar, I now have two related cookies:
api.watchnebula.com FALSE / FALSE 1634917609 csrftoken <string> api.watchnebula.com FALSE / FALSE 1604677609 sessionid <string>
but there's nothing for watchnebula.com itself, like the "nebula-auth" token that your "option #2" cookie method checks for (I assume you use --cookies on a file exported from a browser). Therefore, my initial reflex of swapping test order in _retrieve_nebula_auth() won't help.

I could not reproduce this based on your description, and I'm a bit confused. Could you maybe give me reproduction steps to follow?

My suspicion is that your dropped the --netrc when you ran the second download (but you didn't say so).
Maybe you're suggesting that the first download process could have used the credentials from .netrc to authenticate, and then store the resulting cookie in the cookie jar, not requiring the passing of --netrc on the second run.
Is that suspicion correct?

Oh, and can I be super nitpicky? :) Using "machine watchnebula" in the .netrc is maybe not so trivial - the service's name is Nebula, extractor is nebula.py, I would just call it "nebula" :)

I'm new to .netrc, I hadn't heard of it until you pointed me to it. It was my understanding that I don't get to choose the 'label' that is applied to the site, but rather that it's defined by the site's domain, which in this case is watchnebula.com, not nebula.com. In fact, reading up on this again makes me think I should change this to watchnebula.com (include the TLD).
Can you point me to further documentation about best practices around .netrc?

@Lamieur
Copy link

Lamieur commented Nov 26, 2020

Could you maybe give me reproduction steps to follow?

  1. Have both:
    --netrc
    --cookies
    in your .config/youtube-dl/config
  2. Have:
    machine watchnebula login <...> password <...>
    in your .netrc
  3. youtube-dl https://watchnebula.com/videos/<...> - this works
  4. youtube-dl https://watchnebula.com/videos/<...> - this and subsequent attempts will fail

Maybe you're suggesting that the first download process could have used the credentials from .netrc to authenticate, and then store the resulting cookie in the cookie jar, not requiring the passing of --netrc on the second run.

That's kind of what is already happening. First login works, a cookie is stored by the site. Then the cookie is loaded on youtube-dl's startup with all the other cookies, and sent with requests to the site. The only issue is - the extractor tries to log in again, and Nebula has the strangest reaction - it spews a 403 :)

But I'm still using that work-around (cleaning up the sessionid cookie) and it still works 100%.

It was my understanding that I don't get to choose the 'label' that is applied to the site
Maybe the original intention of its creators was, 40 years ago ;) Although they didn't have DNS and ".com" in Berknet, so... ;) It was the actual name of the machine to which you would ftp/netcp/rsh/whatever.

For youtube-dl's use, the "machine" is just a label. It doesn't matter what it is, but to make things simple, for YouTube it uses "machine youtube", for CuriosityStream it uses "machine curiositystream". Note: those are not only not actual machine names, but not even domains :) For me it would be simplest for Nebula to use "machine nebula", which is also what the extractor is called, which I think is more intuitive than the current "watchnebula". It doesn't (technically) matter, I found it in a second. I just thought making it more intuitive may one day save some novice some time googling :)

@Lamieur
Copy link

Lamieur commented Nov 26, 2020

But I'm still using that work-around (cleaning up the sessionid cookie) and it still works 100%.
But yes, the super elegant solution would be to:

  • check if the cookie exists
  • try using it
  • if that fails, try logging in again

It's not guaranteed to be faster than a fresh login every time though, so again: who cares :) As long as it works, right? :)

@hheimbuerger
Copy link
Author

@Lamieur Ah yes, thanks, I can reproduce this now. For some reason I was only ever testing with either switch for no good reason, my bad.

I definitely agree that it's a fair expectation for this to work, if not with using the cookie jar as a form of 'authentication cache', then at least by successfully re-authenticating on subsequent runs.

I'll look into this and fix it.

@hheimbuerger
Copy link
Author

Side-note on some weird behavior I just noticed. (I don't know what to make of this yet, so for now I'm just documenting findings here.)

More than once, I've noticed that when I picked up work on this implementation after some days or weeks had been passed (during which I didn't visit Nebula in my browser), my unit tests would suddenly fail. I then always assumed that something had be changed, breaking my interpretation. So I log in to the site with my browser to investigate, after which my unit tests suddenly started passing again.

I think I've now finally witnessed what's going on:

image

You cannot see it on the screenshot, but on the first request to /auth/user/, the zype_auth_info field is strangely None. I believe that's also what kept breaking my unit tests after some absence from the site.

My interpretation is that this Zype authentication token expires after a while. The frontend knows this, and apparently tries to fetch it directly then (/zype/auth-info/). When that fails (404), it requests the generation of a new token (/zype/auth-info/new). However, that takes a while (about 2.5s in my case), so it keeps polling until that Zype info is there.

This seems to be 'account-global', meaning that even though the Nebula extractor currently doesn't implement any of this behavior, loading a random video in the browser will fix this problem for the Nebula extractor as well.

@hheimbuerger
Copy link
Author

@Lamieur The issue during authentication when combining --netrc with --cookies should now be resolved on the new fork.

I can corroborate your analysis that apparently, the login endpoint of Nebula responds with a 403 when you supply a session ID cookie. I don't quite understand why it behaves that way, but at least it's reproducible.

I considered implementing some kind of only-login-if-needed approach because it would be faster, but I ultimately decided against it because:

  • it seems like it would be more unstable against future changes on the API, and
  • we're anyway making five requests for each video download, so saving one of those seems to not make a huge difference, and
  • it would probably be a try-to-initiate-video-download-and-login-if-rejected approach, so in some cases it would actually add an additional API call.

So all in all, it doesn't really appear to be worth it and instead, I'm now simply not sending cookies along when authenticating. It's a bit ugly, but effective. (I would have prefered to deactivate cookie jar support on a specific HTTP request, but looking through library code and documentation of youtube-dl, I couldn't find a way to do so.)

@hheimbuerger
Copy link
Author

hheimbuerger commented Jan 17, 2021

That was the last major issue I'm aware of.

GitHub Support since got back to me and let me know that there isn't a way for them to transfer this PR to the new fork. (Reminder: as part of the temporary DMCA takedown of the ytdl-org/youtube_dl repository, all earlier forks have been deleted and cannot be restored. I was requested to create a new fork and apply my commits on top of that.)

Therefore, I will now close this PR and open a new one. Please provide further feedback on this new PR.

This PR has been closed in favor of #27859.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants