Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GetCourseRuIE] & [AcademyMel] Add extractor #8873

Merged
merged 39 commits into from Jan 19, 2024

Conversation

divStar
Copy link
Contributor

@divStar divStar commented Dec 29, 2023

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information

This PR contains 2 extractors:

  • GetCourseRuPlayerIE for player\d{2,}.getcourse.ru - a CDN, hosting various videos
  • GetCourseRuIE for *.getcourse.ru (except player\d{2,})

GetCourseRuPlayerIE

The key feature of this extractor is to extract the masterPlaylistUrl from a HTML/JavaScript response, which usually is injected into a website. This URL is then passed to the Generic extractor.

This extractor also has an _EMBED_REGEX, which allows the Generic extractor to make use of it if a suitable URL is encountered in a given page.

GetCourseRuIE

The key feature here is, that - given proper credentials - this extractor retrieves the iframe tags with particular src attribute values, passes each of the results into the GetCourseRuPlayerIE extractor and collects the results in a playlist.

Since most sub-domains on *.getcourse.ru work the same way (aside from IDs), this generalized extractor should be able to extract videos from most of the websites on getcourse.ru - given the proper credentials.

Further more this extractor contains a curated list of _DOMAINS, that are defacto-aliases to their getcourse.ru-sub-domains.

Misc

Example of the response when using the URL above with a valid s and json query parameter
<!DOCTYPE html><html><head><meta charset="UTF-8"><title></title><meta name="viewport" content="width=device-width, initial-scale=1"><link rel="stylesheet"
              href="https://vh-asset-static.servicecdn.ru/vhstatic/gc.ts.player/app/173/player.css"
        /><link rel="stylesheet"
                  href="https://academymel.online/pl/fileservice/video-hosting/player-style"
                  lazyload
            /><link rel="stylesheet"
              href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.4/css/all.min.css"
              media="print"
              onload="this.media='all'"
        /><link rel="icon" href="data:;base64,iVBORw0KGgo="></head><body class="gbp-account-id-714517"><div id="player-root"></div><script>
        window.VIDEO_EL_LOG_CONFIGS = {"URL":"https:\/\/v02.getcourse.ru\/api\/video-el-logs\/set","VIDEO_ID":4885252,"USER_ID":357766962,"SESSION_ID":1787349907}
    </script><script>
        window.configs = {"isDisableOpenJustInIframe":false,"videoAspectRatio":1.7777777777777777,"isVideoReady":true,"isAutoShowSubtitles":false,"isProtectReady":true,"isNativePlayer":false,"isDebug":false,"isWatched":false,"isControls":true,"isViewOnlyMode":false,"isGridError":false,"isUserAdmin":false,"isGoodQualityWarning":false,"subtitleUrl":"","masterPlaylistUrl":"https:\/\/playlist.servicecdn.ru\/player\/e2eea7129d979d3c630630538bd36ef1\/d998e6daba8c9b94d34911977683c70d\/master.m3u8?user-cdn=cdnvideo&acc-id=714517&user-id=357766962&loc-mode=ru&version=10%3A2%3A1%3A0%3A2%3Acdnvideo&consumer=vod&jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyLWlkIjozNTc3NjY5NjJ9.9oP4F4FJ1rw1-AqMNllA9LytFwlnctOPr-uQSjgUIuE&gzipoff=1","previewUrl":"https:\/\/preview-htz.kinescopecdn.net\/preview\/e2eea7129d979d3c630630538bd36ef1\/preview.jpg?version=1702370629&host=vh-74","saveViewStatisticUrl":"https:\/\/v01.getcourse.ru\/api\/video-player-range-save-info\/save","markAsWatchedUrl":"\/watched\/status\/357766962\/e2eea7129d979d3c630630538bd36ef1","createPersonalVideoUrl":"\/player\/e2eea7129d979d3c630630538bd36ef1\/d998e6daba8c9b94d34911977683c70d\/create-personal-video","checkPersonalVideoUrl":"\/player\/e2eea7129d979d3c630630538bd36ef1\/d998e6daba8c9b94d34911977683c70d\/ready","transcodeProgressUrl":"\/api\/progress\/e2eea7129d979d3c630630538bd36ef1?viewHash=d998e6daba8c9b94d34911977683c70d","saveMonitoringUrl":"https:\/\/v01.getcourse.ru\/api\/fpm-webbrowser-monitoring\/save","videoDuration":1870,"userId":357766962,"userIp":"46.142.183.85","lessonId":null,"accountId":714517,"cdnList":{"integrosproxy":"https:\/\/vh-74-integros.kinescopecdn.net","gcore":"https:\/\/gc74.vhcdn.com","cdnvideo":"https:\/\/2lfkbh0yxg.a.trbcdn.net","cloudflare":"https:\/\/vh74.vhcdn.com"},"gcFileId":513569172,"videoHash":"e2eea7129d979d3c630630538bd36ef1","videoId":4885252,"setHistoryCurrentTimeUrl":"https:\/\/v02.getcourse.ru\/api\/save-last-seen-time\/set","getHistoryCurrentTimeUrl":"https:\/\/v02.getcourse.ru\/api\/save-last-seen-time\/get","saveCdnPriorityHistoryUrl":"https:\/\/v01.getcourse.ru\/api\/fpm-logs-player-switch-cdn\/save","staticAssetVersion":173,"supportInfo":{"data":{"VideoId":4885252,"VideoHash":"e2eea7129d979d3c630630538bd36ef1","ViewId":"d998e6daba8c9b94d34911977683c70d","VerGen":"2","Protect":"no","Storage":"vh74","CdnRsn":"CV rule","ImgVer":"537","PortLstn":"9007","StVer":173,"AccId":714517,"GcFlId":513569172,"afr":"russia"},"changeCdnUrl":"https:\/\/v01.getcourse.ru:3001\/cdn\/e2eea7129d979d3c630630538bd36ef1\/357766962?","managerVideoUrl":"https:\/\/v01.getcourse.ru:3001\/video\/e2eea7129d979d3c630630538bd36ef1\/?view-hash=d998e6daba8c9b94d34911977683c70d"}}
    </script><script crossorigin src="https://vh-asset-static.servicecdn.ru/vhstatic/gc.ts.player/app/173/ru.js?v=1"></script><script crossorigin src="https://vh-asset-static.servicecdn.ru/vhstatic/gc.ts.player/app/173/player.min.js?v=1"></script><script crossorigin async src="https://vh-asset-static.vhcdn.com/vhstatic/ts.video.el.logs/app/12/video.el.logs.min.js?v=1"></script><script>
        window.addEventListener('message', function (event) {
            if (event.data.msg !== 'video-info' || event.source === null) {
                return
            }

            event.source.postMessage({"video_hash":"e2eea7129d979d3c630630538bd36ef1","is_vertical":"false","aspect_ratio":1.7777777777777777,"type":"vh"}, '*')
        });
    </script></body></html>

One way to acquire such an URL is to register at https://academymel.online/3video_2 and login. It gives you a couple of cookies when you log in and then that website's iframe tags will have src attributes, which are URLs to the getcourse.ru domain and contain the current video_hash (it changes probably once every day) and a user_id - all embedded in the json and s query parameters.

Since getting the ....getcourse.ru/sign-player/?... URLs is easy once you're logged in, I did not bother to go for academymel.online itself (also the pages on there could contain multiple videos). Also this system is likely to work for quite a few videos hosted on getcourse.ru, because getcourse.ru is a "build blocks" website for courses and thus abstracts video players and the likes away for its customers.

I have included a valid ID and URL, but they might not be valid at the point of review.

.netrc

This extractor allows the user to supply a username and a password or use --netrc flag. If the user chooses to use the latter, the user can have a .netrc file in e.g. his/her home directory. However: instead of just resorting to machine getcourseru (which is still necessary, albeit with dummy values - these are never used), the user can specify individual domains to use.

Example .netrc file contents
machine getcourseru
login dummy
password dummy

machine academymel.online
login meriat@jaga.email
password bBY-ccbp$8

machine academymel.getcourse.ru
login meriat@jaga.email
password bBY-ccbp$8

machine manibeauty.getcourse.ru
login some@email
password some@password

This allows the user to enter all his/her credentials in order to easily be able to download all desired videos.

Template

Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

@seproDev seproDev added the site-request Request to support a new website label Dec 29, 2023
@divStar
Copy link
Contributor Author

divStar commented Dec 30, 2023

I successfully built the binaries on my fork and tested them with a URL (since it's changing probably like once a day, posting it here won't help): https://github.com/divStar/yt-dlp/actions/runs/7360557969.

I am still looking into possibly creating some account on the source website (https://academymel.online/), that would allow a login process, that in turn would lead to an up-to-date website with an up-to-date link to the video and hence probably enable automated testing of this extractor, but I haven't yet done so.

@divStar
Copy link
Contributor Author

divStar commented Jan 5, 2024

The GetCourseRuIE extractor is a CDN extractor, that needs a correct URL to work with. However: there is no direct login or anything and thus no way of testing it since the URLs change.

This is why I am about to add a AcademyMelIE extractor, which is the site, that embeds the videos, along with example credentials, which - if used in the test/local_parameters.json file as user name and password - will retrieve the GetCourse-URL and ultimately use the GetCourseRuIE extractor to download the file.

Once the tests pass, this PR should now be mergable.

@divStar divStar changed the title [GetCourseRuIE] Add extractor [GetCourseRuIE] & [AcademyMel] Add extractor Jan 7, 2024
@divStar
Copy link
Contributor Author

divStar commented Jan 7, 2024

Review remarks / TODOs:

  • no delegation to the Generic extractor
  • use formats, subtitles = self._extract_m3u8_formats_and_subtitles()
  • extract metadata from the site, that's embedding the videos, and e.g. pass them to the GetCourseRu extractor
  • remove unnecessary try-except blocks
  • try to not cache the cookie in _perform_login for re-using in _real_extract
  • review contributing guidelines once more

I am working on these things.

yt_dlp/extractor/getcourseru.py Outdated Show resolved Hide resolved
yt_dlp/extractor/getcourseru.py Outdated Show resolved Hide resolved
yt_dlp/extractor/academymel.py Outdated Show resolved Hide resolved
yt_dlp/extractor/_extractors.py Outdated Show resolved Hide resolved
yt_dlp/extractor/academymel.py Outdated Show resolved Hide resolved
yt_dlp/extractor/academymel.py Outdated Show resolved Hide resolved
yt_dlp/extractor/academymel.py Outdated Show resolved Hide resolved
yt_dlp/extractor/academymel.py Outdated Show resolved Hide resolved
yt_dlp/extractor/getcourseru.py Outdated Show resolved Hide resolved
yt_dlp/extractor/getcourseru.py Outdated Show resolved Hide resolved
yt_dlp/extractor/getcourseru.py Outdated Show resolved Hide resolved
yt_dlp/extractor/getcourseru.py Outdated Show resolved Hide resolved
yt_dlp/extractor/getcourseru.py Outdated Show resolved Hide resolved
@bashonly
Copy link
Member

Please note that there are hidden suggestions

@bashonly bashonly added the pending-fixes PR has had changes requested label Jan 14, 2024
@divStar divStar requested a review from bashonly January 14, 2024 21:23
yt_dlp/extractor/academymel.py Outdated Show resolved Hide resolved
@divStar
Copy link
Contributor Author

divStar commented Jan 14, 2024

@seproDev has pointed out the following plan to complete this:

  • GetCourseRuIE should match all *.getcourse.ru domains except player02.getcourse.ru.
  • GetCourseRuIE will also match a curated list of domains, that are known to be hosted on getcourse.ru (e.g. academymel.online, which is a mirror to academymel.getcourse.ru, but the DNS redirect is not visible)
  • similar to how the YouTube extractor has a list of Invidious instances
  • GetCourseRuIE will mostly correspond to what AcademyMelIE is doing (login + playlist + call to the player-extractor)
  • GetCourseRuPlayerIE will only match player02.getcourse.ru to extract the video
  • GetCourseRuPlayerIE should define a _EMBED_REGEX variable, that lets it scan for embedded iframe tags in the webpage given the page is not known to the curated list, but a cookie has been provided (which should make the webpage valid and return actual videos instead of a redirect to the log in)

yt_dlp/extractor/getcourseru.py Outdated Show resolved Hide resolved
yt_dlp/extractor/getcourseru.py Outdated Show resolved Hide resolved
yt_dlp/extractor/getcourseru.py Outdated Show resolved Hide resolved
yt_dlp/extractor/getcourseru.py Outdated Show resolved Hide resolved
@bashonly bashonly added needs-testing Patch needs testing and removed pending-fixes PR has had changes requested labels Jan 18, 2024
@divStar
Copy link
Contributor Author

divStar commented Jan 19, 2024

I successfully ran all the test cases in the getcourseru.py file (using credentials, some of which have access to paid lessons, too). Furthermore I successfully tested the _EMBED_REGEX and .netrc.

So overall this looks good to me. Thank you everyone for your help and support! It was an awesome experience!

@bashonly bashonly removed the needs-testing Patch needs testing label Jan 19, 2024
@bashonly bashonly self-assigned this Jan 19, 2024
@bashonly bashonly merged commit 4310b66 into yt-dlp:master Jan 19, 2024
6 checks passed
@divStar
Copy link
Contributor Author

divStar commented Feb 16, 2024

Just reporting back: tried this extractors on multiple occasions and multiple sites successfully (both: GetCourseRuPlayer and GetCourseRu).

aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024
Authored by: divStar, seproDev

Co-authored-by: sepro <4618135+seproDev@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
site-request Request to support a new website
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants