Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oreilly login page error #30884

Open
sandygmaharaj opened this issue Apr 22, 2022 · 17 comments · May be fixed by #31524
Open

oreilly login page error #30884

sandygmaharaj opened this issue Apr 22, 2022 · 17 comments · May be fixed by #31524
Labels
broken-IE problem with existing site extraction

Comments

@sandygmaharaj
Copy link

Checklist

  • [ x ] I'm reporting a broken site support issue
  • [ x ] I've verified that I'm running youtube-dl version 2021.12.17
  • [ x ] I've checked that all provided URLs are alive and playable in a browser
  • [ x ] I've checked that all URLs and arguments with special characters are properly quoted or escaped
  • [ x ] I've searched the bugtracker for similar bug reports including closed ones
  • [ x ] I've read bugs section in FAQ

Verbose log

[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'-u', u'PRIVATE', u'-p', u'PRIVATE', u'--verbose', u'--write-info-json', u'https://learning.oreilly.com/videos/linux-shell-scripting/9781789800906/']
[debug] Encodings: locale UTF-8, fs UTF-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Python version 2.7.16 (CPython) - Linux-4.19.0-18-cloud-amd64-x86_64-with-debian-10.12
[debug] exe versions: ffmpeg 4.1.8-0, ffprobe 4.1.8-0
[debug] Proxy map: {}
[safari:course] Downloading login page
ERROR: An extractor error has occurred. (caused by KeyError(u'next',)); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; type  youtube-dl -U  to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Traceback (most recent call last):
  File "/usr/local/bin/youtube-dl/youtube_dl/extractor/common.py", line 533, in extract
    self.initialize()
  File "/usr/local/bin/youtube-dl/youtube_dl/extractor/common.py", line 437, in initialize
    self._real_initialize()
  File "/usr/local/bin/youtube-dl/youtube_dl/extractor/safari.py", line 29, in _real_initialize
    self._login()
  File "/usr/local/bin/youtube-dl/youtube_dl/extractor/safari.py", line 51, in _login
    'https://api.oreilly.com', qs['next'][0])
KeyError: u'next'
Traceback (most recent call last):
  File "/usr/local/bin/youtube-dl/youtube_dl/YoutubeDL.py", line 815, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/bin/youtube-dl/youtube_dl/YoutubeDL.py", line 836, in __extract_info
    ie_result = ie.extract(url)
  File "/usr/local/bin/youtube-dl/youtube_dl/extractor/common.py", line 547, in extract
    raise ExtractorError('An extractor error has occurred.', cause=e)
ExtractorError: An extractor error has occurred. (caused by KeyError(u'next',)); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; type  youtube-dl -U  to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

Description

Issue seems to be with some changes on login page of oreilly site. youtube-dl is not able to login and hence can not proceed further.

@dirkf
Copy link
Contributor

dirkf commented Apr 22, 2022

Use --cookies ... with a cookie file from your logged-in browser session, and don't use -u/--username ....

@sandygmaharaj
Copy link
Author

Still have an error:
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'--cookies', u'./cookies.txt', u'--verbose', u'--write-info-json', u'https://learning.oreilly.com/videos/complete-bash-shell/9781800209695/']
[debug] Encodings: locale UTF-8, fs UTF-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Python version 2.7.16 (CPython) - Linux-4.19.0-18-cloud-amd64-x86_64-with-debian-10.12
[debug] exe versions: ffmpeg 4.1.8-0, ffprobe 4.1.8-0
[debug] Proxy map: {}
[safari:course] 9781800209695: Downloading course JSON
[download] Downloading playlist: Complete Bash Shell Scripting
[safari:course] playlist Complete Bash Shell Scripting: Collected 93 video ids (downloading 93 of them)
[download] Downloading video 1 of 93
[safari:api] 9781800209695/video1_1: Downloading part JSON
[safari] 9781800209695-video1_1: Downloading webpage
ERROR: Unable to extract kaltura reference id; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; type youtube-dl -U to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Traceback (most recent call last):
File "/usr/local/bin/youtube-dl/youtube_dl/YoutubeDL.py", line 815, in wrapper
return func(self, *args, **kwargs)
File "/usr/local/bin/youtube-dl/youtube_dl/YoutubeDL.py", line 836, in __extract_info
ie_result = ie.extract(url)
File "/usr/local/bin/youtube-dl/youtube_dl/extractor/common.py", line 534, in extract
ie_result = self._real_extract(url)
File "/usr/local/bin/youtube-dl/youtube_dl/extractor/safari.py", line 147, in _real_extract
webpage, 'kaltura reference id', group='id')
File "/usr/local/bin/youtube-dl/youtube_dl/extractor/common.py", line 1012, in _search_regex
raise RegexNotFoundError('Unable to extract %s' % _name)
RegexNotFoundError: Unable to extract kaltura reference id; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; type youtube-dl -U to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

Btw, it was working fine with Oreilly a few days back.

@adbprogrammer

This comment was marked as duplicate.

@prof-ninjason
Copy link

prof-ninjason commented May 3, 2022

Confirmed.

I refreshed my cookies.
I removed username/password.
I have evaluated my script into a single one-line command just to make sure it wasn't my own personal script.
I also removed the last / in the URL just in case.

Still having a "next" error.

@prof-ninjason
Copy link

Use --cookies ... with a cookie file from your logged-in browser session, and don't use -u/--username ....

In order to archive more than just 2 minutes of video; there was a temporary fix found that when using both cookies & username/password, it would allow full-length archiving of the video.

@ankurjain41282
Copy link

I am still facing the issue

@robertgrubba
Copy link

got the same error, fix posted by jcrochon doesn't work for me (using cookies and credentials as parameters doesn't work either).

@hunter86bg
Copy link

@dirkf ,
can you take a look into this one.It seems that on the second attempt (safari:api) there is no qs and thus the extractor can't identify next_uri.

epsilonSpider added a commit to epsilonSpider/youtube-dl that referenced this issue Feb 11, 2023
@epsilonSpider epsilonSpider linked a pull request Feb 11, 2023 that will close this issue
11 tasks
@epsilonSpider
Copy link

@sandygmaharaj @ankurjain41282 @robertgrubba if you would like, can you check if this change addresses the issue here: epsilonSpider@d524ac1 ?

@hunter86bg
Copy link

I managed to download a single video, so it looks better now.

@hunter86bg
Copy link

hunter86bg commented Feb 12, 2023

I managed to download a single video (length bigger than a minute).

Edit: I meant that I manage to download a whole course where at least one of the videos is bigger than a minute.

@Kerruba
Copy link
Contributor

Kerruba commented Jul 17, 2023

From my understanding, the problem stands in (at least) two places:

  1. If you don´t provide username/password through the options, the user is considered not logged in even if credentials are passed using cookies (
    if username is None:
    )
        username, password = self._get_login_info()
        if username is None:
            return
  1. Apparently the is_logged function is not working properly too. This PR should fix this issue Check for oreilly login with new url #31524

I tried locally to replace the is_logged function

        def is_logged(urlh):
            url = urlh.geturl()
            parsed_url = compat_urlparse.urlparse(url)
            return parsed_url.hostname.endswith('learning.oreilly.com') and (
                parsed_url.path.startswith('/home/')
                or (parsed_url.path == '/member/login/' and not parsed_url.query))

And this in combination with the -u someuser -p somepass --cookies <cookies-file> seems to solve the problem

@dirkf
Copy link
Contributor

dirkf commented Jul 17, 2023

If the cookies file is valid, the -u ... -p ... should be unnecessary. The code at 1. is correct operation: not trying to log you in if you didn't ask for it. Probably the change at 2. is what is making the difference.

@Kerruba
Copy link
Contributor

Kerruba commented Jul 18, 2023

@dirkf this is not the behaviour I'm facing. I've updated my local safari extractor to include info about the login process and tried to download the file reported above both using username/password + cookies and only cookies. Note that cookies are valid.
Here the diff in the code

diff --git a/youtube_dl/extractor/safari.py b/youtube_dl/extractor/safari.py
index 2cc665122..161cc94f2 100644
--- a/youtube_dl/extractor/safari.py
+++ b/youtube_dl/extractor/safari.py
@@ -31,14 +31,21 @@ class SafariBaseIE(InfoExtractor):
     def _login(self):
         username, password = self._get_login_info()
         if username is None:
+            self.to_screen('Not Logged in')
             return
+        self.to_screen('Using user {}'.format(username))
 
         _, urlh = self._download_webpage_handle(
             'https://learning.oreilly.com/accounts/login-check/', None,
             'Downloading login page')
 
         def is_logged(urlh):
-            return 'learning.oreilly.com/home/' in urlh.geturl()
+            url = urlh.geturl()
+            parsed_url = compat_urlparse.urlparse(url)
+            return parsed_url.hostname.endswith('learning.oreilly.com') and (
+                parsed_url.path.startswith('/home/')
+                or (parsed_url.path == '/member/login/' and not parsed_url.query))
+            # return 'learning.oreilly.com/home/' in urlh.geturl()
 
         if is_logged(urlh):
             self.LOGGED_IN = True

Here the results I get from the download

  1. Only valid cookies, no username/password provided
> youtube-dl -o file_without_username_and_password.mp4 --playlist-items 1 --cookies /tmp/learning.oreilly.com_cookies.txt https://learning.oreilly.com/videos/complete-bash-shell/9781800209695/
[safari:course] Not Logged in
[safari:course] 9781800209695: Downloading course JSON
[download] Downloading playlist: Complete Bash Shell Scripting
[safari:course] playlist Complete Bash Shell Scripting: Collected 93 video ids (downloading 1 of them)
[download] Downloading video 1 of 1
[safari:api] Not Logged in
[safari:api] 9781800209695/video1_1: Downloading part JSON
[safari] Not Logged in
[safari] 9781800209695-video1_1: Downloading webpage
[Kaltura] 9781800209695-video1_1: Downloading webpage
[Kaltura] 0_eiswe197: Downloading video info JSON
[Kaltura] 0_eiswe197: Checking mp4-1512 URL
[Kaltura] 0_eiswe197: Downloading m3u8 information
[download] Destination: file_without_username_and_password.mp4
[download] 100% of 8.44MiB in 00:01
[download] Finished downloading playlist: Complete Bash Shell Scripting

> ffmpeg -i file_without_username_and_password.mp4 2>&1 | grep "Duration"| cut -d ' ' -f 4 | sed s/,//
00:01:00.00
  1. Here the results passing a totally made up username and password along with the valid cookies
> youtube-dl --username someuser --password somepass -o file_with_username_and_password.mp4 --playlist-items 1 --cookies /tmp/learning.oreilly.com_cookies.txt https://learning.oreilly.com/videos/complete-bash-shell/9781800209695/
[safari:course] Using user someuser
[safari:course] Downloading login page
[safari:course] 9781800209695: Downloading course JSON
[download] Downloading playlist: Complete Bash Shell Scripting
[safari:course] playlist Complete Bash Shell Scripting: Collected 93 video ids (downloading 1 of them)
[download] Downloading video 1 of 1
[safari:api] Using user someuser
[safari:api] Downloading login page
[safari:api] 9781800209695/video1_1: Downloading part JSON
[safari] Using user someuser
[safari] Downloading login page
[safari] 9781800209695-video1_1: Downloading webpage
[safari] 9781800209695-video1_1: Downloading kaltura session JSON
[Kaltura] 9781800209695-video1_1: Downloading webpage
[Kaltura] 0_eiswe197: Downloading video info JSON
[Kaltura] 0_eiswe197: Checking mp4-1512 URL
[Kaltura] 0_eiswe197: Downloading m3u8 information
[download] Destination: file_with_username_and_password.mp4
[download] 100% of 189.76MiB in 00:34
[download] Finished downloading playlist: Complete Bash Shell Scripting

> ffmpeg -i file_with_username_and_password.mp4 2>&1 | grep "Duration"| cut -d ' ' -f 4 | sed s/,//
00:17:32.25

As you can see, providing the username and password make the difference in the downloaded file lenght. I didn't dig into the code that much, but seems to me something wrong is going on.

@dirkf
Copy link
Contributor

dirkf commented Jul 18, 2023

So those results are with the patch applied?

Maybe it's important to update the cookies by visiting the login page and then the login procedure is bypassed. Needs investigation.

@Kerruba
Copy link
Contributor

Kerruba commented Jul 18, 2023

Yeah the results are with the patch applied, sorry if it wasn't clear. Yeah need some more investigation, I'll try to find some time to do that in the next couple of days hopefully

@dirkf dirkf added the broken-IE problem with existing site extraction label Jul 18, 2023
@D357R0Y3R
Copy link

D357R0Y3R commented Feb 23, 2024

Hey @dirkf, @epsilonSpider, @Kerruba I just tried to modify this code a bit with updated urls to fix the downloading at least but still getting 1 minute video download error. Seems like kaltura extractor is having problem downloading correct index.m3u8.

import json
import re

from .common import InfoExtractor

from ..compat import (
    compat_parse_qs,
    compat_urlparse,
)
from ..utils import (
    ExtractorError,
    update_url_query,
)


class SafariBaseIE(InfoExtractor):
    _LOGIN_URL = 'https://learning.oreilly.com/member/login/'
    _NETRC_MACHINE = 'safari'

    _API_BASE = 'https://learning.oreilly.com/api/v1'
    _API_FORMAT = 'json'

    LOGGED_IN = False

    def _perform_login(self, username, password):
        _, urlh = self._download_webpage_handle(
            'https://learning.oreilly.com/member/login/', None,
            'Downloading login page')

        def is_logged(urlh):
            url = urlh.geturl()
            parsed_url = compat_urlparse.urlparse(url)
            return parsed_url.hostname.endswith('learning.oreilly.com') and (
                parsed_url.path.startswith('/home/')
                or (parsed_url.path == '/member/login/' and not parsed_url.query))

        if is_logged(urlh):
            self.LOGGED_IN = True
            return

        redirect_url = urlh.url
        parsed_url = compat_urlparse.urlparse(redirect_url)
        qs = compat_parse_qs(parsed_url.query)
        next_uri = compat_urlparse.urljoin(
            'https://api.oreilly.com', qs['next'][0])

        auth, urlh = self._download_json_handle(
            'https://www.oreilly.com/member/login/', None, 'Logging in',
            data=json.dumps({
                'email': username,
                'password': password,
                'redirect_uri': next_uri,
            }).encode(), headers={
                'Content-Type': 'application/json',
                'Referer': redirect_url,
            }, expected_status=400)

        credentials = auth.get('credentials')
        if (not auth.get('logged_in') and not auth.get('redirect_uri')
                and credentials):
            raise ExtractorError(
                'Unable to login: %s' % credentials, expected=True)

        # oreilly serves two same instances of the following cookies
        # in Set-Cookie header and expects first one to be actually set
        for cookie in ('groot_sessionid', 'orm-jwt', 'orm-rt'):
            self._apply_first_set_cookie_header(urlh, cookie)

        _, urlh = self._download_webpage_handle(
            auth.get('redirect_uri') or next_uri, None, 'Completing login',)

        if is_logged(urlh):
            self.LOGGED_IN = True
            return

        raise ExtractorError('Unable to log in')


class SafariIE(SafariBaseIE):
    IE_NAME = 'safari'
    IE_DESC = 'safaribooksonline.com online video'
    _VALID_URL = r'''(?x)
                        https?://
                            (?:www\.)?(?:safaribooksonline|(?:learning\.)?oreilly)\.com/
                            (?:
                                library/view/[^/]+/(?P<course_id>[^/]+)/(?P<part>[^/?\#&]+)\.html|
                                videos/[^/]+/[^/]+/(?P<reference_id>[^-]+-[^/?\#&]+)
                            )
                    '''

    _TESTS = [{
        'url': 'https://www.safaribooksonline.com/library/view/hadoop-fundamentals-livelessons/9780133392838/part00.html',
        'md5': 'dcc5a425e79f2564148652616af1f2a3',
        'info_dict': {
            'id': '0_qbqx90ic',
            'ext': 'mp4',
            'title': 'Introduction to Hadoop Fundamentals LiveLessons',
            'timestamp': 1437758058,
            'upload_date': '20150724',
            'uploader_id': 'stork',
        },
    }, {
        # non-digits in course id
        'url': 'https://www.safaribooksonline.com/library/view/create-a-nodejs/100000006A0210/part00.html',
        'only_matching': True,
    }, {
        'url': 'https://www.safaribooksonline.com/library/view/learning-path-red/9780134664057/RHCE_Introduction.html',
        'only_matching': True,
    }, {
        'url': 'https://www.safaribooksonline.com/videos/python-programming-language/9780134217314/9780134217314-PYMC_13_00',
        'only_matching': True,
    }, {
        'url': 'https://learning.oreilly.com/videos/hadoop-fundamentals-livelessons/9780133392838/9780133392838-00_SeriesIntro',
        'only_matching': True,
    }, {
        'url': 'https://www.oreilly.com/library/view/hadoop-fundamentals-livelessons/9780133392838/00_SeriesIntro.html',
        'only_matching': True,
    }]

    _PARTNER_ID = '1926081'
    _UICONF_ID = '29375172'

    def _real_extract(self, url):
        mobj = self._match_valid_url(url)

        reference_id = mobj.group('reference_id')
        if reference_id:
            video_id = reference_id
            partner_id = self._PARTNER_ID
            ui_id = self._UICONF_ID
        else:
            video_id = '%s-%s' % (mobj.group('course_id'), mobj.group('part'))

            webpage, urlh = self._download_webpage_handle(url, video_id)

            mobj = re.match(self._VALID_URL, urlh.url)
            reference_id = mobj.group('reference_id')
            if not reference_id:
                reference_id = self._search_regex(
                    r'data-reference-id=(["\'])(?P<id>(?:(?!\1).)+)\1',
                    webpage, 'kaltura reference id', group='id')
            partner_id = self._search_regex(
                r'data-partner-id=(["\'])(?P<id>(?:(?!\1).)+)\1',
                webpage, 'kaltura widget id', default=self._PARTNER_ID,
                group='id')
            ui_id = self._search_regex(
                r'data-ui-id=(["\'])(?P<id>(?:(?!\1).)+)\1',
                webpage, 'kaltura uiconf id', default=self._UICONF_ID,
                group='id')

        query = {
            'wid': '_%s' % partner_id,
            'uiconf_id': ui_id,
            'flashvars[referenceId]': reference_id,
        }

        if self.LOGGED_IN:
            kaltura_session = self._download_json(
                '%s/player/kaltura_session/?reference_id=%s' % (self._API_BASE, reference_id),
                video_id, 'Downloading kaltura session JSON',
                'Unable to download kaltura session JSON', fatal=False,
                headers={'Accept': 'application/json'})
            if kaltura_session:
                session = kaltura_session.get('session')
                if session:
                    query['flashvars[ks]'] = session

        return self.url_result(update_url_query(
            'https://cdnapisec.kaltura.com/html5/html5lib/v2.37.1/mwEmbedFrame.php', query),
            'Kaltura')


class SafariApiIE(SafariBaseIE):
    IE_NAME = 'safari:api'
    _VALID_URL = r'https?://(?:www\.)?(?:safaribooksonline|(?:learning\.)?oreilly)\.com/api/v1/book/(?P<course_id>[^/]+)/chapter(?:-content)?/(?P<part>[^/?#&]+)\.html'

    _TESTS = [{
        'url': 'https://www.safaribooksonline.com/api/v1/book/9780133392838/chapter/part00.html',
        'only_matching': True,
    }, {
        'url': 'https://www.safaribooksonline.com/api/v1/book/9780134664057/chapter/RHCE_Introduction.html',
        'only_matching': True,
    }]

    def _real_extract(self, url):
        mobj = self._match_valid_url(url)
        part = self._download_json(
            url, '%s/%s' % (mobj.group('course_id'), mobj.group('part')),
            'Downloading part JSON')
        web_url = part['web_url']
        if 'library/view' in web_url:
            web_url = web_url.replace('library/view', 'videos')
            natural_keys = part['natural_key']
            web_url = f'{web_url.rsplit("/", 1)[0]}/{natural_keys[0]}-{natural_keys[1][:-5]}'
        return self.url_result(web_url, SafariIE.ie_key())


class SafariCourseIE(SafariBaseIE):
    IE_NAME = 'safari:course'
    IE_DESC = 'safaribooksonline.com online courses'

    _VALID_URL = r'''(?x)
                    https?://
                        (?:
                            (?:www\.)?(?:safaribooksonline|(?:learning\.)?oreilly)\.com/
                            (?:
                                library/view/[^/]+|
                                api/v1/book|
                                videos/[^/]+
                            )|
                            techbus\.safaribooksonline\.com
                        )
                        /(?P<id>[^/]+)
                    '''

    _TESTS = [{
        'url': 'https://www.safaribooksonline.com/library/view/hadoop-fundamentals-livelessons/9780133392838/',
        'info_dict': {
            'id': '9780133392838',
            'title': 'Hadoop Fundamentals LiveLessons',
        },
        'playlist_count': 22,
        'skip': 'Requires safaribooksonline account credentials',
    }, {
        'url': 'https://www.safaribooksonline.com/api/v1/book/9781449396459/?override_format=json',
        'only_matching': True,
    }, {
        'url': 'http://techbus.safaribooksonline.com/9780134426365',
        'only_matching': True,
    }, {
        'url': 'https://www.safaribooksonline.com/videos/python-programming-language/9780134217314',
        'only_matching': True,
    }, {
        'url': 'https://learning.oreilly.com/videos/hadoop-fundamentals-livelessons/9780133392838',
        'only_matching': True,
    }, {
        'url': 'https://www.oreilly.com/library/view/hadoop-fundamentals-livelessons/9780133392838/',
        'only_matching': True,
    }]

    @classmethod
    def suitable(cls, url):
        return (False if SafariIE.suitable(url) or SafariApiIE.suitable(url)
                else super(SafariCourseIE, cls).suitable(url))

    def _real_extract(self, url):
        course_id = self._match_id(url)

        course_json = self._download_json(
            '%s/book/%s/?override_format=%s' % (self._API_BASE, course_id, self._API_FORMAT),
            course_id, 'Downloading course JSON')

        if 'chapters' not in course_json:
            raise ExtractorError(
                'No chapters found for course %s' % course_id, expected=True)

        entries = [
            self.url_result(chapter, SafariApiIE.ie_key())
            for chapter in course_json['chapters']]

        course_title = course_json['title']

        return self.playlist_result(entries, course_id, course_title)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
broken-IE problem with existing site extraction
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants