Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LinkedIn] ERROR: An extractor error has occurred. #31270

Open
tkbi opened this issue Oct 2, 2022 · 4 comments
Open

[LinkedIn] ERROR: An extractor error has occurred. #31270

tkbi opened this issue Oct 2, 2022 · 4 comments
Labels
broken-IE problem with existing site extraction patch-available

Comments

@tkbi
Copy link

tkbi commented Oct 2, 2022

Checklist

  • [ X] I'm reporting a broken site support issue
  • [ X] I've verified that I'm running youtube-dl version 2021.12.17
  • [ X] I've checked that all provided URLs are alive and playable in a browser
  • [ X] I've checked that all URLs and arguments with special characters are properly quoted or escaped
  • [X ] I've searched the bugtracker for similar bug reports including closed ones
  • [X ] I've read bugs section in FAQ

Verbose log

youtube site : = https://www.linkedin.com/learning/einfuhrung-in-die-softwarearchitektur-1-grundlagen-begriffe-und-ausgewahlte-tools/was-ist-softwarearchitektur?autoSkip=true&autoplay=true&resume=false&u=xxxxxxxx
ERROR: An extractor error has occurred. (caused by KeyError(u'JSESSIONID',)); please report this issue on https://yt-dl.org/bug . 
Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. 
Be sure to call youtube-dl with the --verbose flag and include its complete output.

Description

WRITE DESCRIPTION HERE

@dirkf dirkf changed the title ERROR: An extractor error has occurred. [LinkedIn] ERROR: An extractor error has occurred. Oct 2, 2022
@dirkf
Copy link
Contributor

dirkf commented Oct 2, 2022

Be sure to call youtube-dl with the --verbose flag and include its complete output.

Do it?

@Vangelis66
Copy link

Vangelis66 commented Oct 2, 2022

... The URL in OP should work without the query parameters and without providing login credentials; from what I gathered, it must be a free sample:

https://de.linkedin.com/learning/einfuhrung-in-die-softwarearchitektur-1-grundlagen-begriffe-und-ausgewahlte-tools/was-ist-softwarearchitektur

(... it plays fine inside my browser 😄 ); -F-ing this in youtube-dl yields:

youtube-dl -v -F "https://de.linkedin.com/learning/einfuhrung-in-die-softwarearchitektur-1-grundlagen-begriffe-und-ausgewahlte-tools/was-ist-softwarearchitektur" => 

[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-v', '-F', 'https://de.linkedin.com/learning/einfuhrung-in-die-softwarearchitektur-1-grundlagen-begriffe-und-ausgewahlte-tools/was-ist-softwarearchitektur']
[debug] Encodings: locale cp1253, fs mbcs, out cp737, pref cp1253
[debug] youtube-dl version 2022.10.02.810
[debug] Python version 3.4.4 (CPython) - Windows-Vista-6.0.6003-SP2
[debug] exe versions: ffmpeg 5.0, ffprobe 5.0, phantomjs 2.1.1, rtmpdump 2.4
[debug] Proxy map: {}
[generic] was-ist-softwarearchitektur: Requesting header
WARNING: Falling back on generic information extractor.
[generic] was-ist-softwarearchitektur: Downloading webpage
[generic] was-ist-softwarearchitektur: Extracting information
[info] Available formats for was-ist-softwarearchitektur:
format code  extension      resolution note
0            unknown_video  unknown

Sadly, what is actually being downloaded is no media file, but a HTML one, the page's source code:

[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-v', 'https://de.linkedin.com/learning/einfuhrung-in-die-softwarearchitektur-1-grundlagen-begriffe-und-ausgewahlte-tools/was-ist-softwarearchitektur']
[debug] Encodings: locale cp1253, fs mbcs, out cp737, pref cp1253
[debug] youtube-dl version 2022.10.02.810
[debug] Python version 3.4.4 (CPython) - Windows-Vista-6.0.6003-SP2
[debug] exe versions: ffmpeg 5.0, ffprobe 5.0, phantomjs 2.1.1, rtmpdump 2.4
[debug] Proxy map: {}
[generic] was-ist-softwarearchitektur: Requesting header
WARNING: Falling back on generic information extractor.
[generic] was-ist-softwarearchitektur: Downloading webpage
[generic] was-ist-softwarearchitektur: Extracting information
[debug] Default format spec: bestvideo+bestaudio/best
[debug] Invoking downloader on 'https://de.linkedin.com/learning/einfuhrung-in-die-softwarearchitektur-1-grundlagen-begriffe-und-ausgewahlte-tools/was-ist-softwarearchitektur'
[download] Destination: Was ist Softwarearchitektur - Einf?hrung in die Softwarearchitektur 1 - Grundlagen, Begriffe und ausgew?hlte Tools-was-ist-softwarearchitektur.unknown_video
[download] 100% of 118.05KiB in 00:01

Inside the Page Source,

  <div class="share-native-video">
    <video class="share-native-video__node video-js" data-sources=
    "[{&quot;src&quot;:&quot;https://dms.licdn.com/playlist/C4D0DAQGPfzHW4lxbTw/learning-original-video-vbr-540/0/1598782239071?e=1665342000&amp;v=beta&amp;t=dwvCYQxhSyDwwUmJAmm62GV9oPTuK-mQZ3DczI2u7_8#.mp4&quot;}]"
    data-poster-url=
    "https://media-exp1.licdn.com/dms/image/C4E0DAQEIBfhUmf3mlw/learning-public-crop_675_1200/0/1567118425117?e=1665342000&amp;v=beta&amp;t=XO_op2UrLN64_KQbrSj9lOk7ZACr6p7GblyOuweD6mg"
    data-captions-url=
    "https://www.linkedin.com/ambry/?x-li-ambry-ep=AQIMn_ZSVB7fCQAAAYOaDcda-TEnQXD-2_tND91gB_iJ-8yCizcDJTGyq02N63-Q_ApFqSlVY2hn7z3zvzkmPud1C9eKTbQKpe-rU3AB6tiBK57Z4gjR05o7TXuEhZS9vF6zj7WA5667DYvm-J0qPwCyffsBnYOOQuZaQPFlrkGDGKHVtcx1mp451qGQMZaRvPuCLOCjHgkk05cpKbMFOeN7u2Fc7roDLg-R3sNIIPuH-onbSpuRVJ1mAXyeq53bNRcch5sWUvBrfUHXXhtegbt8ae-1usMuwy8NwJCAfExXAlM4r-yp_JIDE1IbfihQUckTrrWxRDrXMF1WDnIh6QdfiSqWD6X6EbgTiOQQSSCjP-qDft_jY2X4_epMn-3u4HcDYnpD87pO_v2-UlRFaufikXU1Q3QkdOY5A6oaKP5gYichUMAiXIR7USfwnVXXfqaRqdEL0SqMjX3jCO2RvtlEMx1jmLVzVHhdqB8qdNXoL0Qz09sFSwrR46M1"
    data-digitalmedia-asset-urn="urn:li:lyndaVideo:(urn:li:lyndaCourse:761019,2815083)"
    data-tracking-id="DJvYbPLNTWK7posByDamzw=="></video>
  </div>

contains direct links to the media file itself,

https://dms.licdn.com/playlist/C4D0DAQGPfzHW4lxbTw/learning-original-video-vbr-540/0/1598782239071?e=1665342000&v=beta&t=dwvCYQxhSyDwwUmJAmm62GV9oPTuK-mQZ3DczI2u7_8#.mp4

(tokenised with a defined lifespan) and to WebVTT subs:

https://www.linkedin.com/ambry/?x-li-ambry-ep=AQIMn_ZSVB7fCQAAAYOaDcda-TEnQXD-2_tND91gB_iJ-8yCizcDJTGyq02N63-Q_ApFqSlVY2hn7z3zvzkmPud1C9eKTbQKpe-rU3AB6tiBK57Z4gjR05o7TXuEhZS9vF6zj7WA5667DYvm-J0qPwCyffsBnYOOQuZaQPFlrkGDGKHVtcx1mp451qGQMZaRvPuCLOCjHgkk05cpKbMFOeN7u2Fc7roDLg-R3sNIIPuH-onbSpuRVJ1mAXyeq53bNRcch5sWUvBrfUHXXhtegbt8ae-1usMuwy8NwJCAfExXAlM4r-yp_JIDE1IbfihQUckTrrWxRDrXMF1WDnIh6QdfiSqWD6X6EbgTiOQQSSCjP-qDft_jY2X4_epMn-3u4HcDYnpD87pO_v2-UlRFaufikXU1Q3QkdOY5A6oaKP5gYichUMAiXIR7USfwnVXXfqaRqdEL0SqMjX3jCO2RvtlEMx1jmLVzVHhdqB8qdNXoL0Qz09sFSwrR46M1

I suppose getting this to work with "paid for content" will require additional steps...
FWIW, yt-dlp fails right away with:

ERROR: Unsupported URL: https://de.linkedin.com/learning/einfuhrung-in-die-softwarearchitektur-1-grundlagen-begriffe-und-ausgewahlte-tools/was-ist-softwarearchitektur

(w/ and w/o -ies generic,default 😉 )

@dirkf
Copy link
Contributor

dirkf commented Oct 3, 2022

There is a LinkedIn extractor, but only for www.linkedin.com, and it doesn't understand this page structure, which has good ld+json block apart from its invalid contentURL.

After giving it a good talking-to:

$ python -m youtube_dl -j 'https://de.linkedin.com/learning/einfuhrung-in-die-softwarearchitektur-1-grundlagen-begriffe-und-ausgewahlte-tools/was-ist-softwarearchitektur' | jq '.'
{
  "display_id": "was-ist-softwarearchitektur",
  "extractor": "linkedin:learning",
  "protocol": "https",
  "description": "Grundlagenwissen für Softwarearchitekten",
  "upload_date": "20190605",
  "timestamp": 1559692800,
  "formats": [
    {
      "protocol": "https",
      "format": "vbr-540 - 540p",
      "url": "https://dms.licdn.com/playlist/C4D0DAQGPfzHW4lxbTw/learning-original-video-vbr-540/0/1598782239071?e=1665367200&v=beta&t=acTi1Goh8ZA7q9v2RAqERacJAikDZVcNmDaNfz8NpGs#.mp4",
      "http_headers": ...,
      "height": 540,
      "ext": "mp4",
      "format_id": "vbr-540"
    }
  ],
  "episode_id": "was-ist-softwarearchitektur",
  "series_id": "einfuhrung-in-die-softwarearchitektur-1-grundlagen-begriffe-und-ausgewahlte-tools",
  "_filename": "Was ist Softwarearchitektur - Einführung in die Softwarearchitektur 1 - Grundlagen, Begriffe und ausgewählte Tools-C4D0DAQGPfzHW4lxbTw.mp4",
  "uploader": "Hendrik Lösch",
  "duration": 312,
  "format_id": "vbr-540",
  "height": 540,
  "http_headers": ...,
  "id": "C4D0DAQGPfzHW4lxbTw",
  "subtitles": {
    "de": [
      {
        "url": "https://www.linkedin.com/ambry/?x-li-ambry-ep=AQIg1xl8j15WUAAAAYObefkbEndlIBeOIr8KICAJLJfFJxEGcNImqIc5sryMAQFVT5UIdIYmp4sS2d7uzpT_2Pn6XApik8l-7zhiIwKm9rKaiGCi-XfRdpKnA9e_vfZNd4012ocdN-6wLmPE7sSeF_AJIw8QoGja6KR-cNgrdyYMRjbQPrQJymTtoL4BP7z_JY4eM8IQMvrjlMoDDRGRlRr7Rq4lbXwP1iPGBfZb7KbDKu3ft1hbWHBuz3tMgYwtYnmAzvOJT6LIxhvNdZQTPc4g92B6OF3pz7xlaxfCVS8mGCKhs7ZNdkDw6juQDDIEedxwUrq4-xCTrTI3sNbfa8lIiTISHgfBuxsinODnef1mPHBFDxiJk16p2fw9SA3iUg1pqavE8Uj_JIP4DWHxtN43WK7R_onPH8KAgnvNUr5PVZBi3nzaqJbFSC9uSRDO2VDNkEcCqD15wA8oVJaAbsw4T87XteGZL8x-_klxt37qvzY_vlnfEZDgx033",
        "ext": "vtt"
      }
    ]
  },
  "view_count": 5709,
  "playlist": null,
  "thumbnails": [
    {
      "url": "https://media-exp1.licdn.com/dms/image/C4E0DAQEIBfhUmf3mlw/learning-public-crop_675_1200/0/1567118425117?e=1665367200&v=beta&t=VXMXYnLrAzWBVEQj73zJXVZc5_KcpSMaAmrNi87NL_Y",
      "id": "0"
    }
  ],
  "title": "Was ist Softwarearchitektur? - Einführung in die Softwarearchitektur 1: Grundlagen, Begriffe und ausgewählte Tools",
  "url": "https://dms.licdn.com/playlist/C4D0DAQGPfzHW4lxbTw/learning-original-video-vbr-540/0/1598782239071?e=1665367200&v=beta&t=acTi1Goh8ZA7q9v2RAqERacJAikDZVcNmDaNfz8NpGs#.mp4",
  "extractor_key": "LinkedInLearning",
  "format": "vbr-540 - 540p",
  ...
}
$

@dirkf
Copy link
Contributor

dirkf commented Oct 28, 2022

This is the patch I used:

--- old/youtube_dl/extractor/linkedin.py
+++ new/youtube_dl/extractor/linkedin.py
@@ -5,9 +5,14 @@ import re
 
 from .common import InfoExtractor
 from ..utils import (
+    extract_attributes,
     ExtractorError,
     float_or_none,
     int_or_none,
+    ISO639Utils,
+    merge_dicts,
+    try_get,
+    url_or_none,
     urlencode_postdata,
     urljoin,
 )
@@ -17,7 +22,7 @@ class LinkedInLearningBaseIE(InfoExtractor):
     _NETRC_MACHINE = 'linkedin'
     _LOGIN_URL = 'https://www.linkedin.com/uas/login?trk=learning'
 
-    def _call_api(self, course_slug, fields, video_slug=None, resolution=None):
+    def _call_api(self, host, course_slug, fields, video_slug=None, resolution=None):
         query = {
             'courseSlug': course_slug,
             'fields': fields,
@@ -27,10 +32,10 @@ class LinkedInLearningBaseIE(InfoExtractor):
         if video_slug:
             query.update({
                 'videoSlug': video_slug,
-                'resolution': '_%s' % resolution,
+                'resolution': '_%s' % (resolution, ),
             })
-            sub = ' %dp' % resolution
-        api_url = 'https://www.linkedin.com/learning-api/detailedCourses'
+            sub = ' %dp' % (resolution, )
+        api_url = 'https://%s.linkedin.com/learning-api/detailedCourses' % (host, )
         return self._download_json(
             api_url, video_slug, 'Downloading%s JSON metadata' % sub, headers={
                 'Csrf-Token': self._get_cookies(api_url)['JSESSIONID'].value,
@@ -73,7 +78,7 @@ class LinkedInLearningBaseIE(InfoExtractor):
 
 class LinkedInLearningIE(LinkedInLearningBaseIE):
     IE_NAME = 'linkedin:learning'
-    _VALID_URL = r'https?://(?:www\.)?linkedin\.com/learning/(?P<course_slug>[^/]+)/(?P<id>[^/?#]+)'
+    _VALID_URL = r'https?://(?P<host>www|[a-z]{2})?(?(host)\.)linkedin\.com/learning/(?P<course_slug>[^/]+)/(?P<id>[^/?#]+)'
     _TEST = {
         'url': 'https://www.linkedin.com/learning/programming-foundations-fundamentals/welcome?autoplay=true',
         'md5': 'a1d74422ff0d5e66a792deb996693167',
@@ -86,15 +91,53 @@ class LinkedInLearningIE(LinkedInLearningBaseIE):
         },
     }
 
+    def _extract_free(self, url, host, course_slug, video_slug):
+        webpage = self._download_webpage(url, video_slug)
+        info = self._search_json_ld(webpage, video_slug, expected_type='VideoObject', default={})
+        title = info['title']
+        info.pop('url', None)
+        native_video = self._search_regex(
+            r'''(<video\b[^>]+\bclass\s*=\s*(["'])(?:(?:(?!\2).)+?\s)?share-native-video__node video-js\2[^>]*>)''',
+            webpage, 'native video', default='')
+        native_video = extract_attributes(native_video)
+        sources = self._parse_json(native_video.get('data-sources', '[]'), video_slug)
+        formats = []
+        video_id = video_slug
+        for src in sources:
+            src = url_or_none(try_get(src, lambda x: x['src']))
+            if not src:
+                continue
+            format_id = self._search_regex(r'-([avt]br-\d+)/', src, 'format id', default=None)
+            video_id = self._search_regex(r'/playlist/([^/]+)/', src, 'video id', default=video_id)
+            ext = self._search_regex(r'#\.(\w+)$', src, 'ext', default=None)
+            formats.append({
+                'url': src,
+                'format_id': format_id,
+                'ext': ext,
+                'height': int_or_none(format_id.split('-')[-1]),
+            })
+        self._sort_formats(formats)
+        sttl_url = native_video.get('data-captions-url') if ISO639Utils.short2long(host) else None
+        return merge_dicts({
+            'id': video_id,
+            'display_id': video_slug,
+            'episode_id': video_slug,
+            'series_id': course_slug,
+            'formats': formats,
+            'subtitles': sttl_url and {host: [{'url': sttl_url, 'ext': 'vtt', }]},
+            }, info)
+
     def _real_extract(self, url):
-        course_slug, video_slug = re.match(self._VALID_URL, url).groups()
+        host, course_slug, video_slug = re.match(self._VALID_URL, url).groups()
 
         video_data = None
         formats = []
         for width, height in ((640, 360), (960, 540), (1280, 720)):
-            video_data = self._call_api(
-                course_slug, 'selectedVideo', video_slug, height)['selectedVideo']
-
+            try:
+                video_data = self._call_api(
+                    host or 'www', course_slug, 'selectedVideo', video_slug, height)['selectedVideo']
+            except (ExtractorError, KeyError):
+                return self._extract_free(url, host, course_slug, video_slug)
             video_url_data = video_data.get('url') or {}
             progressive_url = video_url_data.get('progressiveUrl')
             if progressive_url:
@@ -155,7 +198,7 @@ class LinkedInLearningCourseIE(LinkedInLearningBaseIE):
 
     def _real_extract(self, url):
         course_slug = self._match_id(url)
-        course_data = self._call_api(course_slug, 'chapters,description,title')
+        course_data = self._call_api('www', course_slug, 'chapters,description,title')
 
         entries = []
         for chapter_number, chapter in enumerate(course_data.get('chapters', []), 1):

@dirkf dirkf added patch-available broken-IE problem with existing site extraction and removed incomplete labels Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
broken-IE problem with existing site extraction patch-available
Projects
None yet
Development

No branches or pull requests

3 participants