Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redtube extractor broken (json parse error for mediaDescription) #25311

Closed
stentrav opened this issue May 18, 2020 · 0 comments
Closed

Redtube extractor broken (json parse error for mediaDescription) #25311

stentrav opened this issue May 18, 2020 · 0 comments

Comments

@stentrav
Copy link

@stentrav stentrav commented May 18, 2020

Checklist

  • I'm reporting a broken site support
  • I've verified that I'm running youtube-dl version 2020.05.08
  • I've checked that all provided URLs are alive and playable in a browser
  • I've checked that all URLs and arguments with special characters are properly quoted or escaped
  • I've searched the bugtracker for similar issues including closed ones

Verbose log

[debug] System config: []
[debug] User config: []
[debug] Custom config: ['-o', 'c:/foo/yt/%(title)s.%(ext)s', '--ignore-errors', '--verbose']
[debug] Command-line args: ['--skip-download', '--config-location', 'stentrav/test-config.txt', 'https://www.redtube.com/31253031']
[debug] Encodings: locale cp1252, fs utf-8, out utf-8, pref cp1252
[debug] youtube-dl version 2020.05.08
[debug] Git HEAD: 62afd97fc
[debug] Python version 3.8.3 (CPython) - Windows-10-10.0.18362-SP0
[debug] exe versions: ffmpeg git-2020-01-02-81172b5, ffprobe git-2020-01-02-81172b5
[debug] Proxy map: {}
[RedTube] 31253031: Downloading webpage
WARNING: [RedTube] 31253031: Failed to parse JSON Expecting ',' delimiter: line 1 column 1341 (char 1340)
[debug] Default format spec: bestvideo+bestaudio/best

Description

The Redtube extractor fails to parse some webpages correctly while looking for media details. Here is an example URL that fails.

The extractor uses a regular expression to locate a json expression within the page, but the pattern does not catch the entire json expression. When the extractor attempts to decode the expression, it fails (see the WARNING log message above).

Detailed Problem Description

The extractor module is youtube_dl/extractor/redtube.py. It uses group 1 of this regular expression to extract media information from the webpage:

r'mediaDefinition\s*:\s*(\[.+?\])'

Note, this regular expression does not handle nested json arrays, e.g. [1,2,[3,4],5] matches [1,2,[3,4], which is incomplete and invalid.

When run on this webpage, group 1 looks like:

[{"defaultQuality":false,"format":"upsell", ... ,"quality":[720,480,240]

Note that it ends prematurely with the ]. The actual json expression continues:

[720,480,240]},{"defaultQuality":false,"format":"hls","v

Proposed Fix

The fix needs to extract the complete json expression from the webpage. Concocting a regular expression to match a general json expression is probably not possible. And the python standard library for decoding json generally wants to raise an exception when it gets to stuff beyond the end of the json expression.

The proposed fix is to:

  • Use the current regular expression to locate the start of the valid JSON
  • Try to decode from the json start to the end of the page. Since this will include subsequent non-json stuff,
    the decode may fail with a json.JSONDecoderError.
  • If the decode fails, the exception identifies the failing position in the string. Use it to substring the valid json part.
  • With the valid substring, continue with the extractor logic to decode it without error

The current source:

        medias = self._parse_json(
            self._search_regex(
                r'mediaDefinition\s*:\s*(\[.+?\])', webpage,
                'media definitions', default='{}'),
            video_id, fatal=False)

The proposed fix:

        import json
        
        # find the mediaDefinitions string that is json-parsable
        mobj = re.search(r'mediaDefinition\s*:\s*(\[.+?\])', webpage)
        doc1 = webpage[mobj.start(1):]
        try:
            json.loads(doc1)
        except json.JSONDecodeError as exc:
            doc1 = doc1[0:exc.pos]
        medias = self._parse_json(
            doc1,
            video_id, fatal=False)
@stentrav stentrav mentioned this issue May 19, 2020
5 of 9 tasks complete
@dstftw dstftw closed this in cd13343 May 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant
You can’t perform that action at this time.