Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upGitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
Checklist
Verbose log
Description
The Redtube extractor fails to parse some webpages correctly while looking for media details. Here is an example URL that fails.
The extractor uses a regular expression to locate a json expression within the page, but the pattern does not catch the entire json expression. When the extractor attempts to decode the expression, it fails (see the WARNING log message above).
Detailed Problem Description
The extractor module is youtube_dl/extractor/redtube.py. It uses group 1 of this regular expression to extract media information from the webpage:
Note, this regular expression does not handle nested json arrays, e.g. [1,2,[3,4],5] matches [1,2,[3,4], which is incomplete and invalid.
When run on this webpage, group 1 looks like:
Note that it ends prematurely with the ]. The actual json expression continues:
Proposed Fix
The fix needs to extract the complete json expression from the webpage. Concocting a regular expression to match a general json expression is probably not possible. And the python standard library for decoding json generally wants to raise an exception when it gets to stuff beyond the end of the json expression.
The proposed fix is to:
the decode may fail with a json.JSONDecoderError.
The current source:
The proposed fix: