Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[extractor/iprima] Fix extractor (relax nuxt function regex, add js_to_json hack) #7216

Merged
merged 11 commits into from Sep 21, 2023

Conversation

std-move
Copy link
Contributor

@std-move std-move commented Jun 3, 2023

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information

Fixes the extractor issue mentioned in the latest comment in #6524 (comment), but not the originally reported issue (unable to verify/reproduce that one as I don't have a paid account). Details:

// Fixes #7229

the iPrima extractor has been broken due to failure to extract nuxt_js data. This happens due to additional code being present before the return object statement. Shortened example:

window.__NUXT__=(function(a,b,[...],n,[...],z){n.ulid="DUMMY";n.avatarUrl="<URI>";[...];return {data:{$dataId:{content:{id:q,drupalId:"141111",playId:r,[...]

This additional code that modified a passed parameter is not needed for us to successfully extract the data we need. I have relaxed the regex to allow this additional code to be present.

The second issue was that js_to_json failed due to an Array() parameter being present. Shortened example of the input:

[null,false,"none","ČR",true,"series",[..],Array(10),"p1176840",[...],"p822076"]

Without the changes I've made to the function, quotes get added around Array, resulting in the following parsing exception:

json.decoder.JSONDecodeError: Expecting ',' delimiter in 'u","Array"(10),"p117': line 1 column 1707 (char 1706)

Matching the whole line in the regex is required as the Array parameter is only a part of the original array. This code is not very nice but then again js_to_json is kinda hacky anyway, it is not a proper parser.

This should make the extractor work again and hopefully not break other things. Suggestions to improve the code are very much welcome.

Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Copilot Summary

🤖 Generated by Copilot at 5782255

Summary

🐛🔧🚀

Fix Nuxt.js metadata extraction for some websites by improving regexes for common.py and adding Array constructor handling in _utils.py.

Nuxt.js metadata
Array or logic, no matter
Autumn leaves adapt

Walkthrough

  • Fix Nuxt.js metadata extraction by allowing any code before the return statement in the function that provides the data (link)
  • Handle arrays that are constructed with the Array constructor instead of the literal syntax in the metadata by converting them to JSON arrays (link)

@std-move std-move mentioned this pull request Jun 4, 2023
11 tasks
@bashonly bashonly added enhancement New feature or request site-bug Issue with a specific website labels Jun 14, 2023
…egex, add js_to_json hack)

replace all occurences of Array constructor, not just a single one
@bashonly
Copy link
Member

see #7229 (comment)

@std-move
Copy link
Contributor Author

std-move commented Jul 1, 2023

Adjusted for latest changes to iPrima

…json)

make the code more declarative as suggested in review
@bashonly bashonly mentioned this pull request Jul 9, 2023
11 tasks
@Cenobitax

This comment was marked as resolved.

@Cenobitax

This comment was marked as resolved.

@bashonly
Copy link
Member

bashonly commented Jul 22, 2023

pre-release build available for testing:
edit: just update yt-dlp

@Cenobitax

This comment was marked as resolved.

@milanv74

This comment was marked as resolved.

@std-move

This comment was marked as resolved.

@milanv74
Copy link

Hello @std-move, this is my log with the test movie, I have a paid access to prima+:

[debug] Command-line config: ['-vU', '-u', 'PRIVATE', '-p', 'PRIVATE', 'https://www.iprima.cz/serialy/zoo/zoo/126-epizoda-126']
[debug] Encodings: locale cp1250, fs utf-8, pref cp1250, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version stable@2023.07.22.013310 [eed15ae] (win32_dir)
[debug] Python 3.8.10 (CPython AMD64 64bit) - Windows-10-10.0.19045-SP0 (OpenSSL 1.1.1k 25 Mar 2021)
[debug] exe versions: none
[debug] Optional libraries: Cryptodome-3.18.0, brotli-1.0.9, certifi-2023.05.07, mutagen-1.46.0, sqlite3-2.6.0, websockets-11.0.3
[debug] Proxy map: {}
[debug] Loaded 1857 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
Available version: stable@2023.07.06, Current version: stable@2023.07.22.013310
yt-dlp is up to date (stable@2023.07.22.013310)
[IPrima] Downloading login page
[IPrima] Logging in
[IPrima] Downloading token
[IPrima] Extracting URL: https://www.iprima.cz/serialy/zoo/zoo/126-epizoda-126
[IPrima] 126-epizoda-126: Downloading webpage
[IPrima] p1232709: Getting manifest URLs
ERROR: [IPrima] 126-epizoda-126: Access to stream infos forbidden
File "yt_dlp\extractor\common.py", line 715, in extract
File "yt_dlp\extractor\iprima.py", line 159, in _real_extract
File "yt_dlp\extractor\iprima.py", line 120, in _raise_access_error
File "yt_dlp\extractor\common.py", line 1193, in raise_no_formats

Thank you for your support.

@std-move
Copy link
Contributor Author

Looks like you have a paid account. Another user has reported a similar error in issue #6524, I would advise you to add the report here.

With my free user account, the download works ok. So as a temporary workaround, you can create a new free account and use that one to download the show, albeit in 'non-HD' quality. To make the extractor work with paid accounts, some changes would be required I guess (unless the content is DRMed, in which case it wouldn't help, as DRM circumvention is against yt-dlp's policies)

yt_dlp/utils/_utils.py Outdated Show resolved Hide resolved
@bashonly bashonly added the pending-fixes PR has had changes requested label Sep 19, 2023
std-move and others added 2 commits September 20, 2023 12:17
simplify Array constructor replacement by using backreferences

Co-authored-by: Simon Sawicki <accounts@grub4k.xyz>
@std-move
Copy link
Contributor Author

Thanks @Grub4K for the suggestion, backreferences totally escaped me - they really simplify the code.

Also added simple tests for map/array constructor conversion as suggested.

Please review the changes again.

Copy link
Member

@Grub4K Grub4K left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good like that

yt_dlp/extractor/iprima.py Outdated Show resolved Hide resolved
use greedy search

Co-authored-by: bashonly <88596187+bashonly@users.noreply.github.com>
@bashonly bashonly removed the pending-fixes PR has had changes requested label Sep 20, 2023
@bashonly bashonly added the pending-review PR needs a review label Sep 20, 2023
Copy link
Member

@pukkandan pukkandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls merge as 2 commits @bashonly

yt_dlp/extractor/common.py Outdated Show resolved Hide resolved
@bashonly bashonly merged commit 568f080 into yt-dlp:master Sep 21, 2023
16 checks passed
@bashonly bashonly removed the pending-review PR needs a review label Sep 21, 2023
aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request site-bug Issue with a specific website
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IPrima broken
6 participants