[extractor/iprima] Fix extractor (relax nuxt function regex, add js_to_json hack) #7216

std-move · 2023-06-03T16:18:42Z

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information

Fixes the extractor issue mentioned in the latest comment in #6524 (comment), but not the originally reported issue (unable to verify/reproduce that one as I don't have a paid account). Details:

// Fixes #7229

the iPrima extractor has been broken due to failure to extract nuxt_js data. This happens due to additional code being present before the return object statement. Shortened example:

window.__NUXT__=(function(a,b,[...],n,[...],z){n.ulid="DUMMY";n.avatarUrl="<URI>";[...];return {data:{$dataId:{content:{id:q,drupalId:"141111",playId:r,[...]

This additional code that modified a passed parameter is not needed for us to successfully extract the data we need. I have relaxed the regex to allow this additional code to be present.

The second issue was that js_to_json failed due to an Array() parameter being present. Shortened example of the input:

[null,false,"none","ČR",true,"series",[..],Array(10),"p1176840",[...],"p822076"]

Without the changes I've made to the function, quotes get added around Array, resulting in the following parsing exception:

json.decoder.JSONDecodeError: Expecting ',' delimiter in 'u","Array"(10),"p117': line 1 column 1707 (char 1706)

Matching the whole line in the regex is required as the Array parameter is only a part of the original array. This code is not very nice but then again js_to_json is kinda hacky anyway, it is not a proper parser.

This should make the extractor work again and hopefully not break other things. Suggestions to improve the code are very much welcome.

Template

Before submitting a pull request make sure you have:

At least skimmed through contributing guidelines including yt-dlp coding conventions
Searched the bugtracker for similar pull requests
Checked the code with flake8 and ran relevant tests

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Fix or improvement to an extractor (Make sure to add/update tests)
New extractor (Piracy websites will not be accepted)
Core bug fix/improvement
New feature (It is strongly recommended to open an issue first)

Copilot Summary

`🤖 Generated by Copilot at 5782255`

Summary

🐛🔧🚀

Fix Nuxt.js metadata extraction for some websites by improving regexes for common.py and adding Array constructor handling in _utils.py.

Nuxt.js metadata
Array or logic, no matter
Autumn leaves adapt

Walkthrough

Fix Nuxt.js metadata extraction by allowing any code before the return statement in the function that provides the data (link)
Handle arrays that are constructed with the Array constructor instead of the literal syntax in the metadata by converting them to JSON arrays (link)

…o_json hack)

…dd js_to_json hack) flake8 fix

…egex, add js_to_json hack) replace all occurences of Array constructor, not just a single one

bashonly · 2023-06-30T19:57:11Z

see #7229 (comment)

std-move · 2023-07-01T07:27:04Z

Adjusted for latest changes to iPrima

yt_dlp/extractor/iprima.py

…json) make the code more declarative as suggested in review

bashonly · 2023-07-22T02:42:36Z

pre-release build available for testing:
edit: just update yt-dlp

milanv74 · 2023-08-18T18:14:38Z

Hello @std-move, this is my log with the test movie, I have a paid access to prima+:

[debug] Command-line config: ['-vU', '-u', 'PRIVATE', '-p', 'PRIVATE', 'https://www.iprima.cz/serialy/zoo/zoo/126-epizoda-126']
[debug] Encodings: locale cp1250, fs utf-8, pref cp1250, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version stable@2023.07.22.013310 [eed15ae] (win32_dir)
[debug] Python 3.8.10 (CPython AMD64 64bit) - Windows-10-10.0.19045-SP0 (OpenSSL 1.1.1k 25 Mar 2021)
[debug] exe versions: none
[debug] Optional libraries: Cryptodome-3.18.0, brotli-1.0.9, certifi-2023.05.07, mutagen-1.46.0, sqlite3-2.6.0, websockets-11.0.3
[debug] Proxy map: {}
[debug] Loaded 1857 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
Available version: stable@2023.07.06, Current version: stable@2023.07.22.013310
yt-dlp is up to date (stable@2023.07.22.013310)
[IPrima] Downloading login page
[IPrima] Logging in
[IPrima] Downloading token
[IPrima] Extracting URL: https://www.iprima.cz/serialy/zoo/zoo/126-epizoda-126
[IPrima] 126-epizoda-126: Downloading webpage
[IPrima] p1232709: Getting manifest URLs
ERROR: [IPrima] 126-epizoda-126: Access to stream infos forbidden
File "yt_dlp\extractor\common.py", line 715, in extract
File "yt_dlp\extractor\iprima.py", line 159, in _real_extract
File "yt_dlp\extractor\iprima.py", line 120, in _raise_access_error
File "yt_dlp\extractor\common.py", line 1193, in raise_no_formats

Thank you for your support.

std-move · 2023-08-18T18:48:32Z

Looks like you have a paid account. Another user has reported a similar error in issue #6524, I would advise you to add the report here.

With my free user account, the download works ok. So as a temporary workaround, you can create a new free account and use that one to download the show, albeit in 'non-HD' quality. To make the extractor work with paid accounts, some changes would be required I guess (unless the content is DRMed, in which case it wouldn't help, as DRM circumvention is against yt-dlp's policies)

yt_dlp/utils/_utils.py

simplify Array constructor replacement by using backreferences Co-authored-by: Simon Sawicki <accounts@grub4k.xyz>

std-move · 2023-09-20T10:21:17Z

Thanks @Grub4K for the suggestion, backreferences totally escaped me - they really simplify the code.

Also added simple tests for map/array constructor conversion as suggested.

Please review the changes again.

Grub4K

Looks good like that

yt_dlp/extractor/iprima.py

use greedy search Co-authored-by: bashonly <88596187+bashonly@users.noreply.github.com>

pukkandan

Pls merge as 2 commits @bashonly

yt_dlp/extractor/common.py

Closes yt-dlp#7229 Authored by: std-move

std-move added 2 commits June 3, 2023 17:55

[extractor/iprima] Fix extractor (relax nuxt function regex, add js_t…

5782255

…o_json hack)

fixup! [extractor/iprima] Fix extractor (relax nuxt function regex, a…

7246468

…dd js_to_json hack) flake8 fix

std-move mentioned this pull request Jun 4, 2023

IPrima broken #7229

Closed

11 tasks

bashonly added enhancement New feature or request site-bug Issue with a specific website labels Jun 14, 2023

fixup! fixup! [extractor/iprima] Fix extractor (relax nuxt function r…

5071442

…egex, add js_to_json hack) replace all occurences of Array constructor, not just a single one

[extractor/iprima] fix extractor (get video id from nuxt data json)

0bbb729

pukkandan reviewed Jul 1, 2023

View reviewed changes

yt_dlp/extractor/iprima.py Outdated Show resolved Hide resolved

fixup! [extractor/iprima] fix extractor (get video id from nuxt data …

142b58c

…json) make the code more declarative as suggested in review

bashonly mentioned this pull request Jul 9, 2023

I can't download video from iPrima.cz #7557

Closed

11 tasks

This comment was marked as resolved.

Sign in to view

bashonly mentioned this pull request Jul 18, 2023

Unable to extract __NUXT__ Problem with IPrima.cz #7633

Closed

11 tasks

This comment was marked as resolved.

Sign in to view

Grub4K requested changes Sep 19, 2023

View reviewed changes

yt_dlp/utils/_utils.py Outdated Show resolved Hide resolved

bashonly added the pending-fixes PR has had changes requested label Sep 19, 2023

std-move and others added 2 commits September 20, 2023 12:17

Update yt_dlp/utils/_utils.py

f60f0c3

simplify Array constructor replacement by using backreferences Co-authored-by: Simon Sawicki <accounts@grub4k.xyz>

add simple test for Javascript Array/Map constructors in js_to_json

99bd7b7

std-move requested a review from Grub4K September 20, 2023 10:21

Grub4K approved these changes Sep 20, 2023

View reviewed changes

bashonly requested changes Sep 20, 2023

View reviewed changes

yt_dlp/extractor/iprima.py Outdated Show resolved Hide resolved

use search_json instead of search_regex and parse_json

389f903

std-move requested a review from bashonly September 20, 2023 12:09

bashonly reviewed Sep 20, 2023

View reviewed changes

yt_dlp/extractor/iprima.py Outdated Show resolved Hide resolved

Update yt_dlp/extractor/iprima.py

f18c9e4

use greedy search Co-authored-by: bashonly <88596187+bashonly@users.noreply.github.com>

bashonly removed the pending-fixes PR has had changes requested label Sep 20, 2023

bashonly approved these changes Sep 20, 2023

View reviewed changes

bashonly added the pending-review PR needs a review label Sep 20, 2023

pukkandan approved these changes Sep 21, 2023

View reviewed changes

yt_dlp/extractor/common.py Outdated Show resolved Hide resolved

pukkandan assigned bashonly Sep 21, 2023

bashonly added 2 commits September 21, 2023 17:00

Merge branch 'yt-dlp:master' into pr/7216

97b20cd

Merge branch 'yt-dlp:master' into pr/7216

c1d3c6f

bashonly merged commit 568f080 into yt-dlp:master Sep 21, 2023
16 checks passed

bashonly removed the pending-review PR needs a review label Sep 21, 2023

aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024

[ie/iprima] Fix extractor (yt-dlp#7216)

17a8813

Closes yt-dlp#7229 Authored by: std-move

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[extractor/iprima] Fix extractor (relax nuxt function regex, add js_to_json hack) #7216

[extractor/iprima] Fix extractor (relax nuxt function regex, add js_to_json hack) #7216

std-move commented Jun 3, 2023 •

edited

bashonly commented Jun 30, 2023

std-move commented Jul 1, 2023

This comment was marked as resolved.

This comment was marked as resolved.

bashonly commented Jul 22, 2023 •

edited

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

milanv74 commented Aug 18, 2023

std-move commented Aug 18, 2023

std-move commented Sep 20, 2023

Grub4K left a comment

pukkandan left a comment •

edited

[extractor/iprima] Fix extractor (relax nuxt function regex, add js_to_json hack) #7216

[extractor/iprima] Fix extractor (relax nuxt function regex, add js_to_json hack) #7216

Conversation

std-move commented Jun 3, 2023 • edited

Description of your pull request and other information

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

What is the purpose of your pull request?

🤖 Generated by Copilot at 5782255

Summary

Walkthrough

bashonly commented Jun 30, 2023

std-move commented Jul 1, 2023

This comment was marked as resolved.

This comment was marked as resolved.

bashonly commented Jul 22, 2023 • edited

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

milanv74 commented Aug 18, 2023

std-move commented Aug 18, 2023

std-move commented Sep 20, 2023

Grub4K left a comment

Choose a reason for hiding this comment

pukkandan left a comment • edited

Choose a reason for hiding this comment

std-move commented Jun 3, 2023 •

edited

`🤖 Generated by Copilot at 5782255`

bashonly commented Jul 22, 2023 •

edited

pukkandan left a comment •

edited