Unable to extract mp3Form form for InfoQ video #31131

russtoku · 2022-08-02T18:41:03Z

Checklist

I'm reporting a broken site support
I've verified that I'm running youtube-dl version 2021.12.17
I've checked that all provided URLs are alive and playable in a browser
I've checked that all URLs and arguments with special characters are properly quoted or escaped
I've searched the bugtracker for similar issues including closed ones

Verbose log

$ youtube-dl  --verbose  'https://www.infoq.com/presentations/problems-async-arch/'
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['--verbose', 'https://www.infoq.com/presentations/problems-async-arch/']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Python version 3.10.5 (CPython) - macOS-12.5-x86_64-i386-64bit
[debug] exe versions: ffmpeg 5.1, ffprobe 5.1
[debug] Proxy map: {}
[InfoQ] problems-async-arch: Downloading webpage
ERROR: Unable to extract mp3Form form; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Traceback (most recent call last):
  File "/usr/local/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.10/site-packages/youtube_dl/YoutubeDL.py", line 815, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.10/site-packages/youtube_dl/YoutubeDL.py", line 836, in __extract_info
    ie_result = ie.extract(url)
  File "/usr/local/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.10/site-packages/youtube_dl/extractor/common.py", line 534, in extract
    ie_result = self._real_extract(url)
  File "/usr/local/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.10/site-packages/youtube_dl/extractor/infoq.py", line 128, in _real_extract
    + self._extract_http_audio(webpage, video_id))
  File "/usr/local/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.10/site-packages/youtube_dl/extractor/infoq.py", line 93, in _extract_http_audio
    fields = self._form_hidden_inputs('mp3Form', webpage)
  File "/usr/local/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.10/site-packages/youtube_dl/extractor/common.py", line 1367, in _form_hidden_inputs
    form = self._search_regex(
  File "/usr/local/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.10/site-packages/youtube_dl/extractor/common.py", line 1012, in _search_regex
    raise RegexNotFoundError('Unable to extract %s' % _name)
youtube_dl.utils.RegexNotFoundError: Unable to extract mp3Form form; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

Description

I can watch the video in a browser when I'm logged in or not. When downloading the video with youtube-dl, I get the error, "Unable to extract mp3Form form".

The text was updated successfully, but these errors were encountered:

dirkf · 2022-08-02T21:46:29Z

The page doesn't have the <form...> with id="mp3Form" that the extractor expects. This shouldn't be a crashing error. The http_video format has audio.

russtoku · 2022-08-02T22:06:04Z

I was able to work around this problem by modifying the _real_extract() method in youtube_dl/extractor/infoq.py and commenting out the self._extract_http_audio(webpage, video_id) call in line 128.

dirkf · 2022-08-02T22:16:19Z

That would work in this case.

More generally, let's assume that the targeted form is sometimes present. We can either

trap the failing code inside _extract_http_audio() with a try:, or
extend the __form_hidden_inputs() method signature with **kwargs passed into its _search_regex() call so that fatal=False can be passed down.

russtoku · 2022-08-02T22:42:47Z

The first option occurred to me to minimize the number of things touched (infoq.py vs infoq.py and common.py) and just deal with a smaller blast radius.

The second option would better if other extractors might need to deal with a similar situation. This would expose more of the facilities that _search_regex() provides.

Thanks for looking into this!

russtoku · 2022-08-03T00:59:30Z

I tried extending the _form_hidden_inputs() method to pass fatal=False and it does download the video from InfoQ but with WARNING: unable to extract mp3Form form.

$ youtube-dl -v https://www.infoq.com/presentations/problems-async-arch/
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-v', 'https://www.infoq.com/presentations/problems-async-arch/']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Python version 3.10.5 (CPython) - macOS-12.5-x86_64-i386-64bit
[debug] exe versions: ffmpeg 5.1, ffprobe 5.1
[debug] Proxy map: {}
[InfoQ] problems-async-arch: Downloading webpage
WARNING: unable to extract mp3Form form; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
[debug] Default format spec: bestvideo+bestaudio/best
[debug] Invoking downloader on 'https://videoh.infoq.com/presentations/21-nov-unblockeddesign.mp4?Policy=eyJTdGF0ZW1lbnQiOiBbeyJSZXNvdXJjZSI6IioiLCJDb25kaXRpb24iOnsiRGF0ZUxlc3NUaGFuIjp7IkFXUzpFcG9jaFRpbWUiOjE2NTk0OTQ0MDZ9LCJJcEFkZHJlc3MiOnsiQVdTOlNvdXJjZUlwIjoiMC4wLjAuMC8wIn19fV19&Signature=KUA9-Ak19p9ZPMBvwk1KTESuNeZXi~nLl8HSlcyOEUmRP6CRa~5LXQsBxOqSRgyHeKDMd9OFidkSyysMkwbav3msMuV6nfR8P1KdbUKSfX-c980~KvPQn51X15IxLQpYrPjfoU-TMiGp232JwL3i5vizxcX-8MN3KLuIHmqY1RQ_&Key-Pair-Id=APKAIMZVI7QH4C5YKH6Q'
[download] Destination: Unblocked by Design-problems-async-arch.mp4
[download]   1.8% of 217.90MiB at  7.89MiB/s ETA 00:27^C
ERROR: Interrupted by user

This is what I did:

$ diff -u extractor/infoq.py-orig extractor/infoq.py
--- extractor/infoq.py-orig	2021-12-16 09:02:01.000000000 -1000
+++ extractor/infoq.py	2022-08-02 14:31:51.000000000 -1000
@@ -90,7 +90,7 @@
         }]
 
     def _extract_http_audio(self, webpage, video_id):
-        fields = self._form_hidden_inputs('mp3Form', webpage)
+        fields = self._form_hidden_inputs('mp3Form', webpage, fatal=False)
         http_audio_url = fields.get('filename')
         if not http_audio_url:
             return []

$ diff -u extractor/common.py-orig extractor/common.py
--- extractor/common.py-orig	2021-12-16 09:02:01.000000000 -1000
+++ extractor/common.py	2022-08-02 14:33:30.000000000 -1000
@@ -1363,10 +1363,12 @@
                 hidden_inputs[name] = value
         return hidden_inputs
 
-    def _form_hidden_inputs(self, form_id, html):
+    def _form_hidden_inputs(self, form_id, html, fatal=True):
         form = self._search_regex(
             r'(?is)<form[^>]+?id=(["\'])%s\1[^>]*>(?P<form>.+?)</form>' % form_id,
-            html, '%s form' % form_id, group='form')
+            html, '%s form' % form_id, group='form', fatal=fatal)
+        if not form:
+            form = ''
         return self._hidden_inputs(form)
 
     def _sort_formats(self, formats, field_preference=None):

The warning comes from the _search_regex() method in common.py. When the fatal parameter is passed as False and no match is found, things end up in else clause and the warning message is printed with the bug_reports_message() text. No exception is raised.

To have the warning message not printed when passing fatal=False to the _form_hidden_inputs() method, a check for no match object and not fatal is needed in the else clause of the _search_regex() method.

$ diff -u extractor/common.py-orig extractor/common.py
--- extractor/common.py-orig	2021-12-16 09:02:01.000000000 -1000
+++ extractor/common.py	2022-08-02 14:51:27.000000000 -1000
@@ -1011,6 +1011,8 @@
         elif fatal:
             raise RegexNotFoundError('Unable to extract %s' % _name)
         else:
+            if not mobj and not fatal:
+                return None
             self._downloader.report_warning('unable to extract %s' % _name + bug_reports_message())
             return None
 
@@ -1363,10 +1365,12 @@
                 hidden_inputs[name] = value
         return hidden_inputs
 
-    def _form_hidden_inputs(self, form_id, html):
+    def _form_hidden_inputs(self, form_id, html, fatal=True):
         form = self._search_regex(
             r'(?is)<form[^>]+?id=(["\'])%s\1[^>]*>(?P<form>.+?)</form>' % form_id,
-            html, '%s form' % form_id, group='form')
+            html, '%s form' % form_id, group='form', fatal=fatal)
+        if not form:
+            form = ''
         return self._hidden_inputs(form)
 
     def _sort_formats(self, formats, field_preference=None):

dirkf · 2022-08-03T11:36:55Z

You can use default=None (no warning log message) or fatal=False (warning message) depending on whether the missing form is expected or its absence indicates that more work is needed on the extractor. There isn't, though there could be, an expected parameter along with fatal that would suppress the bug_reports_message() in the warning when True, if the user should be able to see why no separate audio was extracted.

So I would (and did, just like your diff above) use fatal=False initially. Then, depending on tests:

if the form is never there in current pages, I'd just remove the _extract_http_audio() method;
if the form is still served in old pages but not in new ones, I'd switch to default=None;

Having a look at yt-dlp's extractor, the case is already handled and ignored there. Equivalent yt-dl code would be:

     def _extract_http_audio(self, webpage, video_id):
+        try:
-        fields = self._form_hidden_inputs('mp3Form', webpage)
+            fields = self._form_hidden_inputs('mp3Form', webpage)
+        except ExtractorError:
+            fields = {}
         http_audio_url = fields.get('filename')
...

This code assumes that any kind of ExtractorError should be ignored here, not just RegexNotFoundError. For now I'll probably just push this for compatibility.

russtoku · 2022-08-03T17:38:35Z

Awesome! Thanks, again!

…actor

* proposed fix for issue #31131, aligns with yt-dlp Co-authored-by: dirkf <fieldhouse@gmx.net>

russtoku · 2022-08-19T20:19:35Z

Thanks for the commit!

* proposed fix for issue ytdl-org#31131, aligns with yt-dlp Co-authored-by: dirkf <fieldhouse@gmx.net>

dirkf added the broken-IE problem with existing site extraction label Aug 2, 2022

gudata added a commit to gudata/youtube-dl that referenced this issue Aug 18, 2022

Implement the proposed fix at issue ytdl-org#31131 for the infoq extr…

87aaaca

…actor

gudata mentioned this issue Aug 18, 2022

Implement the proposed fix at issue #31131 for the infoq extractor #31181

Merged

11 tasks

dirkf closed this as completed in #31181 Aug 19, 2022

dirkf added a commit that referenced this issue Aug 19, 2022

[infoq] Avoid crash if the page has no mp3Form

a8d5316

* proposed fix for issue #31131, aligns with yt-dlp Co-authored-by: dirkf <fieldhouse@gmx.net>

belamenso pushed a commit to belamenso/youtube-dl that referenced this issue Sep 19, 2022

[infoq] Avoid crash if the page has no mp3Form

722e314

* proposed fix for issue ytdl-org#31131, aligns with yt-dlp Co-authored-by: dirkf <fieldhouse@gmx.net>

alxlive pushed a commit to alxlive/youtube-dl that referenced this issue Feb 27, 2023

[infoq] Avoid crash if the page has no mp3Form

a5fcc6f

* proposed fix for issue ytdl-org#31131, aligns with yt-dlp Co-authored-by: dirkf <fieldhouse@gmx.net>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to extract mp3Form form for InfoQ video #31131

Unable to extract mp3Form form for InfoQ video #31131

russtoku commented Aug 2, 2022

dirkf commented Aug 2, 2022

russtoku commented Aug 2, 2022

dirkf commented Aug 2, 2022

russtoku commented Aug 2, 2022

russtoku commented Aug 3, 2022

dirkf commented Aug 3, 2022

russtoku commented Aug 3, 2022

russtoku commented Aug 19, 2022

Unable to extract mp3Form form for InfoQ video #31131

Unable to extract mp3Form form for InfoQ video #31131

Comments

russtoku commented Aug 2, 2022

Checklist

Verbose log

Description

dirkf commented Aug 2, 2022

russtoku commented Aug 2, 2022

dirkf commented Aug 2, 2022

russtoku commented Aug 2, 2022

russtoku commented Aug 3, 2022

dirkf commented Aug 3, 2022

russtoku commented Aug 3, 2022

russtoku commented Aug 19, 2022