Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to extract mp3Form form for InfoQ video #31131

Closed
5 tasks done
russtoku opened this issue Aug 2, 2022 · 8 comments · Fixed by #31181
Closed
5 tasks done

Unable to extract mp3Form form for InfoQ video #31131

russtoku opened this issue Aug 2, 2022 · 8 comments · Fixed by #31181
Labels
broken-IE problem with existing site extraction

Comments

@russtoku
Copy link

russtoku commented Aug 2, 2022

Checklist

  • I'm reporting a broken site support
  • I've verified that I'm running youtube-dl version 2021.12.17
  • I've checked that all provided URLs are alive and playable in a browser
  • I've checked that all URLs and arguments with special characters are properly quoted or escaped
  • I've searched the bugtracker for similar issues including closed ones

Verbose log

$ youtube-dl  --verbose  'https://www.infoq.com/presentations/problems-async-arch/'
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['--verbose', 'https://www.infoq.com/presentations/problems-async-arch/']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Python version 3.10.5 (CPython) - macOS-12.5-x86_64-i386-64bit
[debug] exe versions: ffmpeg 5.1, ffprobe 5.1
[debug] Proxy map: {}
[InfoQ] problems-async-arch: Downloading webpage
ERROR: Unable to extract mp3Form form; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Traceback (most recent call last):
  File "/usr/local/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.10/site-packages/youtube_dl/YoutubeDL.py", line 815, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.10/site-packages/youtube_dl/YoutubeDL.py", line 836, in __extract_info
    ie_result = ie.extract(url)
  File "/usr/local/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.10/site-packages/youtube_dl/extractor/common.py", line 534, in extract
    ie_result = self._real_extract(url)
  File "/usr/local/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.10/site-packages/youtube_dl/extractor/infoq.py", line 128, in _real_extract
    + self._extract_http_audio(webpage, video_id))
  File "/usr/local/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.10/site-packages/youtube_dl/extractor/infoq.py", line 93, in _extract_http_audio
    fields = self._form_hidden_inputs('mp3Form', webpage)
  File "/usr/local/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.10/site-packages/youtube_dl/extractor/common.py", line 1367, in _form_hidden_inputs
    form = self._search_regex(
  File "/usr/local/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.10/site-packages/youtube_dl/extractor/common.py", line 1012, in _search_regex
    raise RegexNotFoundError('Unable to extract %s' % _name)
youtube_dl.utils.RegexNotFoundError: Unable to extract mp3Form form; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

Description

I can watch the video in a browser when I'm logged in or not. When downloading the video with youtube-dl, I get the error, "Unable to extract mp3Form form".

@dirkf
Copy link
Contributor

dirkf commented Aug 2, 2022

The page doesn't have the <form...> with id="mp3Form" that the extractor expects. This shouldn't be a crashing error. The http_video format has audio.

@dirkf dirkf added the broken-IE problem with existing site extraction label Aug 2, 2022
@russtoku
Copy link
Author

russtoku commented Aug 2, 2022

I was able to work around this problem by modifying the _real_extract() method in youtube_dl/extractor/infoq.py and commenting out the self._extract_http_audio(webpage, video_id) call in line 128.

@dirkf
Copy link
Contributor

dirkf commented Aug 2, 2022

That would work in this case.

More generally, let's assume that the targeted form is sometimes present. We can either

  • trap the failing code inside _extract_http_audio() with a try:, or
  • extend the __form_hidden_inputs() method signature with **kwargs passed into its _search_regex() call so that fatal=False can be passed down.

@russtoku
Copy link
Author

russtoku commented Aug 2, 2022

The first option occurred to me to minimize the number of things touched (infoq.py vs infoq.py and common.py) and just deal with a smaller blast radius.

The second option would better if other extractors might need to deal with a similar situation. This would expose more of the facilities that _search_regex() provides.

Thanks for looking into this!

@russtoku
Copy link
Author

russtoku commented Aug 3, 2022

I tried extending the _form_hidden_inputs() method to pass fatal=False and it does download the video from InfoQ but with WARNING: unable to extract mp3Form form.

$ youtube-dl -v https://www.infoq.com/presentations/problems-async-arch/
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-v', 'https://www.infoq.com/presentations/problems-async-arch/']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Python version 3.10.5 (CPython) - macOS-12.5-x86_64-i386-64bit
[debug] exe versions: ffmpeg 5.1, ffprobe 5.1
[debug] Proxy map: {}
[InfoQ] problems-async-arch: Downloading webpage
WARNING: unable to extract mp3Form form; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
[debug] Default format spec: bestvideo+bestaudio/best
[debug] Invoking downloader on 'https://videoh.infoq.com/presentations/21-nov-unblockeddesign.mp4?Policy=eyJTdGF0ZW1lbnQiOiBbeyJSZXNvdXJjZSI6IioiLCJDb25kaXRpb24iOnsiRGF0ZUxlc3NUaGFuIjp7IkFXUzpFcG9jaFRpbWUiOjE2NTk0OTQ0MDZ9LCJJcEFkZHJlc3MiOnsiQVdTOlNvdXJjZUlwIjoiMC4wLjAuMC8wIn19fV19&Signature=KUA9-Ak19p9ZPMBvwk1KTESuNeZXi~nLl8HSlcyOEUmRP6CRa~5LXQsBxOqSRgyHeKDMd9OFidkSyysMkwbav3msMuV6nfR8P1KdbUKSfX-c980~KvPQn51X15IxLQpYrPjfoU-TMiGp232JwL3i5vizxcX-8MN3KLuIHmqY1RQ_&Key-Pair-Id=APKAIMZVI7QH4C5YKH6Q'
[download] Destination: Unblocked by Design-problems-async-arch.mp4
[download]   1.8% of 217.90MiB at  7.89MiB/s ETA 00:27^C
ERROR: Interrupted by user

This is what I did:

$ diff -u extractor/infoq.py-orig extractor/infoq.py
--- extractor/infoq.py-orig	2021-12-16 09:02:01.000000000 -1000
+++ extractor/infoq.py	2022-08-02 14:31:51.000000000 -1000
@@ -90,7 +90,7 @@
         }]
 
     def _extract_http_audio(self, webpage, video_id):
-        fields = self._form_hidden_inputs('mp3Form', webpage)
+        fields = self._form_hidden_inputs('mp3Form', webpage, fatal=False)
         http_audio_url = fields.get('filename')
         if not http_audio_url:
             return []

$ diff -u extractor/common.py-orig extractor/common.py
--- extractor/common.py-orig	2021-12-16 09:02:01.000000000 -1000
+++ extractor/common.py	2022-08-02 14:33:30.000000000 -1000
@@ -1363,10 +1363,12 @@
                 hidden_inputs[name] = value
         return hidden_inputs
 
-    def _form_hidden_inputs(self, form_id, html):
+    def _form_hidden_inputs(self, form_id, html, fatal=True):
         form = self._search_regex(
             r'(?is)<form[^>]+?id=(["\'])%s\1[^>]*>(?P<form>.+?)</form>' % form_id,
-            html, '%s form' % form_id, group='form')
+            html, '%s form' % form_id, group='form', fatal=fatal)
+        if not form:
+            form = ''
         return self._hidden_inputs(form)
 
     def _sort_formats(self, formats, field_preference=None):

The warning comes from the _search_regex() method in common.py. When the fatal parameter is passed as False and no match is found, things end up in else clause and the warning message is printed with the bug_reports_message() text. No exception is raised.

To have the warning message not printed when passing fatal=False to the _form_hidden_inputs() method, a check for no match object and not fatal is needed in the else clause of the _search_regex() method.

$ diff -u extractor/common.py-orig extractor/common.py
--- extractor/common.py-orig	2021-12-16 09:02:01.000000000 -1000
+++ extractor/common.py	2022-08-02 14:51:27.000000000 -1000
@@ -1011,6 +1011,8 @@
         elif fatal:
             raise RegexNotFoundError('Unable to extract %s' % _name)
         else:
+            if not mobj and not fatal:
+                return None
             self._downloader.report_warning('unable to extract %s' % _name + bug_reports_message())
             return None
 
@@ -1363,10 +1365,12 @@
                 hidden_inputs[name] = value
         return hidden_inputs
 
-    def _form_hidden_inputs(self, form_id, html):
+    def _form_hidden_inputs(self, form_id, html, fatal=True):
         form = self._search_regex(
             r'(?is)<form[^>]+?id=(["\'])%s\1[^>]*>(?P<form>.+?)</form>' % form_id,
-            html, '%s form' % form_id, group='form')
+            html, '%s form' % form_id, group='form', fatal=fatal)
+        if not form:
+            form = ''
         return self._hidden_inputs(form)
 
     def _sort_formats(self, formats, field_preference=None):

@dirkf
Copy link
Contributor

dirkf commented Aug 3, 2022

You can use default=None (no warning log message) or fatal=False (warning message) depending on whether the missing form is expected or its absence indicates that more work is needed on the extractor. There isn't, though there could be, an expected parameter along with fatal that would suppress the bug_reports_message() in the warning when True, if the user should be able to see why no separate audio was extracted.

So I would (and did, just like your diff above) use fatal=False initially. Then, depending on tests:

  • if the form is never there in current pages, I'd just remove the _extract_http_audio() method;
  • if the form is still served in old pages but not in new ones, I'd switch to default=None;

Having a look at yt-dlp's extractor, the case is already handled and ignored there. Equivalent yt-dl code would be:

     def _extract_http_audio(self, webpage, video_id):
+        try:
-        fields = self._form_hidden_inputs('mp3Form', webpage)
+            fields = self._form_hidden_inputs('mp3Form', webpage)
+        except ExtractorError:
+            fields = {}
         http_audio_url = fields.get('filename')
...

This code assumes that any kind of ExtractorError should be ignored here, not just RegexNotFoundError. For now I'll probably just push this for compatibility.

@russtoku
Copy link
Author

russtoku commented Aug 3, 2022

Awesome! Thanks, again!

gudata added a commit to gudata/youtube-dl that referenced this issue Aug 18, 2022
dirkf added a commit that referenced this issue Aug 19, 2022
* proposed fix for issue #31131, aligns with yt-dlp

Co-authored-by: dirkf <fieldhouse@gmx.net>
@russtoku
Copy link
Author

Thanks for the commit!

belamenso pushed a commit to belamenso/youtube-dl that referenced this issue Sep 19, 2022
* proposed fix for issue ytdl-org#31131, aligns with yt-dlp

Co-authored-by: dirkf <fieldhouse@gmx.net>
alxlive pushed a commit to alxlive/youtube-dl that referenced this issue Feb 27, 2023
* proposed fix for issue ytdl-org#31131, aligns with yt-dlp

Co-authored-by: dirkf <fieldhouse@gmx.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
broken-IE problem with existing site extraction
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants