[internazionale] Add new extractor for www.internazionale.it #14973

iamleot · 2017-12-13T11:30:06Z

Please follow the guide below

You will be asked some questions, please read them carefully and answer honestly
Put an x into all the boxes [ ] relevant to your pull request (like that [x])
Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

At least skimmed through adding new extractor tutorial and youtube-dl coding conventions sections
Searched the bugtracker for similar pull requests
Checked the code with flake8

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Bug fix
Improvement
New extractor
New feature

Description of your pull request and other information

Add a new extractor for internazionale.it.
This was implemented analyzing the web browser requests via
mitmproxy and manually inspecting part of the
JavaScript code served.

dstftw · 2017-12-15T17:10:49Z

youtube_dl/extractor/internazionale.py

+                'url': 'https://video.internazionale.it/%s/%s.m3u8'
+                       % (video_path, id),
+                'ext': 'mp4',
+                'protocol': 'm3u8',


_extract_m3u8_formats.

At least mpd is also available.

dstftw · 2017-12-15T17:12:50Z

youtube_dl/extractor/internazionale.py

+
+        video_container = self._html_search_regex(r'<div class="video-container" (.*)>', webpage, 'video_container')
+
+        id = self._html_search_regex(r'data-job-id="([^"]+)"', video_container, 'id')


Do not shadow built-in names.

dstftw · 2017-12-15T17:13:10Z

youtube_dl/extractor/internazionale.py

+        video_id = self._match_id(url)
+        webpage = self._download_webpage(url, video_id)
+
+        video_container = self._html_search_regex(r'<div class="video-container" (.*)>', webpage, 'video_container')


Capturing empty string does not make any sense. What's the point capturing this at all? id and path occur only once in webpage.

dstftw · 2017-12-15T17:15:41Z

youtube_dl/extractor/internazionale.py

+        'info_dict': {
+            'id': '265968',
+            'ext': 'mp4',
+            'description': 'Il regista statunitense Richard Linklater ci racconta una scena del film Boyhood e la sua passione per l’imprecisione della memoria. Il film è un’avventura durata 12 anni, durante la quale Linklater ha seguito il protagonista dal 2002 al 2014 per raccontare la sua crescita e il rapporto con i genitori divorziati. Leggi',


dstftw · 2017-12-15T17:15:57Z

youtube_dl/extractor/internazionale.py

+            'description': 'Tre ragazzi raccontano quanto è difficile essere italiani di fatto ma non di diritto: una vita fatta di burocrazia, opportunità negate e grandi contraddizioni. Leggi',
+            'title': 'Storie di italiani senza cittadinanza',
+            'thumbnail': r're:^https?://.*\.jpg$',
+        }


Remove duplicates.

- Use `md5:...' instead of provide a long description in info_dict and only keep one test. - Directly search for `data-job-id' and `data-video-path' attributes. - Extract m3u8 and mpd via _extract_m3u8_formats() and _extract_mpd_formats() TODO: Figure out why `python test/test_download.py TestDownload.test_Internazionale` TODO: with a DownloadError and `ERROR: requested format not available'. TODO: For m3u8 `youtube_dl -F' on a Internazionale URL indicate as extension TODO: `m3u8' instead of mp4, is this correct?

iamleot · 2017-12-18T21:41:35Z

Hello Sergey, "Sergey M." writes:

dstftw requested changes on this pull request. 1. `_extract_m3u8_formats`. 2. At least mpd is also available.

I tried to address that and respectively used _extract_m3u8_formats() and _extract_mpd_formats().

Do not shadow built-in names.

Whoops, nice catch!

+ video_container = self._html_search_regex(r'<div class="video-container" (.*)>', webpage, 'video_container') Capturing empty string does not make any sense. What's the point capturing this at all? id and path occur only once in webpage.

I guess that first extracting the relevant part from the entire webpage and then extracting only interesting attributes is faster. What you propose is right and simpler, so I've changed as you've suggested.

`md5:`.

OK.

Remove duplicates.

Did you meant remove a test? In that case I've kept only the first one. However, two points that probably need to be addressed are (and, I have tried to investigate further without a lot of luck): - Why TestDownload.test_Internazionale now fails with `ERROR: requested format not available`?: ``` % python2.7 test/test_download.py TestDownload.test_Internazionale [Internazionale] 2015/02/19/richard-linklater-racconta-una-scena-di-boyhood: Downloading webpage [Internazionale] 2015/02/19/richard-linklater-racconta-una-scena-di-boyhood: Downloading m3u8 information [Internazionale] 2015/02/19/richard-linklater-racconta-una-scena-di-boyhood: Downloading MPD manifest ERROR: requested format not available Traceback (most recent call last): File "/home/leot/repos/youtube-dl/youtube_dl/YoutubeDL.py", line 795, in extract_info return self.process_ie_result(ie_result, download, extra_info) File "/home/leot/repos/youtube-dl/youtube_dl/YoutubeDL.py", line 849, in process_ie_result return self.process_video_result(ie_result, download=download) File "/home/leot/repos/youtube-dl/youtube_dl/YoutubeDL.py", line 1612, in process_video_result expected=True) ExtractorError: requested format not available E ====================================================================== ERROR: test_Internazionale (__main__.TestDownload): ---------------------------------------------------------------------- Traceback (most recent call last): File "test/test_download.py", line 159, in test_template force_generic_extractor=params.get('force_generic_extractor', False)) File "/home/leot/repos/youtube-dl/youtube_dl/YoutubeDL.py", line 807, in extract_info self.report_error(compat_str(e), e.format_traceback()) File "/home/leot/repos/youtube-dl/youtube_dl/YoutubeDL.py", line 612, in report_error self.trouble(error_message, tb) File "/home/leot/repos/youtube-dl/youtube_dl/YoutubeDL.py", line 582, in trouble raise DownloadError(message, exc_info) DownloadError: ERROR: requested format not available

…

---------------------------------------------------------------------- Ran 1 test in 2.949s FAILED (errors=1) Exit 1 ``` - Is it okay that the extension is `m3u8` when invoking `youtube-dl -F`: ``` % python2.7 -m youtube_dl -F 'https://www.internazionale.it/video/2015/02/19/richard-linklater-racconta-una-scena-di-boyhood' [Internazionale] 2015/02/19/richard-linklater-racconta-una-scena-di-boyhood: Downloading webpage [Internazionale] 2015/02/19/richard-linklater-racconta-una-scena-di-boyhood: Downloading m3u8 information [Internazionale] 2015/02/19/richard-linklater-racconta-una-scena-di-boyhood: Downloading MPD manifest [info] Available formats for 265968: format code extension resolution note audio-1-Audio m3u8 audio only [en] 128kbps m4a audio only DASH audio 128k , mp4a.40.2 (44100Hz) 360p_800kbps mp4 640x360 DASH video 800k , avc1.42c00d, 30fps, video only 480p_1200kbps mp4 854x480 DASH video 1200k , avc1.42c00d, 30fps, video only 1728 m3u8 640x360 1728k , avc1.64000d, video only 720p_2400kbps mp4 1280x720 DASH video 2400k , avc1.42c00d, 30fps, video only 2528 m3u8 854x480 2528k , avc1.64000d, video only 4928 m3u8 1280x720 4928k , avc1.64000d, video only (best) ``` Thank you for the review and for the attention!

[internazionale] Add new extractor for www.internazionale.it

018046d

dstftw requested changes Dec 15, 2017

View reviewed changes

dstftw added the pending-fixes label Dec 15, 2017

dstftw closed this in 640788f Dec 27, 2017

PuffingtonToast referenced this pull request in PuffingtonToast/youtube-dl Jan 5, 2018

[internazionale] Improve extraction (closes #14973)

ef89b99

dstftw added a commit that referenced this pull request Feb 9, 2018

Credit @iamleot for internazionale (#14973)

430f2ca

cypheron mentioned this pull request Feb 3, 2021

Evaluation / overview of new proposed extractors / sites #28054

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[internazionale] Add new extractor for www.internazionale.it #14973

[internazionale] Add new extractor for www.internazionale.it #14973

iamleot commented Dec 13, 2017

dstftw Dec 15, 2017

dstftw Dec 15, 2017

dstftw Dec 15, 2017

dstftw Dec 15, 2017

dstftw Dec 15, 2017

iamleot commented Dec 18, 2017 via email


		video_container = self._html_search_regex(r'<div class="video-container" (.*)>', webpage, 'video_container')

		id = self._html_search_regex(r'data-job-id="([^"]+)"', video_container, 'id')

[internazionale] Add new extractor for www.internazionale.it #14973

[internazionale] Add new extractor for www.internazionale.it #14973

Conversation

iamleot commented Dec 13, 2017

Please follow the guide below

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

Description of your pull request and other information

dstftw Dec 15, 2017

Choose a reason for hiding this comment

dstftw Dec 15, 2017

Choose a reason for hiding this comment

dstftw Dec 15, 2017

Choose a reason for hiding this comment

dstftw Dec 15, 2017

Choose a reason for hiding this comment

dstftw Dec 15, 2017

Choose a reason for hiding this comment

iamleot commented Dec 18, 2017 via email