[extractor/mx3] Add extractor #8736

martinxyz · 2023-12-09T17:30:22Z

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information

Add a simple, basic extractor for mx3.ch.

(mx3.ch is a site that hosts music uploaded by bands from or in Switzerland. As a first approximation I'd say it's government funded.)

Template

Before submitting a pull request make sure you have:

At least skimmed through contributing guidelines including yt-dlp coding conventions
Searched the bugtracker for similar pull requests
Checked the code with flake8 and ran relevant tests

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Fix or improvement to an extractor (Make sure to add/update tests)
New extractor (Piracy websites will not be accepted)
Core bug fix/improvement
New feature (It is strongly recommended to open an issue first)

seproDev · 2023-12-19T19:52:26Z

I think it would be better to extract metadata from: https://mx3.ch/t/1LIY.json
And to also include the other formats:

https://mx3.ch/tracks/1Cru/player_asset (128 kbps mp3)
https://mx3.ch/tracks/1Cru/player_asset?quality=hd (320 kbps mp3)
https://mx3.ch/tracks/1Cru/player_asset?quality= (source wav file)
https://mx3.ch/tracks/1C6E/download (source download. Not available for all files)

martinxyz · 2023-12-26T16:13:20Z

Thanks for looking at this. I missed that there is a JSON, using it now where possible. Sadly it is missing the genre info (which I would be tempted to drop, if it was the only thing), and also whether the "download" format is available.

I have added multiple formats now. The Mime-Types are all over the place. (Different video formats, mp3, wav.) I'm not confident about hardcoding a bitrate into the info. I've removed my previous code that hardcoded the file extension to 'mp3' for audio and 'mp4' for video.

I tried to set 'ext' to None (in the hope to trigger the default code that makes a HEAD request, then guesses the extension) but the testing framework doesn't seem to like that. So I've added the HEAD request directly into the extractor now. It works, but I wonder if there is a better way. (Ideally, the result of the HEAD request would also fill the file size info, etc. - But maybe it's too much, let's first have a working extractor.)

yt_dlp/extractor/mx3.py

The track IDs on neo.mx3.ch and volksmusik.mx3.ch do not work on mx3.ch. The sites even require users to create a separate login. And also extract "composer" and "performer".

Some other extractors use lists too, but it doesn't work well with filename templates.

yt_dlp/extractor/mx3.py

seproDev · 2024-01-20T00:04:55Z

yt_dlp/extractor/mx3.py

+        add_format({
+            'url': f'{track_url}/player_asset',
+            'format_id': 'default',
+            'quality': 1,
+        }, fatal=True)
+        # the formats below don't always exist
+        add_format({
+            'url': f'{track_url}/player_asset?quality=hd',
+            'format_id': 'hd',
+            'quality': 10,
+        }, fatal=False)
+        add_format({
+            'url': f'{track_url}/download',
+            'format_id': 'download',
+            'quality': 11,
+        }, fatal=False)


imo. yt-dlp should always download the highest quality format by default. We also prefer the source file on other extractors like Vimeo. If you don't want to download the highest quality format, you can use -f or -S.

Suggested change

add_format({

'url': f'{track_url}/player_asset',

'format_id': 'default',

'quality': 1,

}, fatal=True)

# the formats below don't always exist

add_format({

'url': f'{track_url}/player_asset?quality=hd',

'format_id': 'hd',

'quality': 10,

}, fatal=False)

add_format({

'url': f'{track_url}/download',

'format_id': 'download',

'quality': 11,

}, fatal=False)

add_format({

'url': f'{track_url}/player_asset',

'format_id': 'default',

'quality': 1,

})

add_format({

'url': f'{track_url}/player_asset?quality=hd',

'format_id': 'hd',

'quality': 10,

})

add_format({

'url': f'{track_url}/download',

'format_id': 'download',

'quality': 11,

})

add_format({

'url': f'{track_url}/player_asset?quality=source',

'format_id': 'source',

'quality': 11,

})

Hm. I can see your point. But for me there is simply no difference in quality between a high-bitrate MP3 and WAV. For video I would list the formats and pick one manually, but for music I use the "Open With" browser extension (non-interactive) and check the download folder later. I'm going to use -f"best[ext!=wav][ext!=flac][filesize<50M]" -x now, so it will work for me if you add the format.

If you only want the format selection for this site, just use -f hd/default. This will download hd if available and otherwise fallback to default.

yt_dlp/extractor/mx3.py

seproDev · 2024-01-20T00:58:43Z

yt_dlp/extractor/mx3.py

+        return {
+            'id': track_id,
+            'formats': formats,
+            'artist': ', '.join(artists),
+            'genre': genre,
+            **traverse_obj(json, {
+                'title': ('title', {str}),
+                'composer': ('composer_name', {str}),
+                'thumbnail': (('picture_url_xlarge', 'picture_url'), {url_or_none}),
+            }, get_all=False),
+        }


Okay, I wrote a function to extract more metadata fields. For artist/performer, how about we split this across artist and album_artist, with a fallback for artist.

Suggested change

return {

'id': track_id,

'formats': formats,

'artist': ', '.join(artists),

'genre': genre,

**traverse_obj(json, {

'title': ('title', {str}),

'composer': ('composer_name', {str}),

'thumbnail': (('picture_url_xlarge', 'picture_url'), {url_or_none}),

}, get_all=False),

}

more_info = get_element_by_class('single-more-info', webpage)

def get_info_field(name):

return self._html_search_regex(

rf'<dt[^>]*>\s*{name}\s*</dt>\s*<dd[^>]*>(.*?)</dd>',

more_info, name, default=None, flags=re.DOTALL)

return {

'id': track_id,

'formats': formats,

'genre': self._html_search_regex(

r'<div\b[^>]+class="single-band-genre"[^>]*>([^<]+)</div>', webpage, 'genre', fatal=False),

'release_year': int_or_none(get_info_field('Year of creation')),

'description ': get_info_field('Description'),

'tags': try_call(lambda: get_info_field('Tag').split(', '), list),

**traverse_obj(data, {

'title': ('title', {str}),

'artist': (('performer_name', 'artist'), {str}),

'album_artist': ('artist', {str}),

'composer': ('composer_name', {str}),

'thumbnail': (('picture_url_xlarge', 'picture_url'), {url_or_none}),

}, get_all=False),

}

The artist/album_artist split seem to fit pretty well, I like it. I slightly preferred the filenames I got previously with the format string, I'll get a few duplicated artist names now but it's not too bad really.

I've updated the tests to match. I noticed that https://neo.mx3.ch/t/1hpd kind of has a description, but they put it all into the credits field, not sure if we want to add that.

Code as suggested by sepro; updated tests.

bashonly · 2024-01-21T00:56:02Z

Are there overlapping IDs between the 3 sites? Could this just be 1 extractor?

seproDev · 2024-01-21T01:46:31Z

There are collisions 1g2T (neo), 1g2T (volksmusik)

Authored by: martinxyz

seproDev added the site-request Request to support a new website label Dec 9, 2023

bashonly self-requested a review December 12, 2023 00:08

seproDev added the pending-fixes PR has had changes requested label Dec 19, 2023

martinxyz added 3 commits December 26, 2023 16:40

[ie/mx3] Add extractor

3675f9f

[ie/mx3] Use JSON where possible

6c96964

[ie/mx3] Support multiple formats, don't hardcode extension guess

e93c02a

martinxyz force-pushed the mx3 branch from 22bc7ce to e93c02a Compare December 26, 2023 15:42

seproDev requested changes Jan 5, 2024

View reviewed changes

martinxyz added 4 commits January 14, 2024 20:13

[ie/mx3] Support for neo.mx3.ch and volksmusik.mx3.ch

fb53e71

The track IDs on neo.mx3.ch and volksmusik.mx3.ch do not work on mx3.ch. The sites even require users to create a separate login. And also extract "composer" and "performer".

[ie/mx3] Do not prepend artist to title, do not output lists

ce38da9

Some other extractors use lists too, but it doesn't work well with filename templates.

[ie/mx3] Refactor: abstract base class

55d4944

[ie/mx3] Always try HEAD on media URLs; extract size and timestamp

8f43d97

seproDev requested changes Jan 20, 2024

View reviewed changes

martinxyz and others added 3 commits January 20, 2024 12:01

[ie/mx3] Refactoring and fixes from review

ad2893a

[ie/mx3] Extract more fields, extract performer as artist

b340d5b

Code as suggested by sepro; updated tests.

Small cleanup

43b916d

seproDev added pending-review PR needs a review and removed pending-fixes PR has had changes requested labels Jan 20, 2024

seproDev approved these changes Jan 20, 2024

View reviewed changes

bashonly added 2 commits January 21, 2024 02:25

refactor

272551d

qualities

212ff27

bashonly approved these changes Jan 21, 2024

View reviewed changes

bashonly added 2 commits January 21, 2024 02:29

oops

1cb1df5

revert get_info_field regex change

336e2f1

bashonly removed the pending-review PR needs a review label Jan 21, 2024

bashonly assigned seproDev Jan 21, 2024

seproDev merged commit 5a63454 into yt-dlp:master Jan 21, 2024
6 checks passed

aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024

[ie/mx3] Add extractors (yt-dlp#8736)

68a8a80

Authored by: martinxyz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[extractor/mx3] Add extractor #8736

[extractor/mx3] Add extractor #8736

martinxyz commented Dec 9, 2023

seproDev commented Dec 19, 2023

martinxyz commented Dec 26, 2023 •

edited

seproDev Jan 20, 2024

martinxyz Jan 20, 2024

seproDev Jan 20, 2024

seproDev Jan 20, 2024 •

edited

martinxyz Jan 20, 2024

bashonly commented Jan 21, 2024

seproDev commented Jan 21, 2024

[extractor/mx3] Add extractor #8736

[extractor/mx3] Add extractor #8736

Conversation

martinxyz commented Dec 9, 2023

Description of your pull request and other information

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

What is the purpose of your pull request?

seproDev commented Dec 19, 2023

martinxyz commented Dec 26, 2023 • edited

seproDev Jan 20, 2024

Choose a reason for hiding this comment

martinxyz Jan 20, 2024

Choose a reason for hiding this comment

seproDev Jan 20, 2024

Choose a reason for hiding this comment

seproDev Jan 20, 2024 • edited

Choose a reason for hiding this comment

martinxyz Jan 20, 2024

Choose a reason for hiding this comment

bashonly commented Jan 21, 2024

seproDev commented Jan 21, 2024

martinxyz commented Dec 26, 2023 •

edited

seproDev Jan 20, 2024 •

edited