New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[extractor/mx3] Add extractor #8736
Conversation
I think it would be better to extract metadata from:
|
Thanks for looking at this. I missed that there is a JSON, using it now where possible. Sadly it is missing the genre info (which I would be tempted to drop, if it was the only thing), and also whether the "download" format is available. I have added multiple formats now. The Mime-Types are all over the place. (Different video formats, mp3, wav.) I'm not confident about hardcoding a bitrate into the info. I've removed my previous code that hardcoded the file extension to 'mp3' for audio and 'mp4' for video. I tried to set 'ext' to None (in the hope to trigger the default code that makes a HEAD request, then guesses the extension) but the testing framework doesn't seem to like that. So I've added the HEAD request directly into the extractor now. It works, but I wonder if there is a better way. (Ideally, the result of the HEAD request would also fill the file size info, etc. - But maybe it's too much, let's first have a working extractor.) |
The track IDs on neo.mx3.ch and volksmusik.mx3.ch do not work on mx3.ch. The sites even require users to create a separate login. And also extract "composer" and "performer".
Some other extractors use lists too, but it doesn't work well with filename templates.
yt_dlp/extractor/mx3.py
Outdated
add_format({ | ||
'url': f'{track_url}/player_asset', | ||
'format_id': 'default', | ||
'quality': 1, | ||
}, fatal=True) | ||
# the formats below don't always exist | ||
add_format({ | ||
'url': f'{track_url}/player_asset?quality=hd', | ||
'format_id': 'hd', | ||
'quality': 10, | ||
}, fatal=False) | ||
add_format({ | ||
'url': f'{track_url}/download', | ||
'format_id': 'download', | ||
'quality': 11, | ||
}, fatal=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo. yt-dlp should always download the highest quality format by default. We also prefer the source file on other extractors like Vimeo. If you don't want to download the highest quality format, you can use -f
or -S
.
add_format({ | |
'url': f'{track_url}/player_asset', | |
'format_id': 'default', | |
'quality': 1, | |
}, fatal=True) | |
# the formats below don't always exist | |
add_format({ | |
'url': f'{track_url}/player_asset?quality=hd', | |
'format_id': 'hd', | |
'quality': 10, | |
}, fatal=False) | |
add_format({ | |
'url': f'{track_url}/download', | |
'format_id': 'download', | |
'quality': 11, | |
}, fatal=False) | |
add_format({ | |
'url': f'{track_url}/player_asset', | |
'format_id': 'default', | |
'quality': 1, | |
}) | |
add_format({ | |
'url': f'{track_url}/player_asset?quality=hd', | |
'format_id': 'hd', | |
'quality': 10, | |
}) | |
add_format({ | |
'url': f'{track_url}/download', | |
'format_id': 'download', | |
'quality': 11, | |
}) | |
add_format({ | |
'url': f'{track_url}/player_asset?quality=source', | |
'format_id': 'source', | |
'quality': 11, | |
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. I can see your point. But for me there is simply no difference in quality between a high-bitrate MP3 and WAV. For video I would list the formats and pick one manually, but for music I use the "Open With" browser extension (non-interactive) and check the download folder later. I'm going to use -f"best[ext!=wav][ext!=flac][filesize<50M]" -x
now, so it will work for me if you add the format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you only want the format selection for this site, just use -f hd/default
. This will download hd
if available and otherwise fallback to default
.
yt_dlp/extractor/mx3.py
Outdated
return { | ||
'id': track_id, | ||
'formats': formats, | ||
'artist': ', '.join(artists), | ||
'genre': genre, | ||
**traverse_obj(json, { | ||
'title': ('title', {str}), | ||
'composer': ('composer_name', {str}), | ||
'thumbnail': (('picture_url_xlarge', 'picture_url'), {url_or_none}), | ||
}, get_all=False), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I wrote a function to extract more metadata fields. For artist/performer, how about we split this across artist
and album_artist
, with a fallback for artist
.
return { | |
'id': track_id, | |
'formats': formats, | |
'artist': ', '.join(artists), | |
'genre': genre, | |
**traverse_obj(json, { | |
'title': ('title', {str}), | |
'composer': ('composer_name', {str}), | |
'thumbnail': (('picture_url_xlarge', 'picture_url'), {url_or_none}), | |
}, get_all=False), | |
} | |
more_info = get_element_by_class('single-more-info', webpage) | |
def get_info_field(name): | |
return self._html_search_regex( | |
rf'<dt[^>]*>\s*{name}\s*</dt>\s*<dd[^>]*>(.*?)</dd>', | |
more_info, name, default=None, flags=re.DOTALL) | |
return { | |
'id': track_id, | |
'formats': formats, | |
'genre': self._html_search_regex( | |
r'<div\b[^>]+class="single-band-genre"[^>]*>([^<]+)</div>', webpage, 'genre', fatal=False), | |
'release_year': int_or_none(get_info_field('Year of creation')), | |
'description ': get_info_field('Description'), | |
'tags': try_call(lambda: get_info_field('Tag').split(', '), list), | |
**traverse_obj(data, { | |
'title': ('title', {str}), | |
'artist': (('performer_name', 'artist'), {str}), | |
'album_artist': ('artist', {str}), | |
'composer': ('composer_name', {str}), | |
'thumbnail': (('picture_url_xlarge', 'picture_url'), {url_or_none}), | |
}, get_all=False), | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The artist/album_artist split seem to fit pretty well, I like it. I slightly preferred the filenames I got previously with the format string, I'll get a few duplicated artist names now but it's not too bad really.
I've updated the tests to match. I noticed that https://neo.mx3.ch/t/1hpd kind of has a description, but they put it all into the credits field, not sure if we want to add that.
Code as suggested by sepro; updated tests.
Are there overlapping IDs between the 3 sites? Could this just be 1 extractor? |
There are collisions 1g2T (neo), 1g2T (volksmusik) |
Authored by: martinxyz
IMPORTANT: PRs without the template will be CLOSED
Description of your pull request and other information
Add a simple, basic extractor for mx3.ch.
(mx3.ch is a site that hosts music uploaded by bands from or in Switzerland. As a first approximation I'd say it's government funded.)
Template
Before submitting a pull request make sure you have:
In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:
What is the purpose of your pull request?