Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BitChute] - Fix 'uploader' and playlist tests md5 result #8507

Merged
merged 4 commits into from Dec 11, 2023

Conversation

SirElderling
Copy link
Contributor

@SirElderling SirElderling commented Nov 3, 2023

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information

This pull request contains two changes:

  • Fixed the code that retrieves the uploader name from the webpage
  • Fixed the md5 hash for one of the playlist tests.

Fixes #8492

Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Copilot Summary

🤖 Generated by Copilot at 0f97a23

Summary

🐛🛠️📝

Improve BitChute extractor by fixing uploader name and test case.

We're sailing on the BitChute sea, me hearties, yo ho ho
We're scraping all the data we can find, yo ho ho
We've fixed the uploader and the description too
We've made the extractor more reliable and true

Walkthrough

  • Fix uploader name extraction for BitChuteIE using regex (link)
  • Update description hash for BitChuteChannelIE test case to match website changes (link)

@Grub4K
Copy link
Member

Grub4K commented Nov 3, 2023

Instead of reopening essentially the same PR, feel free to edit the description or add commits to the old PR instead

@Grub4K Grub4K mentioned this pull request Nov 3, 2023
9 tasks
@SirElderling
Copy link
Contributor Author

SirElderling commented Nov 3, 2023

Instead of reopening essentially the same PR, feel free to edit the description or add commits to the old PR instead

Sorry. These pull requests were created while my account was flagged by github. The Checks and Copilot summary were not working. Hence, I decided to recreate them.

@Grub4K
Copy link
Member

Grub4K commented Nov 3, 2023

copilot summary can be ignored (its honestly quite pointless) and I will manually approve the workflows. As long as pushing to the old branch works, feel free to keep doing that

@Grub4K
Copy link
Member

Grub4K commented Nov 3, 2023

Note that I had to manually approve the workflows for this PR as well

@SirElderling
Copy link
Contributor Author

Thanks @Grub4K for the clarification.

@bashonly bashonly self-requested a review November 3, 2023 20:05
@seproDev seproDev added the site-bug Issue with a specific website label Nov 4, 2023
@SirElderling SirElderling changed the title [BitChute] - Fix 'uploader' and playlist md5 result [BitChute] - Fix 'uploader' and playlist tests md5 result Nov 4, 2023
@@ -126,7 +126,7 @@ def _real_extract(self, url):
'title': self._html_extract_title(webpage) or self._og_search_title(webpage),
'description': self._og_search_description(webpage, default=None),
'thumbnail': self._og_search_thumbnail(webpage),
'uploader': clean_html(get_element_by_class('owner', webpage)),
'uploader': self._search_regex(r'<p\sclass=\"name.+Channel\">([^<]+)<', webpage, 'uploader'),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something like this:

Suggested change
'uploader': self._search_regex(r'<p\sclass=\"name.+Channel\">([^<]+)<', webpage, 'uploader'),
'channel': clean_html(get_element_by_class('name', details)),
'uploader': clean_html(get_element_by_class('creator', details)),

with

details = get_element_by_class('details', webpage) or ''

Potentially, the channel_url and uploader_url could also be extracted, as these can also differ.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a couple of changes, and added the channel_url and uploader_url as you suggested.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both channel and uploader should be added as they can differ as seen here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the example @seproDev . I added the channel extraction as well, since the information was already there.

Comment on lines 116 to 120
details = get_element_by_class('details', webpage) or ''
uploader_path = extract_attributes(
get_element_html_by_class('spa', get_element_html_by_class('creator', details)) or '').get('href')
channel_path = extract_attributes(
get_element_html_by_class('spa', get_element_html_by_class('name', details)) or '').get('href')
Copy link
Member

@bashonly bashonly Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. all of this "details" extraction should be moved down near the return for better code flow/readability
  2. uploader_path and channel_path are currently fatal because get_element_html_by_class can return None and throws when its html input is None
  3. matching base_url in _VALID_URL is overkill, we can always just use https://www.bitchute.com
  4. use utils.urljoin
  5. move duplicated code into a function
        details = get_element_by_class('details', webpage) or ''
        uploader_html = get_element_html_by_class('creator', details) or ''
        channel_html = get_element_html_by_class('name', details) or ''

        def construct_url(html):
            path = extract_attributes(get_element_html_by_class('spa', html) or '').get('href')
            return urljoin('https://www.bitchute.com', path)

        return {
            # ...
            'uploader': clean_html(uploader_html),
            'channel': clean_html(channel_html),
            'uploader_url': construct_url(uploader_html),
            'channel_url': construct_url(channel_html),
            # ...
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions @bashonly . I applied those and reverted the base_url change.

@bashonly bashonly self-requested a review November 26, 2023 03:24
@sefabey
Copy link

sefabey commented Dec 6, 2023

Interested in this being merged.

@bashonly bashonly self-assigned this Dec 6, 2023
@bashonly bashonly merged commit b1a1ec1 into yt-dlp:master Dec 11, 2023
15 checks passed
gonzalezjo pushed a commit to gonzalezjo/yt-dlp that referenced this pull request Dec 12, 2023
@SirElderling SirElderling deleted the bitchute branch December 18, 2023 06:25
@SirElderling SirElderling restored the bitchute branch December 18, 2023 06:25
@SirElderling SirElderling deleted the bitchute branch December 18, 2023 06:25
aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
site-bug Issue with a specific website
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bitchute - output template "%(uploader)s" no longer works.
5 participants