Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ie/sheeta] Support websites based on sheeta #9978

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

pzhlkj6612
Copy link
Contributor

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information

This PR fixes #9541, fixes #4655, closes #8645, fixes #8317.

Sheeta is a fan club system developed by DWANGO Co., Ltd. (株式会社ドワンゴ).

Sheeta logo

You can find websites supported by this extractor by Googling ""(C) DWANGO Co., Ltd." "入会"".

This PR should NOT be merged unless the expected_status=404 logic for generic sites gets implemented (see #9541 (comment)).

Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

yt_dlp/extractor/sheeta.py Fixed Show fixed Hide fixed
yt_dlp/extractor/sheeta.py Fixed Show fixed Hide fixed
@seproDev seproDev added the site-request Request to support a new website label May 20, 2024
@pzhlkj6612
Copy link
Contributor Author

A target: OOP. Make AuthProvider or AuthManager to hide all the mess about login. (hey, "auth" stands for "authentication" or "authorization"?)

if not mail_tel:
return

cache_key = hashlib.sha1(f'{self._DOMAIN}:{mail_tel}:{password}'.encode()).hexdigest()

Check failure

Code scanning / CodeQL

Use of a broken or weak cryptographic hashing algorithm on sensitive data High

Sensitive data (password)
is used in a hashing algorithm (SHA1) that is insecure for password hashing, since it is not a computationally expensive hash function.
Sensitive data (id)
is used in a hashing algorithm (SHA1) that is insecure.
Copy link
Member

@bashonly bashonly May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we caching this to disk? From what I can tell, the data we are caching cache_key doesn't depend on any responses from the site, we are just caching username and password to disk? And it will only be loaded if the user is already passing their username and password?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I'm caching tokens for successfully logged-in users.

Almost all sheeta-based websites (except https://nicochannel.jp/) don't support cookie-based authorization, so login credentials are required. If a user runs yt-dlp multiple times without cached tokens, time will be wasted submitting login information.

This is emulating a web browser. The JS code stores tokens in Local Storage, so users don't have to type password every time they enter sheeta sites.

To keep each user's token distinct from anyone else's on each website, I'm using a hash algorithm to make the cache key.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache_key just seems unnecessary to me, and there is definitely no reason to be hashing the password. Using self._NETRC_MACHINE as the section should do the work of distinguishing between different sites, and the key can just be the username

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or if for some reason you do need _DOMAIN instead of simply self._NETRC_MACHINE, the password can still be omitted from the key

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this IE supports multiple sites, but self._NETRC_MACHINE is a constant string. If a user uses the same username (e-mail address) on different sites, the cache will be overwritten. So, I think it's not OK to use self._NETRC_MACHINE as the section.

About hashing password: Now I think that's totally unnecessary. If a cached token is invalidated for some reason, our code will drop it and try getting a new one. We shouldn't care if the password is changed or not.

I still need a cache_key since I need update cached tokens outside of the login procedure.

So:

cache_key = base64.b64decode(f'{self._DOMAIN}:{mail_tel}'.encode()).decode()

Your opinion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this IE supports multiple sites, but self._NETRC_MACHINE is a constant string.

We should not be subclassing these other sites' extractors from a concrete IE like SheetaEmbedIE. There should be a SheetaBaseIE from which they are subclassed. This will also simplify it so we can set the _NETRC_MACHINE in the subclasses and it will be unique per site

yt_dlp/extractor/sheeta.py Dismissed Show dismissed Hide dismissed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
site-request Request to support a new website
Projects
None yet
3 participants