Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some more url formats would be nice #40

Closed
MSDev201 opened this issue Sep 29, 2021 · 16 comments
Closed

Some more url formats would be nice #40

MSDev201 opened this issue Sep 29, 2021 · 16 comments

Comments

@MSDev201
Copy link

Channel links with the Channel name don't work:

tubearchivist      | {'csrfmiddlewaretoken': ['AY0tO0Dav2zXgzjttPYb4lEsQ5qgYn5NE4Fzk66r983FaufAOvSgNKSTb3mBAUFQ'], 'subscribe': ['https://www.youtube.com/c/veritasium']}
tubearchivist      | parsing subscribe ids failed!
tubearchivist      | ['https://www.youtube.com/c/veritasium']

As a workaround I currently copy the link from the channel name when watching a video. This link has the neded channel id.

Playlist links in this format: https://www.youtube.com/watch?v=aFPJf-wKTd0&list=UUHnyfMqiRRG1u-2MsSQLbXA&index=2 are parsed like one video but not as list.

@bbilly1
Copy link
Member

bbilly1 commented Sep 30, 2021

I was under the impression that using the user name like that is not going to be unique, as a user can have multiple channels. Or am I mistaken there?

At the moment I detect the type of link out of the length of the ID, as a way to separate between video, channel and playlist. veritasium is a good example, that is 11 characters, same as a video ID, not saying that it couldn't be done to separate between these things, but I think that will only be more confusing. Additional as you can also just add the video ID alone without the whole URL, this will make things more complicated and potentially ambiguous.

So for the playlist link, yes, I chose that by design, when you open that page, you land on the video not on the playlist. I don't think that it would make sense to download the whole playlist by giving a URL to a video. While giving the link to the playlist will download the whole playlist.

Btw: I have started writing these things down in the wiki.

@MSDev201
Copy link
Author

MSDev201 commented Sep 30, 2021

The channel name url is unique https://support.google.com/youtube/answer/2657968

Finding the parts of the url can be done by checking for the query parameters and start of url:
://www.youtube.com/watch?v=aFPJf-wKTd0&list=UUHnyfMqiRRG1u-2MsSQLbXA&index=2
://www.youtube.com/watch?v=HeQX2HjkcNo

Shared links have the id directly after the domain name:
://youtu.be/HeQX2HjkcNo
://youtu.be/HeQX2HjkcNo?t=100

A channel url could be identified by the first part of the url:
://www.youtube.com/c/veritasium
://www.youtube.com/c/veritasium/about
://www.youtube.com/channel/UCHnyfMqiRRG1u-2MsSQLbXA

Links like this could be difficult (but nowher linked on youtube so i think they are not needed but nice to have):
://www.youtube.com/veritasium

Formatting links in markdown is a little bit tricky but I hope you can see what I mean :)

@bbilly1
Copy link
Member

bbilly1 commented Oct 1, 2021

Ah I see, I was under the impression that's the username and not the channel name, my mistake.

So as I understand it, even though the channel name is unique, it can change. So I think the solution here would be to convert the channel name to the channel ID at that point in time and keep using the channel ID for everything there after.

Additionally to the link formats you have mentioned, I have also found:
www.youtube.com/user/<username>

I think this is the old format, so some of the channels I'm subscribed to have been around for some time like it's:
www.youtube.com/user/Computerphile
and not:
www.youtube.com/c/Computerphile

Strangely
www.youtube.com/user/veritasium
redirects to his 2009 channel where the only video is him explaining that this is his old channel... So these formats are not interchangeable.

A little bit confusing but the solution above still stands.

So yes, I think your way of identifying that is better than my current solution. This will require a rewrite of the process_url_list. Any chance you want to take a look at it?

@lamusmaser
Copy link
Collaborator

I have a rough outline of how this could work using only the modules that are already included. I will get a pull requested either later today or tomorrow.

Outline of tasks:
Instead of Raising an exception for "/c/" or "/user/"
Performs removal of parameters from link, finds the actual username within the link, requests the YouTube page for the user, grabs the "canonical" link reference, then parses that through a recursive call to the process_url_list and then sends that data through to the parent request (will only process that one time) to add to the youtube_ids list.

I'll reference this when I get the pull request.

@MSDev201
Copy link
Author

MSDev201 commented Oct 3, 2021

ChannelId can also be found in a meta field:

<meta itemprop="channelId" content="UCHnyfMqiRRG1u-2MsSQLbXA">

@MSDev201
Copy link
Author

MSDev201 commented Oct 3, 2021

I don't have much experience with Python, but wouldn't it be easier to parse the links with RegEx? You could define patterns for videos, playlists and channels and extract the relevant parts from the URL.

Maybe I have time next weekend and can try to implement something like that.

Quick prototype:

(https?:\/\/)(www\.)?(youtube|youtu)\.([\w]{2,3})\/(watch\?v=)?(?<videoId>[\w\d]{11})

Group "videoId" would hold watch IDs of these urls:

https://www.youtube.com/watch?v=JSeZ12Juryk
https://youtube.com/watch?v=JSeZ12Juryk
https://www.youtube.de/watch?v=JSeZ12Juryk
https://youtu.be/JSeZ12Juryk
http://www.youtube.de/watch?v=JSeZ12Juryk
https://www.youtube.com/watch?v=KT18KJouHWg&list=UUHnyfMqiRRG1u-2MsSQLbXA&index=2

The RegEx would have to be adapted further so that other possible cases are also covered. For example, this regex does not yet cover links that have the v parameter in the first position.

@lamusmaser
Copy link
Collaborator

Based on my understanding of the code, the specific function provides the following:
Get a request to download with a URL link -> Sanitize the link by removing the prefixed URL data in front of the link (which is consistent for all of the varieties of URLs that include either the video, playlist, or channel ID) -> Sanitize it further by removing any additional query links that could be appended to the end (to get the actual ID) -> send that forward to the requesting function to handle it as either a video, playlist, or channel. The video will grab the channel data from the youtubedl function. Playlist will go through each of the videos and grab each of their data from the youtubedl function, as well. Channel is obviously already setup, so no further processing required.

My proposed solution attempts to get the channel data from youtube directly (could also call the youtubedl function, but it wasn't already imported within this script); I could adjust it to use the meta tag, I just found the tag via my own search for the ID and found that it provided it via a link tag, with the full canonical URL, which would then be able to be passed to the function recursively and provide the same sanitized results.

There are more clean ways to do it, and you're welcome to submit your own PR that does it differently. This was my quick way to get it to work so we could show that it functions, then we can enhance it afterward. Import a parser, like Beautiful Soup, or sending a javascript request as part of the Requests call to get the specific tag, or we can look at utilizing the youtubedl function to maintain consistency.

@bbilly1
Copy link
Member

bbilly1 commented Oct 3, 2021

Thanks for everybody looking into it! I do think that should be passed to yt-dlp to extract instead. I mean all of the solutions above will work but from a future maintainability standpoint, if yt-dlp already provides the needed information, that's going to make things easier going forward.
The only time where scraping in a traditional sense was needed, was extracting some additional channel info that yt-dlp doesn't do.

I have no qualms importing yt-dlp there, that's a good reason.

So the example using yt-dlp to extract the channel_id for "/c/" or "/user/" urls could look something like this:

obs = {
    "default_search": "ytsearch",
    "quiet": True,
    "skip_download": True,
    "extract_flat": True,
    "playlistend": 0,
}
chan = youtube_dl.YoutubeDL(obs).extract_info(url, download=False)
channel_id = chan["channel_id"]

Maybe that's where this process_url_list function should be convert to a class. "If it doesn't fit in a screen, it's too big" or something, I dont' know, I don't always follow these rules as well... :-)

So pseudo code:

class UrlListParser:
    """take a multi line string and detect valid youtube ids"""
    
    def __init__(self, url_str):
        self.url_list = # split the string by newline

    def process_list(self):
        """loop through the list"""
        youtube_ids = []
        for url in self.url_list:
            if "/c/" in url or "/user/" in url:
                # dedect /c/ and /user/
                youtube_id = self.extract_channel_name(url)
                youtube_ids.append({"url": youtube_id, "type": "channel"})
            else:
                # dedect the rest
                youtube_id, id_type = self.find_valid_id(url)
                youtube_ids.append({"url": youtube_id, "type": id_type})

        return youtube_ids

    def extract_channel_name(url):
        """extract channel_id from channel name using yt-dlp"""
        # do the extraction
        return youtube_id

    def find_valid_id(url):
        """extract the id and detect the type"""
        # do the extraction
        # id_type can be channel, video or playlist
        return youtube_id, id_type

Then the class can be called like that from anywhere needed:

youtube_ids = UrlListParser(url_str).process_list()

@lamusmaser Do you want to flesh it out? It still needs to raise a ValueError if extraction fails...

Then I'll change it everywhere in the project where it needs to be called and we have a nice reusable and extendable class there! :-)

@lamusmaser
Copy link
Collaborator

@bbilly1 I will take a swing at updating it and any of the references based on pseudo above. I'll see if I can get more in this weekend, otherwise it will be next weekend.

@bbilly1
Copy link
Member

bbilly1 commented Oct 3, 2021

Nice! I've planned to make a new release later today to get some of the new UI and some other changes out. But no pressure, that's important to take the time needed, and can also be in the release after.

@MSDev201
Copy link
Author

MSDev201 commented Oct 8, 2021

Today I tried to include some playlists in my download que and it looks like not every playlist id is 34 characters long. Can this also be fixed by using yt-dlp?

As example this URL has a playlist ID with 18 characters:
https://www.youtube.com/playlist?list=PL16649CCE7EFA8B2F

@lamusmaser
Copy link
Collaborator

This has not been forgotten, I just haven't had the time to devote to recoding this section. If someone else gets to it first, please say so here. It is in my to-do list, but I have a few other items ahead of it (unrelated to this project).

@bbilly1
Copy link
Member

bbilly1 commented Oct 31, 2021

OK, I took a swing at that, I think I could get all of the requirements working as expected, even www.youtube.com/veritasium works. :-)

Look forward to the improvements in the next release!

@bbilly1
Copy link
Member

bbilly1 commented Nov 1, 2021

OK, this is now merged in v0.0.7. Thanks for everybody looking into it!

Closing this for now.

@bbilly1 bbilly1 closed this as completed Nov 1, 2021
@Panzer1119
Copy link

I'm on v0.0.7 and I can't add YouTube Video URLs like these https://youtu.be/2tdiKTSdE9Y to the download queue (that url is taken directly from the wiki).

If I try adding it, it tells me:

Failed to extract links.
Not a video, channel or playlist ID or URL

@bbilly1
Copy link
Member

bbilly1 commented Nov 12, 2021

Oh man, of course you are right... Thanks for taking the time, that's very unfortunate...

bbilly1 added a commit that referenced this issue Nov 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants