Some more url formats would be nice #40

MSDev201 · 2021-09-29T18:39:08Z

Channel links with the Channel name don't work:

tubearchivist      | {'csrfmiddlewaretoken': ['AY0tO0Dav2zXgzjttPYb4lEsQ5qgYn5NE4Fzk66r983FaufAOvSgNKSTb3mBAUFQ'], 'subscribe': ['https://www.youtube.com/c/veritasium']}
tubearchivist      | parsing subscribe ids failed!
tubearchivist      | ['https://www.youtube.com/c/veritasium']

As a workaround I currently copy the link from the channel name when watching a video. This link has the neded channel id.

Playlist links in this format: https://www.youtube.com/watch?v=aFPJf-wKTd0&list=UUHnyfMqiRRG1u-2MsSQLbXA&index=2 are parsed like one video but not as list.

The text was updated successfully, but these errors were encountered:

bbilly1 · 2021-09-30T09:19:29Z

I was under the impression that using the user name like that is not going to be unique, as a user can have multiple channels. Or am I mistaken there?

At the moment I detect the type of link out of the length of the ID, as a way to separate between video, channel and playlist. veritasium is a good example, that is 11 characters, same as a video ID, not saying that it couldn't be done to separate between these things, but I think that will only be more confusing. Additional as you can also just add the video ID alone without the whole URL, this will make things more complicated and potentially ambiguous.

So for the playlist link, yes, I chose that by design, when you open that page, you land on the video not on the playlist. I don't think that it would make sense to download the whole playlist by giving a URL to a video. While giving the link to the playlist will download the whole playlist.

Btw: I have started writing these things down in the wiki.

MSDev201 · 2021-09-30T19:11:48Z

The channel name url is unique https://support.google.com/youtube/answer/2657968

Finding the parts of the url can be done by checking for the query parameters and start of url:
://www.youtube.com/watch?v=aFPJf-wKTd0&list=UUHnyfMqiRRG1u-2MsSQLbXA&index=2
://www.youtube.com/watch?v=HeQX2HjkcNo

Shared links have the id directly after the domain name:
://youtu.be/HeQX2HjkcNo
://youtu.be/HeQX2HjkcNo?t=100

A channel url could be identified by the first part of the url:
://www.youtube.com/c/veritasium
://www.youtube.com/c/veritasium/about
://www.youtube.com/channel/UCHnyfMqiRRG1u-2MsSQLbXA

Links like this could be difficult (but nowher linked on youtube so i think they are not needed but nice to have):
://www.youtube.com/veritasium

Formatting links in markdown is a little bit tricky but I hope you can see what I mean :)

bbilly1 · 2021-10-01T04:18:02Z

Ah I see, I was under the impression that's the username and not the channel name, my mistake.

So as I understand it, even though the channel name is unique, it can change. So I think the solution here would be to convert the channel name to the channel ID at that point in time and keep using the channel ID for everything there after.

Additionally to the link formats you have mentioned, I have also found:
www.youtube.com/user/<username>

I think this is the old format, so some of the channels I'm subscribed to have been around for some time like it's:
www.youtube.com/user/Computerphile
and not:
www.youtube.com/c/Computerphile

Strangely
www.youtube.com/user/veritasium
redirects to his 2009 channel where the only video is him explaining that this is his old channel... So these formats are not interchangeable.

A little bit confusing but the solution above still stands.

So yes, I think your way of identifying that is better than my current solution. This will require a rewrite of the process_url_list. Any chance you want to take a look at it?

lamusmaser · 2021-10-02T21:52:05Z

I have a rough outline of how this could work using only the modules that are already included. I will get a pull requested either later today or tomorrow.

Outline of tasks:
Instead of Raising an exception for "/c/" or "/user/"
Performs removal of parameters from link, finds the actual username within the link, requests the YouTube page for the user, grabs the "canonical" link reference, then parses that through a recursive call to the process_url_list and then sends that data through to the parent request (will only process that one time) to add to the youtube_ids list.

I'll reference this when I get the pull request.

MSDev201 · 2021-10-03T00:13:32Z

ChannelId can also be found in a meta field:

<meta itemprop="channelId" content="UCHnyfMqiRRG1u-2MsSQLbXA">

MSDev201 · 2021-10-03T00:21:48Z

I don't have much experience with Python, but wouldn't it be easier to parse the links with RegEx? You could define patterns for videos, playlists and channels and extract the relevant parts from the URL.

Maybe I have time next weekend and can try to implement something like that.

Quick prototype:

(https?:\/\/)(www\.)?(youtube|youtu)\.([\w]{2,3})\/(watch\?v=)?(?<videoId>[\w\d]{11})

Group "videoId" would hold watch IDs of these urls:

https://www.youtube.com/watch?v=JSeZ12Juryk
https://youtube.com/watch?v=JSeZ12Juryk
https://www.youtube.de/watch?v=JSeZ12Juryk
https://youtu.be/JSeZ12Juryk
http://www.youtube.de/watch?v=JSeZ12Juryk
https://www.youtube.com/watch?v=KT18KJouHWg&list=UUHnyfMqiRRG1u-2MsSQLbXA&index=2

The RegEx would have to be adapted further so that other possible cases are also covered. For example, this regex does not yet cover links that have the v parameter in the first position.

lamusmaser · 2021-10-03T00:45:44Z

Based on my understanding of the code, the specific function provides the following:
Get a request to download with a URL link -> Sanitize the link by removing the prefixed URL data in front of the link (which is consistent for all of the varieties of URLs that include either the video, playlist, or channel ID) -> Sanitize it further by removing any additional query links that could be appended to the end (to get the actual ID) -> send that forward to the requesting function to handle it as either a video, playlist, or channel. The video will grab the channel data from the youtubedl function. Playlist will go through each of the videos and grab each of their data from the youtubedl function, as well. Channel is obviously already setup, so no further processing required.

My proposed solution attempts to get the channel data from youtube directly (could also call the youtubedl function, but it wasn't already imported within this script); I could adjust it to use the meta tag, I just found the tag via my own search for the ID and found that it provided it via a link tag, with the full canonical URL, which would then be able to be passed to the function recursively and provide the same sanitized results.

There are more clean ways to do it, and you're welcome to submit your own PR that does it differently. This was my quick way to get it to work so we could show that it functions, then we can enhance it afterward. Import a parser, like Beautiful Soup, or sending a javascript request as part of the Requests call to get the specific tag, or we can look at utilizing the youtubedl function to maintain consistency.

bbilly1 · 2021-10-03T02:09:11Z

Thanks for everybody looking into it! I do think that should be passed to yt-dlp to extract instead. I mean all of the solutions above will work but from a future maintainability standpoint, if yt-dlp already provides the needed information, that's going to make things easier going forward.
The only time where scraping in a traditional sense was needed, was extracting some additional channel info that yt-dlp doesn't do.

I have no qualms importing yt-dlp there, that's a good reason.

So the example using yt-dlp to extract the channel_id for "/c/" or "/user/" urls could look something like this:

obs = {
    "default_search": "ytsearch",
    "quiet": True,
    "skip_download": True,
    "extract_flat": True,
    "playlistend": 0,
}
chan = youtube_dl.YoutubeDL(obs).extract_info(url, download=False)
channel_id = chan["channel_id"]

Maybe that's where this process_url_list function should be convert to a class. "If it doesn't fit in a screen, it's too big" or something, I dont' know, I don't always follow these rules as well... :-)

So pseudo code:

class UrlListParser:
    """take a multi line string and detect valid youtube ids"""
    
    def __init__(self, url_str):
        self.url_list = # split the string by newline

    def process_list(self):
        """loop through the list"""
        youtube_ids = []
        for url in self.url_list:
            if "/c/" in url or "/user/" in url:
                # dedect /c/ and /user/
                youtube_id = self.extract_channel_name(url)
                youtube_ids.append({"url": youtube_id, "type": "channel"})
            else:
                # dedect the rest
                youtube_id, id_type = self.find_valid_id(url)
                youtube_ids.append({"url": youtube_id, "type": id_type})

        return youtube_ids

    def extract_channel_name(url):
        """extract channel_id from channel name using yt-dlp"""
        # do the extraction
        return youtube_id

    def find_valid_id(url):
        """extract the id and detect the type"""
        # do the extraction
        # id_type can be channel, video or playlist
        return youtube_id, id_type

Then the class can be called like that from anywhere needed:

youtube_ids = UrlListParser(url_str).process_list()

@lamusmaser Do you want to flesh it out? It still needs to raise a ValueError if extraction fails...

Then I'll change it everywhere in the project where it needs to be called and we have a nice reusable and extendable class there! :-)

lamusmaser · 2021-10-03T02:32:48Z

@bbilly1 I will take a swing at updating it and any of the references based on pseudo above. I'll see if I can get more in this weekend, otherwise it will be next weekend.

bbilly1 · 2021-10-03T02:41:53Z

Nice! I've planned to make a new release later today to get some of the new UI and some other changes out. But no pressure, that's important to take the time needed, and can also be in the release after.

MSDev201 · 2021-10-08T16:20:14Z

Today I tried to include some playlists in my download que and it looks like not every playlist id is 34 characters long. Can this also be fixed by using yt-dlp?

As example this URL has a playlist ID with 18 characters:
https://www.youtube.com/playlist?list=PL16649CCE7EFA8B2F

lamusmaser · 2021-10-25T05:17:36Z

This has not been forgotten, I just haven't had the time to devote to recoding this section. If someone else gets to it first, please say so here. It is in my to-do list, but I have a few other items ahead of it (unrelated to this project).

bbilly1 · 2021-10-31T08:58:03Z

OK, I took a swing at that, I think I could get all of the requirements working as expected, even www.youtube.com/veritasium works. :-)

Look forward to the improvements in the next release!

bbilly1 · 2021-11-01T11:21:49Z

OK, this is now merged in v0.0.7. Thanks for everybody looking into it!

Closing this for now.

Panzer1119 · 2021-11-12T11:15:02Z

I'm on v0.0.7 and I can't add YouTube Video URLs like these https://youtu.be/2tdiKTSdE9Y to the download queue (that url is taken directly from the wiki).

If I try adding it, it tells me:

Failed to extract links.
Not a video, channel or playlist ID or URL

bbilly1 · 2021-11-12T12:24:53Z

Oh man, of course you are right... Thanks for taking the time, that's very unfortunate...

lamusmaser mentioned this issue Oct 2, 2021

Update helper.py #44

Closed

bbilly1 added a commit that referenced this issue Oct 31, 2021

rewrite url_str extractor to convert channel names into channel ids, #40

1ba3090

bbilly1 closed this as completed Nov 1, 2021

bbilly1 added a commit that referenced this issue Nov 12, 2021

fix youtu.be extractor, #40

7930da5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some more url formats would be nice #40

Some more url formats would be nice #40

MSDev201 commented Sep 29, 2021

bbilly1 commented Sep 30, 2021

MSDev201 commented Sep 30, 2021 •

edited

bbilly1 commented Oct 1, 2021

lamusmaser commented Oct 2, 2021

MSDev201 commented Oct 3, 2021

MSDev201 commented Oct 3, 2021 •

edited

lamusmaser commented Oct 3, 2021

bbilly1 commented Oct 3, 2021 •

edited

lamusmaser commented Oct 3, 2021

bbilly1 commented Oct 3, 2021

MSDev201 commented Oct 8, 2021

lamusmaser commented Oct 25, 2021

bbilly1 commented Oct 31, 2021

bbilly1 commented Nov 1, 2021

Panzer1119 commented Nov 12, 2021

bbilly1 commented Nov 12, 2021

Some more url formats would be nice #40

Some more url formats would be nice #40

Comments

MSDev201 commented Sep 29, 2021

bbilly1 commented Sep 30, 2021

MSDev201 commented Sep 30, 2021 • edited

bbilly1 commented Oct 1, 2021

lamusmaser commented Oct 2, 2021

MSDev201 commented Oct 3, 2021

MSDev201 commented Oct 3, 2021 • edited

lamusmaser commented Oct 3, 2021

bbilly1 commented Oct 3, 2021 • edited

lamusmaser commented Oct 3, 2021

bbilly1 commented Oct 3, 2021

MSDev201 commented Oct 8, 2021

lamusmaser commented Oct 25, 2021

bbilly1 commented Oct 31, 2021

bbilly1 commented Nov 1, 2021

Panzer1119 commented Nov 12, 2021

bbilly1 commented Nov 12, 2021

MSDev201 commented Sep 30, 2021 •

edited

MSDev201 commented Oct 3, 2021 •

edited

bbilly1 commented Oct 3, 2021 •

edited