# Intro

In this tutorial, we will use BeautifulSoup and other resources to scrape data from YouTube website and build a relational graph model about the relationships between channels. Two channels will be considered "connected" to each other if one of them appears in the sidebar of the other one when you go to the channel homepage. For example for the user [PewDiePie](https://www.youtube.com/user/PewDiePie) (most subscribed youtuber), the related channels are penguinz0, VanossGaming, IGN, and The Game Theorists.

# Getting the data

We want to build a graph of related youtube channels. To do this, we will need to get information about which channels are recommended or featured by a particular channel. The first thing we might think to use is the [YouTube API](https://developers.google.com/youtube/v3/). But if you look closely, you will see that the YouTube API does not provide this information. Thus we must be more resourceful. We will do the same thing we did on the Yelp part of the first assignment of this class and use requests and BeautifulSoup to scrape the webpage for this information. 

First we must import the relevant libraries:

In [1]:
import requests
from bs4 import BeautifulSoup

Next we will use the same function as in HW1 to retrieve HTML pages:

In [2]:
def retrieve_html(url):
    r = requests.get(url)
    return (r.status_code, r.text)

Let's test this just to make sure it works. What happens if we retrieve the url of PewDiePie's channel?

In [89]:
(_, webpage) = retrieve_html("https://www.youtube.com/user/PewDiePie")
print(webpage)


    <!DOCTYPE html><html lang="en" data-cast-api-enabled="true"><head><style name="www-roboto" >@font-face{font-family:'Roboto';font-style:normal;font-weight:400;src:local('Roboto Regular'),local('Roboto-Regular'),url(//fonts.gstatic.com/s/roboto/v18/KFOmCnqEu92Fr1Mu4mxP.ttf)format('truetype');}@font-face{font-family:'Roboto';font-style:normal;font-weight:500;src:local('Roboto Medium'),local('Roboto-Medium'),url(//fonts.gstatic.com/s/roboto/v18/KFOlCnqEu92Fr1MmEU9fBBc9.ttf)format('truetype');}@font-face{font-family:'Roboto';font-style:italic;font-weight:400;src:local('Roboto Italic'),local('Roboto-Italic'),url(//fonts.gstatic.com/s/roboto/v18/KFOkCnqEu92Fr1Mu51xIIzc.ttf)format('truetype');}@font-face{font-family:'Roboto';font-style:italic;font-weight:500;src:local('Roboto Medium Italic'),local('Roboto-MediumItalic'),url(//fonts.gstatic.com/s/roboto/v18/KFOjCnqEu92Fr1Mu51S7ACc6CsE.ttf)format('truetype');}</style><script name="www-roboto" >if (document.fonts && document.fonts.load) {doc

This should print a jumble of HTML. You can ctrl-f "PewDiePie" to ensure it is the right jumble. 

Now we want to find out what the channels on PewDiePie's recommended sidebar are. Which part of the jumble describes that? In practice the second-fastest way to figure this out is to open the webpage in chrome, right-click on that part of the webpage and click "Inspect element." 

The fastest way is to have someone tell you. It's the thing called "miniChannelRenderer". You can get it using BeautifulSoup like this:

In [21]:
soup = BeautifulSoup(webpage, 'html.parser') #setup
sidebar = soup.find('ul', {'class': 'branded-page-related-channels-list'})
print(sidebar) #test

<ul class="branded-page-related-channels-list">
<li class="branded-page-related-channels-item spf-link clearfix" data-external-id="UCq6VFHwMzcMXbuKyG7SQYIg">
<span class="yt-lockup clearfix yt-lockup-channel yt-lockup-mini">
<div class="yt-lockup-thumbnail" style="width: 34px;">
<a aria-hidden="true" class="ux-thumb-wrap yt-uix-sessionlink spf-link " data-sessionlink="ei=VFLAWvfqLbaqhwaf-LjYCw&amp;feature=rc-rel&amp;ved=CGsQ9BwYACITCPfklsKRmNoCFTbVwQodHzwOuyibHA" href="/user/penguinz0" rel="nofollow"> <span class="video-thumb yt-thumb yt-thumb-34">
<span class="yt-thumb-square">
<span class="yt-thumb-clip">
<img alt="" aria-hidden="true" data-thumb="https://yt3.ggpht.com/a-/AJLlDp079_1NwKa-w-xmsRVfqSHqL_-JtDoXxtJUVw=s88-mo-c-c0xffffffff-rj-k-no" data-ytimg="1" height="34" onload=";window.__ytRIL &amp;&amp; __ytRIL(this)" src="/yts/img/pixel-vfl3z5WfW.gif" width="34"/>
<span class="vertical-align"></span>
</span>
</span>
</span>
</a>
</div>
<div class="yt-lockup-content">
<h3 class="yt-

The first channel in the list is penguinz0. To find this name we have to look in the first element in the JSON and then find the "yt-lockup-title" element.

In [32]:
firstChannel = sidebar.find("h3", {"class": "yt-lockup-title"}).a.text
print(firstChannel)

penguinz0


We also want the link.

In [34]:
firstChannelURL = sidebar.find("h3", {"class": "yt-lockup-title"}).a['href']
print(firstChannelURL)

/user/penguinz0


We will use this link later to "crawl" channels.

We have found the first channel in the recommended list, but we really want all of them. So we use find_all.

In [35]:
channels = sidebar.find_all("h3", {"class": "yt-lockup-title"})
for channel in channels:
    print(channel.a.text)

penguinz0
VanossGaming
The Game Theorists
IGN


The next step is to "crawl" along the channels to find many interrelated channels. To do this, we need to store the links we get in a queue and essentially do a "breadth-first-search" of the channels. We will write a function that returns the list of names and URLs of related channels for a given channel.

In [42]:
def relatedChannels(channelURL):
    (_, channelWebpage) = retrieve_html("https://www.youtube.com" + channelURL)
    soup = BeautifulSoup(channelWebpage, 'html.parser')
    sidebar = soup.find('ul', {'class': 'branded-page-related-channels-list'})
    relatedChannelsList = sidebar.find_all("h3", {"class": "yt-lockup-title"})
    
    channelNames = []
    channelURLs = []
    for c in relatedChannelsList:
        channelNames.append(c.a.text)
        channelURLs.append(c.a['href'])
    return (channelNames, channelURLs)
        
(relatedChannelNames, relatedChannelURLs) = relatedChannels('/user/PewDiePie') # we will use the same convetion Youtube.com does and addume the first part of the URL
print(relatedChannelNames, relatedChannelURLs)

['penguinz0', 'VanossGaming', 'The Game Theorists', 'IGN'] ['/user/penguinz0', '/user/VanossGaming', '/user/MatthewPatrick13', '/user/IGNentertainment']


Now let's try crawling. The Youtube network is obviously impossibly large, so we will limit ourselves to 20 channels right now. You can change the number but be warned that this could make this part take a long time.

Note that we also have to keep track of visited channels to keep from visiting the same channel more than once. 

In [45]:
from collections import deque

seed = "/user/PewDiePie" # You can change this to start at another channel
numChannels = 20 # number of channels to scan.

q = deque([seed])
visited = []
for i in range(numChannels):
    channel = q.pop()
    visited.append(channel)
    relatedChannelNames, relatedChannelURLs = relatedChannels(channel)
    print(channel)
    for name in relatedChannelNames:
        print("    " + name)
    for url in relatedChannelURLs:
        if(not(url in visited)):
            q.append(url)


/user/PewDiePie
    penguinz0
    VanossGaming
    The Game Theorists
    IGN
/user/IGNentertainment
    Prepare To Try
    GameTrailers
    IGN News
    Beyond!
    Nintendo Voice Chat
    Unlocked
    Game Scoop!
    FireteamChat
    IGN Anime Club
    IGN Walkthroughs
/user/IGNGameplay
    IGN
    IGN News
    Prepare To Try
    Beyond!
    Nintendo Voice Chat
    Unlocked
    FireteamChat
    Game Scoop!
    GameTrailers
/user/gametrailers
    IGN
    IGN News
    Prepare To Try
    Beyond!
    Unlocked
    Game Scoop!
    Nintendo Voice Chat
    FireteamChat
    IGN Walkthroughs
/channel/UCZfDHQE3mrZ7123Wl3iRhbg
    IGN
    IGN News
    Prepare To Try
    Beyond!
    Nintendo Voice Chat
    Unlocked
    Game Scoop!
    IGN Walkthroughs
    GameTrailers
/channel/UChDyKjO7PB_QuqTTFKKR9Iw
    IGN
    IGN News
    Beyond!
    Nintendo Voice Chat
    Unlocked
    IGN Anime Club
    FireteamChat
    IGN Walkthroughs
/channel/UCLQ22KZD2OE4tFjc7NfWWRg
    IGN
    IGN News
    Prepare To T

The above code prints every channel with its "neighbors" in the graph or recommended channels below it and indented.

# Building the Graph Model

You may notice that, at least when using PewDiePie as the seed, the channel URLs stop being /user/username and start being /channel/gibberish. That's because Youtube has a confusing and partially deprecated way of keeping track of channels. Most channels has both a user name and a channel ID. The /user/ URLs are using the user name and the /channel/ URLs are using the channel IDs. There are other differences between the two concepts but for us that understanding is sufficient. You may also notice that the *screen name* as displayed on the webpage is different than the *user name* that shows up in the URL. This can be a bit confusing sometimes, for example, it's not immediately obvious that the IGN screen name corresponds to the IGNentertainment user name rather than to any of the other IGN-related channels.

In addition, while *user names* and *Channel IDs* are unique, *screen names* are not; a particular screen name can be used by any number of channels. This can prove problematic when we are trying to build a large model of the data; how are we to track which channels are connected to which when we can't even keep the channels seperate? 

One way to resolve all of these problems is to use the URLs as the core identifiers and to a pair of dictionaries using those as keys. One of the dictionaries, *names* will map URLs to screen names. The other one, *neighbors* will map URLs to a list of its neighbors.

Ok, so how do we find out the screen names? We can find the location of the screen names similar to how we found the other information off the webpage using inspect element. It turns you can access the screen name this way:

In [58]:
def findScreenName(channel):
    (_, webpage) = retrieve_html("https://www.youtube.com" + channel)
    soup = BeautifulSoup(webpage, "html.parser")
    screenName = soup.find("meta", {"name": "title"})['content']
    return screenName

screenName = findScreenName("/channel/UC-lHJZR3Gqxm24_Vd_AJ5Yw")# channel ID for PewDiePie
print(screenName)

PewDiePie


So now we will construct the dictionaries. Note this code runs rather slowly, so it prints its progress as it goes. 

In [77]:
def crawlGraph(numChannels, seed):
    q = deque([seed])
    names = {}
    neighbors = {}
    for i in range(numChannels):
        if(i%10==0):
            print(i,"/",numChannels,"channels scanned")
        channel = q.pop()
        visited.append(channel)
        relatedChannelNames, relatedChannelURLs = relatedChannels(channel)
    
        channelId = channel[9:] # cut off the first part of URL to just get channel ID
        screenName = findScreenName(channel)
        names[channel] = screenName
    
        neighbors[channel] = relatedChannelURLs
    
        for url in relatedChannelURLs:
            if(not(url in names)): #we can use names instead of visited now
                q.append(url)
    return names, neighbors

names, neighbors = crawlGraph(20, "/channel/UC-lHJZR3Gqxm24_Vd_AJ5Yw")
print(names)
print(neighbors)

0 / 100 channels scanned
10 / 100 channels scanned
20 / 100 channels scanned
30 / 100 channels scanned
40 / 100 channels scanned
50 / 100 channels scanned
60 / 100 channels scanned
70 / 100 channels scanned
80 / 100 channels scanned
90 / 100 channels scanned
{'/channel/UC-lHJZR3Gqxm24_Vd_AJ5Yw': 'PewDiePie', '/user/IGNentertainment': 'IGN', '/user/IGNGameplay': 'IGN Walkthroughs', '/user/gametrailers': 'GameTrailers', '/channel/UCZfDHQE3mrZ7123Wl3iRhbg': 'FireteamChat', '/channel/UChDyKjO7PB_QuqTTFKKR9Iw': 'Game Scoop!', '/channel/UCLQ22KZD2OE4tFjc7NfWWRg': 'IGN Anime Club', '/channel/UCTqWN7lps75nnS07twBfVhw': 'Unlocked', '/channel/UCG275E2gosKB-UPk-qCK8qw': 'Nintendo Voice Chat', '/channel/UCmajVTORW3vp5qM8r0myznA': 'Beyond!', '/channel/UCcoGampOoeZW-BWTL8SesyQ': 'Prepare To Try', '/user/IGNNews': 'IGN News', '/user/Charalanahzard': 'Alanah Pearce', '/user/MatthewPatrick13': 'The Game Theorists', '/user/ThatOneVideoGamer': 'The Completionist', '/user/satchbags': "Satchbag's Goods", '

# Storing Data

This is good, but this code is frustratingly slow. One way to make this less of a problem is to use a protocol for storing and retrieving the data in the dictionaries so that we don't have to re-run this function every time we use it. The pickle module works well for this. Here is how we this store data in files using pickle: 

In [88]:
import pickle

with open('names20.txt', 'wb') as f:
    pickle.dump(names, f)
    
with open('neighbors20.txt', 'wb') as f:
    pickle.dump(neighbors, f)

Then the information can be accessed like this:

In [None]:
with open('names20.txt', 'rb') as f:
    names20 = pickle.load(f)
    
with open('neighbors20.txt', 'rb') as f:
    neighbors20 = pickle.load(f)

I have already run code that scans 100 channels and stores it in a file. That information can be retrieved like this:

In [80]:
names100 = {}
neighbors100 = {}

with open('names100.txt', 'rb') as f:
    names100 = pickle.load(f)
    
with open('neighbors100.txt', 'rb') as f:
    neighbors100 = pickle.load(f)
    
print(names100)
print(neighbors100)

{'/channel/UC-lHJZR3Gqxm24_Vd_AJ5Yw': 'PewDiePie', '/user/IGNentertainment': 'IGN', '/user/IGNGameplay': 'IGN Walkthroughs', '/user/gametrailers': 'GameTrailers', '/channel/UCZfDHQE3mrZ7123Wl3iRhbg': 'FireteamChat', '/channel/UChDyKjO7PB_QuqTTFKKR9Iw': 'Game Scoop!', '/channel/UCLQ22KZD2OE4tFjc7NfWWRg': 'IGN Anime Club', '/channel/UCTqWN7lps75nnS07twBfVhw': 'Unlocked', '/channel/UCG275E2gosKB-UPk-qCK8qw': 'Nintendo Voice Chat', '/channel/UCmajVTORW3vp5qM8r0myznA': 'Beyond!', '/channel/UCcoGampOoeZW-BWTL8SesyQ': 'Prepare To Try', '/user/IGNNews': 'IGN News', '/user/Charalanahzard': 'Alanah Pearce', '/user/MatthewPatrick13': 'The Game Theorists', '/user/ThatOneVideoGamer': 'The Completionist', '/user/satchbags': "Satchbag's Goods", '/user/SunderGamer': 'Sunder', '/user/TheNationalDex': 'TheNationalDex', '/user/TamashiiHiroka': 'TamashiiHiroka', '/user/TheJWittz': 'TheJWittz', '/user/TheGameStation': 'The Game Station', '/user/OMFGcata': 'Jesse Cox', '/user/ChaoticMonki': 'Cryaotic', '/us

(The second parameter in the the open calls describes what will be done with the file. 'r' means read, 'w' means write, and 'b' means binary, which is faster with pickle)

Now we have a full model of the related channels that can be manipulated and searched. Let's print the first ten in a way that's better understood. Note that it is possible for a channel to have neighbors that we have not actually scanned, since we cannot scan all the channels. In this case, we also would not know the recognizable screen name of those channels. The following code simply leaves those channels out of the output.

In [87]:
i = 0

for c in neighbors:
    if(i>10):
        break
    name = names[c]
    print(name)
    for neighbor in neighbors100[c]:
        if(neighbor in names100): # Check if this channel has been scanned for brevity
            print('    ', names100[neighbor])
    i+=1

PewDiePie
     penguinz0
     The Game Theorists
     IGN
IGN
     Prepare To Try
     GameTrailers
     IGN News
     Beyond!
     Nintendo Voice Chat
     Unlocked
     Game Scoop!
     FireteamChat
     IGN Anime Club
     IGN Walkthroughs
IGN Walkthroughs
     IGN
     IGN News
     Prepare To Try
     Beyond!
     Nintendo Voice Chat
     Unlocked
     FireteamChat
     Game Scoop!
     GameTrailers
GameTrailers
     IGN
     IGN News
     Prepare To Try
     Beyond!
     Unlocked
     Game Scoop!
     Nintendo Voice Chat
     FireteamChat
     IGN Walkthroughs
FireteamChat
     IGN
     IGN News
     Prepare To Try
     Beyond!
     Nintendo Voice Chat
     Unlocked
     Game Scoop!
     IGN Walkthroughs
     GameTrailers
Game Scoop!
     IGN
     IGN News
     Beyond!
     Nintendo Voice Chat
     Unlocked
     IGN Anime Club
     FireteamChat
     IGN Walkthroughs
IGN Anime Club
     IGN
     IGN News
     Prepare To Try
     Beyond!
     Nintendo Voice Chat
     Unlocked
     

And we're done.