1 - Select web page
2 - If it can be parsed with API => API and json
    Else : BeautifulSoup
3 - If 1 page: simple web scraping
    Else : take kickstarter() from advanced web scraping
4 - Expected output :
    CSV File + Pandas DataFrame with parameters we chose
    + Data Cleaning an Explanation of what's been done, and goals.

I want to scrape a playlist of a documentary series I enjoy, that's on youtube : How It's Made.

Series 29 : https://www.youtube.com/playlist?list=PLQvvxnU2-ItePyupBYGFLlKJ5SZap2y8L

From the original page of Season 29, with all the links, I want to scrape every video in the playlist and extract :
    - Title
    - Description
    - Date it was added
    - Length
    - Nb of likes and dislikes
    - Nb of views
    - Top comment (with user name) - for fun mostly

In [63]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import datetime
from IPython.core.display import HTML


# Initialization
Getting every URL from the playlist page, pushing headers in case.

In [45]:
headers = {'User-Agent': 'Chrome/71.0.3578.98'}

playlist_url = "https://www.youtube.com/playlist?list=PLQvvxnU2-ItePyupBYGFLlKJ5SZap2y8L"

response = requests.get(playlist_url, headers=headers).content
playlist_soup = BeautifulSoup(response, "html.parser")

# Getting the links for all the videos in the playlist, from the playlist page.

Gets the "href" element, and creates a string by adding "youtube.com" at the beginning.
Returns a list containing the urls of every video.
The video thumbnail elements were in a TABLE that wasn't visible in the inspector.
I (with Eldiias's help) had to analyze the HTML soup to locate the table.
This is the first place were I use a word in the playlist TITLE, and were the code is specific to this playlist.

In [46]:

def getlinks(playlist_soup):
    links=["https://youtube.com" + element['href'] for element in playlist_soup.select('#pl-video-table tr a.pl-video-title-link')]
    return links
    

# Soup from each URL
Taking an individual link as an input, this function returns the soup from the video page.

In [47]:
def videosoup(url):
    
    response = requests.get(url, headers=headers).content
    video_soup = BeautifulSoup(response, "html.parser")
    
    return video_soup

# Video Title
This function returns the title of the video, in string type.

In [48]:
def gettingtitle(video_soup):
    title = video_soup.select("h1.watch-title-container")
    if len(title)>0 :
        return (title[0].text.strip())
    else :
        return "No title available"

# Nb of views
This function returns the number of views, in integer type. It actually retrieves the text : "(nbofviews) vues". So we strip the "vues".  This is specific to french youtube.

In [49]:

def nbviews(video_soup):

    views = video_soup.select("div#watch7-views-info")
    nb_of_views=" ".join(views[0].text.strip().split()).strip(" vues")     #stripping the "vues" =views
    nb_of_views_numerical = int("".join([nb for nb in nb_of_views if nb.isdigit()]))   #turning the nb of views into int

    return(nb_of_views_numerical)

# Nb of likes and dislikes
Likes and dislikes are both contained in a "button arial-label" block.
They are always positionned in the same blocks and easy to retrieve.
These functions return the likes and dislikes in integer type.

In [50]:


def likes(video_soup):
    likes_dislikes = video_soup.select('div span.yt-uix-clickcard button[aria-label]')
    likes = int(''.join(likes_dislikes[1].text.split()))
    return(likes)

def dislikes(video_soup):
    
    likes_dislikes = video_soup.select('div span.yt-uix-clickcard button[aria-label]')
    dislikes = int(''.join(likes_dislikes[3].text.split()))
    return(dislikes)



# Duration
The length of the video is in the metadata, in seconds.
I'll extract it using regex, then convert it to datetime to make classifying easier in the DataFrame.


In [51]:
def duration(video_soup):
    regex = r"length_seconds\":\"(\d*)"
    duration = int(re.search(regex, str(video_soup)).group(1))
    duration_dt = datetime.timedelta(seconds=duration)

    return duration_dt

# Short Description
The short description gives basic info about the video.
In the case of our playlist, it gives the season, episode number, and subject.

This part contains some error handling : some of the videos in the playlist are BLOCKED in France.
When a video is blocked, the short description is empty, and thus the result for the regex search is None. This rises an error in the code and ends the loop when building the dataframe.

To avoid this, a variable "p" is added, with a value if the description is empty and a different value when the description exists. The function "shortdesc" returns the description (or "None") and the value of p that will be used later in error handling. 

In [52]:
#Short descrition is also in the metadata.

def shortdesc(video_soup):
    regex2 = r"shortDescription\\\":\\\"(.*?)\\"
    short_desc = re.search(regex2, str(video_soup))
    if short_desc is None:
        p=2
    else:
        short_desc=short_desc.group(1)
        p=0
    return short_desc,p

# Publication Date

Again, the publication date is present in the metadata.
I converted it to datetime format for easier classifying in the dataframe.

In [53]:
def pubdate(video_soup):
    regexdate = r"([0-9]{4}-[0-9]{2}-[0-9]{2})\" itemprop=\"datePublished" 
    datepub=re.search(regexdate, str(video_soup)).group(1)
    datepub_object = datetime.datetime.strptime(datepub, '%Y-%m-%d')
    return datepub_object                  

# Episode number :
From the short description. This function will be run (or not) according to the value of the control variable p, describe earlier.

In [54]:
def episodenb(descriptiontxt):
    regexnb = r"episode (\d*)" 
    ep_nb=re.search(regexnb, descriptiontxt).group(1)
    return ep_nb

# Subject
Same as before.

In [55]:
def subject(descriptiontxt):
    regexsubj = r"episode \d* (.*)"
    subj=re.search(regexsubj, descriptiontxt).group(1)
    return subj

# Video thumbnails urls

In [56]:
def getimglinks(playlist_soup):
    longlinks=[element["data-thumb"] for element in playlist_soup.select('#pl-video-table tr span.yt-thumb-clip img[data-thumb]')]
    regeximg = r"https.*jpg"  #shortening the links
    imglinks = [re.findall(regeximg, imglink) for imglink in longlinks]
    imglinks = [imglink[0] for imglink in imglinks if len(imglink)>0] #cleaning the list from non-existent thumbnails
    return imglinks

- Couldn't retrieve top comment : not in the actual html, too tricky.
- Find extension for youtube FRANCE (key words like "vue" etc).
- This code wouldn't work for any playlist, as it uses some words from the title of the playlist.
    Could probably improve this.
    

# Creating the empty dataframe with the column names.

In [57]:
yt_df = pd.DataFrame(columns=["TITLE", "EPISODE_NUMBER", "SUBJECT", "PUBLICATION DATE", "DURATION", "NB_OF_VIEWS","LIKES", "DISLIKES","THUMBNAIL","URL" ])

# The loop
Here, the functions loop over every link in the link list.

The info from every function is added to a list, that is added to the data frame at the end of each iteration.

If the video is blocked, the error handling takes over and jumps to the next link with "continue", ending that iteration early.

A counter is used to create the index, updated after adding the row to the dataframe.


In [58]:
links = getlinks(playlist_soup)
counter = 0
img_counter = 0
for link in links :
    i=0
    url = link
    video_soup = videosoup(url)
    title = gettingtitle(video_soup)
    short_desc,i = shortdesc(video_soup)
    img_counter +=1
    if i>1:
        continue
    thumbnail_url = getimglinks(playlist_soup)[img_counter-1]
    ep_nb = episodenb(short_desc)
    subj = subject(short_desc)
    pub_date = pubdate(video_soup)
    duration_obj = duration(video_soup)
    nb_of_views = nbviews(video_soup)
    likes_nb = likes(video_soup)
    dislikes_nb = dislikes(video_soup)
    link_info_list = [title, ep_nb, subj, pub_date, duration_obj, nb_of_views, likes_nb, dislikes_nb, thumbnail_url, url]

    yt_df.loc[counter] = link_info_list
    counter+=1

In [59]:
yt_df.head()

Unnamed: 0,TITLE,EPISODE_NUMBER,SUBJECT,PUBLICATION DATE,DURATION,NB_OF_VIEWS,LIKES,DISLIKES,THUMBNAIL,URL
0,How Its Made - 1405 Skateboard Wheels,1,Skateboard Wheels,2018-04-12,00:04:54,360772,3351,76,https://i.ytimg.com/vi/U64j80P-Vl0/hqdefault.jpg,https://youtube.com/watch?v=U64j80P-Vl0&list=PLQvvxnU2-ItePyupBYGFLlKJ5SZap2y8L&index=2&t=0s
1,How Its Made - 1408 Honeycomb Candles,1,Honeycomb Candles,2018-05-03,00:05:01,409008,3481,105,https://i.ytimg.com/vi/P8sHCOyE3Us/hqdefault.jpg,https://youtube.com/watch?v=P8sHCOyE3Us&list=PLQvvxnU2-ItePyupBYGFLlKJ5SZap2y8L&index=5&t=0s
2,How Its Made - 1410 Drum Crushers,2,Drum Crushers,2018-05-17,00:04:53,317430,1530,99,https://i.ytimg.com/vi/rA230GPcdR0/hqdefault.jpg,https://youtube.com/watch?v=rA230GPcdR0&list=PLQvvxnU2-ItePyupBYGFLlKJ5SZap2y8L&index=7&t=0s
3,How Its Made - 1411 Kimchi,2,Kimchi,2018-05-24,00:05:06,313592,1984,254,https://i.ytimg.com/vi/c2nKIjLIiNY/hqdefault.jpg,https://youtube.com/watch?v=c2nKIjLIiNY&list=PLQvvxnU2-ItePyupBYGFLlKJ5SZap2y8L&index=8&t=0s
4,How Its Made - 1412 Parquet Floors,2,Parquet Floors,2018-05-31,00:04:49,193091,780,67,https://i.ytimg.com/vi/RuFeYW-KWC0/hqdefault.jpg,https://youtube.com/watch?v=RuFeYW-KWC0&list=PLQvvxnU2-ItePyupBYGFLlKJ5SZap2y8L&index=9&t=0s


# Visualizing the Dataframe with thumbnails for each video 

In [None]:
# First, I'm going to reorganize the column names, to get the thumbnail next to the subject.
# I'll also create a new dataframe for this part.

In [60]:
yt_viz = yt_df[["TITLE", "EPISODE_NUMBER", "SUBJECT","THUMBNAIL", "PUBLICATION DATE", "DURATION", "NB_OF_VIEWS","LIKES", "DISLIKES","URL" ]].copy()

In [61]:
yt_viz.head()

Unnamed: 0,TITLE,EPISODE_NUMBER,SUBJECT,THUMBNAIL,PUBLICATION DATE,DURATION,NB_OF_VIEWS,LIKES,DISLIKES,URL
0,How Its Made - 1405 Skateboard Wheels,1,Skateboard Wheels,https://i.ytimg.com/vi/U64j80P-Vl0/hqdefault.jpg,2018-04-12,00:04:54,360772,3351,76,https://youtube.com/watch?v=U64j80P-Vl0&list=PLQvvxnU2-ItePyupBYGFLlKJ5SZap2y8L&index=2&t=0s
1,How Its Made - 1408 Honeycomb Candles,1,Honeycomb Candles,https://i.ytimg.com/vi/P8sHCOyE3Us/hqdefault.jpg,2018-05-03,00:05:01,409008,3481,105,https://youtube.com/watch?v=P8sHCOyE3Us&list=PLQvvxnU2-ItePyupBYGFLlKJ5SZap2y8L&index=5&t=0s
2,How Its Made - 1410 Drum Crushers,2,Drum Crushers,https://i.ytimg.com/vi/rA230GPcdR0/hqdefault.jpg,2018-05-17,00:04:53,317430,1530,99,https://youtube.com/watch?v=rA230GPcdR0&list=PLQvvxnU2-ItePyupBYGFLlKJ5SZap2y8L&index=7&t=0s
3,How Its Made - 1411 Kimchi,2,Kimchi,https://i.ytimg.com/vi/c2nKIjLIiNY/hqdefault.jpg,2018-05-24,00:05:06,313592,1984,254,https://youtube.com/watch?v=c2nKIjLIiNY&list=PLQvvxnU2-ItePyupBYGFLlKJ5SZap2y8L&index=8&t=0s
4,How Its Made - 1412 Parquet Floors,2,Parquet Floors,https://i.ytimg.com/vi/RuFeYW-KWC0/hqdefault.jpg,2018-05-31,00:04:49,193091,780,67,https://youtube.com/watch?v=RuFeYW-KWC0&list=PLQvvxnU2-ItePyupBYGFLlKJ5SZap2y8L&index=9&t=0s


In [62]:
# Now, in order to use the HTML module and display the images, 
# I must reformat the thumbnail urls.
# I will use a simple function to add an img src tag and specify a display width.
def path_to_image_html(path):
    return '<img src="'+ path + '" width="100" >' 
yt_viz['THUMBNAIL'] =yt_viz['THUMBNAIL'].apply(path_to_image_html)

In [64]:
#Now, we can vizualise the DF.
pd.set_option('display.max_colwidth', -1)
HTML(yt_viz.to_html(escape=False ,formatters=dict(image=path_to_image_html)))

Unnamed: 0,TITLE,EPISODE_NUMBER,SUBJECT,THUMBNAIL,PUBLICATION DATE,DURATION,NB_OF_VIEWS,LIKES,DISLIKES,URL
0,How Its Made - 1405 Skateboard Wheels,1,Skateboard Wheels,,2018-04-12,00:04:54,360772,3351,76,https://youtube.com/watch?v=U64j80P-Vl0&list=PLQvvxnU2-ItePyupBYGFLlKJ5SZap2y8L&index=2&t=0s
1,How Its Made - 1408 Honeycomb Candles,1,Honeycomb Candles,,2018-05-03,00:05:01,409008,3481,105,https://youtube.com/watch?v=P8sHCOyE3Us&list=PLQvvxnU2-ItePyupBYGFLlKJ5SZap2y8L&index=5&t=0s
2,How Its Made - 1410 Drum Crushers,2,Drum Crushers,,2018-05-17,00:04:53,317430,1530,99,https://youtube.com/watch?v=rA230GPcdR0&list=PLQvvxnU2-ItePyupBYGFLlKJ5SZap2y8L&index=7&t=0s
3,How Its Made - 1411 Kimchi,2,Kimchi,,2018-05-24,00:05:06,313592,1984,254,https://youtube.com/watch?v=c2nKIjLIiNY&list=PLQvvxnU2-ItePyupBYGFLlKJ5SZap2y8L&index=8&t=0s
4,How Its Made - 1412 Parquet Floors,2,Parquet Floors,,2018-05-31,00:04:49,193091,780,67,https://youtube.com/watch?v=RuFeYW-KWC0&list=PLQvvxnU2-ItePyupBYGFLlKJ5SZap2y8L&index=9&t=0s
5,How Its Made - 1414 Steel Bicycles,3,Steel Bicycles,,2018-06-14,00:04:56,123009,572,26,https://youtube.com/watch?v=2Q9jfRyevxA&list=PLQvvxnU2-ItePyupBYGFLlKJ5SZap2y8L&index=11&t=0s
6,How Its Made - 1415 Raw Pet Food,3,Raw Pet Food,,2018-06-21,00:04:53,126396,677,55,https://youtube.com/watch?v=0eSf2WwKr30&list=PLQvvxnU2-ItePyupBYGFLlKJ5SZap2y8L&index=12&t=0s
7,How Its Made - 1416 Replica Police Lanterns,3,Replica Police Lanterns,,2018-06-28,00:05:10,71901,380,15,https://youtube.com/watch?v=w9PtCinbKno&list=PLQvvxnU2-ItePyupBYGFLlKJ5SZap2y8L&index=13&t=0s
8,How Its Made - 1417 Thermoplastic Fire Helmets,4,Thermoplastic Fire Helmets,,2018-07-05,00:04:52,111847,535,35,https://youtube.com/watch?v=gXybQW_Tkck&list=PLQvvxnU2-ItePyupBYGFLlKJ5SZap2y8L&index=14&t=0s
9,How Its Made - 1418 Basketry Sculptures,4,Basketry Sculptures,,2018-07-12,00:05:03,65085,263,64,https://youtube.com/watch?v=a9MjqxEM0iU&list=PLQvvxnU2-ItePyupBYGFLlKJ5SZap2y8L&index=15&t=0s
