## Basic Data Using Scrapetube:

### Get basic data:

For testing, use a low number of max_entries. To remove the limit, use `max_entries = None`.

In [1]:
# !pip install scrapetube 

In [2]:
# Imports
from datetime import datetime
import pandas as pd
import numpy as np
import scrapetube

# Parameters
max_entries = 50

GMM_url = "https://www.youtube.com/@GoodMythicalMorning"
video_iterator = scrapetube.get_channel(channel_url=GMM_url, limit=max_entries, sort_by="newest")

# New dictionary class with multidimensional get
class custom_dict(dict):
    def multidim_get(self, *keys):
        """
        Allows .get() method to operate on nested dictionarys.
        """
        value = self
        for key in keys:
            try:
                value = value[key]
            except KeyError:
                return None
        return value

# Features to extract from Scrapetube video object
"""
#ID
    - name
    - length
    - views
    - published date
    - thumbnail
        - still
        - video
    - scrape datetime
"""

# Function to extract features
def get_basic_video_details(video):
    video = custom_dict(video)
    return {
        "id": video.multidim_get("videoId"),
        "name": video.multidim_get("title","runs",0,"text"),
        "duration": video.multidim_get("lengthText","simpleText"),
        "views": video.multidim_get("viewCountText","simpleText"),
        "published": video.multidim_get("publishedTimeText","simpleText"),
        "thumbnail": {
            "still": video.multidim_get("thumbnail","thumbnails",-1,"url"),
            "video": video.multidim_get("richThumbnail","movingThumbnailRenderer","movingThumbnailDetails","thumbnails",0,"url")
        },
        "scraped": datetime.now()
    }

# Build a dataframe of episodes using our Scrapetube iterator
df = pd.DataFrame([
    get_basic_video_details(video)
    for video
    in video_iterator
])

### Clean duration values:
Standardize format to "HH:MM:SS" then convert to integer value (seconds)

In [3]:
def _leading_timecode(string, timecode_format="00:00:00"):
    """
    Standardize timecode with leading zeros and delimiters
    """
    return timecode_format[:len(string)-1:-1] + string

df["duration"] = np.dot(
    df["duration"].apply(_leading_timecode).str.split(":", expand=True).astype(int), # Get hours, minutes, and seconds
    [3600, 60, 1] # Multiply by 3x1 matrix to convert to total seconds
    )

### Clean view counts:
Convert from format "##,###,### views" to integer value (views)

In [4]:
df["views"] = df["views"].str.replace("\D", "", regex=True).astype(int, errors="ignore") # Remove any non-integer characters then convert to int

### Check output:
This is as far as we can get with ScrapeTube. However, we can still get more information about these videos using the YouTube Data API!

In [5]:
df.head()

Unnamed: 0,id,name,duration,views,published,thumbnail,scraped
0,fsJU9mOcvhQ,Our Most Unhinged Moments This Year,1322,241465,11 hours ago,{'still': 'https://i.ytimg.com/vi/fsJU9mOcvhQ/...,2022-12-21 16:43:02.313851
1,XiORNYGT-6s,Our Best Food Creations This Year,1331,871064,2 days ago,{'still': 'https://i.ytimg.com/vi/XiORNYGT-6s/...,2022-12-21 16:43:02.313869
2,B6dXVr0r0Ws,We Tried EVERY Goldfish Flavor,1194,1443804,5 days ago,{'still': 'https://i.ytimg.com/vi/B6dXVr0r0Ws/...,2022-12-21 16:43:02.313878
3,RPp5CXZVhlc,We Hug For 20 Minutes Straight... For Science,1394,537756,6 days ago,{'still': 'https://i.ytimg.com/vi/RPp5CXZVhlc/...,2022-12-21 16:43:02.313884
4,JrZP8aAZE9M,Lab Grown Dairy Taste Test,1140,904243,7 days ago,{'still': 'https://i.ytimg.com/vi/JrZP8aAZE9M/...,2022-12-21 16:43:02.313891


### Export to CSV:

In [6]:
output_path = "data/gmm-episodes_basic.csv"
df.to_csv(output_path)

---

## More data with YouTube Data API:

### Create a Google Cloud project:

YouTube Data API calls are limited by a daily quota, we need to run our calls through a Google Cloud project in order to keep track of our quota usage.

**Steps:**
1. Create a Google Cloud project [here](https://console.cloud.google.com/).
2. Enable the [YouTube Data API](https://developers.google.com/youtube/v3) for your project.

### Create and import an API Key:

To run API calls we will need a Key to link this notebook to our Google Cloud project.

**Steps:**
1. Create an API Key [here](https://console.cloud.google.com/apis/credentials) for your project.
2. Create a file to store this API Key using the `credentials-template.json` template. Name the new file `credentials.json`.

In [7]:
import json

credentials_file = "credentials.json"

with open(credentials_file, "r") as fh:
    credentials = json.load(fh)

assert "API_Key" in list(credentials.keys())
assert isinstance(credentials["API_Key"], str)

print("Successfully loaded API key from credentials file.")

Successfully loaded API key from credentials file.


### Get all videos from channel:

First, we can use the search API to build a list of all videos on the Good Mythical Morning channel. In this query we can also retrieve basic information like title, description, and publish date.

We are limited to retrieving 50 videos at a time, so we need to break this process up into several queries. To do this, we will retrieve one page at a time and then combine our results. 

For testing, use a low number of pages. To remove this limit, set `max_pages` to an arbitrarily large value (like infinity). 

In [8]:
import requests

max_pages = 1
results_per_page = 50

def dict_merge(base, *args):
    """
    Helper function for merging n dictionaries.
    """
    for dictionary in args: base |= dictionary
    return base

def get_channel_videos_page(channel_id, page_token=None, credential=credentials["API_Key"], max_page_results=50, order="date", parts=["id"]):
    """
    Fucntion to retrieve one page of videos.

    Parameters:
    channel_id - the ID of the channel to get videos from
    page_token - the ID of the page we are looking for. if no page specified this should be 'None'
    credential - the API Key used for the query
    max_page_results - the number of results to return in each page. this should be limited to 50
    order - how to sort the videos we are returning
    parts - the pieces of information to retrieve in our query
    """

    url = "https://www.googleapis.com/youtube/v3/search" +\
        f"?key={credential}" +\
        f"&channelId={channel_id}" +\
        f"&maxResults={max_page_results}" +\
        f"&order={order}" +\
        f"&part={','.join(parts)}" +\
        f"{f'&pageToken={page_token}' if page_token else ''}"

    webpage = requests.get(url)
    content = json.loads(webpage.text)

    next_page = content.get("nextPageToken")
    videos = content.get("items")

    videos_data = [
        dict_merge(*(video[part] for part in parts))
        for video
        in videos
        if video["id"]["kind"] == "youtube#video"
    ]

    return videos_data, next_page


def get_channel_videos(channel_id, max_depth=10, **kwargs):
    """
    Function to retrieve all videos from a channel, where count is greater than can be retrieved in a single page.

    Parameters:
    channel_id - the ID of the channel to get videos from
    max_depth - the maximum number of pages to query
    **kwargs - arguments to pass to get_channel_videos_page
    """

    all_videos = []
    
    page_videos, next_page = get_channel_videos_page(channel_id, **kwargs)
    all_videos += page_videos

    depth = 1
    while next_page and depth < max_depth:
        page_videos, next_page = get_channel_videos_page(channel_id, page_token=next_page, **kwargs)
        all_videos += page_videos
        depth += 1

    return all_videos


GMM_channel_id = "UC4PooiX37Pld1T8J5SYT-SQ"
videos = get_channel_videos(GMM_channel_id, max_depth=max_pages, max_page_results=results_per_page)
videos[0].keys()

dict_keys(['kind', 'videoId'])

### Get more data from YouTube DataAPI

In [9]:
def get_video_data(video_id, parts=["snippet", "statistics", "contentDetails"], credential=credentials["API_Key"]):
    """
    Using the YouTube Data API, get information for a video by it's video ID.
    
    Parameters:
    video_id - the ID of the video to find
    parts - the parts of data to return in API call. documentation here https://developers.google.com/youtube/v3/getting-started#partial
    credential - the API Key to use for the call
    """
    
    url = "https://www.googleapis.com/youtube/v3/videos" +\
        f"?key={credential}" +\
        f"&part={','.join(parts)}" +\
        f"&id={video_id}"

    webpage = requests.get(url)
    content = json.loads(webpage.text)

    video = content["items"][0]

    return dict_merge({"id": video_id}, {"scraped": datetime.now()}, *(video[part] for part in parts))

for entry in videos:
    entry |= get_video_data(entry["videoId"])

df = pd.DataFrame(videos)
df = df.drop(columns=["kind", "videoId", "channelId", "channelTitle"]) # Remove redundant columns 
df.head()

Unnamed: 0,id,scraped,publishedAt,title,description,thumbnails,tags,categoryId,liveBroadcastContent,defaultLanguage,...,likeCount,favoriteCount,commentCount,duration,dimension,definition,caption,licensedContent,contentRating,projection
0,fsJU9mOcvhQ,2022-12-21 16:43:05.050517,2022-12-21T11:00:33Z,Our Most Unhinged Moments This Year,"Today, we're looking back at our most unhinged...",{'default': {'url': 'https://i.ytimg.com/vi/fs...,"[gmm, good mythical morning, rhettandlink, rhe...",24,none,en,...,12960,0,532,PT22M2S,2d,hd,True,True,{},rectangular
1,86b9eToglMY,2022-12-21 16:43:05.358828,2022-12-20T17:00:08Z,Link's Too Sensitive To Eat Ice Cream,They're SENSITIVE! #shorts\n\nRemember this #G...,{'default': {'url': 'https://i.ytimg.com/vi/86...,"[gmm, good mythical morning, rhettandlink, rhe...",24,none,en,...,7013,0,116,PT37S,2d,hd,True,True,{},rectangular
2,XiORNYGT-6s,2022-12-21 16:43:05.658942,2022-12-19T11:00:15Z,Our Best Food Creations This Year,"Today, we're looking back at our favorite food...",{'default': {'url': 'https://i.ytimg.com/vi/Xi...,"[gmm, good mythical morning, rhettandlink, rhe...",24,none,en,...,28752,0,1041,PT22M11S,2d,hd,True,True,{},rectangular
3,GDWWseS9phk,2022-12-21 16:43:05.947199,2022-12-17T11:00:13Z,Bologna Is 500 Years Old #ad UberOne,"This is an ad for Uber One, the membership for...",{'default': {'url': 'https://i.ytimg.com/vi/GD...,"[gmm, good mythical morning, rhettandlink, rhe...",24,none,en,...,3311,0,40,PT59S,2d,hd,True,True,{},rectangular
4,B6dXVr0r0Ws,2022-12-21 16:43:06.223003,2022-12-16T11:00:06Z,We Tried EVERY Goldfish Flavor,"Today, we're eating way too many Goldfish! GMM...",{'default': {'url': 'https://i.ytimg.com/vi/B6...,"[gmm, good mythical morning, rhettandlink, rhe...",24,none,en,...,48227,0,1777,PT19M54S,2d,hd,True,True,{},rectangular


### Export to CSV

In [10]:
output_path = "data/gmm-episodes_full.csv"
df.to_csv(output_path)