### Read credentials from a JSON file
Store your credentials in project root, in file 'credentials.json'.

For now, all we need is an API key for a Google Cloud project which contains the YouTube Data API. This can be created here: https://console.cloud.google.com/apis/credentials (don't forget to add the YouTube Data API)

In [1]:
import json

with open("credentials.json", "r") as fh:
    credentials = json.load(fh)

assert "API_Key" in list(credentials.keys())
assert type(credentials["API_Key"]) == str

print("Successfully loaded API key from credentials file.")

Successfully loaded API key from credentials file.


### Get episodes using Scrapetube:

For testing, use a low number of max_entries. To remove the limit, use `max_entries = None`.

In [2]:
# !pip install scrapetube 

In [3]:
# Imports
from datetime import datetime
import pandas as pd
import numpy as np
import scrapetube

# Parameters
max_entries = 50

GMM_url = "https://www.youtube.com/@GoodMythicalMorning"
video_iterator = scrapetube.get_channel(channel_url=GMM_url, limit=max_entries, sort_by="newest")

# New dictionary class with multidimensional get
class custom_dict(dict):
    def multidim_get(self, keys):
        value = self
        for key in keys:
            try:
                value = value[key]
            except KeyError:
                return None
        return value

# Features to extract from Scrapetube video object
"""
#ID
    - name
    - length
    - views
    - published date
    - thumbnail
        - still
        - video
    - scrape datetime
"""

# Function to extract features
def get_video_details(video):
    video = custom_dict(video)
    return {
        "id": video.multidim_get(keys=["videoId"]),
        "name": video.multidim_get(keys=["title","runs",0,"text"]),
        "duration": video.multidim_get(keys=["lengthText","simpleText"]),
        "views": video.multidim_get(keys=["viewCountText","simpleText"]),
        "published": video.multidim_get(keys=["publishedTimeText","simpleText"]),
        "thumb": {
            "still": video.multidim_get(keys=["thumbnail","thumbnails",-1,"url"]),
            "video": video.multidim_get(keys=["richThumbnail","movingThumbnailRenderer","movingThumbnailDetails","thumbnails",0,"url"])
        },
        "scraped": datetime.now()
    }

# Build a dataframe of episodes using our Scrapetube iterator
df = pd.DataFrame([
    get_video_details(video)
    for video
    in video_iterator
])

### Clean duration values:
Convert from format "MM:SS" to integer value (seconds)
- *Note: this will not work for videos over an hour long, but given GMM does not have any episodes meeting this criteria, this shouldn't cause any problems.*

In [4]:
df["duration"] = np.dot(
    df["duration"].str.split(":", expand=True).astype(int), # Get minutes and seconds
    [60, 1] # Multiply by 2x1 matrix to convert to total seconds
    )

### Clean view counts:
Convert from format "##,###,### views" to integer value (views)

In [5]:
df["views"] = df["views"].str.replace('\D', '', regex=True).astype(int) # Remove any non-integer characters then convert to int

### Check output:
This is as far as we can get with ScrapeTube. However, we can still get more information about these videos using the YouTube Data API!

In [6]:
df.head()

Unnamed: 0,id,name,duration,views,published,thumb,scraped
0,XiORNYGT-6s,Our Best Food Creations This Year,1331,738442,1 day ago,{'still': 'https://i.ytimg.com/vi/XiORNYGT-6s/...,2022-12-20 15:08:42.475608
1,B6dXVr0r0Ws,We Tried EVERY Goldfish Flavor,1194,1340918,4 days ago,{'still': 'https://i.ytimg.com/vi/B6dXVr0r0Ws/...,2022-12-20 15:08:42.475628
2,RPp5CXZVhlc,We Hug For 20 Minutes Straight... For Science,1394,522652,5 days ago,{'still': 'https://i.ytimg.com/vi/RPp5CXZVhlc/...,2022-12-20 15:08:42.475637
3,JrZP8aAZE9M,Lab Grown Dairy Taste Test,1140,887282,6 days ago,{'still': 'https://i.ytimg.com/vi/JrZP8aAZE9M/...,2022-12-20 15:08:42.475645
4,74ntqQXYK5s,Testing Discontinued Toys From The 80's,1191,857868,7 days ago,{'still': 'https://i.ytimg.com/vi/74ntqQXYK5s/...,2022-12-20 15:08:42.475652


### Export to CSV:

In [7]:
output_path = "data/gmm-episodes.csv"
df.to_csv(output_path)