## Basic Data Using Scrapetube:

### Get basic data:

For testing, use a low number of max_entries. To remove the limit, use `max_entries = None`.

In [1]:
# !pip install scrapetube 

In [2]:
# Imports
from datetime import datetime
import pandas as pd
import numpy as np
import scrapetube

# Parameters
max_entries = 50

GMM_url = "https://www.youtube.com/@GoodMythicalMorning"
video_iterator = scrapetube.get_channel(channel_url=GMM_url, limit=max_entries, sort_by="newest")

# New dictionary class with multidimensional get
class custom_dict(dict):
    def multidim_get(self, keys):
        value = self
        for key in keys:
            try:
                value = value[key]
            except KeyError:
                return None
        return value

# Features to extract from Scrapetube video object
"""
#ID
    - name
    - length
    - views
    - published date
    - thumbnail
        - still
        - video
    - scrape datetime
"""

# Function to extract features
def get_basic_video_details(video):
    video = custom_dict(video)
    return {
        "id": video.multidim_get(keys=["videoId"]),
        "name": video.multidim_get(keys=["title","runs",0,"text"]),
        "duration": video.multidim_get(keys=["lengthText","simpleText"]),
        "views": video.multidim_get(keys=["viewCountText","simpleText"]),
        "published": video.multidim_get(keys=["publishedTimeText","simpleText"]),
        "thumbnail": {
            "still": video.multidim_get(keys=["thumbnail","thumbnails",-1,"url"]),
            "video": video.multidim_get(keys=["richThumbnail","movingThumbnailRenderer","movingThumbnailDetails","thumbnails",0,"url"])
        },
        "scraped": datetime.now()
    }

# Build a dataframe of episodes using our Scrapetube iterator
df = pd.DataFrame([
    get_basic_video_details(video)
    for video
    in video_iterator
])

### Clean duration values:
Convert from format "MM:SS" to integer value (seconds)
- *Note: this will not work for videos over an hour long, but given GMM does not have any episodes meeting this criteria, this shouldn't cause any problems.*

In [3]:
df["duration"] = np.dot(
    df["duration"].str.split(":", expand=True).astype(int), # Get minutes and seconds
    [60, 1] # Multiply by 2x1 matrix to convert to total seconds
    )

### Clean view counts:
Convert from format "##,###,### views" to integer value (views)

In [4]:
df["views"] = df["views"].str.replace('\D', '', regex=True).astype(int) # Remove any non-integer characters then convert to int

### Check output:
This is as far as we can get with ScrapeTube. However, we can still get more information about these videos using the YouTube Data API!

In [5]:
df.head()

Unnamed: 0,id,name,duration,views,published,thumbnail,scraped
0,XiORNYGT-6s,Our Best Food Creations This Year,1331,831356,1 day ago,{'still': 'https://i.ytimg.com/vi/XiORNYGT-6s/...,2022-12-21 02:37:25.098818
1,B6dXVr0r0Ws,We Tried EVERY Goldfish Flavor,1194,1409443,4 days ago,{'still': 'https://i.ytimg.com/vi/B6dXVr0r0Ws/...,2022-12-21 02:37:25.098894
2,RPp5CXZVhlc,We Hug For 20 Minutes Straight... For Science,1394,532061,5 days ago,{'still': 'https://i.ytimg.com/vi/RPp5CXZVhlc/...,2022-12-21 02:37:25.098913
3,JrZP8aAZE9M,Lab Grown Dairy Taste Test,1140,898094,6 days ago,{'still': 'https://i.ytimg.com/vi/JrZP8aAZE9M/...,2022-12-21 02:37:25.098927
4,74ntqQXYK5s,Testing Discontinued Toys From The 80's,1191,870749,7 days ago,{'still': 'https://i.ytimg.com/vi/74ntqQXYK5s/...,2022-12-21 02:37:25.098943


### Export to CSV:

In [6]:
output_path = "data/gmm-episodes_basic.csv"
df.to_csv(output_path)

---

## More data with YouTube Data API:

### Create a Google Cloud project:

YouTube Data API calls are limited by a daily quota, we need to run our calls through a Google Cloud project in order to keep track of our quota usage.

**Steps:**
1. Create a Google Cloud project [here](https://console.cloud.google.com/).
2. Enable the [YouTube Data API](https://developers.google.com/youtube/v3) for your project.

### Create and import an API Key:

To run API calls we will need a Key to link this notebook to our Google Cloud project.

**Steps:**
1. Create an API Key [here](https://console.cloud.google.com/apis/credentials) for your project.
2. Create a file to store this API Key using the `credentials-template.json` template. Name the new file `credentials.json`.

In [7]:
import json

credentials_file = "credentials.json"

with open(credentials_file, "r") as fh:
    credentials = json.load(fh)

assert "API_Key" in list(credentials.keys())
assert isinstance(credentials["API_Key"], str)

print("Successfully loaded API key from credentials file.")

Successfully loaded API key from credentials file.


### Get all video IDs from channel:

Create a set of all video IDs on the GMM channel.

In [8]:
import requests

def get_channel_videos_page(channel_id, page_token=None, credential=credentials["API_Key"], max_page_results=50, order="date", part="id"):
    """
    Retrieve one page of videos.
    """

    url = f"""https://www.googleapis.com/youtube/v3/search?key={credential}&channelId={channel_id}&maxResults={max_page_results}&order={order}&part={part}"""
    if page_token: url += f"&pageToken={page_token}"

    webpage = requests.get(url)
    content = json.loads(webpage.text)
    
    next_page = content.get("nextPageToken")
    videos = content.get("items")

    video_ids = [
        video["id"]["videoId"]
        for video
        in videos
        if video["id"]["kind"] == "youtube#video"
    ]

    return video_ids, next_page


def get_channel_videos(channel_id, max_depth=None, **kwargs):
    """
    Use API calls to retireve video IDs of all videos on a YouTube channel.
    Retrieved single page at a time
    """

    all_video_ids = []
    depth = 0
    
    page_video_ids, next_page = get_channel_videos_page(channel_id, **kwargs)
    all_video_ids += page_video_ids

    while next_page and depth < max_depth:
        page_video_ids, next_page = get_channel_videos_page(channel_id, page_token=next_page, **kwargs)
        all_video_ids += page_video_ids
        depth += 1

    return set(all_video_ids)


GMM_channel_id = "UC4PooiX37Pld1T8J5SYT-SQ"
GMM_video_ids = get_channel_videos(GMM_channel_id, max_depth=5)
pd.DataFrame(GMM_video_ids, columns=["video_ids"]).head()


Unnamed: 0,video_ids
0,P6K2R1hj0Kg
1,jjQX8T89b2g
2,rWqv2lrxkw8
3,z9WoHffcLik
4,u1bHR3ecJTY


### Set up a verified YouTube API service:

To enable more specific queries, we will need to use OAuth2.0 verification.

**Steps:**
1. Configure the OAuth2.0 Consent Screen for your project [here](https://console.cloud.google.com/apis/credentials/consent).
2. Create OAuth2.0 verification for your project [here](https://console.cloud.google.com/apis/credentials), select *'Desktop App'* as your client type.
4. Download the `client_secret.json` file [here](https://console.cloud.google.com/apis/credentials) and save it in the project folder as `client_secret.json`.

*Adapted from code by [Jie Jenn](https://www.youtube.com/@jiejenn).*

In [9]:
from Google import create_service

client_secret_file = "client_secret.json"
api_name = "youtube"
api_version = "v3"
scopes = ["https://www.googleapis.com/auth/youtube"]

youtube_api_service = create_service(client_secret_file, api_name, api_version, scopes)

youtube v3 service created successfully


### Get data from YouTube API

Now we can use our service to run API calls.

In [10]:
def get_video_details(video_id, parts=["snippet", "statistics", "contentDetails"], service=youtube_api_service):
    """
    Using YouTube data API, get information for a video by it's video ID.
    
    Parameters:
    video_id - the ID of the video to find
    parts - the parts of data to return in API call, more info at 
    service - the service through which to send API calls
    """
    request = service.videos().list(
        id=video_id,
        part=",".join(parts)
    ).execute()

    episode = request["items"][0]

    result = {"id": video_id}
    for part in parts:
        result = result | episode[part]
        
    return result

df = pd.DataFrame([
    get_video_details(id)
    for id
    in GMM_video_ids
])

df.head()

Unnamed: 0,id,publishedAt,channelId,title,description,thumbnails,channelTitle,tags,categoryId,liveBroadcastContent,...,likeCount,favoriteCount,commentCount,duration,dimension,definition,caption,licensedContent,contentRating,projection
0,P6K2R1hj0Kg,2019-01-11T11:00:03Z,UC4PooiX37Pld1T8J5SYT-SQ,Crazy Diet Fad Challenge,We're challenging our knowledge on the crazies...,{'default': {'url': 'https://i.ytimg.com/vi/P6...,Good Mythical Morning,"[gmm, good mythical morning, rhettandlink, rhe...",24,none,...,34804,0,1916,PT14M45S,2d,hd,True,True,{},rectangular
1,jjQX8T89b2g,2021-12-31T11:00:26Z,UC4PooiX37Pld1T8J5SYT-SQ,Top 5 Favorite Moments of 2021,"Today, we're looking back at our favorite mome...",{'default': {'url': 'https://i.ytimg.com/vi/jj...,Good Mythical Morning,"[gmm, good mythical morning, rhettandlink, rhe...",24,none,...,21336,0,783,PT21M51S,2d,hd,True,True,{},rectangular
2,rWqv2lrxkw8,2021-11-27T11:00:20Z,UC4PooiX37Pld1T8J5SYT-SQ,Sucking Up Hot Dogs With A Vacuum,We'll never look at vacuums the same again. #s...,{'default': {'url': 'https://i.ytimg.com/vi/rW...,Good Mythical Morning,"[gmm, good mythical morning, rhettandlink, rhe...",24,none,...,42883,0,679,PT46S,2d,hd,True,True,{},rectangular
3,z9WoHffcLik,2021-11-03T10:00:27Z,UC4PooiX37Pld1T8J5SYT-SQ,Discontinued Snacks Taste Test,"Today, we're eating and drinking some snacks t...",{'default': {'url': 'https://i.ytimg.com/vi/z9...,Good Mythical Morning,"[gmm, good mythical morning, rhettandlink, rhe...",24,none,...,58266,0,1815,PT17M16S,2d,hd,True,True,{},rectangular
4,u1bHR3ecJTY,2019-12-03T11:00:10Z,UC4PooiX37Pld1T8J5SYT-SQ,Putting Weird Things In A Dishwasher (TEST),Ever wondered what would happen if you put a g...,{'default': {'url': 'https://i.ytimg.com/vi/u1...,Good Mythical Morning,"[gmm, good mythical morning, rhettandlink, rhe...",24,none,...,33342,0,997,PT14M13S,2d,hd,True,True,{},rectangular


### Export to CSV

In [11]:
output_path = "data/gmm-episodes_full.csv"
df.to_csv(output_path)