### Working with the YouTube API 
This file automatically downloads transcripts and video information using the YouTube API. For help see: <br> 
https://developers.google.com/youtube/v3/ <br>
https://github.com/spnichol/youtube_tutorial/blob/master/youtube_videos.py

In [1]:
import os
import pandas as pd
import numpy as np
import json

### Search YT for videos
This section contains two functions, but I will only use one. <br> The first one runs a search and returns 50 results. I'm going to use that to obtain safe and unrestricted content. <br> The second one uses the video resource https://developers.google.com/youtube/v3/docs/videos to load extra information about each video (by id). I won't use that because there's a lot of missing data for the parameter's I'm interested in.

In [64]:
# function to query the API
from apiclient.discovery import build

# Set key and enable YouTube Data API for your project.
DEVELOPER_KEY = "AIzaSyBP5sx70WtDUst-hL41i2fBQOZXXOssjDI" # your key here
YOUTUBE_API_SERVICE_NAME = "youtube"
YOUTUBE_API_VERSION = "v3"

# this function searches YT for videos only
def youtube_search(q, token=None, safe = 'none'): #can specify 'strict' to build validation set
    youtube = build(YOUTUBE_API_SERVICE_NAME, YOUTUBE_API_VERSION, developerKey=DEVELOPER_KEY)
    
    search_response = youtube.search().list(
        q=q,
        type='video',
        pageToken=token,
        order='relevance',
        part='id,snippet',
        maxResults=50,
        relevanceLanguage='en',
        safeSearch=safe
    ).execute()
    
    videos = [] 
    for search_result in search_response.get("items", []): 
        videos.append(search_result)

    try:
        nexttok = search_response["nextPageToken"]
        return(nexttok, videos)
    except Exception as e:
        nexttok = "last_page"
        return(nexttok, videos)
    
# this function gets extra information about a video    
def geo_query(video_id):
    youtube = build(YOUTUBE_API_SERVICE_NAME, YOUTUBE_API_VERSION,
                    developerKey=DEVELOPER_KEY)

    video_response = youtube.videos().list(
        id=video_id,
        part='contentDetails, statistics'
    ).execute()

    return video_response

# make table: extract json, ID for urls, title, description, thumbnail
def make_results_table(json_results):
    yt_urls = []
    titles = []
    description = []
    channel = []
    thumbnail = []
    
    for video in json_results:
        yt_urls.append(video['id']['videoId'])
        titles.append(video['snippet']['title'])
        description.append(video['snippet']['description'])
        channel.append(video['snippet']['channelTitle'])
        thumbnail.append(video['snippet']['thumbnails']['default']['url'])
    
    video_table = pd.DataFrame({"url":yt_urls, "title":titles, "description":description, "channel":channel, "thumbnail":thumbnail})
    return video_table

In [62]:
# run YT search
search_results = youtube_search("disney")#"{}".format(search)) # can specify safe = 'strict' for a safe search
search_token = search_results[0]
search_json = search_results[1]

In [66]:
results_table = make_results_table(search_json)
results_table.head()

Unnamed: 0,url,title,description,channel,thumbnail
0,xHpH11hiWfg,RALPH BREAKS THE INTERNET: Wreck-it Ralph 2 Tr...,Ralph's back! Check out our brand trailer for ...,Disney UK,https://i.ytimg.com/vi/xHpH11hiWfg/default.jpg
1,p4D19K8s-lA,10 Theories That Make Disney Movies So Much Da...,"Beware, some of these fan theories just might ...",Screen Rant,https://i.ytimg.com/vi/p4D19K8s-lA/default.jpg
2,_30jPKzWdN0,Why Are There No Mosquitoes at Disney World?,Walt Disney World is smack dab in the middle o...,Rob Plays,https://i.ytimg.com/vi/_30jPKzWdN0/default.jpg
3,BJzMhLhc070,Going Batty / Scare B&B | Full Episode | Vampi...,Vampirina and her family move to Pennsylvania ...,disneyjunior,https://i.ytimg.com/vi/BJzMhLhc070/default.jpg
4,IPSdb0JPnK8,"A Disney Springs Update From 400 Feet Up, Chec...",In today's vlog we head over to Disney Springs...,TheTimTracker,https://i.ytimg.com/vi/IPSdb0JPnK8/default.jpg


### Build training and validation data sets

In [105]:
kid_searches = ['peppa pig','Frozen','PAW patrol','oggy','powerpuff girls','mickey mouse','minnie mouse','dora','doraemon','wonder woman',
           'star wars','batman','superman','lego','power rangers','bugs bunny','elsa','baby einstein','spiderman','scooby doo',
           'alphabet','animals','winnie the pooh','tom and jerry','disney','sesame street','school','ufo','little pony','sponge bob']

In [106]:
safe_list=[]
unsafe_list=[]

for each in kid_searches:
    safe_results = youtube_search(each, safe='strict') 
    safe_json = safe_results[1]
    safe_results_table = make_results_table(safe_json)
    safe_results_table['query']=each
    safe_list.append(safe_results_table)
    unsafe_results = youtube_search(each) 
    unsafe_json = unsafe_results[1]
    unsafe_results_table = make_results_table(unsafe_json)
    unsafe_results_table['query']=each
    unsafe_list.append(unsafe_results_table)

safe_table = pd.concat(safe_list)
unsafe_table = pd.concat(unsafe_list)

In [107]:
print(safe_table.shape, unsafe_table.shape)

(1500, 6) (1500, 6)


In [108]:
unsafe_table.head()

Unnamed: 0,url,title,description,channel,thumbnail,query
0,UhNX5NdJCs4,Peppa Pig Full Episodes | LIVE Peppa Pig 2018 ...,Peppa Pig Full Episodes | LIVE Peppa Pig 2017 ...,Peppa Pig - Official Channel,https://i.ytimg.com/vi/UhNX5NdJCs4/default_liv...,peppa pig
1,HRcB1WOQZIo,Peppa Pig English Episodes | Parachute Jump | ...,Peppa Pig English Episodes | Parachute Jump | ...,Peppa Pig - Official Channel,https://i.ytimg.com/vi/HRcB1WOQZIo/default.jpg,peppa pig
2,W9JbqgenXUE,Peppa Pig English Episodes in 4K | Scooters! |...,We come back to discover brand new clips from ...,Peppa Pig - Official Channel,https://i.ytimg.com/vi/W9JbqgenXUE/default.jpg,peppa pig
3,P18xl3yDMX0,Peppa Pig English Episodes | Robbie and Rosie ...,We have yet more new characters joining the wo...,Peppa Pig - Official Channel,https://i.ytimg.com/vi/P18xl3yDMX0/default.jpg,peppa pig
4,s7IL4pERfr0,Peppa Pig English Episodes | Pottery with Pepp...,Subscribe for more videos: http://bit.ly/Peppa...,Peppa Pig - Official Channel,https://i.ytimg.com/vi/s7IL4pERfr0/default.jpg,peppa pig


In [110]:
# remove any safe content from the unsafe table 
notsafe = unsafe_table.loc[~unsafe_table['url'].isin(safe_table['url'])]
notsafe.shape

(210, 6)

### Extract transcripts

In [None]:
# turn urls into transcripts
from webvtt import WebVTT

# https://stackoverflow.com/questions/48125300/cant-scrape-youtube-videos-closed-captions
#I'll need to autopopulate the output filenames with %
def download_subs(video_url, lang="en"):
    cmd = [
        "youtube-dl",
        "--skip-download",
        "--write-auto-sub",
        "--sub-format vtt",
        "--sub-lang",
        lang,
        video_url,
        "--output test.vtt"
    ]
    
    command_string = " ".join(cmd)
    print(command_string)
    
    os.system(command_string)
    # https://pypi.org/project/webvtt-py/