# Data Collection Part II: YouTube Videos

The measure of video popularity on YouTube based on the number of views the video receives. This statement is somewhat subjective, as there is again in this case no formal definition of what constitutes a "viral" video. However, we can infer from the placement of the "views" widget on the YouTube plaform, and it's relative prominance compared to the "likes," that this is the best metric for measuring popularity of a video. For this reason, I chose to use views as my main search criteria, though I did collect likeCount and dislikeCount for each video object as well.

YouTube's video search functionaity is somewhat less robust than that of Reddit. Google allows developers to enter a search query in their API call, however the search term can only be a single string and will only search based on video title. In order to execute a search for videos with the all-time highest views, I wrote a search query that would search every combination of letters and numbers, and then filtered that result by view count. 

#### Data Collection Methodology

Again here, the video comments were stored separately from the main video data, as were the video stats. In fact, the initial API call only returned video metadata and video ids for the specified tearch terms. I chained two additional API calls onto the original one and set the output as a series of nested dictionaries. Comment replies were, in this case, a bit easier to control, in that I was able to set a maximum limit. The data returned have up to 50 comments per video, and up to 20 replies per comment.

Due to the limit of 50 results per top-level API call, I ran the call function many times and simply worte a series of JSON files to the local directory, and joined them together in a later notebook.

In [1]:
## Packages and libraries:

import pandas as pd
import numpy as np

from apiclient.discovery import build
import apiclient.errors
import api_key as api_key

import json

In [2]:
youtube = build('youtube', 'v3', developerKey='AIzaSyCIfF0Zx769xJoLkCx0UgmU1XhNyyDNpvY')
youtube.search()

In [57]:
## API Call function:

youtube_json = []

def get_video_comments(search_term):
        
    next_page_token = 'CPQDEAA'
    
    res_search = youtube.search().list(part='snippet',
                                maxResults=50,
                                q=search_term,
                                order='viewCount',
                                type='video',
                                pageToken=next_page_token).execute()
    video_ID = []
    skip_videos = ['y7e-GC6oGhg', 'vvbN-cWe0A0', '1Aoc-cd9eYs', 'J78SdCzzumA', 'dY_3ggKg0Bc', 'hHW1oY26kxQ', 
                   'EDkoj932YFo', '5VcSwejU2D0', 'QX_oy9614HQ', 'Fkd9TWUtFm0', 'QtIreEZYPaE', 'IFAcqaNzNSc',
                   'tGn3-RW8Ajk', 'Xiu62ETFlyk', 'XfRY0ASZHuU', 'JuppD-oQKp8', 'OxzKb4a1Qc4', 'bSrf4_mQrkQ',
                   '_h5qmAiTnV8', 'iLc4wHK_3Zg', 'hY-RmUM5H9g', 'WybszDOC3tQ', 'TlflQmjlRxQ', 'pzUt8hGvEwk', 
                   'nbY5a71f4-U']

    for item in res_search['items']: 
        if item['id']['videoId'] != None:
            video_ID.append(item['id']['videoId'])

    video_items = []
    
    for item in video_ID:
        
        try:
        
            if item not in skip_videos:
                print(item)
                res_stats = youtube.videos().list(part='snippet, statistics', id=item).execute()
                res_comm = youtube.commentThreads().list(part='snippet, replies', videoId=item, maxResults=50).execute()

                for item in res_stats['items']:
                    if 'statistics' in item.keys():
                        temp_dict = {}
                        temp_id = {}
                        temp_id = item['id']
                        temp_dict['videoID'] = temp_id

                        published = {}
                        published = item['snippet']['publishedAt']
                        temp_dict['publishedAt'] = published

                        title = {}
                        title = item['snippet']['title']
                        temp_dict['title'] = title

                        description = {}
                        description = item['snippet']['description']
                        temp_dict['description'] = description

                        views = {}
                        views = item['statistics']['viewCount']
                        temp_dict['viewCount'] = views

                        likes = {}
                        likes = item['statistics']['likeCount']
                        temp_dict['likeCount'] = likes

                        dislikes = {}
                        dislikes = item['statistics']['dislikeCount']
                        temp_dict['dislikeCount'] = dislikes

                        comment_count = {}
                        comment_count = item['statistics']['commentCount']
                        temp_dict['commentCount'] = comment_count

                    comment_list = []

                    for comment in res_comm['items']:
                        temp_comment_dict = {}

                        topLevelComment = {}
                        topLevelComment = comment['snippet']['topLevelComment']['snippet']['textOriginal']
                        temp_comment_dict['topLevelComment'] = topLevelComment

                        if comment['snippet']['totalReplyCount'] != 0:
                            counter = 0
                            while counter < 20:
                                for i in comment['replies']['comments']:
                                    reply = {}
                                    reply = i['snippet']['textOriginal']
                                    temp_comment_dict['reply'] = reply
                                    counter +=1

                        comment_list.append(temp_comment_dict)

                    temp_dict['comments'] = comment_list

                    video_items.append(temp_dict)

        except:
            
            continue
                    
        next_page_token = res_search.get('nextPageToken')

        if next_page_token is None:
            break

    youtube_json.append(video_items) 
    
    print(next_page_token)

## Write the output to a JSON file in the local directory:

    with open('//Users/veronicagiannotta/DSI/Projects/Capstone/Data/YOUTUBEdata_020919_10.json', 'a') as f:
        f.write(json.dumps([youtube_json]))

    

In [58]:
get_video_comments('a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z|1|1|2|3|4|5|6|7|8|9|0')

CPQDEAA


In [2]:
yt_data = []
notParsed = []
yt_file = open('./Data/YOUTUBEdata_020919_6.json',"r")
for line in yt_file:    
    if line.strip():   
        try:
            post=json.loads(line)
            yt_data.append(post)
        except:
            notParsed.append(line)
            continue
print(len(yt_data))
print('Could not parse: ', len(notParsed))

1
Could not parse:  0


In [3]:
len(yt_data[0][0][0])

46

In [4]:
yt_data[0][0][0]

[{'videoID': 'da1vvigy5tQ',
  'publishedAt': '2015-05-04T15:17:29.000Z',
  'title': 'Reversing Type 2 diabetes starts with ignoring the guidelines | Sarah Hallberg | TEDxPurdueU',
  'description': 'Can a person be "cured" of Type 2 Diabetes? Dr. Sarah Hallberg provides compelling evidence that it can, and the solution is simpler than you might think.\n\nDr. Sarah Hallberg is the Medical Director of the Medically Supervised Weight Loss Program at IU Health Arnett, a program she created. She is board certified in both obesity medicine and internal medicine and has a Master’s Degree in Exercise Physiology. She has recently created what is only the second non-surgical weight loss rotation in the country for medical students. Her program has consistently exceeded national benchmarks for weight loss, and has been highly successful in reversing diabetes and other metabolic diseases. Dr. Hallberg is also the co-author of www.fitteru.us, a blog about health and wellness.\r\n\r\nB.S., Kinesiolog