# Collect and analyse YouTube data

First, let's install a few relevant libraries:

In [None]:
!pip install jsonlines tqdm pandas google-api-python-client

...and load these libraries

In [None]:
import os
os.getcwd() # This line of code returns my current working directory

In [None]:
import jsonlines
from tqdm import tqdm
import pandas as pd

import googleapiclient.discovery
import googleapiclient.errors
from googleapiclient.discovery import build

## Obtaining credentials for the YouTube Data API

Getting access to the YouTube Data API

* **Step 1**: Go to the Google Cloud Console and login with a Google account: https://console.cloud.google.com/

The next steps are clearly given in the following video (from 0:00 to 4:39): https://www.youtube.com/watch?v=th5_9woFJmk&ab_channel=CoreySchafer . We recommend watching this segment of the video and following the steps as you go, but just in case the following bullet points summarise the steps in writing:

* **Step 2**: Once you are in the Google Cloud Console, create a new project by clicking the “Create Project” button, or by clicking the project dropdown menu at the top of the page and clicking the “New Project” button. The newly created project should be automatically selected as your current project but if this isn’t the case, simply select it in the list of projects in the project dropdown menu (at the top of the page).

* **Step 3**: Use the search bar at the top of the page to search for “Youtube Data API v3” and click on the result when it appears (note that more than one result may appear, so select the option entitled “Youtube Data API v3”).

* **Step 4**: Once on the page for “Youtube Data API v3”, click on the blue button that says “Enable”.

* **Step 5**: At this point you should have been automatically directed to a page with “API services” written in the top-left corner. Once there, click on “Create credentials”.

* **Step 6**: A form will appear. Under “Which API are you using?” Select “Youtube Data API v3”, and under “What data will you be accessing?” Select “Public data”.

* **Step 7**: Once the form is submitted, the API key will be displayed. Copy this key and paste instead of "YOUR_API_KEY" in the api_key variable below.


**IMPORTANT**: For safety reasons we recommend that, once you are done collecting data for a project, you go back to your project in the Google Cloud Console and delete the API key.


## Initialising API client

In [None]:
# Disable OAuthlib's HTTPS verification when running locally.
os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"

api_key = "your-api-key"  # Replace "YOUR_API_KEY_HERE" with your actual API key
api_service_name = "youtube"
api_version = "v3"

In [None]:
youtube = build(api_service_name, api_version, developerKey=api_key)

## Designing the request you want to make using the API documentation

The following webpages provide all the information you need to make requests using the YouTube API

**YouTube Data API documentation**: https://developers.google.com/youtube/v3/docs

**YouTube Data API requests pricing**: https://developers.google.com/youtube/v3/determine_quota_cost

## Collecting data

### Searching for lists of videos from a keyword

Let's query the YouTube Data API for the 50 most viewed videos published between the 11th and the 22nd of November 2024 (i.e. during COP29) relating to the keywords "climate change" or "global warming". We will further restrict the results to only retrieve videos (e.g. not playlists) and to have only English videos:

In [None]:
request = youtube.search().list(
    part="snippet",
    maxResults=50,
    publishedAfter="2024-11-11T00:00:00Z",
    publishedBefore="2024-11-22T00:00:00Z",
    order="viewCount",
    q="climate change | global warming",
    relevanceLanguage="en",
    type="video"
)
response = request.execute()

Let's look at the response we obtained...

In [None]:
response.keys()

In [None]:
len(response["items"])

In [None]:
response["items"][0]

The response contains a "nextPageToken" parameter:

In [None]:
response["nextPageToken"]

The "nextPageToken" can be used to obtain the next page of results:

In [None]:
request = youtube.search().list(
    part="snippet",
    maxResults=50,
    publishedAfter="2024-11-11T00:00:00Z",
    publishedBefore="2024-11-22T00:00:00Z",
    order="viewCount",
    q="climate change | global warming",
    relevanceLanguage="en",
    type="video",
    pageToken=response["nextPageToken"] # We added this argument here to specify we want the next page of results
)
next_page_response = request.execute()

In [None]:
len(next_page_response["items"])

In [None]:
next_page_response["items"][0]

Let's generalise the previous code so that we collect the N first pages of results for our query:

In [None]:
N = 2                      # We set N to 2 to define that we want the top 2 pages of results only. We can change the value of N if we want a different number of pages.
next_page_token = None     # The next_page_token variable will be used in the for loop to store the ID of the next page. We set it to None for the initial query.
search_results = list()    # We create an empty list to store the comment query results

# Let's iterate over the number of pages we want...
for i in tqdm(range(N)): # The "tqdm" wrapper around the "ids_list" variable allows us to see a progress bar

    # Retrieve a page of results for search query
    if next_page_token is None:
        # If "next_page_token" is  None (i.e. if this is the request for the first page), we do not use "pageToken" as a query parameter...
        request = youtube.search().list(
            part="snippet",
            maxResults=50,
            publishedAfter="2024-11-11T00:00:00Z",
            publishedBefore="2024-11-22T00:00:00Z",
            order="viewCount",
            q="climate change | global warming",
            relevanceLanguage="en",
            type="video"
        )
        page_response = request.execute()
        search_results.append(page_response)
    else:
        # If it not None however, we use "nextPageToken" to specify the "pageToken" as a query parameter...
        request = youtube.search().list(
            part="snippet",
            maxResults=50,
            publishedAfter="2024-11-11T00:00:00Z",
            publishedBefore="2024-11-22T00:00:00Z",
            order="viewCount",
            q="climate change | global warming",
            relevanceLanguage="en",
            type="video",
            pageToken=next_page_token
        )
        page_response = request.execute()
        search_results.append(page_response)

    # Try to retrieve the "nextPageToken" if there is one.
    try:
        next_page_token = page_response["nextPageToken"]

    # If the response does not have a "nextPageToken" field, we simply break out of the loop
    except KeyError:
        break

### Retrieving comment threads from videos

Let's test the comment query for just one video. Here, we will use of the very first video returned by the search query:

In [None]:
video_id = search_results[0]["items"][0]["id"]["videoId"]
video_id

In [None]:
request = youtube.commentThreads().list(
    part="snippet,id,replies",
    maxResults=100,
    order="time",
    videoId=video_id
)
comment_response = request.execute()

The response to this query contains the following keys:

In [None]:
comment_response.keys()

The response does not contain a "nextPageToken" field, which means that the video contains less than 100 comment threads. We can verify this by looking at the number of threads in the response:

In [None]:
len(comment_response["items"][0])

Having tested the comment collection query for a single video, let's create a for loop to collect the comments for all the videos in the search queries stored in the "search_results" variables. To do this, we first need to retrieve all the video IDs from the results in "search_results":

In [None]:
ids_list = list()

for result in search_results:
    for item in result["items"]:
        ids_list.append(item["id"]["videoId"])

Having retrieved the video IDs, we can now run the for loop to query the YouTube for comments. The for loop should be able to handle cases where a video has more than one page of comments (i.e. more than 100 comment threads):

In [None]:
comment_results = dict() # We create an empty dictionary to store the comment query results

# Let's iterate over the video IDs...
for id in tqdm(ids_list[:3]):

    # We initialise the comment results for this particular video ID to be an empty list
    comment_results[id] = list()

    # Try to retrieve the first page of comments for the video
    try:
        request = youtube.commentThreads().list(
            part="snippet,id,replies",
            maxResults=100,
            order="time",
            videoId=id
        )
        comment_response = request.execute()
        comment_results[id].append(comment_response)

    # Some video might have disable comments. If this is the case, these lines of code will catch the error and simply move on to the next video.
    except Exception as e:
        print(id, e)
        continue  # The "continue" command will skip the rest of the code in this iteration of the loop

    # Try to retrieve the "nextPageToken" if there is one.
    try:
        nextPageToken = comment_response["nextPageToken"]

    # If the response does not have a "nextPageToken" field, the loop simply moves on to the next video
    except KeyError:
        continue

    # Given that a value was found for "nextPageToken", let's retrieve the comments of the next page until a "nextPageToken" cannot be found
    while True:
        request = youtube.commentThreads().list(
            part="snippet,id,replies",
            maxResults=100,
            order="time",
            videoId=id,
            pageToken=nextPageToken
        )
        comment_response = request.execute()
        comment_results[id].append(comment_response)
        try:
            nextPageToken = comment_response["nextPageToken"]
        except KeyError:
            break

In [None]:
comment_results.keys()

## Calculate simple statistics to identify interesting video

Let's compute for each video its number of comment threads, its total number of comments and the number of comments per thread:

In [None]:
stats_list = list()

for i, id in enumerate(comment_results):
    nb_threads = 0
    nb_comments = 0
    nb_comments_per_thread = None

    for result in comment_results[id]:
        nb_threads += len(result["items"])
        for item in result["items"]:
            nb_comments += 1
            if "replies" in item:
                nb_comments += len(item["replies"]["comments"])

    if nb_threads > 0:
        nb_comments_per_thread = (nb_comments/nb_threads)

    stats_list.append({"video_id": id, "nb_threads": nb_threads, "nb_comments": nb_comments, "nb_comments_per_thread": nb_comments_per_thread})

stats_df = pd.DataFrame(stats_list)
stats_df

Let's sort the results by the number of comments:

In [None]:
stats_df.sort_values(by=['nb_comments'], ascending=False)

For the rest of this analysis, we'll simply focus on the video with the most comments:

In [None]:
rel_ID = stats_df.sort_values(by=['nb_comments'], ascending=False).iloc[0]["video_id"]
rel_ID

We can select the value in the "comment_results" dictionary (which holds all comments per video) for the video with the specified ID:

In [None]:
rel_comment_results = comment_results[rel_ID]

We will use this data in the next section to generate a social network visualisation.

## Social Network Analysis

We will use a version of the "comment_list_to_edge_list" function from the Reddit Notebook adapted to the format of data returned by the YouTube API to generate a Gephi CSV edge list:

In [None]:
def comment_list_to_edge_list(comments, include_parent = False):
    """
    A function that converts the comments on a YouTube video into a Gephi CSV edge list.

    The input list is expected to be a list for which each item is a response from the YouTube Data API for a comment thread query. Each item corresponds to a different page of results for the same video.

    The DataFrame returned is in a format so it can be written to a CSV edge list using Pandas' to_csv method.

    The edge list will include an edge from the author of a reply to the author of the comment that is being
    replied to.

    If you pass the optional parameter include_parent, the edge list will also include edges from the authors of
    top-level comments to the author of the video. In that case include_parent should be set to the
    user name of the youtuber who published the video.
    """

    user_pairs = list()

    # Let's iterate over the pages of results
    for page in comments:

        # Let's iterate over the comment threads in each page
        for thread in page["items"]:

            # Retrieve the name of the author of the top level comment in the presen thread
            top_comment_author = thread["snippet"]["topLevelComment"]["snippet"]["authorDisplayName"]

            # If include_parent is not False, add an edge between the video author and the top level comment author
            if include_parent is not False:
                user_pairs.append({"author_parent": include_parent, "author_child": top_comment_author})

            # If the top level comment has replies, let's add an edge between each replier and the author of the top level comment
            if "replies" in thread:
                for comment in thread["replies"]["comments"]:
                    user_pairs.append({"author_parent": top_comment_author, "author_child": comment["snippet"]["authorDisplayName"]})

    # Conver the list of edges to a pandas DataFrame, so we can use Pandas' to_csv method later to save the list of edges
    edgelist = pd.DataFrame(user_pairs)

    return edgelist

We want to include top level comments as replies to the video's author in our social network analysis. We therefore need the author name for the chosen video:

In [None]:
for result in search_results:
    for video in result["items"]:
        if video["id"]["videoId"] == rel_ID:
            video_author = video["snippet"]["channelTitle"]

We can then generate the list of edges and save it as a file:

In [None]:
edgelist = comment_list_to_edge_list(rel_comment_results, video_author)

In [None]:
edgelist

In [None]:
edgelist.to_csv('comments_' + rel_ID + '.csv', index = False)

## Save your results

In [None]:
with jsonlines.open("./search_results.jsonl", mode="w") as writer:
    for obj in search_results:
        writer.write(obj)

In [None]:
with jsonlines.open("./comment_results.jsonl", mode="w") as writer:
    for obj in [comment_results]:
        writer.write(obj)

If you want to read the results from memory, here is how to do it:

In [None]:
search_results = list()
with jsonlines.open("./search_results.jsonl", mode="r") as reader:
    for obj in reader:
        search_results.append(obj)

In [None]:
comment_results = list()
with jsonlines.open("./comment_results.jsonl", mode="r") as reader:
    for obj in reader:
        comment_results.append(obj)

comment_results = comment_results[0]