# YouTube Comments via YouTube API

This notebook utilizes Google's YouTube Data API to generate a dataset of YouTube comments from Japanese language videos that are trending in the region. It takes up to 250 videos depending on how many videos in the most popular chart for the region are categorized as being in the language to increase chances that comments on the video are in Japanese. From each video, up to 100 comments are grabbed along with all their replies.

In [1]:
import os
import pandas as pd
from dotenv import load_dotenv
from googleapiclient.discovery import build
from pathlib import Path
from time import sleep

In [2]:
# Load in credentials from environment variables
load_dotenv()
API_KEY = os.getenv('API_KEY')

# Initialize API client
youtube = build(
    'youtube', 'v3', developerKey=API_KEY
)

## Retrieve comments from YouTube

commentThreads() with 'snippet,replies' pulls comments and all their replies from a specified videoId.
- order='relevance' gets comments with lots of likes, which are more likely to have replies

In [3]:
def retrieve_comments(video_id, max_results=10):
    
    # Make request to API and save as a variable
    request = youtube.commentThreads().list(
        part='snippet,replies',
        maxResults=max_results,
        order='relevance',
        videoId=video_id
    )
    try:
        response = request.execute()
        
        return response
    except:
        return None

In [None]:
# For testing
# test_response = retrieve_comments(video_id='4V0UAhe8o5c')
# test_response

## Writing comment data to a Pandas DF

Takes in a response from an API call and a dictionary containing information about the video. Returns a dataframe created from a list of dictionaries that contain:

* channel - the channel's title
* video_id - YouTube's unique identifier of the video
* category_id - YouTube id that classifies what the content of the video is about
* text - original text of the comment before any edits
* date_published - date when the comment was made
* comment_type - either top-level or reply

In [4]:
def extract_info(response, video):
    
    comment_data = []
    
    for item in response['items']:
        
        # Grab the top-level comment first
        comment_data.append({
            'channel': video['channel'],
            'video_id': video['video_id'],
            'category_id': video['category_id'],
            'text': item['snippet']['topLevelComment']['snippet']['textOriginal'],
            'date_published': item['snippet']['topLevelComment']['snippet']['publishedAt'],
            'comment_type': 'top-level'
        })
        
        # Check if there are replies and get same info if there are
        if 'replies' in item.keys():

            for reply in item['replies']['comments']:

                comment_data.append({
                    'channel': video['channel'],
                    'video_id': video['video_id'],
                    'category_id': video['category_id'],
                    'text': reply['snippet']['textOriginal'],
                    'date_published': reply['snippet']['publishedAt'],
                    'comment_type': 'reply'
                })
        
    return pd.DataFrame(comment_data)

## Retrieve popular videos

videos().list() method with chart='mostPopular' generates a resource that has information about the currrently trending videos within a region.

There is a limit on the amount of videos displayed in the response but within the results is a token is given that can generate the next page of results. The function accepts this token as a parameter, which when combined with a loop allows for all the results to be retrieved up to the maximum specified.

In [5]:
def retrieve_videos(next_page=''):
    request = youtube.videos().list(
        part='snippet',
        chart='mostPopular',
        maxResults=250,
        pageToken=next_page,
        regionCode='JP',
    )
    response = request.execute()
 
    return response

In [6]:
def list_videos(response):
    
    for item in response['items']:
        
        snippet = item['snippet']
        
        if (snippet.get('defaultLanguage') == 'ja' or
            snippet.get('defaultAudioLanguage') == 'ja'):
      
            videos.append({
                'video_id': item['id'],
                'channel': snippet['channelTitle'],
                'category_id': snippet['categoryId']
    })

In [7]:
# Make list
videos = []

# Run once to generate first page
response = retrieve_videos('')
next_page = response['nextPageToken']
list_videos(response)

# Loop until there is no nextPageToken
try:
    while next_page:
        response = retrieve_videos(next_page)
        next_page = response['nextPageToken']
        list_videos(response)
except:
    pass

In [10]:
# Check list of videos
videos

[{'video_id': 'Jb6Zlg30rgk', 'channel': 'mwamjapan', 'category_id': '10'},
 {'video_id': 'JOFXeA2h3tY',
  'channel': 'Pekora Ch. 兎田ぺこら',
  'category_id': '20'},
 {'video_id': 'pjPXzJedwiA', 'channel': 'ぱかチューブっ!', 'category_id': '20'},
 {'video_id': 'CrtIyaMX0hY', 'channel': 'JRA公式チャンネル', 'category_id': '17'},
 {'video_id': 'LhcMEYfuHZ0', 'channel': 'Top J Records', 'category_id': '10'},
 {'video_id': 'wyF_AXj2RfY',
  'channel': 'エガちゃんねる 〜替えのパンツ〜',
  'category_id': '23'},
 {'video_id': '9tazyKRmaZw', 'channel': 'トレバー・バウアー', 'category_id': '17'},
 {'video_id': 'JIZ_5BTM43Y',
  'channel': 'eFootball チャンネル',
  'category_id': '17'},
 {'video_id': 'dYzgZVhiqmU',
  'channel': 'millennium parade Official YouTube Channel',
  'category_id': '10'},
 {'video_id': 'XAy6jznoWYE', 'channel': '渋谷ハル', 'category_id': '20'},
 {'video_id': 'l9yeGeYLf3w', 'channel': 'JRA公式チャンネル', 'category_id': '17'},
 {'video_id': '4fx_0IRdvSc',
  'channel': 'SUPER GT Official Channel',
  'category_id': '17'},
 {'video_id

In [11]:
# Initialize a DataFrame to store data
comments_df = pd.DataFrame(columns=['channel',
                                    'video_id',
                                    'category_id',
                                    'text',
                                    'date_published',
                                    'comment_type'])

In [12]:
# Loop through videos list       
for video in videos:
    
    # Query the YouTube Data API
    response = retrieve_comments(video['video_id'], max_results=100)
    sleep(3)

    # Add the data from the response to the DF
    if response:
        data = extract_info(response, video)
        comments_df = pd.concat([comments_df, data])

In [13]:
# Check DF
comments_df

Unnamed: 0,channel,video_id,category_id,text,date_published,comment_type
0,mwamjapan,Jb6Zlg30rgk,10,This season is going to be a masterpiece <3,2023-04-16T15:39:10Z,top-level
1,mwamjapan,Jb6Zlg30rgk,10,Every season is masterpiece 🔥🔥🔥,2023-04-17T16:41:52Z,reply
2,mwamjapan,Jb6Zlg30rgk,10,@HM cry about it,2023-04-17T16:15:19Z,reply
3,mwamjapan,Jb6Zlg30rgk,10,@HM dude demon slayer has no story but it has ...,2023-04-17T15:44:29Z,reply
4,mwamjapan,Jb6Zlg30rgk,10,@HM 🙂🙃😒😒😒,2023-04-17T13:44:05Z,reply
...,...,...,...,...,...,...
139,細川バレンタイン / 前向き教室,wtbVhUgMS4M,17,@R \n\n半グレは、逆に受けた！,2023-04-09T05:07:38Z,reply
140,細川バレンタイン / 前向き教室,wtbVhUgMS4M,17,@細川バレンタイン / 前向き教室 いやいや見た目半グレって言われてますよw喜ばないでくださいw,2023-04-09T04:26:00Z,reply
141,細川バレンタイン / 前向き教室,wtbVhUgMS4M,17,ありがとうございます😊\n\n嬉しい😎,2023-04-08T18:23:04Z,reply
142,細川バレンタイン / 前向き教室,wtbVhUgMS4M,17,デビュー戦なのにKOできなかったから期待外れっていう人の気持ちがわからない\n技術とセンスを...,2023-04-08T21:42:22Z,top-level


In [14]:
# Export the comment data to a CSV file
output_path = Path('Resources/youtube_comments.csv')
comments_df.to_csv(output_path)