* This python notebook is developed to pull the Youtube data using [Youtube API](https://developers.google.com/youtube/v3/docs) which is publicly available.

* YouTube Channels are selected for analysis and pulled the list of videos with video_id, description, title, published date, view count, like count and comment count.

* Comments are pulled for each video along with date of posted and number of likes received for the respective comment.




Import required dependencies

In [None]:
import requests
import pandas as pd

API Key to authorize and pull the data this is generated from the Google account and Cloud.

**Caveat:** API exhausts for a API Key as quota exceed per day. Refer [documentation](https://developers.google.com/youtube/v3/guides/quota_and_compliance_audits) for more information.


For creation of youtube api key, one has to have a goole account and should signed into Gloud Cloud Platform (GCP) and enable YouTube API key from services. Refer the [youtube video](https://colab.research.google.com/drive/1mvJXAdRaSCPMUcuqGCjb97lBMQhwRgrD?usp=sharing) to create the api_key.



In [2]:
api_key = '<place your api key>'

List of channels for research. Use this [link](https://urldefense.com/v3/__https:/www.streamweasels.com/tools/youtube-channel-id-and-user-id-convertor/__;!!DZ3fjg!9vMbpkNVZM8aKwDI4WcGxRIkT9XnF2xVtl8OTuT6_0tVMUpHJKyI_fm-kKM8uDC57ogvbSpGeMbNqAyyHgid1Q$) to get the channel_id for the respective YouTube Channel.

In [None]:
channels = {
    'DeepLearning.AI': 'UCcIXc5mJsHVYTZR1maL5l9w',
    'StatQuest': 'UCtYLUTtgS3k1Fg4y5tAhLbw',
    'Sentdex': 'UCfzlCWGWYyIQ0aLC5w48gBQ',
    '3Blue1Brown': 'UCYO_jab_esuFRV4b17AJtAw',
    'Siraj Raval': 'UCWN3xxRkmTPmbKwht9FuE5A',
    'Two Minute Papers': 'UCbfYPyITQ-7l4upoX8nvctg',
    'Lexfridman': 'UCSHZKyawb77ixDdsGog4iWA'
}

Defined a function to pull the list of videos for each channel. Use the endpoint https://www.googleapis.com/youtube/v3/search

In [None]:
def get_channel_videos(api_key, channel_id):
    videos_base_url = "https://www.googleapis.com/youtube/v3/search"
    next_page_token = None
    all_videos = []

    while True:
        videos_params = {
            'key': api_key,
            'channelId': channel_id,
            'part': 'snippet',
            'order': 'date',
            'maxResults': 50,
            'type': 'video',
            'pageToken': next_page_token
        }

        videos_response = requests.get(videos_base_url, params=videos_params)
        videos_data = videos_response.json()
        video_ids = [item['id']['videoId'] for item in videos_data['items']]

        # Fetch details for the current batch of video IDs
        if video_ids:
            details_base_url = "https://www.googleapis.com/youtube/v3/videos"
            details_params = {
                'key': api_key,
                'id': ','.join(video_ids),
                'part': 'snippet,statistics'
            }

            details_response = requests.get(details_base_url, params=details_params)
            details_data = details_response.json()

            for item in details_data['items']:
                all_videos.append({
                    'channel_id': item['snippet']['channelId'],
                    'channel_name': item['snippet']['channelTitle'],
                    'video_id': item['id'],
                    'title': item['snippet']['title'],
                    'description': item['snippet']['description'],
                    'published_at': item['snippet']['publishedAt'],
                    'view_count': item['statistics'].get('viewCount', '0'),
                    'like_count': item['statistics'].get('likeCount', '0'),
                    'comment_count': item['statistics'].get('commentCount', '0')
                })

        next_page_token = videos_data.get('nextPageToken')
        if not next_page_token:
            break

    return all_videos


Fetch the data (list of videos) using the loop and store in the csv file - youtube_channel_videos_details.

In [None]:
all_videos = []
for channel_name, channel_id in channels.items():
    print(f"Fetching videos for channel: {channel_name}")
    videos = get_channel_videos(api_key, channel_id)
    all_videos.extend(videos)

# Convert list of videos to DataFrame
videos_df = pd.DataFrame(all_videos)

# Optionally, save to CSV
videos_df.to_csv('youtube_channel_videos_details.csv', index=False)
print("Data extraction complete. Check the CSV file.")

Fetching videos for channel: DeepLearning.AI
Fetching videos for channel: StatQuest
Fetching videos for channel: Sentdex
Fetching videos for channel: 3Blue1Brown
Fetching videos for channel: Siraj Raval
Fetching videos for channel: Two Minute Papers
Fetching videos for channel: Lexfridman
Data extraction complete. Check the CSV file.


Function definition for data pull of comments for each video. Use the end point https://www.googleapis.com/youtube/v3/commentThreads

In [None]:
def get_video_comments(api_key, video_id):
    comments_base_url = "https://www.googleapis.com/youtube/v3/commentThreads"
    comments_params = {
        'key': api_key,
        'videoId': video_id,
        'part': 'snippet',
        'maxResults': 100,
        'textFormat': 'plainText'
    }
    all_comments = []
    next_page_token = None

    while True:
        if next_page_token:
            comments_params['pageToken'] = next_page_token

        response = requests.get(comments_base_url, params=comments_params)
        if response.status_code == 403:
            print("403 - Access Forbidden. API key or quota may be exhausted.")
            return all_comments, False  # Return collected comments and False indicating failure due to API limits.

        data = response.json()

        for item in data['items']:
            top_level_comment = item['snippet']['topLevelComment']
            top_level_comment_snippet = top_level_comment['snippet']
            all_comments.append({
                'video_id': video_id,
                'comment_id': top_level_comment['id'],
                'text': top_level_comment_snippet['textDisplay'],
                'author': top_level_comment_snippet['authorDisplayName'],
                'like_count': top_level_comment_snippet['likeCount'],
                'published_at': top_level_comment_snippet['publishedAt']
            })

        next_page_token = data.get('nextPageToken')
        if not next_page_token:
            break

    return all_comments, True

Presetting to pull the data and see where the API exhausts

In [None]:
all_comments = []
successful = True
last_index = 0

Execute the above function for pulling the comments data, this loop stops when the quota exceeds/API exhausts. And with minimum effort in adjusting the index from where to start, the rest of the data can be pulled on next day.

In [None]:
for index, video_id in enumerate(videos_df['video_id']):
    print(f"Fetching comments for video ID: {video_id}")
    video_comments, successful = get_video_comments(api_key, video_id)
    all_comments.extend(video_comments)
    if not successful:
        last_index = index
        break

# Convert list of comments to DataFrame
comments_df = pd.DataFrame(all_comments)
# Optionally, save to CSV
comments_df.to_csv('youtube_video_comments.csv', index=False)
print(f"Comments data extraction stopped at index {last_index}. Check the CSV file for data.")

Fetching comments for video ID: kk6rh45r0fo
403 - Access Forbidden. API key or quota may be exhausted.
Comments data extraction stopped at index 0. Check the CSV file for data.


Comments data is stored to csv - youtube_video_comments