# Example: Scrape YouTube data (API)

Using API: Legal, fast, easy, full information

**Apply a YouTube Data API v3:**

1. Create a Google account (skip this step if you already have one).
2. Open Google developer console https://console.cloud.google.com
3. Click the dropdown menu on the top left beside "Google Cloud". In the popup window click `NEW PROJECT` to create a project (you don't have to select the organization).
4. Click `APIs & Services` from the side menu.
5. Click `Enable APIs and Services` (on the top menu with a + ).
6. Scroll down to find `YouTube Data API v3`.
7. Click `Enable`.
8. The page will automatically take you to another page, click `Credentials` on the left side menu.
9. You can always find your API key from the Credentials.
10. If you don't see you API there, in the Credentials page, click `+ Create Credentials` , select API key, then you can get a new one. Copy this key to this Jupyter Notebook. 

## Example 1: Scrape YouTube Comments

Intall packages and import libraries

! pip install urllib3

conda install -c anaconda urllib3 

https://anaconda.org/anaconda/urllib3



! pip install google_auth_oauthlib

conda install -c conda-forge google-auth-oauthlib

https://anaconda.org/conda-forge/google-auth-oauthlib



! pip install google-api-python-client

conda install -c conda-forge google-api-python-client

https://anaconda.org/conda-forge/google-api-python-client

In [12]:
# Libraries
import pandas as pd
import time
import json
from urllib.request import urlopen
from googleapiclient.discovery import build

**To create the url (video id list) from your search on YouTube:**

1. Open https://developers.google.com/youtube/v3/docs/search/list
2. Maximize your window size.
3. On the right side, there is a `Try this method` window panel.
4. Click the maximize window size button "[]" The square next to the close sign.
5. Set your search conditions
    - q: for search keywords or phrases
6. Once you setup the conditions, click `EXECUTE`
7. If the right bottom console displays `200` with a green bar, your url is generated and the search is successful. 
8. Copy the url on the right top window start with "https://" to the following module.
    - For example: https://youtube.googleapis.com/youtube/v3/search?maxResults=50&q=data%20science&key=[YOUR_API_KEY]
9. Replace [YOUR_API_KEY] with you API key, removing the brackets. 
    - For example: https://youtube.googleapis.com/youtube/v3/search?maxResults=50&q=data%20science&key=AIzaSyD8eJx2sUHuN6n3AGn15bmKfcEtLQ67YT0
10. You can check this url just click the link and see the json data in a browser window. 

In [50]:
# get the url from https://developers.google.com/youtube/v3/docs/search/list?apix=true&apix_params=%7B%22maxResults%22%3A50%2C%22order%22%3A%22date%22%2C%22q%22%3A%22hurricane%20ian%22%7D#parameters
# Try this API using parameters
url = "https://youtube.googleapis.com/youtube/v3/search?maxResults=50&q=data%20science&key=AIzaSyD8eJx2sUHuN6n3AGn15bmKfcEtLQ67YT0"
  
# store the response of URL
response = urlopen(url)
  
# storing the JSON response 
# from url in data
data_json = json.loads(response.read())

# print the json response
print(json.dumps(data_json, indent=4))

{
    "kind": "youtube#searchListResponse",
    "etag": "3PIGvXcOtJ2bNotRJN7sVMORe5w",
    "nextPageToken": "CDIQAA",
    "regionCode": "US",
    "pageInfo": {
        "totalResults": 1000000,
        "resultsPerPage": 50
    },
    "items": [
        {
            "kind": "youtube#searchResult",
            "etag": "I8EhfFKbhot4XoKJvGSbQgJlUlg",
            "id": {
                "kind": "youtube#video",
                "videoId": "X3paOmcrTjQ"
            }
        },
        {
            "kind": "youtube#searchResult",
            "etag": "zkeluh9ioQpbbPSJY6-wmeP7CxI",
            "id": {
                "kind": "youtube#video",
                "videoId": "RBSUwFGa6Fk"
            }
        },
        {
            "kind": "youtube#searchResult",
            "etag": "PiU4CJVbEt7PiwO9usat1JzYjvE",
            "id": {
                "kind": "youtube#video",
                "videoId": "xC-c7E5PK0Y"
            }
        },
        {
            "kind": "youtube#searchResult",
      

In [53]:
# find out the video id and store the etags as well
youtubedf = pd.json_normalize(data_json, 'items', errors='ignore')
youtubedf = youtubedf.drop(columns=['kind', 'id.kind'])
youtubedf.head()

Unnamed: 0,etag,id.videoId,id.channelId,id.playlistId
0,I8EhfFKbhot4XoKJvGSbQgJlUlg,X3paOmcrTjQ,,
1,zkeluh9ioQpbbPSJY6-wmeP7CxI,RBSUwFGa6Fk,,
2,PiU4CJVbEt7PiwO9usat1JzYjvE,xC-c7E5PK0Y,,
3,DXv1-NdL7T2HPXRfHfIXlRwyf3A,ua-CiDNNj30,,
4,g_82o1rsr6dWvEJZqkh-7_CUntQ,-ETQ97mXXF0,,


In [54]:
youtubedf.shape

(50, 4)

In [55]:
## The url (search) is changing everyday/every search. It returns the most recent videos (max 50) for a search.
# save your videoID to a csv file for future use
# you can search the comments based on the videoID from the csv 
youtubedf.to_csv("DS1004YT_VID.csv")

In [17]:
# get the YouTube credentials
API_KEY = "AIzaSyD8eJx2sUHuN6n3AGn15bmKfcEtLQ67YT0"
youtube = build ('youtube','v3',developerKey = API_KEY)

In [18]:
dir(youtube)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_add_basic_methods',
 '_add_nested_resources',
 '_add_next_methods',
 '_baseUrl',
 '_developerKey',
 '_dynamic_attrs',
 '_http',
 '_model',
 '_requestBuilder',
 '_resourceDesc',
 '_rootDesc',
 '_schema',
 '_set_dynamic_attr',
 '_set_service_methods',
 'abuseReports',
 'activities',
 'captions',
 'channelBanners',
 'channelSections',
 'channels',
 'close',
 'commentThreads',
 'comments',
 'i18nLanguages',
 'i18nRegions',
 'liveBroadcasts',
 'liveChatBans',
 'liveChatMessages',
 'liveChatModerators',
 'liveStreams',
 'members',
 'membershipsLevels',
 'new_batch_h

In [56]:
# read the videoId from the saved csv
YTdf = pd.read_csv("DS1004YT_VID.csv")
YTdf.head()

Unnamed: 0.1,Unnamed: 0,etag,id.videoId,id.channelId,id.playlistId
0,0,I8EhfFKbhot4XoKJvGSbQgJlUlg,X3paOmcrTjQ,,
1,1,zkeluh9ioQpbbPSJY6-wmeP7CxI,RBSUwFGa6Fk,,
2,2,PiU4CJVbEt7PiwO9usat1JzYjvE,xC-c7E5PK0Y,,
3,3,DXv1-NdL7T2HPXRfHfIXlRwyf3A,ua-CiDNNj30,,
4,4,g_82o1rsr6dWvEJZqkh-7_CUntQ,-ETQ97mXXF0,,


In [57]:
# get the YouTube comments for each video using youtube.commentThreads().list(part="snippet", videoId=vid)
# multiple nested json data
for vid in YTdf["id.videoId"][:1]:
    request = youtube.commentThreads().list(part="snippet", videoId=vid)
    response = request.execute()
    print(json.dumps(response, indent=4))

{
    "kind": "youtube#commentThreadListResponse",
    "etag": "2NRtafYLyKoyCTwaUfbGGtoIMdo",
    "nextPageToken": "QURTSl9pMGRnQVNYbFVXRTgwMEpYMndXMlgydl9fVnJwVmNFeW5NWFdodk5QaXJEY0kzeFRlekpZOUlBTVo0TnpHdHd4MVVCbnk0SFp5dWVjWTFOV0lFR0JxRkNINTYwRlE=",
    "pageInfo": {
        "totalResults": 20,
        "resultsPerPage": 20
    },
    "items": [
        {
            "kind": "youtube#commentThread",
            "etag": "y86llHFPHtjBdLgQ_ohH9gvtNBA",
            "id": "UgxvUkmSK9q2mnPHAuZ4AaABAg",
            "snippet": {
                "videoId": "X3paOmcrTjQ",
                "topLevelComment": {
                    "kind": "youtube#comment",
                    "etag": "DBB88ytp1Y3Cu9CCT1VCrPMa-YQ",
                    "id": "UgxvUkmSK9q2mnPHAuZ4AaABAg",
                    "snippet": {
                        "videoId": "X3paOmcrTjQ",
                        "textDisplay": "\ud83d\udd25Explore Our FREE Courses With Completion Certificate: <a href=\"https://www.youtube.com/watch?v=-

In [58]:
# query each item one by one from multiple json data
for vid in YTdf["id.videoId"][:1]:
    request = youtube.commentThreads().list(part="snippet", videoId=vid)
    response = request.execute()
    
    print(json.dumps(response["items"][1], indent=4))
    print("**************************************************")
    print(json.dumps(response["items"][2], indent=4))
    print("**************************************************")
    print(json.dumps(response["items"][3], indent=4))

{
    "kind": "youtube#commentThread",
    "etag": "16M5IlA8kSMbdXJ2JSJWvWvpBXI",
    "id": "UgwIqSnkVJX2r5L8uTB4AaABAg",
    "snippet": {
        "videoId": "X3paOmcrTjQ",
        "topLevelComment": {
            "kind": "youtube#comment",
            "etag": "a3h0AUB3Ywh2dFT-TsSsxCsnNWY",
            "id": "UgwIqSnkVJX2r5L8uTB4AaABAg",
            "snippet": {
                "videoId": "X3paOmcrTjQ",
                "textDisplay": "<b>This was NOT helpful at all! I have no idea what a Data Scientist is..and you assumed people like myself would know the terms you used or how it is applied in business scenarios. I came here to learn..but you created a video for people who already know the basics of data science and how it is applied..which means this was NOT a video for beginners. This doesn&#39;t make me want to watch any more of your videos since you did such a poor job here!</b> \u2639\ud83d\udc4e",
                "textOriginal": "*This was NOT helpful at all! I have no idea what 

In [59]:
for vid in YTdf["id.videoId"][:1]:
    request = youtube.commentThreads().list(part="snippet", videoId=vid)
    response = request.execute()
    comments = response["items"]
    print(len(comments)) # the number of comments for videos (based on the number of videoID we got)

20


In [60]:
# initialize the dictionary with lists
comments_info ={"video_id":[],"likeCount":[],"replyCount":[],"publishDate":[],"textOriginal":[],
                "authorDisplayName":[],"authorChannelId":[],"authorChannelUrl":[]}
#get data 
for vid in YTdf["id.videoId"][:1]:
    try:
        request = youtube.commentThreads().list(part="snippet", videoId=vid)
        response = request.execute()
        comments = response['items']
        
        for i in range(0, len(comments)):
            snippet = comments[i]['snippet']
            #...................................................#
            video_id = snippet['videoId']
            replyCount = snippet['totalReplyCount']
            #--------------------------------------------------#
            #--------------------------------------------------#
            top_snippet = snippet['topLevelComment']['snippet']
            #.................................................. #
            likeCount = top_snippet['likeCount']
            publishDate = top_snippet['publishedAt']
            textOriginal = top_snippet['textOriginal']
            authorDisplayName = top_snippet['authorDisplayName']
            authorChannelId = top_snippet['authorChannelId']['value']
            authorChannelUrl = top_snippet['authorChannelUrl']

            comments_info["video_id"].append(video_id)
            comments_info["likeCount"].append(likeCount)
            comments_info["replyCount"].append(replyCount)
            comments_info["publishDate"].append(publishDate)
            comments_info["textOriginal"].append(textOriginal)
            comments_info["authorDisplayName"].append(authorDisplayName)
            comments_info["authorChannelId"].append(authorChannelId)
            comments_info["authorChannelUrl"].append(authorChannelUrl)
            
            print(video_id, "success")
        
    except:
        print(video_id, "error occured")
        
        
    time.sleep(5)

X3paOmcrTjQ success
X3paOmcrTjQ success
X3paOmcrTjQ success
X3paOmcrTjQ success
X3paOmcrTjQ success
X3paOmcrTjQ success
X3paOmcrTjQ success
X3paOmcrTjQ success
X3paOmcrTjQ success
X3paOmcrTjQ success
X3paOmcrTjQ success
X3paOmcrTjQ success
X3paOmcrTjQ success
X3paOmcrTjQ success
X3paOmcrTjQ success
X3paOmcrTjQ success
X3paOmcrTjQ success
X3paOmcrTjQ success
X3paOmcrTjQ success
X3paOmcrTjQ success


In [61]:
YTcomments = pd.DataFrame(comments_info)
YTcomments.head()

Unnamed: 0,video_id,likeCount,replyCount,publishDate,textOriginal,authorDisplayName,authorChannelId,authorChannelUrl
0,X3paOmcrTjQ,43,4,2021-09-02T18:04:08Z,🔥Explore Our FREE Courses With Completion Cert...,Simplilearn,UCsvqVGtbbyHaMoevxPAq9Fg,http://www.youtube.com/channel/UCsvqVGtbbyHaMo...
1,X3paOmcrTjQ,1,0,2022-08-13T18:11:01Z,*This was NOT helpful at all! I have no idea w...,Nightbird,UCkzrrYMaT5O9s-76qvKRDxg,http://www.youtube.com/channel/UCkzrrYMaT5O9s-...
2,X3paOmcrTjQ,1,1,2022-08-11T18:25:04Z,I am thinking to take cse with data science in...,Sayan Mondal,UCtkuXL-WSFKYuvpWLa49bsg,http://www.youtube.com/channel/UCtkuXL-WSFKYuv...
3,X3paOmcrTjQ,0,1,2022-07-05T11:12:44Z,Is the salary monthly or annual?,aryan puri,UCHWEjQjQgPltdbTDVTO-qBA,http://www.youtube.com/channel/UCHWEjQjQgPltdb...
4,X3paOmcrTjQ,1,1,2022-07-03T09:15:02Z,"Captions, please. Thank you in advance!",Slow Down Cooking,UCFTFCadazGjRtJwh4neDsCg,http://www.youtube.com/channel/UCFTFCadazGjRtJ...


In [62]:
YTcomments.shape

(20, 8)

In [63]:
# Split the date
split_comment = YTcomments['publishDate'].astype(str).str.split("T")
YTcomments['PDate'] = split_comment.str[0]
YTcomments['PTime'] = split_comment.str[1] 
print(YTcomments['PDate'].head())
print(YTcomments['PTime'].head()) #Times are expressed in UTC (Coordinated Universal Time), with a special UTC designator ("Z").

0    2021-09-02
1    2022-08-13
2    2022-08-11
3    2022-07-05
4    2022-07-03
Name: PDate, dtype: object
0    18:04:08Z
1    18:11:01Z
2    18:25:04Z
3    11:12:44Z
4    09:15:02Z
Name: PTime, dtype: object


In [None]:
YTcomments.to_csv("YTcomments_example.csv")

## Action 1: Get all the comments for the 50 videos and store the data into pandas dataframe

## Example 2: Scrape YouTube videos statistics

In [64]:
#get video stats using youtube.videos().list(part="statistics",id=vid)
# multiple nested json data
for vid in YTdf["id.videoId"][:2]:
    request2 = youtube.videos().list(part="statistics",id=vid)
    response2 = request2.execute()
    print(json.dumps(response2, indent=4))

{
    "kind": "youtube#videoListResponse",
    "etag": "GuG4RheTxXAS5Uv6gkdO0LTTnoE",
    "items": [
        {
            "kind": "youtube#video",
            "etag": "_5vqx5E-IlYIQKZZKVq83jnN5tA",
            "id": "X3paOmcrTjQ",
            "statistics": {
                "viewCount": "2974252",
                "likeCount": "44031",
                "favoriteCount": "0",
                "commentCount": "1087"
            }
        }
    ],
    "pageInfo": {
        "totalResults": 1,
        "resultsPerPage": 1
    }
}
{
    "kind": "youtube#videoListResponse",
    "etag": "tJiUFS1AtpilByl7moiwY4LT33s",
    "items": [
        {
            "kind": "youtube#video",
            "etag": "ggZbkUF7VkgDbIetmlVqhvQe5Xk",
            "id": "RBSUwFGa6Fk",
            "statistics": {
                "viewCount": "30161",
                "likeCount": "1100",
                "favoriteCount": "0",
                "commentCount": "22"
            }
        }
    ],
    "pageInfo": {
        "total

## Action 2: Parse the video statistics (json data) and store them into pandas dataframe (for all the videos in the url)

- get all the statistics of videos, including: videoID,viewCount, likeCount, favoriteCount, and CommentCount