# **SCRAPING CNBC INDONESIA YOUTUBE's CHANNEL - USING YOUTUBE APIs v3 SEARCH METHOD**

In [1]:
# load libraries
import pandas as pd
import requests
import time


#### Everything in this code were inspired by: [**strataScratch**](https://www.youtube.com/watch?v=fklHBWow8vE&list=PLv6MQO1ZzdmrizwpS3S9DJBLQut4j1-ei&index=2)
**References:**<br>
[How to Get channel_ID](https://www.youtube.com/watch?v=D12v4rTtiYM)<br>
[How to Get API_KEY and url](https://blog.hubspot.com/website/how-to-get-youtube-api-key)

In [2]:
API_KEY = 'xxxx'
CHANNEL_ID = 'xxxx'  

**References:**<br>
[understanding quota cost ](https://developers.google.com/youtube/v3/determine_quota_cost)<br>
[youtube default quota allocation is 10k units per day](https://developers.google.com/youtube/v3/guides/quota_and_compliance_audits#:~:text=Projects%20that%20enable%20the%20YouTube,page%20in%20the%20API%20Console.)<br>
[youtube parameter common use case](https://developers.google.com/youtube/v3/docs/search/list#common-use-cases)

In [3]:
# for some unknown reason adding '+..+' works, reading documentation isn't much of a help 
url = 'https://youtube.googleapis.com/youtube/v3/search?key='+API_KEY+'&channelId='+CHANNEL_ID+'&part=snippet,id&order=date&maxResults=40'
response = requests.get(url).json()
# i put [0] to show only the first video data
response['items'][0]

{'kind': 'youtube#searchResult',
 'etag': 'lIEpHhqmPuDKezSe-_Z1vFkmPFo',
 'id': {'kind': 'youtube#video', 'videoId': 'uUM9xk0f7xI'},
 'snippet': {'publishedAt': '2022-07-09T06:15:01Z',
  'channelId': 'UCGN9JsnkvK05v2lnTI_-uGA',
  'title': 'Langkah Kemensos Antisipasi Aksi Badan Amal &quot;Nakal&quot;',
  'description': 'Berita Lainnya: Siapa Chris Pincher, Buat Pemerintah Inggris Gonjang-ganjing? Link: https://bit.ly/3RiKr5x Pengelolaan dana ...',
  'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/uUM9xk0f7xI/default.jpg',
    'width': 120,
    'height': 90},
   'medium': {'url': 'https://i.ytimg.com/vi/uUM9xk0f7xI/mqdefault.jpg',
    'width': 320,
    'height': 180},
   'high': {'url': 'https://i.ytimg.com/vi/uUM9xk0f7xI/hqdefault.jpg',
    'width': 480,
    'height': 360}},
  'channelTitle': 'CNBC Indonesia',
  'liveBroadcastContent': 'none',
  'publishTime': '2022-07-09T06:15:01Z'}}

<p align="center">
<img width="992" height="448" src="https://raw.githubusercontent.com/wjudho/wjudho/main/images/cnbc.png">
</p>

### **1. Let's Break it Down! Where Our Data is Located**
**References:**<br>
[access keys(), items() and values() in Python Dictionary](https://www.pythontutorial.net/python-basics/python-dictionary/)

In [4]:
# there are 6 elements (dict_keys)
print(len(response))
response.keys()

6


dict_keys(['kind', 'etag', 'nextPageToken', 'regionCode', 'pageInfo', 'items'])

<p align="center">
<img width="538" height="251" src="https://raw.githubusercontent.com/wjudho/wjudho/main/images/json.png">
</p>

**References:**<br>
[what does [0] do](https://stackoverflow.com/questions/12521189/what-does-0-mean-in-python)

In [5]:
# there are 4 elements inside ['items'] keys. 
# to simplify let's just work on just 1 videos (4 elements) for now using [0]
print(len(response['items'][0]))
response['items'][0].keys()

4


dict_keys(['kind', 'etag', 'id', 'snippet'])

<p align="center">
<img width="538" height="251" src="https://raw.githubusercontent.com/wjudho/wjudho/main/images/json2.png">
</p>

In [6]:
# there are 8 elements inside ['snippet']
# this is where most of the data reside
print(len(response['items'][0]['snippet']))
response['items'][0]['snippet'].keys()

8


dict_keys(['publishedAt', 'channelId', 'title', 'description', 'thumbnails', 'channelTitle', 'liveBroadcastContent', 'publishTime'])

In [7]:
# there are 2 elements inside ['id']
print(len(response['items'][0]['id']))
response['items'][0]['id'].keys()

2


dict_keys(['kind', 'videoId'])

<p align="center">
<img width="420" height="343" src="https://raw.githubusercontent.com/wjudho/wjudho/main/images/json4.png">
</p>

### **2. Let's Grab Only Data We Need into Variable Each**

In [8]:
video_id = response['items'][0]['id']['videoId']
video_title = response['items'][0]['snippet']['title']
# let's change it into string, to prevent unintentionally changed
video_title = str(video_title).replace("&amp;", "")
upload_date = response['items'][0]['snippet']['publishedAt']
upload_date = str(upload_date).split('T')[0]
# this conclude our first data from `search method` (remember still and only 1 videos which include 4 elements inside)

<p align="center">
<img width="401" height="121" src="https://raw.githubusercontent.com/wjudho/wjudho/main/images/syntax.png"><br>
i hope this could explain the <strong>[0]</strong> or <strong>[1]</strong> part 
</p>

**References:**<br>
[understanding lists in python](https://www.digitalocean.com/community/tutorials/understanding-lists-in-python-)

### **3. Adding ViewCount, and LikeCount**

- to add more details like **ViewCount** and **LikeCount**, we need to use another method which is **videos Method**<br>
- this step will be added inside the looping<br>
**References:**<br>
[statistics](https://developers.google.com/youtube/v3/docs/videos#resource-representation)<br>
[list method](https://developers.google.com/youtube/v3/docs/videos/list)<br>
[quota cost](https://developers.google.com/youtube/v3/determine_quota_cost)

### **4. Putting All Together Into A DataFrame**

<p align="center">
<img width="420" height="343" src="https://raw.githubusercontent.com/wjudho/wjudho/main/images/json4.png"><br>
<img width="417" height="81" src="https://raw.githubusercontent.com/wjudho/wjudho/main/images/[0].png">
<img width="417" height="81" src="https://raw.githubusercontent.com/wjudho/wjudho/main/images/[all].png"><br>

</p>

In [10]:
# create empty dataframe, later we are going to put collected data into this empty dataFrame
df = pd.DataFrame(columns=['video_id', 'video_title', 'upload_date', 'view_count', 'like_count', 'comment_count'])

In [11]:
# in this loop, we locked everything using 'youtube#video' to prevent unintended loop. just to make sure we are grabing the data that we want
# this loop specific task is to collect all of videos

for i in response['items']:
    if i['id']['kind'] == 'youtube#video':
        video_id = i['id']['videoId']
        video_title = i['snippet']['title']
        video_title = str(video_title).replace("&amp;", "")
        upload_date = i['snippet']['publishedAt']
        upload_date = str(upload_date).split('T')[0]
   
        # add these into the loop (this is step 3)
        url_video_stats = 'https://youtube.googleapis.com/youtube/v3/videos?id='+video_id+'&part=statistics&key='+API_KEY
        response_video_stats = requests.get(url_video_stats).json()

        view_count = response_video_stats['items'][0]['statistics']['viewCount']
        like_count = response_video_stats['items'][0]['statistics']['likeCount']
        comment_count = response_video_stats['items'][0]['statistics']['commentCount']

        #uncomment these to test whether our loop worked or not
        #print(video_id)
        #print(video_title)
        #print(upload_date)
        #print(view_count)
        #print(like_count)
        #print(comment_count)

        # save all those collected data into DataFrame
        df = pd.concat([df, pd.DataFrame.from_records([{'video_id':video_id,'video_title':video_title,
                        'upload_date':upload_date,'view_count':view_count,
                        'like_count':like_count,'comment_count':comment_count}])], ignore_index=True)

df.head()

Unnamed: 0,video_id,video_title,upload_date,view_count,like_count,comment_count
0,uUM9xk0f7xI,Langkah Kemensos Antisipasi Aksi Badan Amal &q...,2022-07-09,159,10,1
1,sQTzAFnDkqI,"Pengusaha: Petani Sawit Menderita, Terpaksa Ju...",2022-07-09,878,49,27
2,xnldby0FcZ0,"Presidensi G20 Indonesia, Recover Together, Re...",2022-07-08,5178,95,25
3,keb0JkOsZv0,Pungli! Pejabat BPN Cimahi Terancam Dipecat,2022-07-08,4110,88,60
4,DT2zxetUjsw,Waspada! Harga Terigu Mie Instan Naik,2022-07-08,4473,80,40


In [12]:
df.to_csv(r'C:\Users\wis\Documents\GitHub\webscraping\youtube\cnbc.csv', index=False)