# USING AN API TO EXTRACT DATA FROM ANY YOUTUBE CHANNEL

Last month, I came across this video [Python YouTube API Tutorial: Calculating the Duration of a Playlist](https://www.youtube.com/watch?v=coZbOM6E47I&t=16s). The video shows how to calculate the duration of any playlist on YouTube.  This video is part of a tutorial on the YouTube API. The video inspired me to work on my first personal data science project.  Even though the idea is simple, extract and analyze data from YouTube.   
The first step of the project is to collect data for a specific list YouTube channel, we are going to  retrieving statistiques information for each video  (likes, comments, views, dislikes, date),  the collected data will be saved and stored to be used later without the need to run the script again.   
In the second part of the project (will came soon) we will use data science tools to analyze the data and to get insights from it. We can look for the most popular videos on the channel, the most watched playlist, the relationship between duration and number of views,the relationship between video duration and number of comments, the ratio between likes and dislikes. 


### Table of Contents
* [Creating an API Key](#chapter1)
* [Cleaning the data](#chaptern)

##  Creating an API Key <a class='anchor' id='chapter1'>

First things first, we need a YOUTUBE API KEY. We used this video [Python YouTube API Tutorial](https://www.youtube.com/watch?v=th5_9woFJmk&t=2s) to set up an API key and install the packages we needed. It's a clear and well explained video. At the end of the video, you can make your first YouTube API request. 

In [1]:
from googleapiclient.discovery import build
import os
import pandas as pd
import re
from datetime import date
from dotenv import load_dotenv
import json

## Storing the API key
we will store the API key in a fille called `.env` and use `dotenv` module to  read it.   
For more details, you can check this post [Keeping your API keys secret with dotenv](http://jonathansoma.com/lede/foundations-2019/classes/apis/keeping-api-keys-secret/).

In [2]:
load_dotenv()
API_KEY = os.getenv('api_key2')

## Building a service object

Before using the Youtube API to make requests, we need to build a service object.
We will use the [`build()`](https://googleapis.github.io/google-api-python-client/docs/epy/googleapiclient.discovery-module.html#build) function to create the service object, we will need to specify the name of the service, in our case `youtube`, the API version as `v3` and we will also need a developer key.
For more information, you can always check the [Getting Started](https://github.com/googleapis/google-api-python-client/blob/master/docs/start.md) document from [
google-api-python-client documentation](https://github.com/googleapis/google-api-python-client).


In [3]:
youtube = build('youtube', 'v3', developerKey=API_KEY)

## Retrieve Statistics for Any YouTube Channel

we are ready to make our first request. Since our goal is to collect data for a specific list YouTube channel. We need a parameter which uniquely identifies the YouTube channel.   
In order to request information about a particular channel, we call the [`channel.list`](https://developers.google.com/youtube/v3/docs/channels/list) method, and to identify the channel, we can use the channel ID or the username associated with that channel.  
Perhaps you are wondering how to find the ID of a channel? Me too.  
One way to do it based on this post on [stackoverflow](https://stackoverflow.com/questions/14366648/how-can-i-get-a-channel-id-from-youtube), is to look for either `data-channel-external-id` or `externalId` in the source code  of the channel page.   
In this [jupyter notebook](channels_id.ipynb) we used [`search.list`](https://developers.google.com/youtube/v3/guides/implementation/search) method to find a channel ID for a list of YouTube channel.

We will use the YouTube channel [Corey Schafer](https://www.youtube.com/channel/UCCezIgC97PvUuR4_gbFUs5g) as first example.

In [4]:
user_name = 'schafer5' 
channel_id = 'UCCezIgC97PvUuR4_gbFUs5g'



request = youtube.channels().list(
        part="statistics",
        forUsername=user_name # or id=channel_id
    )
response = request.execute()

In [5]:
print(json.dumps(response, indent=4,sort_keys=True))

{
    "etag": "JZZJRE9VFnr1RrtLrw73wHDRuZs",
    "items": [
        {
            "etag": "D-6Pw-_TgSGYbgmT_F48rCNdIXI",
            "id": "UCCezIgC97PvUuR4_gbFUs5g",
            "kind": "youtube#channel",
            "statistics": {
                "hiddenSubscriberCount": false,
                "subscriberCount": "789000",
                "videoCount": "230",
                "viewCount": "58487827"
            }
        }
    ],
    "kind": "youtube#channelListResponse",
    "pageInfo": {
        "resultsPerPage": 5,
        "totalResults": 1
    }
}


We can look for more than one channel, by passing comma-separated list of the YouTube channel ID(s). 

### Load 


Instead of having a list of random youtube channels  to work with, it caught our attention  this page [Top Programmer Guru](https://noonies.tech/award/top-programming-guru), where are listed some  popular code learning channels.  
We used web scraping to make a dataset of two columns `channelName` and `url` with 71 rows, each row representing a different channel. After cleaning the dataset and removing the duplicates, we end up with 67 rows.[The code](web_scraping.py)  

In [4]:
channelList = pd.read_csv('data/channelsList.csv')
channelList.head(10)

Unnamed: 0,channelName,title,channelId,kind,url,gender,rank
0,Programming with Mosh,Programming with Mosh,UCWv7vMbMWH4-V0ZXdmDpPBA,youtube#channel,https://www.youtube.com/c/programmingwithmosh/...,Male,1
1,Traversy Media,Traversy Media,UC29ju8bIPH5as8OGnQzwJyA,youtube#channel,https://www.youtube.com/user/TechGuyWeb,Male,2
2,Corey Schafer,Corey Schafer,UCCezIgC97PvUuR4_gbFUs5g,youtube#channel,https://www.youtube.com/user/schafer5,Male,3
3,Tech With Tim,Tech With Tim,UC4JX40jDee_tINbkjycV4Sg,youtube#channel,https://m.youtube.com/channel/UC4JX40jDee_tINb...,Male,4
4,Krish Naik,Krish Naik,UCNU_lfiiWBdtULKOw6X0Dig,youtube#channel,https://www.youtube.com/user/krishnaik06/playl...,Male,5
5,freeCodeCamp.org,freeCodeCamp.org,UC8butISFwT-Wl7EV0hUK0BQ,youtube#channel,https://www.youtube.com/channel/UC8butISFwT-Wl...,,6
6,Hitesh Choudhary,Hitesh Choudhary,UCXgGY0wkgOzynnHvSEVmE3A,youtube#channel,https://www.youtube.com/c/HiteshChoudharydotcom,Male,7
7,Clever Programmer,Clever Programmer,UCqrILQNl5Ed9Dz6CGMyvMTQ,youtube#channel,https://m.youtube.com/cleverprogrammer?uid=qrI...,Male,8
8,Caleb Curry,Caleb Curry,UCZUyPT9DkJWmS_DzdOi7RIA,youtube#channel,https://www.youtube.com/user/CalebTheVideoMaker2,Male,9
9,programming Hero,Programming Hero,UCStj-ORBZ7TGK1FwtGAUgbQ,youtube#channel,https://www.youtube.com/channel/UCStj-ORBZ7TGK...,Male,10


In [5]:
channelList.shape

(67, 7)

In [6]:
channel_ids = channelList.channelId.values
for c in channel_ids:
    print(c)
    break

UCWv7vMbMWH4-V0ZXdmDpPBA


In [16]:
len(channel_ids)

67

In [8]:
request = youtube.channels().list(
        part="snippet,contentDetails,statistics",
        id=channel_ids[0], 
        maxResults = 50
    )
response = request.execute()

#### Let's display the data we collocate for a one channel.

In [9]:
print(json.dumps(response, indent=4,sort_keys=True))

{
    "etag": "v7lRerb7HvZsQsaGaTP-msy7B1c",
    "items": [
        {
            "contentDetails": {
                "relatedPlaylists": {
                    "favorites": "",
                    "likes": "",
                    "uploads": "UUWv7vMbMWH4-V0ZXdmDpPBA"
                }
            },
            "etag": "bSLWId-AGGyVM8lCPNxyHKUJ4C0",
            "id": "UCWv7vMbMWH4-V0ZXdmDpPBA",
            "kind": "youtube#channel",
            "snippet": {
                "country": "AU",
                "customUrl": "programmingwithmosh",
                "description": "I train professional software engineers that companies love to hire. \n\nMy courses: http://codewithmosh.com \n\nMy blog: http://programmingwithmosh.com\n\nConnect on social media: \n\nhttp://www.twitter.com/moshhamedani\n\nhttps://www.facebook.com/programmingwithmosh\n\n#python #javascript #chsarp",
                "localized": {
                    "description": "I train professional software engineers that compani

#### Let's see if your favorite channels for learning coding are in the top 10.


In [10]:
for item in channelList.title.head(10).values:
    print(item)

Programming with Mosh
Traversy Media
Corey Schafer
Tech With Tim
Krish Naik
freeCodeCamp.org
Hitesh Choudhary
Clever Programmer
Caleb Curry
Programming Hero


#### Let's store to result in DataFrame
The response to the request can be stored in a table (like DataFrame) to have a better display, also . We are going to save the data as a `csv` file to avoid making requests every time we run the script, to use it for other projects and share it with this jupyter notebook.  
We should mention that  using the YouTube API is free, but there is limit quoto of request per day. The qota is about 10,000 units per day. each oparation have different cost retrieveing a list of channels, videos, plalists can cost 1 unit, but search request costs 100 units.   
You can check this link for more details about [Calculation quota usage](https://developers.google.com/youtube/v3/getting-started#calculating-quota-usage).            
For this reason will limit to collaction of data for only one channel 

In [11]:
import math

def make_chunks(data, chunk_size):
    
    '''Split a data into chunk of given size'''
    
    num_chunks = math.ceil(len(data) / chunk_size)
    
    return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

In [12]:
def getChannelStat(youtube, channelIdList):
    '''
    collect statiict infomation for given channel Id
    Args:
        youtube (youtube api): youtube api
        channelIdList (list): a list of channel Ids id
    return dic: {video_id: list ,playlist_id: list}
    '''

    nextPageToken = None
   
    
    chunks = make_chunks(channelIdList, 50)
    
    channels_stat = {}

    channels_stat['channelId'] = []
    channels_stat['title'] = []
    channels_stat['description'] = []
    channels_stat['country'] = []
    channels_stat['viewCount'] = []
    channels_stat['subscriberCount'] = []
    channels_stat['videoCount'] = []
    channels_stat['subscriberCount'] = []
    channels_stat['publishedAt'] = []
    channels_stat['uploads'] = []
    
    for chunk in chunks:
        request = youtube.channels().list(
            part="snippet,contentDetails,statistics",
            id=','.join(chunk), 
            maxResults = 50,
            
        )
        response = request.execute()
        

        for item in response['items']:
            try:
                channels_stat['country'].append(item['snippet']['country'])
            except KeyError:
                channels_stat['country'].append(None)
            channels_stat['channelId'].append(item['id'])
            channels_stat['title'].append(item['snippet']['title'])
            channels_stat['description'].append(item['snippet']['description'])
            # channels_stat['country'].append(item['snippet']['country'])
            channels_stat['viewCount'].append(item['statistics']['viewCount'])
            channels_stat['videoCount'].append(item['statistics']['videoCount'])
            channels_stat['subscriberCount'].append(item['statistics']['subscriberCount'])
            channels_stat['publishedAt'].append(item['snippet']['publishedAt'])
            channels_stat['uploads'].append(item['contentDetails']['relatedPlaylists']['uploads'])
    df = pd.DataFrame.from_dict(channels_stat)
    return df

In [13]:
Data = getChannelStat(youtube, channel_ids)

In [14]:
Data.shape

(67, 9)

In [15]:
Data.head()

Unnamed: 0,channelId,title,description,country,viewCount,subscriberCount,videoCount,publishedAt,uploads
0,UC-QDfvrRIDB6F0bIO4I4HkQ,Pretty Printed,I'm Anthony. I make programming videos.\n\nFee...,US,9909751,72700,432,2015-11-18T21:40:58Z,UU-QDfvrRIDB6F0bIO4I4HkQ
1,UCsBjURrPoezykLs9EqgamOA,Fireship,High-intensity ⚡ code tutorials to help you bu...,US,46455926,723000,333,2017-04-07T18:17:23Z,UUsBjURrPoezykLs9EqgamOA
2,UCshZ3rdoCLjDYuTR_RBubzw,Program With Erik,My name is Erik Hanchett and I'm a web and Jav...,US,6666152,91500,581,2013-07-05T21:44:58Z,UUshZ3rdoCLjDYuTR_RBubzw
3,UC7PWnwwqMSqAXQkKXqxRkMw,Gajesh S Naik,"I am Gajesh Naik, 13 y/o. Besides the curricul...",IN,537810,11100,89,2016-07-29T11:04:06Z,UU7PWnwwqMSqAXQkKXqxRkMw
4,UCqrILQNl5Ed9Dz6CGMyvMTQ,Clever Programmer,You can find awesome programming lessons here!...,US,41675095,978000,599,2016-03-12T08:59:15Z,UUqrILQNl5Ed9Dz6CGMyvMTQ


In [17]:
Data.to_csv('data/channelsDB.csv')

In [22]:
summery_df = pd.merge(channelList,Data )

In [23]:
summery_df.head()

Unnamed: 0,channelName,title,channelId,kind,url,gender,rank,description,country,viewCount,subscriberCount,videoCount,publishedAt,uploads
0,Programming with Mosh,Programming with Mosh,UCWv7vMbMWH4-V0ZXdmDpPBA,youtube#channel,https://www.youtube.com/c/programmingwithmosh/...,Male,1,I train professional software engineers that c...,AU,82263182,1830000,161,2014-10-07T00:40:53Z,UUWv7vMbMWH4-V0ZXdmDpPBA
1,Traversy Media,Traversy Media,UC29ju8bIPH5as8OGnQzwJyA,youtube#channel,https://www.youtube.com/user/TechGuyWeb,Male,2,Traversy Media features the best online web de...,US,142390124,1560000,881,2009-10-30T21:33:14Z,UU29ju8bIPH5as8OGnQzwJyA
2,Corey Schafer,Corey Schafer,UCCezIgC97PvUuR4_gbFUs5g,youtube#channel,https://www.youtube.com/user/schafer5,Male,3,Welcome to my Channel. This channel is focused...,US,58863870,792000,230,2006-05-31T22:49:22Z,UUCezIgC97PvUuR4_gbFUs5g
3,Tech With Tim,Tech With Tim,UC4JX40jDee_tINbkjycV4Sg,youtube#channel,https://m.youtube.com/channel/UC4JX40jDee_tINb...,Male,4,"Learn programming, software engineering, machi...",CA,51968790,680000,602,2014-04-23T01:57:10Z,UU4JX40jDee_tINbkjycV4Sg
4,Krish Naik,Krish Naik,UCNU_lfiiWBdtULKOw6X0Dig,youtube#channel,https://www.youtube.com/user/krishnaik06/playl...,Male,5,"I work as a Lead Data Scientist, pioneering in...",IN,28303465,390000,1102,2012-02-11T04:05:06Z,UUNU_lfiiWBdtULKOw6X0Dig


In [24]:
summery_df[summery_df.isna().any(axis=1)]

Unnamed: 0,channelName,title,channelId,kind,url,gender,rank,description,country,viewCount,subscriberCount,videoCount,publishedAt,uploads
5,freeCodeCamp.org,freeCodeCamp.org,UC8butISFwT-Wl7EV0hUK0BQ,youtube#channel,https://www.youtube.com/channel/UC8butISFwT-Wl...,,6,Learn to code for free.,US,197380290,3780000,1164,2014-12-16T21:18:48Z,UU8butISFwT-Wl7EV0hUK0BQ
20,Kalle Hallden,Kalle Hallden,UCWr0mx597DnSGLFk1WfvSkQ,youtube#channel,https://www.youtube.com/channel/UCWr0mx597DnSG...,Male,21,"Hi, I am 300 moons old. I count everything in ...",,33062991,514000,190,2015-10-18T20:39:56Z,UUWr0mx597DnSGLFk1WfvSkQ
33,Scott Hansellman,Scott Hanselman,UCL-fHOdarou-CR2XUmK48Og,youtube#channel,https://www.youtube.com/user/shanselman,Male,35,I'm a teacher. I speak all over to whomever wi...,,8995924,125000,313,2006-03-15T10:14:39Z,UUL-fHOdarou-CR2XUmK48Og
39,CodingEntrepreneurs,CodingEntrepreneurs,UCWEHue8kksIaktO8KTTN_zg,youtube#channel,https://www.youtube.com/user/CodingEntrepreneurs,Male,42,Coding for Entrepreneurs is a Programming Seri...,,15831903,179000,671,2013-06-30T00:56:13Z,UUWEHue8kksIaktO8KTTN_zg
49,Chris Coyier,Chris Coyier,UCADyUOnhyEoQqrw_RrsGleA,youtube#channel,https://www.youtube.com/user/realcsstricks,Male,53,This is the official YouTube channel for CSS-T...,,3651883,59600,292,2011-05-12T01:53:15Z,UUADyUOnhyEoQqrw_RrsGleA
53,chuck severance,Chuck Severance,UChYfrRp_CWyqOt-ZYJGOgmA,youtube#channel,https://m.youtube.com/user/csev,Male,57,,,7776775,62100,878,2006-08-19T14:24:00Z,UUhYfrRp_CWyqOt-ZYJGOgmA
65,durga software solutions,Durga Software Solutions,UCbjozK_PYCTLEluFlrJ8UZg,youtube#channel,https://www.youtube.com/c/DurgaSoftwareSolutions,,70,"DURGA Software Solutions is an Institute, whic...",IN,124800178,582000,16443,2014-02-03T04:15:47Z,UUbjozK_PYCTLEluFlrJ8UZg


In [25]:
summery_df.to_csv('data/summeryDB.csv',index=False)

### The next step
We will be collecting data for each video and playlist in a single channel. Just to make sure we can finish the project before we surpass the request quota limit. see [Extract YouTube video statistics and playlist]()