# USING AN API TO EXTRACT DATA FROM ANY YOUTUBE CHANNEL

Last month, I came across this video [Python YouTube API Tutorial: Calculating the Duration of a Playlist](https://www.youtube.com/watch?v=coZbOM6E47I&t=16s). The video shows how to calculate the duration of any playlist on YouTube.  This video is part of a tutorial on the YouTube API. The video inspired me to work on my first personal data science project.  Even though the idea is simple, extract and analyze data from YouTube.   
The first step of the project is to collect data for a specific YouTube channel by retrieving metric information from each video uploaded to that channel, then the data will be saved and stored to be used later without the need to run the script again.   
In the second part of the project (will came soon) we will use data science tools to analyze the data and to get insights from it. We can look for the most popular videos on the channel, the most watched playlist, the relationship between duration and number of views, the relationship between video duration and number of comments, the ratio between likes and dislikes. 


##  Creating an API Key

First things first, we need a YOUTUBE API KEY. I used this video https://www.youtube.com/watch?v=th5_9woFJmk&t=2s to set up my API key and install the packages we need. It's a clear and well explained video. At the end of this video, you can make your first YouTube API request. 

In [None]:
from googleapiclient.discovery import build
import os
import pandas as pd
import re
from datetime import date
from dotenv import load_dotenv
import json

## Hiding the API key
we will store the API key in a fille called `.env` and use `dotenv` module to  read it.  
check http://jonathansoma.com/lede/foundations-2019/classes/apis/keeping-api-keys-secret/

In [None]:
load_dotenv()
API_KEY = os.getenv('api_key')

## Building a service object

Before using the Youtube API to make requests, we need to build a service object.
We will use the [`build()`](https://googleapis.github.io/google-api-python-client/docs/epy/googleapiclient.discovery-module.html#build) function to create the service object, we will need to specify the name of the service, in our case `youtube`, the API version as `v3` and we will also need a developer key.
For more information, you can always check the [Getting Started](https://github.com/googleapis/google-api-python-client/blob/master/docs/start.md) document from [
google-api-python-client documentation](https://github.com/googleapis/google-api-python-client).


In [None]:
youtube = build('youtube', 'v3', developerKey=API_KEY)

## Retrieve Statistics for Any YouTube Channel

we are ready to make our first request. Since our goal is to collect data for a specific YouTube channel. We need a parameter which uniquely identifies the YouTube channel.   
In order to request information about a particular channel, we call the `channel.list` method, and to identify the channel, we can use the channel ID or the username associated with that channel.  
Perhaps you are wondering how to find the ID of a channel? Me too.  
One way to do it based on this post on [stackoverflow](https://stackoverflow.com/questions/14366648/how-can-i-get-a-channel-id-from-youtube), is to look for either `data-channel-external-id` or `externalId` in the source code  of the channel page. If you fund a better solution, please share it with us.


In this project we will use the YouTube channel [Corey Schafer](https://www.youtube.com/channel/UCCezIgC97PvUuR4_gbFUs5g) as an example. Because this project is inspired from his YouTube API tutorial. Thanks [Corey Schafer](https://coreyms.com/).  

In [None]:
user_name = 'schafer5' 
channel_id = 'UCCezIgC97PvUuR4_gbFUs5g'



request = youtube.channels().list(
        part="statistics",
        forUsername=user_name
    )
response = request.execute()

In [None]:
print(json.dumps(response, indent=4,sort_keys=True))

We can look for more than one channel, by passing a list of channel ids.     
we created a list of channel IDs, by selecting the top 10 channels from the [Top Programmer Guru](https://noonies.tech/award/top-programming-guru) list.

In [None]:
channel_ids = ["UCWv7vMbMWH4-V0ZXdmDpPBA", "UC29ju8bIPH5as8OGnQzwJyA", "UCCezIgC97PvUuR4_gbFUs5g", "UC4JX40jDee_tINbkjycV4Sg", "UCNU_lfiiWBdtULKOw6X0Dig", "UC8butISFwT-Wl7EV0hUK0BQ", "UCXgGY0wkgOzynnHvSEVmE3A", "UCqrILQNl5Ed9Dz6CGMyvMTQ", "UCStj-ORBZ7TGK1FwtGAUgbQ","UCZUyPT9DkJWmS_DzdOi7RIA"  ]

In [None]:
request = youtube.channels().list(
        part="snippet,contentDetails,statistics",
        id=channel_ids
    )
response = request.execute()

In [None]:
print(json.dumps(response, indent=4,sort_keys=True))

#### Let's see if your favorite channels for learning coding  are in the top 10.
(The order in which the channel titles are displayed is random).

In [None]:
for item in response['items']:
    print(item['snippet']['title'])

#### Let's display the data we collocate for a one channel.

In [None]:

print(json.dumps(item, indent=4,sort_keys=True))

#### Let's store to result in DataFrame
The response to the request can be stored in a table (like DataFrame) to have a better display, also . We are going to save the data as a `csv` file to avoid making requests every time we run the script, to use it for other projects and share it with this jupyter notebook.  
We should mention that  using the YouTube API is free, but there is limit quoto of request per day. The qota is about 10,000 units per day. each oparation have different cost retrieveing a list of channels, videos, plalists can cost 1 unit, but search request costs 100 units.   
You can check this link for more details about [Calculation quota usage](https://developers.google.com/youtube/v3/getting-started#calculating-quota-usage).            
For this reason will limit to collaction of data for only one channel 

In [None]:
channels_stat = {}

channels_stat['channelId'] = []
channels_stat['title'] = []
channels_stat['description'] = []
channels_stat['country'] = []
channels_stat['viewCount'] = []
channels_stat['subscriberCount'] = []
channels_stat['videoCount'] = []
channels_stat['subscriberCount'] = []
channels_stat['publishedAt'] = []
channels_stat['uploads'] = []
for item in response['items']:
    
    channels_stat['channelId'].append(item['id'])
    channels_stat['title'].append(item['snippet']['title'])
    channels_stat['description'].append(item['snippet']['description'])
    channels_stat['country'].append(item['snippet']['country'])
    channels_stat['viewCount'].append(item['statistics']['viewCount'])
    channels_stat['videoCount'].append(item['statistics']['videoCount'])
    channels_stat['subscriberCount'].append(item['statistics']['subscriberCount'])
    channels_stat['publishedAt'].append(item['snippet']['publishedAt'])
    channels_stat['uploads'].append(item['contentDetails']['relatedPlaylists']['uploads'])

channels_stat

In [None]:
channels_stat = pd.DataFrame.from_dict(channels_stat)
channels_stat

In [None]:
channels_stat.to_csv('channelsDB.csv')

In [None]:
channels_stat = pd.read_csv('channelsDB.csv', index_col=0)

In [None]:
channels_stat

In [None]:
channels_stat.country.value_counts()

In [None]:
channels_stat.videoCount.sort_values()

In [None]:
channels_stat.describe(include='all')

### The next step
We will be collecting data for each video and playlist in a single channel. Just to make sure we can finish the project before we surpass the request quota limit.

In [None]:
def getVideosId(youtube, channelId):
    '''
    Get list of all videos ids in youtube channle
    
    Args:
        youtube (servibe object):
        channelId (string): the channel id 
        
    Return:
        a list of videos ids
    '''
    videosIdList = []
    nextPageToken = None

    while True:

        request = youtube.search().list(
            part="snippet",
                channelId=channelId,
                maxResults=50,
                regionCode='US',
                pageToken=nextPageToken,
            )
        response = request.execute()



        for item in response['items']:

            if item['id']['kind'] == "youtube#video":

                videosIdList.append(item['id']['videoId'])

        nextPageToken = response.get('nextPageToken')
        if not nextPageToken:
            break

    return videosIdList

### A list of all videos in a youtube channel
we will only work with one. we have limit quota of 10.000 on request for 

In [None]:
channelId = channels_stat.loc[9, 'channelId']
videosIdList = getVideosId(youtube, channelId)

In [None]:
len(videosIdList)

In [None]:
today = date.today()

In [None]:
f'The channel {channels_stat.loc[9, "title"]} has  { len(videosIdList)} videos until {today}.'

### A table of all playlists in  youtube channel

In [None]:
def getPlaylistId(youtube, channelId):

    '''
    Get list of all playlist for given channeId  and save result in to database 

    return:

        df (DataFrame): dataframe withe the following columns
            plylistId | title | description | itmCount | channelId

    '''
    pl_dict = {'playlistId':[], 'title': [], 'description': [] ,'itmCount':[], 'channelId':[]}

    nextPageToken = None

    while True:

        pl_request = youtube.playlists().list(
            part ='contentDetails, snippet',
            channelId=channelId,
            maxResults=50,
            pageToken=nextPageToken,)
        pl_response = pl_request.execute()


        for item in pl_response['items']:

            pl_dict['playlistId'].append(item['id'])
            pl_dict['title'].append(item['snippet']['title'])
            pl_dict['description'].append(item['snippet']['description'])
            pl_dict['itmCount'].append(item['contentDetails']['itemCount'])
            pl_dict['channelId'].append(channelId)

        nextPageToken = response.get('nextPageToken')

        if not nextPageToken:
        
            break

    df = pd.DataFrame.from_dict(pl_dict)

    return df

In [None]:
playlistDb = getPlaylistId(youtube, channelId)

In [None]:
playlistDb

In [None]:
playlistDb.info()

What the table above miss is some statistics about each playlistId, 
	like the  number of view and duration. 
	To add this information and more we can not use the youtube api diractly, 
	we have to go aroud, 
	one way to do this is going through each videos in  a playlist.

In [None]:
def getPlaylistItems(youtube, playlist_id):
    '''
    return videos id  in given  playlist
    Args:
        youtube (youtube api): youtube api
    playlist_id (string): the playlist id
    return dic: {video_id: list ,playlist_id: list}
    '''

    nextPageToken = None
    playlist_items = {'videoId': [], 'playlistId': []}

    while True:
        pl_request = youtube.playlistItems().list(
            part ='contentDetails',
            playlistId=playlist_id,
            maxResults=50,
            pageToken=nextPageToken,
            )

        pl_response = pl_request.execute()


        for item in pl_response['items']:

            video_id = item['contentDetails']['videoId']
            playlist_items['playlistId'].append(playlist_id)
            playlist_items['videoId'].append(video_id) # video can in more then one playlist




        nextPageToken = pl_response.get('nextPageToken')

        if not nextPageToken:
            break

    return playlist_items

In [None]:

playlists_items = {'videoId': [], 'playlistId': []}
# list of all the playlist
playlistIds = playlistDb['playlistId'].values 

def dictUpdate(dict1, dict2):
    
    '''
        councatinute the value of two dictionary with the same keys
    '''
    
    dict3 = {}

    for key in dict1:

        dict3[key] = dict1[key] + dict2[key]

    return dict3

In [None]:
# get the items in each playlist
for playlist_id in playlistIds:

    playlist_items = getPlaylistItems(youtube, playlist_id)

    playlists_items = dictUpdate(playlists_items, playlist_items)

Let's save the resuls we get in DataFrame.

In [None]:
playlistItemsDB = pd.DataFrame.from_dict(playlists_items)
playlistItemsDB['channelId'] = channelId


In [None]:
playlistItemsDB.head()

In [None]:
playlistItemsDB.shape

In [None]:
len(playlistItemsDB['videoId'].unique())

A video can be in more  than one playlist, it can also not belong to any playlist

### A table of all videos in  youtube channel
We will create dataset of statistics information of each video in the channel.

In [None]:
def getVideoStat(youtube, videos_id_list):

    '''
    Get statistics about each videos in list 
    Args:
        youtube (youtube api): youtube api
        videos_id_list (list): a list of videos id, with less 50 elements
    
    '''
    videos_request = youtube.videos().list(
        part='contentDetails, statistics, snippet',
        id = ','.join(videos_id_list),
    )

    videos_response = videos_request.execute()

    for item in videos_response['items']:

        # vid_dict['playlistId'].append(playlist_id) this column will be add using join
        vid_dict['videoId'].append(item['id'])
        vid_dict['title'].append(item['snippet']['title'])
        vid_dict['tags'].append(item['snippet']['tags'])
        vid_dict['viewCount'].append(item['statistics']['viewCount'])
        vid_dict['likeCount'].append(item['statistics']['likeCount'])
        vid_dict['dislikeCount'].append(item['statistics']['dislikeCount'])
        vid_dict['commentCount'].append(item['statistics']['commentCount'])
        vid_dict['duration'].append(item['contentDetails']['duration'])
        vid_dict['date'].append(item['snippet']['publishedAt'])
        vid_dict['channelId'].append(item['snippet']['channelId'])



In [None]:
import math

def make_chunks(data, chunk_size):
    
    '''Split a data into chunk of given size'''
    
    num_chunks = math.ceil(len(data) / chunk_size)
    
    return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

In [None]:
chunks = make_chunks(videosIdList, 50)

In [None]:
len(chunks)

In [None]:
getVideoStat(youtube, chunks[0])

In [None]:
list_columns = ['videoId','title', 'tags', 'viewCount', 'likeCount', 'dislikeCount', 'commentCount', 'duration','channelId','date']
vid_dict = {key : [] for key in list_columns}

for chunk in chunks:
    getVideoStat(youtube, chunk)

videosDB = pd.DataFrame.from_dict(vid_dict)

videosDB

It's time to  save all to data we colleact `videosDB` `playlistItemsDB` and `playlistDb` to `csv` file, to use it later 

In [None]:
videosDB.to_csv('videosDB.csv')
playlistItemsDB.to_csv('playlistItemsDB.csv')
playlistDb.to_csv('playlistDb.csv')