# Extract YouTube video statistics and playlist
In this part of the project we will focus on video and playlist data, we will collect statistical data from each YouTube channel in the `summery_df` database.  We will continue to use the YouTupe API.
 To gain speed, we will use `multiprocessing` techniques and the `MapReduce` framework. Because we expect to process more than 48386 videos. 
First, we will list all videos in each channel, and then we will collect statistical data for each video.
secondly, most of the time, videos are organized in `playlists`. to get more data about the plyalist we go to the list of all playlist, each playlist will be grouped with the videos it continues. using the data collected in the first part, we can have some insight about the duration of the playlist number of items and views count...




In [1]:
from googleapiclient.discovery import build
from multiprocessing import Pool
import functools
import concurrent.futures
import math
import os
import pandas as pd
import re
from datetime import date
from dotenv import load_dotenv
import json
import time

In [2]:
load_dotenv()
API_KEY = os.getenv('api_key')
youtube = build('youtube', 'v3', developerKey=API_KEY)

In [9]:
request = youtube.videos().list(
        part="snippet,contentDetails,statistics",
        id="pTFZFxd4hOI"
    )
response = request.execute()

In [16]:
print(json.dumps(response['items'][0]['snippet']["liveBroadcastContent"], indent=4))

"none"


In [17]:
print(json.dumps(response['items'][0]['snippet']["defaultAudioLanguage"], indent=4))

"en"


In [3]:
summery_df = pd.read_csv('../data/channelStatistic.csv') 

In [4]:
summery_df.head()

Unnamed: 0,channelName,title,channelId,url,gender,rank,description,countryCode,viewCount,subscriberCount,videoCount,publishedAt,uploads,publishedDate,country,countryOther,continent,year,yearClass,subscriberCountClass
0,Programming with Mosh,Programming with Mosh,UCWv7vMbMWH4-V0ZXdmDpPBA,https://www.youtube.com/c/programmingwithmosh/...,Male,1,I train professional software engineers that c...,AU,82263182,1830000,161,2014-10-07T00:40:53Z,UUWv7vMbMWH4-V0ZXdmDpPBA,2014-10-07,Australia,Australia,Oceania,2014,"(2010.0, 2015.0]","(683500.0, 2460000.0]"
1,Traversy Media,Traversy Media,UC29ju8bIPH5as8OGnQzwJyA,https://www.youtube.com/user/TechGuyWeb,Male,2,Traversy Media features the best online web de...,US,142390124,1560000,881,2009-10-30T21:33:14Z,UU29ju8bIPH5as8OGnQzwJyA,2009-10-30,United States,United States,North america,2009,"(2005.999, 2010.0]","(683500.0, 2460000.0]"
2,Corey Schafer,Corey Schafer,UCCezIgC97PvUuR4_gbFUs5g,https://www.youtube.com/user/schafer5,Male,3,Welcome to my Channel. This channel is focused...,US,58863870,792000,230,2006-05-31T22:49:22Z,UUCezIgC97PvUuR4_gbFUs5g,2006-05-31,United States,United States,North america,2006,"(2005.999, 2010.0]","(683500.0, 2460000.0]"
3,Tech With Tim,Tech With Tim,UC4JX40jDee_tINbkjycV4Sg,https://m.youtube.com/channel/UC4JX40jDee_tINb...,Male,4,"Learn programming, software engineering, machi...",CA,51968790,680000,602,2014-04-23T01:57:10Z,UU4JX40jDee_tINbkjycV4Sg,2014-04-23,Canada,Canada,North america,2014,"(2010.0, 2015.0]","(97250.0, 683500.0]"
4,Krish Naik,Krish Naik,UCNU_lfiiWBdtULKOw6X0Dig,https://www.youtube.com/user/krishnaik06/playl...,Male,5,"I work as a Lead Data Scientist, pioneering in...",IN,28303465,390000,1102,2012-02-11T04:05:06Z,UUNU_lfiiWBdtULKOw6X0Dig,2012-02-11,India,India,Asia,2012,"(2010.0, 2015.0]","(97250.0, 683500.0]"


In [5]:
uploads = summery_df.uploads.values

In [6]:
len(uploads)

63

In [7]:
# https://github.com/ClarityCoders/YouTubeAnalysis/blob/master/notebook.ipynb

def getVideoIdUpload(youtube, upload_ids):
    '''
    Get list of all videos ids in youtube channle
    
    Args:
        youtube (servibe object):
        channelId (string): the channel id 
        
    Return:
        a list of videos ids
    '''
    video_id_list = []
    
    
    for upload_id in upload_ids:
        
        nextPageToken = None
        while True:

            request = youtube.playlistItems().list(
                part="snippet", #contentDetails",
                maxResults=50,
                playlistId=upload_id,
                pageToken=nextPageToken,
            )

            response = request.execute()



            for item in response['items']:

                video_id = item['snippet']['resourceId']['videoId']

                if video_id not in video_id_list:

                    video_id_list.append(video_id)


            nextPageToken = response.get('nextPageToken')

            if not nextPageToken:
                break

    return video_id_list

In [4]:



def make_chunks(data, num_chunks):
    
    chunk_size = math.ceil(len(data) / num_chunks)
    
    return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

In [54]:
corey = getVideoIdUpload(youtube, ['UUCezIgC97PvUuR4_gbFUs5g', 'UU4JX40jDee_tINbkjycV4Sg'])

In [56]:
len(corey)

230

In [19]:


def getVideoIdParallel(youtube, uploads, num_processes=4):
    
    chunks = make_chunks(uploads,num_processes)
    
    with concurrent.futures.ProcessPoolExecutor() as executor:
        
        futures = [executor.submit(getVideoIdUpload, youtube, chunk) for chunk in chunks]
        
    results = [future.result() for future in futures]
    # Merge results
    merged_results = []
    for result in results:
        merged_results.extend(result)
    return merged_results


In [22]:
start = time.time()

videos_id_list = getVideoIdParallel(youtube, uploads, num_processes=8)

end = time.time()

print(end - start)

21.556363105773926


In [23]:
len(videos_id_list)

31097

In [24]:
summery_df.videoCount.sum()

31089

We have two different results, perhaps new videos have been uploaded since the `channelStatistic` dataset was created.

In [25]:

videos_id_list[0]

'rHux0gMZ3Eg'

In [26]:
# without using multiprocessing
import time

start = time.time()

date = getVideoIdUpload(youtube, uploads)

end = time.time()

print(end - start)

68.38697671890259


 We will use the [MapReduce](https://en.wikipedia.org/wiki/MapReduce)  framework when gather statistical data on each video to go faster. The code for the `map_reduce` function has been adapted from DATAQUEST[
Course: Parallel Processing](https://www.dataquest.io/course/parallel-processing/).

In [9]:


def map_reduce(data, num_processes, mapper, reducer):
    
    chunks = make_chunks(data, num_processes)
    
    with Pool(num_processes) as pool:
        
        chunk_results = pool.map(mapper, chunks)
        
    return functools.reduce(reducer, chunk_results)

In [10]:
def make_chunks2(data, chunk_size):
    
    '''Split a data into chunk of given size'''
    
    num_chunks = math.ceil(len(data) / chunk_size)
    
    return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

In [34]:
def getVideoStat(youtube,videos_id_list):

    '''
    Get statistics data for each videos in list 
    Args:
        youtube (youtube api): youtube api
        videos_id_list (list): a list of videos id, with ##less 50 elements
    
    '''
    list_columns = ['videoId','title', 'tags', 'viewCount', 'likeCount', 'dislikeCount', 'commentCount', 'duration','channelId','defaultAudioLanguage','publishedAt', 'liveBroadcastContent']
    vid_dict = {key : [] for key in list_columns}
    
    chunks = make_chunks2(videos_id_list,50)
    for chunk in chunks:
        videos_request = youtube.videos().list(
            part='contentDetails, statistics, snippet',
            id = ','.join(chunk),
        )

        videos_response = videos_request.execute()

        for item in videos_response['items']:

            # vid_dict['playlistId'].append(playlist_id) this column will be add using join
            vid_dict['videoId'].append(item['id'])
            vid_dict['title'].append(item['snippet']['title'])
            try:
                vid_dict['tags'].append(item['snippet']['tags'])
            except KeyError:
                vid_dict['tags'].append(None)
            vid_dict['viewCount'].append(item['statistics'].get('viewCount',None))
            vid_dict['likeCount'].append(item['statistics'].get('likeCount',None))
            vid_dict['dislikeCount'].append(item['statistics'].get('dislikeCount',None))
            vid_dict['commentCount'].append(item['statistics'].get('commentCount',None))
            vid_dict['duration'].append(item['contentDetails']['duration'])
            vid_dict['publishedAt'].append(item['snippet']['publishedAt'])
            vid_dict['channelId'].append(item['snippet']['channelId'])
            vid_dict['liveBroadcastContent'].append(item['snippet'].get('liveBroadcastContent',None))
            vid_dict['defaultAudioLanguage'].append(item['snippet'].get('defaultAudioLanguage',None))
    return vid_dict


In [8]:
def reducer(dic1, dic2):
    
    merged = dict()
    
    for item in dic1:
        
        merged[item] = dic1[item]+ dic2[item]
        
    return merged

In [35]:
from functools import partial
start = time.time()

data = map_reduce(videos_id_list, 8, partial(getVideoStat, youtube), reducer)

end = time.time()

print(end - start)

25.93574094772339


In [36]:
len(data['videoId'])

31097

In [37]:
df = pd.DataFrame.from_dict(data)

In [38]:
df.head()

Unnamed: 0,videoId,title,tags,viewCount,likeCount,dislikeCount,commentCount,duration,channelId,defaultAudioLanguage,publishedAt,liveBroadcastContent
0,rHux0gMZ3Eg,Django Tutorial for Beginners [2021],"[django tutorial, django, learn django, django...",51981,2970,28,514,PT1H2M36S,UCWv7vMbMWH4-V0ZXdmDpPBA,en,2021-06-28T14:00:31Z,none
1,pTFZFxd4hOI,Docker Tutorial for Beginners [2021],"[docker tutorial, docker, learn docker, docker...",285783,8734,314,632,PT56M4S,UCWv7vMbMWH4-V0ZXdmDpPBA,en,2021-03-30T15:14:41Z,none
2,Eo90IEphG_M,Docker course is coming!,,28449,1794,11,231,PT49S,UCWv7vMbMWH4-V0ZXdmDpPBA,en,2021-03-25T17:30:03Z,none
3,qz0aGYrrlhU,HTML Tutorial for Beginners: HTML Crash Course...,"[html tutorial, html5 tutorial, html, html5, l...",983673,29773,386,1655,PT1H9M34S,UCWv7vMbMWH4-V0ZXdmDpPBA,en,2021-01-11T14:30:10Z,none
4,-_X6PhkjpzU,5 Front-end Development Skills to Land Your Fi...,"[front-end development, front end development,...",399842,22398,153,643,PT9M2S,UCWv7vMbMWH4-V0ZXdmDpPBA,en,2021-01-07T14:30:14Z,none


In [39]:
# save the data in csv file
df.to_csv('../data/videosDB.csv', index=False)

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31097 entries, 0 to 31096
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   videoId               31097 non-null  object
 1   title                 31097 non-null  object
 2   tags                  28027 non-null  object
 3   viewCount             31085 non-null  object
 4   likeCount             31053 non-null  object
 5   dislikeCount          31053 non-null  object
 6   commentCount          31057 non-null  object
 7   duration              31097 non-null  object
 8   channelId             31097 non-null  object
 9   defaultAudioLanguage  26396 non-null  object
 10  publishedAt           31097 non-null  object
 11  liveBroadcastContent  31097 non-null  object
dtypes: object(12)
memory usage: 2.8+ MB


In [41]:
len(df.videoId.unique())

31097

In [43]:
df.defaultAudioLanguage.value_counts()

en        20795
hi         1546
en-US      1393
ur         1195
en-GB       741
es-419      345
pl          200
en-CA       135
ko           35
en-IN         4
zxx           4
mr            1
zh-TW         1
fr-CA         1
Name: defaultAudioLanguage, dtype: int64

In [44]:
df.liveBroadcastContent.value_counts()

none        31096
upcoming        1
Name: liveBroadcastContent, dtype: int64

In [12]:
def getVideoDesc(youtube,videos_id_list):

    '''
    Get description for each videos in list 
    Args:
        youtube (youtube api): youtube api
        videos_id_list (list): a list of videos id, with ##less 50 elements
    
    '''
    list_columns = ['videoId','description']
    vid_dict = {key : [] for key in list_columns}
    
    chunks = make_chunks2(videos_id_list,50)
    for chunk in chunks:
        videos_request = youtube.videos().list(
            part='snippet',
            id = ','.join(chunk),
        )

        videos_response = videos_request.execute()

        for item in videos_response['items']:

            # vid_dict['playlistId'].append(playlist_id) this column will be add using join
            vid_dict['videoId'].append(item['id'])
            vid_dict['description'].append(item['snippet'].get('description',None))
            
    return vid_dict


In [5]:
df = pd.read_csv('../data/videosDB.csv')

In [7]:
videos_id_list = df.videoId.values

In [14]:
from functools import partial
start = time.time()

data = map_reduce(videos_id_list, 8, partial(getVideoDesc, youtube), reducer)

end = time.time()

print(end - start)

43.4334442615509


In [16]:
data = pd.DataFrame.from_dict(data)

In [17]:
data.head()

Unnamed: 0,videoId,description
0,rHux0gMZ3Eg,Django Tutorial for Beginners - Learn Django f...
1,pTFZFxd4hOI,Docker Tutorial for Beginners - Learn Docker f...
2,Eo90IEphG_M,
3,qz0aGYrrlhU,HTML Tutorial for Beginners - Learn HTML for a...
4,-_X6PhkjpzU,Everything you need to know in a simple path t...


In [21]:
df = df.merge(data,on='videoId')

In [20]:
df.shape

(31097, 12)

In [22]:
df.to_csv('../data/videosDB.csv', index=False)

### Playlist items 

In [45]:
def getPlaylistId(youtube, channelsId):

    '''
    Get list of all playlist for given list of channels Id  and save result in dict

    return:

        df (DataFrame): dataframe withe the following columns
            plylistId | title | description | itmCount | channelId

    '''
    pl_dict = {'playlistId':[], 'title': [], 'description': [] ,'itmCount':[], 'publishedAt': [],'channelId':[]}

    
    for channelId in channelsId:
        
        nextPageToken = None
        
        while True:

            request = youtube.playlists().list(
                part ='contentDetails, snippet',
                channelId=channelId,
                maxResults=50,
                pageToken=nextPageToken,)
            response = request.execute()


            for item in response['items']:

                pl_dict['playlistId'].append(item['id'])
                pl_dict['title'].append(item['snippet']['title'])
                pl_dict['description'].append(item['snippet']['description'])
                pl_dict['publishedAt'].append(item['snippet']['publishedAt'])
                pl_dict['itmCount'].append(item['contentDetails']['itemCount'])
                pl_dict['channelId'].append(channelId)

            nextPageToken = response.get('nextPageToken')

            if not nextPageToken:

                break

        #df = pd.DataFrame.from_dict(pl_dict)

    return pl_dict

In [46]:
channels_id_list = summery_df.channelId.values

In [47]:
len(channels_id_list)

63

In [48]:
from functools import partial
start = time.time()

pl_dic = map_reduce(channels_id_list, 8, partial(getPlaylistId, youtube), reducer)

end = time.time()

print(end - start)

2.059824228286743


In [49]:
pl_dic

{'playlistId': ['PLTjRvDozrdlxzQet01qZBt-sRG8bbDggv',
  'PLTjRvDozrdlxlMnoG9_yJKPMxMJu8FWRK',
  'PLTjRvDozrdlxCs_3gaqd120LcGxmfe8rG',
  'PLTjRvDozrdlw5En5v2xrBr_EqieHf7hGs',
  'PLTjRvDozrdlynYXGUfyyMZdrQ0Sz27aud',
  'PLTjRvDozrdlyXC_6mOBhmCWoMAmvHarng',
  'PLTjRvDozrdlxj5wgH4qkvwSOdHLOCx10f',
  'PLTjRvDozrdlxEIuOBZkMAK5uiqp8rHUax',
  'PLTjRvDozrdlydy3uUBWZlLUTNpJSGGCEm',
  'PLTjRvDozrdlxJjrQ4phZAUmiRn-HbK3M_',
  'PLTjRvDozrdlxAhsPP4ZYtt3G8KbJ449oT',
  'PLTjRvDozrdlz3_FPXwb6lX_HoGXa09Yef',
  'PLillGF-RfqbbI5_XFuidhe3EE7bej5T_y',
  'PLillGF-RfqbY3c2r0htQyVbDJJoBFE6Rb',
  'PLillGF-RfqbZ2ybcoD2OaabW2P7Ws8CWu',
  'PLillGF-RfqbbJYRaNqeUzAb7QY-IqBKRx',
  'PLillGF-Rfqba3xeEvDzIcUCxwMlGiewfV',
  'PLillGF-RfqbbRA-CIUxlxkUpbq0IFkX60',
  'PLillGF-RfqbZyLc9sMQ72_u3FW9fVxo1p',
  'PLillGF-RfqbZrjw48EXLdM4dsOhURCLZx',
  'PLillGF-RfqbbFSFYR_yJfDcdq6It6OqdO',
  'PLillGF-RfqbYSx-Ab1xWVanGKtowTsnNm',
  'PLillGF-RfqbaxgxkKgKk1XlJAVCX31xRI',
  'PLillGF-Rfqbb6vZqT-Lzi9Al_noaY5LAs',
  'PLillGF-RfqbYoGoCjKoMOk

In [50]:
playlis_df = pd.DataFrame.from_dict(pl_dic)

In [51]:
playlis_df

Unnamed: 0,playlistId,title,description,itmCount,publishedAt,channelId
0,PLTjRvDozrdlxzQet01qZBt-sRG8bbDggv,Mobile Development,,6,2020-06-14T18:44:27Z,UCWv7vMbMWH4-V0ZXdmDpPBA
1,PLTjRvDozrdlxlMnoG9_yJKPMxMJu8FWRK,Job Interview Preparation Videos,,6,2019-12-10T03:20:31Z,UCWv7vMbMWH4-V0ZXdmDpPBA
2,PLTjRvDozrdlxCs_3gaqd120LcGxmfe8rG,Programming Languages,,8,2019-07-15T01:00:50Z,UCWv7vMbMWH4-V0ZXdmDpPBA
3,PLTjRvDozrdlw5En5v2xrBr_EqieHf7hGs,Front-end Development,,11,2019-03-23T20:17:05Z,UCWv7vMbMWH4-V0ZXdmDpPBA
4,PLTjRvDozrdlynYXGUfyyMZdrQ0Sz27aud,Back-end Development,All the essential tutorials to learn back-end ...,16,2019-03-23T20:08:38Z,UCWv7vMbMWH4-V0ZXdmDpPBA
...,...,...,...,...,...,...
2256,PL9fcHFJHtFabiuFvjtydossqwgZlkfcxp,My Favorite Tutorials of Wordpress Development...,These are the video tutorials i love to watch ...,20,2016-12-19T15:40:42Z,UCjM2CgqAXgXQuFjJa732IRw
2257,PL9fcHFJHtFaY-qiU_v7uZMbAB1k4Dz1q3,Web Designing and Web Development Tips and Tri...,"In this playlist, i will be adding videos of T...",17,2016-12-19T08:29:43Z,UCjM2CgqAXgXQuFjJa732IRw
2258,PL9fcHFJHtFaZh9U9BiKlqX7bGdvFkSjro,WordPress and WooCommerce Tutorials for interm...,"In this playlist Here, you have complete WordP...",30,2016-11-26T07:11:02Z,UCjM2CgqAXgXQuFjJa732IRw
2259,PL9fcHFJHtFaYQnLwdm1aJbD1GK11DSYgs,wordpress Theme and Plugin Development Tutoria...,"In Here, you have complete Pinegrow Solutions,...",4,2016-08-09T10:35:40Z,UCjM2CgqAXgXQuFjJa732IRw


In [55]:
# save the DataFrame in csv file
playlis_df.to_csv('data/playlistDB.csv', index=False)

In [53]:
playlist_id_list = playlis_df.playlistId.values

In [54]:
len(playlist_id_list)

2261

In [56]:
def getPlaylistItems(youtube, playlists_id):
    '''
    return videos id  in given  playlist
    Args:
        youtube (youtube api): youtube api
    playlist_id (string): the playlist id
    return dic: {video_id: list ,playlist_id: list}
    '''

    
    playlist_items = {'videoId': [], 'playlistId': []}
    
    for playlist_id in playlists_id:

        nextPageToken = None
        while True:
            pl_request = youtube.playlistItems().list(
                part ='contentDetails',
                playlistId=playlist_id,
                maxResults=50,
                pageToken=nextPageToken,
                )

            pl_response = pl_request.execute()


            for item in pl_response['items']:

                video_id = item['contentDetails']['videoId']
                playlist_items['playlistId'].append(playlist_id)
                playlist_items['videoId'].append(video_id) # video can in more then one playlist




            nextPageToken = pl_response.get('nextPageToken')

            if not nextPageToken:
                break

    return playlist_items

In [57]:
start = time.time()

pl_item_dic = map_reduce(playlist_id_list, 8, partial(getPlaylistItems, youtube), reducer)

end = time.time()

print(end - start)

16.80524754524231


In [None]:
pl_item_dic

In [58]:
pl_item_df = pd.DataFrame.from_dict(pl_item_dic)

In [59]:
pl_item_df

Unnamed: 0,videoId,playlistId
0,0-S5a0eXPoc,PLTjRvDozrdlxzQet01qZBt-sRG8bbDggv
1,SLwpqD8n3d0,PLTjRvDozrdlxzQet01qZBt-sRG8bbDggv
2,93ZU6j59wL4,PLTjRvDozrdlxzQet01qZBt-sRG8bbDggv
3,uxZZzmeCoLE,PLTjRvDozrdlxzQet01qZBt-sRG8bbDggv
4,8JJ101D3knE,PLTjRvDozrdlxzQet01qZBt-sRG8bbDggv
...,...,...
31690,PWyMjOEomcc,PL9fcHFJHtFabSYUpg7pCPFpIaIElNcubv
31691,EzUcP0yPgoA,PL9fcHFJHtFabSYUpg7pCPFpIaIElNcubv
31692,iu35wtVi5mU,PL9fcHFJHtFabSYUpg7pCPFpIaIElNcubv
31693,rn92rMiV-48,PL9fcHFJHtFabSYUpg7pCPFpIaIElNcubv


In [60]:
len(pl_item_df.videoId.unique())

25925

The video can be included in several playlists or not included in any playlist.

In [61]:
pl_item_df.to_csv('data/playlistItemsDB.csv', index=False)