# Extract YouTube video statistics and playlist
In this part of the project we will focus on video and playlist data, we will collect statistical data from each YouTube channel in the `summery_df` database.  We will continue to use the YouTupe API.
 To gain speed, we will use `multiprocessing` techniques and the `MapReduce` framework. Because we expect to process more than 48386 videos. 
First, we will list all videos in each channel, and then we will collect statistical data for each video.
secondly, most of the time, videos are organized in `playlists`. to get more data about the plyalist we go to the list of all playlist, each playlist will be grouped with the videos it continues. using the data collected in the first part, we can have some insight about the duration of the playlist number of items and views count...




In [115]:
from googleapiclient.discovery import build
import os
import pandas as pd
import re
from datetime import date
from dotenv import load_dotenv
import json

In [117]:
load_dotenv()
API_KEY = os.getenv('api_key')
youtube = build('youtube', 'v3', developerKey=API_KEY)

In [5]:
summery_df = pd.read_csv('data/summeryCleanDB.csv') 

In [116]:
summery_df.head()

Unnamed: 0,channelName,title,channelId,kind,url,description,countryCode,viewCount,subscriberCount,videoCount,publishedAt,uploads,country,countryOther,publishedDate
0,Programming with Mosh,Programming with Mosh,UCWv7vMbMWH4-V0ZXdmDpPBA,youtube#channel,https://www.youtube.com/c/programmingwithmosh/...,I train professional software engineers that c...,AU,79813920,1790000,160,2014-10-07T00:40:53Z,UUWv7vMbMWH4-V0ZXdmDpPBA,Australia,Australia,2014-10-07
1,Traversy Media,Traversy Media,UC29ju8bIPH5as8OGnQzwJyA,youtube#channel,https://www.youtube.com/user/TechGuyWeb,Traversy Media features the best online web de...,US,140038994,1540000,879,2009-10-30T21:33:14Z,UU29ju8bIPH5as8OGnQzwJyA,United States,United States,2009-10-30
2,Corey Schafer,Corey Schafer,UCCezIgC97PvUuR4_gbFUs5g,youtube#channel,https://www.youtube.com/user/schafer5,Welcome to my Channel. This channel is focused...,US,57821417,782000,230,2006-05-31T22:49:22Z,UUCezIgC97PvUuR4_gbFUs5g,United States,United States,2006-05-31
3,Tech With Tim,Tech With Tim,UC4JX40jDee_tINbkjycV4Sg,youtube#channel,https://m.youtube.com/channel/UC4JX40jDee_tINb...,"Learn programming, software engineering, machi...",CA,50278115,664000,591,2014-04-23T01:57:10Z,UU4JX40jDee_tINbkjycV4Sg,Canada,Canada,2014-04-23
4,Krish Naik,Krish Naik,UCNU_lfiiWBdtULKOw6X0Dig,youtube#channel,https://www.youtube.com/user/krishnaik06/playl...,"I work as a Lead Data Scientist, pioneering in...",IN,26639800,375000,1061,2012-02-11T04:05:06Z,UUNU_lfiiWBdtULKOw6X0Dig,India,India,2012-02-11


In [34]:
uploads = summery_df.uploads.values

In [35]:
len(uploads)

68

In [62]:
# https://github.com/ClarityCoders/YouTubeAnalysis/blob/master/notebook.ipynb

def getVideoIdUpload(youtube, upload_ids):
    '''
    Get list of all videos ids in youtube channle
    
    Args:
        youtube (servibe object):
        channelId (string): the channel id 
        
    Return:
        a list of videos ids
    '''
    video_id_list = []
    
    
    for upload_id in upload_ids:
        
        nextPageToken = None
        while True:

            request = youtube.playlistItems().list(
                part="snippet", #contentDetails",
                maxResults=50,
                playlistId=upload_id,
                pageToken=nextPageToken,
            )

            response = request.execute()



            for item in response['items']:

                video_id = item['snippet']['resourceId']['videoId']

                if video_id not in video_id_list:

                    video_id_list.append(video_id)


            nextPageToken = response.get('nextPageToken')

            if not nextPageToken:
                break

    return video_id_list

In [12]:
import math


def make_chunks(data, num_chunks):
    
    chunk_size = math.ceil(len(data) / num_chunks)
    
    return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

In [54]:
corey = getVideoIdUpload(youtube, ['UUCezIgC97PvUuR4_gbFUs5g', 'UU4JX40jDee_tINbkjycV4Sg'])

In [56]:
len(corey)

230

In [19]:
import concurrent.futures

In [25]:


def getVideoIdParallel(youtube, uploads, num_processes=4):
    
    chunks = make_chunks(uploads,num_processes)
    
    with concurrent.futures.ProcessPoolExecutor() as executor:
        
        futures = [executor.submit(getVideoIdUpload, youtube, chunk) for chunk in chunks]
        
    results = [future.result() for future in futures]
    # Merge results
    merged_results = []
    for result in results:
        merged_results.extend(result)
    return merged_results


In [65]:
start = time.time()

videos_id_list = getVideoIdParallel(youtube, uploads, num_processes=8)

end = time.time()

print(end - start)

77.36557841300964


In [66]:
len(videos_id_list)

48749

In [46]:
summery_df.videoCount.sum()

48386

We have two different results, perhaps new videos have been uploaded since the `summery_df` dataset was created.

In [47]:

videos_id_list[0]

'pTFZFxd4hOI'

In [60]:
# without using multiprocessing
import time

start = time.time()

date = getVideoIdUpload(youtube, uploads)

end = time.time()

print(end - start)

159.74628448486328


In [61]:
len(data)

48749

 We will use the [MapReduce](https://en.wikipedia.org/wiki/MapReduce)  framework when gather statistical data on each video to go faster. The code for the `map_reduce` function has been adapted from DATAQUEST[
Course: Parallel Processing](https://www.dataquest.io/course/parallel-processing/).

In [79]:
from multiprocessing import Pool
import functools


def map_reduce(data, num_processes, mapper, reducer):
    
    chunks = make_chunks(data, num_processes)
    
    with Pool(num_processes) as pool:
        
        chunk_results = pool.map(mapper, chunks)
        
    return functools.reduce(reducer, chunk_results)

In [68]:
def make_chunks2(data, chunk_size):
    
    '''Split a data into chunk of given size'''
    
    num_chunks = math.ceil(len(data) / chunk_size)
    
    return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

In [111]:
def getVideoStat(youtube,videos_id_list):

    '''
    Get statistics about each videos in list 
    Args:
        youtube (youtube api): youtube api
        videos_id_list (list): a list of videos id, with ##less 50 elements
    
    '''
    list_columns = ['videoId','title', 'tags', 'viewCount', 'likeCount', 'dislikeCount', 'commentCount', 'duration','channelId','date']
    vid_dict = {key : [] for key in list_columns}
    
    chunks = make_chunks2(videos_id_list,50)
    for chunk in chunks:
        videos_request = youtube.videos().list(
            part='contentDetails, statistics, snippet',
            id = ','.join(chunk),
        )

        videos_response = videos_request.execute()

        for item in videos_response['items']:

            # vid_dict['playlistId'].append(playlist_id) this column will be add using join
            vid_dict['videoId'].append(item['id'])
            vid_dict['title'].append(item['snippet']['title'])
            try:
                vid_dict['tags'].append(item['snippet']['tags'])
            except KeyError:
                vid_dict['tags'].append(None)
            vid_dict['viewCount'].append(item['statistics'].get('viewCount',None))
            vid_dict['likeCount'].append(item['statistics'].get('likeCount',None))
            vid_dict['dislikeCount'].append(item['statistics'].get('dislikeCount',None))
            vid_dict['commentCount'].append(item['statistics'].get('commentCount',None))
            vid_dict['duration'].append(item['contentDetails']['duration'])
            vid_dict['date'].append(item['snippet']['publishedAt'])
            vid_dict['channelId'].append(item['snippet']['channelId'])
            
    return vid_dict


In [101]:
def reducer(dic1, dic2):
    
    merged = dict()
    
    for item in dic1:
        
        merged[item] = dic1[item]+ dic2[item]
        
    return merged

In [118]:
from functools import partial
start = time.time()

data = map_reduce(videos_id_list, 8, partial(getVideoStat, youtube), reducer)

end = time.time()

print(end - start)

37.307560443878174


In [119]:
len(data['videoId'])

48749

In [120]:
df = pd.DataFrame.from_dict(data)

In [121]:
df.head()

Unnamed: 0,videoId,title,tags,viewCount,likeCount,dislikeCount,commentCount,duration,channelId,date
0,pTFZFxd4hOI,Docker Tutorial for Beginners [2021],"[docker tutorial, docker, learn docker, docker...",259236,8125,273,606,PT56M4S,UCWv7vMbMWH4-V0ZXdmDpPBA,2021-03-30T15:14:41Z
1,Eo90IEphG_M,Docker course is coming!,,27613,1777,11,231,PT49S,UCWv7vMbMWH4-V0ZXdmDpPBA,2021-03-25T17:30:03Z
2,qz0aGYrrlhU,HTML Tutorial for Beginners: HTML Crash Course...,"[html tutorial, html5 tutorial, html, html5, l...",904761,26935,343,1604,PT1H9M34S,UCWv7vMbMWH4-V0ZXdmDpPBA,2021-01-11T14:30:10Z
3,-_X6PhkjpzU,5 Front-end Development Skills to Land Your Fi...,"[front-end development, front end development,...",374612,21094,146,617,PT9M2S,UCWv7vMbMWH4-V0ZXdmDpPBA,2021-01-07T14:30:14Z
4,Nb0btdq1164,"Learn Web Development and Make $78,000/y","[web development, web development 2021, progra...",99544,7022,124,454,PT5M15S,UCWv7vMbMWH4-V0ZXdmDpPBA,2021-01-05T14:30:11Z


In [123]:
# save the data in csv file
df.to_csv('videosBD2.csv', index=False)

In [124]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48749 entries, 0 to 48748
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   videoId       48749 non-null  object
 1   title         48749 non-null  object
 2   tags          45448 non-null  object
 3   viewCount     48737 non-null  object
 4   likeCount     48683 non-null  object
 5   dislikeCount  48683 non-null  object
 6   commentCount  48680 non-null  object
 7   duration      48749 non-null  object
 8   channelId     48749 non-null  object
 9   date          48749 non-null  object
dtypes: object(10)
memory usage: 3.7+ MB


In [126]:
len(df.videoId.unique())

48749

### Playlist items 

In [140]:
def getPlaylistId(youtube, channelsId):

    '''
    Get list of all playlist for given list of channels Id  and save result in dict

    return:

        df (DataFrame): dataframe withe the following columns
            plylistId | title | description | itmCount | channelId

    '''
    pl_dict = {'playlistId':[], 'title': [], 'description': [] ,'itmCount':[], 'publishedAt': [],'channelId':[]}

    
    for channelId in channelsId:
        
        nextPageToken = None
        
        while True:

            request = youtube.playlists().list(
                part ='contentDetails, snippet',
                channelId=channelId,
                maxResults=50,
                pageToken=nextPageToken,)
            response = request.execute()


            for item in response['items']:

                pl_dict['playlistId'].append(item['id'])
                pl_dict['title'].append(item['snippet']['title'])
                pl_dict['description'].append(item['snippet']['description'])
                pl_dict['publishedAt'].append(item['snippet']['publishedAt'])
                pl_dict['itmCount'].append(item['contentDetails']['itemCount'])
                pl_dict['channelId'].append(channelId)

            nextPageToken = response.get('nextPageToken')

            if not nextPageToken:

                break

        #df = pd.DataFrame.from_dict(pl_dict)

    return pl_dict

In [131]:
channels_id_list = summery_df.channelId.values

In [132]:
len(channels_id_list)

68

In [141]:
from functools import partial
start = time.time()

pl_dic = map_reduce(channels_id_list, 8, partial(getPlaylistId, youtube), reducer)

end = time.time()

print(end - start)

10.29264235496521


In [None]:
pl_dic

In [142]:
playlis_df = pd.DataFrame.from_dict(pl_dic)

In [None]:
playlis_df

In [158]:
# save the DataFrame in csv file
playlis_df.to_csv('playlistBD2.csv', index=False)

In [147]:
playlist_id_list = playlis_df.playlistId.values

In [146]:
len(playlist_id_list)

4002

In [156]:
def getPlaylistItems(youtube, playlists_id):
    '''
    return videos id  in given  playlist
    Args:
        youtube (youtube api): youtube api
    playlist_id (string): the playlist id
    return dic: {video_id: list ,playlist_id: list}
    '''

    
    playlist_items = {'videoId': [], 'playlistId': []}
    
    for playlist_id in playlists_id:

        nextPageToken = None
        while True:
            pl_request = youtube.playlistItems().list(
                part ='contentDetails',
                playlistId=playlist_id,
                maxResults=50,
                pageToken=nextPageToken,
                )

            pl_response = pl_request.execute()


            for item in pl_response['items']:

                video_id = item['contentDetails']['videoId']
                playlist_items['playlistId'].append(playlist_id)
                playlist_items['videoId'].append(video_id) # video can in more then one playlist




            nextPageToken = pl_response.get('nextPageToken')

            if not nextPageToken:
                break

    return playlist_items

In [157]:
start = time.time()

pl_item_dic = map_reduce(playlist_id_list, 8, partial(getPlaylistItems, youtube), reducer)

end = time.time()

print(end - start)

38.88629937171936


In [None]:
pl_item_dic

In [160]:
pl_item_df = pd.DataFrame.from_dict(pl_item_dic)

In [None]:
pl_item_df

In [163]:
len(pl_item_df.videoId.unique())

39925

The video can be included in several playlists or not included in any playlist.

In [165]:
pl_item_df.to_csv('playlistItemsDB2.csv', index=False)