First part of the project is to pull real-world data from the **Youtube API**, instead of using existing static data from a database, like Kaggle.

**Benefits** of using an API to extract data:
* Real world skills in coding and analytics are exercised
* Experience in modern technologies and tools gained

Overview of the **steps**:
- Pull the data
- Look through the JSON response
- Save the data in a pandas data frame


Starting with importing the necessary libraries, which include "requests" to make the API calls, "pandas" to save the data into a pandas dataframe and finally "time".

In [1]:
#import libraries
import requests
import pandas as pd
import time

What is an API?

 The Application Programming Interface lists the operations and their desctription that a developer can use and defines how pieces of code interact with each other. 
 
 In Layman's terms, the API serves the same purpose as the waiter in a restaurant. They list the available choices and take an order, that is then transfered to the kitchen or system. From there the order is prepared, how is not our concern, and the waiter delivers the response back to us.

The **API key** is an important aspect that authenticates the user and allows to make the API calls to Youtube.

Steps to create a Youtuve API key:
* Visit Google Developers Console
* Create a new projecy
* Enable APIs and Services
* Go to "Credentials" -> "Create Credentials" -> "API Key"



In [2]:
API_KEY = "insert_your_key"
CHANNEL_ID  = "UCW8Ews7tdKKkBT6GdtQaXvQ"

Making an **API call**

To make a call to the API, the URL needs to be passed to the API. The URL is the location of the API and specifies the necessary data. The HTTP GET URL is found through the Youtuve Data API overview and this is only the root of the URL. The parameters will also need to be set depending on the data to be extracted.

The following are the necessary parameters for the scope of this project:
 
* snippet and id information 
* channel_id
* maximum results
* order the results by date
* pageToken allows to go through all the pages of the results until all the videos are collected

The get method makes the API call and grabs data from that call. The data will be returned in **JSON** form.

A JSON object is a popular data file set as a javascript object and contains the data in an attribute/key-value pair.

In [5]:
pageToken=""
url = "https://www.googleapis.com/youtube/v3/search?key="+API_KEY+"&channelId="+CHANNEL_ID+"&part=snippet,id&order=date&maxResults=10000&"+pageToken

response = requests.get(url).json()

response

{'kind': 'youtube#searchListResponse',
 'etag': 'WmqCtFXhmKf2vxhW_4fF2HV7AWw',
 'nextPageToken': 'CDIQAA',
 'regionCode': 'NL',
 'pageInfo': {'totalResults': 92, 'resultsPerPage': 50},
 'items': [{'kind': 'youtube#searchResult',
   'etag': '_WQ1Gi8GAuGmDYFBWXBp9BpABVA',
   'id': {'kind': 'youtube#video', 'videoId': 'tnoOz6MzTPg'},
   'snippet': {'publishedAt': '2022-05-11T15:01:39Z',
    'channelId': 'UCW8Ews7tdKKkBT6GdtQaXvQ',
    'title': 'Python cumsum() | Solving Python Optimization Questions On a Data Science Interview',
    'description': "In this video, we'll take a close look at one of the larger families of data science questions concerning optimization. We'll use the ...",
    'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/tnoOz6MzTPg/default.jpg',
      'width': 120,
      'height': 90},
     'medium': {'url': 'https://i.ytimg.com/vi/tnoOz6MzTPg/mqdefault.jpg',
      'width': 320,
      'height': 180},
     'high': {'url': 'https://i.ytimg.com/vi/tnoOz6MzTPg/hqdef

Reading through the JSON object, "items" is identified as the key holding all the relevant video information to be examined.

In [6]:
response['items']

[{'kind': 'youtube#searchResult',
  'etag': '_WQ1Gi8GAuGmDYFBWXBp9BpABVA',
  'id': {'kind': 'youtube#video', 'videoId': 'tnoOz6MzTPg'},
  'snippet': {'publishedAt': '2022-05-11T15:01:39Z',
   'channelId': 'UCW8Ews7tdKKkBT6GdtQaXvQ',
   'title': 'Python cumsum() | Solving Python Optimization Questions On a Data Science Interview',
   'description': "In this video, we'll take a close look at one of the larger families of data science questions concerning optimization. We'll use the ...",
   'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/tnoOz6MzTPg/default.jpg',
     'width': 120,
     'height': 90},
    'medium': {'url': 'https://i.ytimg.com/vi/tnoOz6MzTPg/mqdefault.jpg',
     'width': 320,
     'height': 180},
    'high': {'url': 'https://i.ytimg.com/vi/tnoOz6MzTPg/hqdefault.jpg',
     'width': 480,
     'height': 360}},
   'channelTitle': 'StrataScratch',
   'liveBroadcastContent': 'none',
   'publishTime': '2022-05-11T15:01:39Z'}},
 {'kind': 'youtube#searchResult',
  'etag

Next, going through the attributes that I'm interested in saving and for this example I'l be saving the video id in a **variable**. Therefore, starting from the outer key, 'items', I drill down to the key 'id' and finally select 'videoId'. The process is repeated for the rest of the variables.

In this case only the first video, number 0, is selected. To generalize for all the videos an **iterative process** will be used.

In [10]:
video_id = response['items'][0]['id']['videoId']

video_title = response['items'][0]['snippet']['title']
#replacing the '&' symbol with a blank
video_title = str(video_title).replace("&amp;","")

upload_date = response['items'][0]['snippet']['publishedAt']
#to remove the timezone inormation I split on the "T" and save everything to the left of the 0
upload_Date = str(upload_date).split("T")[0]

Creating a **for loop** to iterate the above process for all the videos in the channel.

Through the placeholder variable 'video' I go through every video in the array response['items'] and grab the relevant information.

In [6]:
for video in response['items']:
    if video['id']['kind'] == 'youtube#video':
        video_id = video['id']['videoId']
        
        video_title = video['snippet']['title']
        video_title = str(video_title).replace("&amp;","")

        upload_date = video['snippet']['publishedAt']
        upload_Date = str(upload_date).split("T")[0]

However, the data in this API call doesn't have any information on the views, likes, dislikes etc. statistics and a **second separate API call** is necessary.

To grab all of the video information this URL takes the video_id extracted above as a parameter. The second parameter is the API key.
The statistics extracted are:
* View count
* Like count
* Favourite count
* Comment count


In [11]:
url_video_stats = "https://www.googleapis.com/youtube/v3/videos?id="+video_id+"&part=statistics&key="+API_KEY

response_video_stats = requests.get(url_video_stats).json()

view_count = response_video_stats['items'][0]['statistics']['viewCount']
like_count = response_video_stats['items'][0]['statistics']['likeCount']
favorite_count = response_video_stats['items'][0]['statistics']['favoriteCount']
comment_count = response_video_stats['items'][0]['statistics']['commentCount']

These statistics will be extracted for every video (unique video id) and therefore should be inserted in the previous for loop, that is rewritten below.

In [13]:
for video in response['items']:
    if video['id']['kind'] == 'youtube#video':
        video_id = video['id']['videoId']
        
        video_title = video['snippet']['title']
        video_title = str(video_title).replace("&amp;","")

        upload_date = video['snippet']['publishedAt']
        upload_date = str(upload_date).split("T")[0]

        # extract views, likes, favourites and number of comments
        url_video_stats = "https://www.googleapis.com/youtube/v3/videos?id="+video_id+"&part=statistics&key="+API_KEY

        response_video_stats = requests.get(url_video_stats).json()

        view_count = response_video_stats['items'][0]['statistics']['viewCount']
        like_count = response_video_stats['items'][0]['statistics']['likeCount']
        favorite_count = response_video_stats['items'][0]['statistics']['favoriteCount']
        comment_count = response_video_stats['items'][0]['statistics']['commentCount']


With the above code the variables are overwritten with every iteration, therefore a data frame is created to save all the relevant data for every video.
The data frame has 7 columns, corresponding to the collected variables. 

In [11]:
df= pd.DataFrame(columns=['video_id', 'video_title', 'upload_date', 'view_count', 'like_count', 'favorite_count', 'comment_count'])

**Append method**

An empty data frame has been created and all there's left is to store the data in the table.
The append() method comes in handy, as with every iteration it will modify the dataframe and add a new line with the details of every video at the end. 

In [25]:
for video in response['items']:
    if video['id']['kind'] == 'youtube#video':
        video_id = video['id']['videoId']
        
        video_title = video['snippet']['title']
        video_title = str(video_title).replace("&amp;","")

        upload_date = video['snippet']['publishedAt']
        upload_date = str(upload_date).split("T")[0]

        # extract views, likes, favourites and number of comments
        url_video_stats = "https://www.googleapis.com/youtube/v3/videos?id="+video_id+"&part=statistics&key="+API_KEY

        response_video_stats = requests.get(url_video_stats).json()

        view_count = response_video_stats['items'][0]['statistics']['viewCount']
        like_count = response_video_stats['items'][0]['statistics']['likeCount']
        favorite_count = response_video_stats['items'][0]['statistics']['favoriteCount']
        comment_count = response_video_stats['items'][0]['statistics']['commentCount']

        df = df.append({'video_id': video_id, 'video_title': video_title, 'upload_date': upload_date, 'view_count' : view_count, 'like_count' : like_count, 'favorite_count' : favorite_count, 'comment_count' : comment_count}, ignore_index = True)


Implementing **good Software Engineering fundamentals**

The API calls are mixed within the for loop.To make the code clearer and more slimple the API calls can be broken down into several smaller blocks, with the use of functions.

In [12]:
#import libraries
import requests
import pandas as pd
import time

In [21]:
#Keys
API_KEY = "AIzaSyAtW9XIp79rBPcUf_Cb7P1GIsrimYtEBRs"
CHANNEL_ID  = "UCZdQjaSoLjIzFnWsDQOv4ww"

The part within the for loop where the video details are extracted is transformed into a function returning the count metrics.

In [14]:
def get_video_details(video_id):
    # extract views, likes, favourites and number of comments
    url_video_stats = "https://www.googleapis.com/youtube/v3/videos?id="+video_id+"&part=statistics&key="+API_KEY
    response_video_stats = requests.get(url_video_stats).json()

    view_count = response_video_stats['items'][0]['statistics']['viewCount']
    like_count = response_video_stats['items'][0]['statistics']['likeCount']
    favorite_count = response_video_stats['items'][0]['statistics']['favoriteCount']
    comment_count = response_video_stats['items'][0]['statistics']['commentCount']

    return view_count, like_count, favorite_count, comment_count


The rest of the for loop is transformed get_videos function, which has a data rame as an input. The function will extract the metrics for every video of the channel and append it to the dataframe.

There is a chance that the iteration of the for loop will start before the data are collected with the API. To avoid that the time.sleep() function is used to stall the for loop for 1 second and make sure that all data are stored in the response variable.

In [18]:
def get_videos(df):
    #make API call
    pageToken = ""
    url = "https://www.googleapis.com/youtube/v3/search?key="+API_KEY+"&channelId="+CHANNEL_ID+"&part=snippet,id&order=date&maxResults=10000&"+pageToken

    response = requests.get(url).json()
    time.sleep(1) #give it a second before starting the for loop

    for video in response['items']:
        if video['id']['kind'] == "youtube#video":
            video_id = video['id']['videoId']
            video_title = video['snippet']['title']
            video_title = str(video_title).replace("&amp;","")
            upload_date = video['snippet']['publishedAt']
            upload_date = str(upload_date).split("T")[0]
            
            view_count, like_count, favourite_count, comment_count = get_video_details(video_id)

            df = df.append({'video_id':video_id,'video_title':video_title,
                            "upload_date":upload_date,"view_count":view_count,
                            "like_count":like_count,"favourite_count":favourite_count,
                            "comment_count":comment_count},ignore_index=True)
    return df

The functions are created and the below portion is the main code where the functions are called and the data frame is created.

In [22]:
#main

#build our dataframe
df = pd.DataFrame(columns=["video_id","video_title","upload_date","view_count","like_count","favourite_count","comment_count"]) 

df = get_videos(df)

In [23]:
df.describe()

Unnamed: 0,video_id,video_title,upload_date,view_count,like_count,favourite_count,comment_count
count,50,50,50,50,50,50,50
unique,50,50,50,50,50,1,48
top,ESjPKurqcC0,HOW I QUIT MY JOB TO SAIL THE WORLD,2022-05-09,339237,27080,0,1139
freq,1,1,1,1,1,50,2


This concludes the first part of the project. For the next part I will be uploading the above data frame on a database.

To be continued...