## Course1 : Foundation of information

**Assignment**: Data extraction and analysis from social media platform Youtube ( 30 Marks )

**Problem statement**

Videos are a fast growing medium where people communicate, share knowledge, showcase skills etc. YouTube is one of the biggest platforms which hosts videos. The YouTube platform hosts content from many different professions/arts/ cultures across the world.

People can express their opinion about the video in the form of likes, dislikes, comments which are features provided by the YouTube platform which provides the information on the sentiment about the video.

The assignment involves the steps on programmatic data extraction from YouTube on which analysis can be conducted to understand various attributes related to a video.

**Steps to be performed**

1. Connect to the Youtube API using a Python client ( 5 Marks )



> 1.a Create a YouTube API key (3 marks)



In [2]:
pip install python-dotenv

Collecting python-dotenv
  Obtaining dependency information for python-dotenv from https://files.pythonhosted.org/packages/6a/3e/b68c118422ec867fa7ab88444e1274aa40681c606d59ac27de5a5588f082/python_dotenv-1.0.1-py3-none-any.whl.metadata
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
# donot save api key here - security reasons - pick from .env files
import os
from dotenv import load_dotenv

# Load the environment variables from .env file
load_dotenv()

api_service_name = "youtube"
api_version = "v3"
developer_key = os.getenv("GCLOUD_API_KEY")



> 1.b Install the Google API python client  (2 marks)



In [1]:
pip install google-api-python-client

Collecting google-api-python-client
  Obtaining dependency information for google-api-python-client from https://files.pythonhosted.org/packages/73/e4/d8d38ca79045a72880c98e6d2ebc737c92d596d5dc0bf2e4233b00be5daa/google_api_python_client-2.116.0-py2.py3-none-any.whl.metadata
  Downloading google_api_python_client-2.116.0-py2.py3-none-any.whl.metadata (6.6 kB)
Collecting httplib2<1.dev0,>=0.15.0 (from google-api-python-client)
  Downloading httplib2-0.22.0-py3-none-any.whl (96 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.9/96.9 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting google-auth<3.0.0.dev0,>=1.19.0 (from google-api-python-client)
  Obtaining dependency information for google-auth<3.0.0.dev0,>=1.19.0 from https://files.pythonhosted.org/packages/82/41/7fb855444cead5b2213e053447ce3a0b7bf2c3529c443e0cf75b2f13b405/google_auth-2.27.0-py2.py3-none-any.whl.metadata
  Downloading google_auth-2.27.0-py2.py3-none-any.whl.metadata (4

refer to the [supporting](https://developers.google.com/youtube/v3/getting-started) link on how to create YouTube API Key

Reference link : https://developers.google.com/youtube/v3/quickstart/python

In [40]:
import googleapiclient.discovery
from pprint import pprint

# Resource contruction: api_service_name, api_version, developer_key already loaded in previous code block
youtube = googleapiclient.discovery.build(api_service_name, api_version, developerKey=developer_key)

2. Search and extract the data



> 2.a Search videos related to the query string  “avatar movie”
(For this part, choose/search one video of your choice and perform data collection steps on that specific video ) (3 marks)

> Output expected : ID, Snippet with following attributes Channel ID, Video Description, Channel Title, Video Title






Reference link:  https://developers.google.com/youtube/v3/docs/search/list

In [44]:
# How many results should we fetch? Is there a specific format in which we should print the results? What all does perform data collection step include?
request = youtube.search().list(
  part="id,snippet",
  type="video",
  q="avatar movie",
  maxResults=1,
  fields="items(id(videoId),snippet(channelId,description,channelTitle,title))"
)

def get_formatted_search_results(response):
  to_return = []
  for data in response.get('items', []):
    to_return.append({
      'id': data['id']['videoId'],
      'snippet': {
        'channel_id': data['snippet']['channelId'],
        'video_description': data['snippet']['description'],
        'channel_title': data['snippet']['channelTitle'],
        'video_title': data['snippet']['title'],
      }
    })
  return to_return

try:
  search_response = request.execute()
  avatar_movie_results = get_formatted_search_results(search_response)
  pprint(avatar_movie_results)
except googleapiclient.errors.HttpError as e:
  # Handle HTTP errors
  print("An HTTP error occurred:", e.resp.status, e.content)
except googleapiclient.errors.Error as e:
  # Handle other googleapiclient errors
  print("An error occurred:", e)
except Exception as e:
  # Handle other exceptions, such as connection errors
  print("An unexpected error occurred:", str(e))

[{'id': 'PLtgIILX7E8',
  'snippet': {'channel_id': 'UC0A86RKLCqTEUna3hPlEpzg',
              'channel_title': 'Superhero FXL Games',
              'video_description': 'AVATAR Full Movie 2023: Fallen Kingdom | '
                                   'Superhero FXL Action Movies 2023 in '
                                   'English (Game Movie). Best Action Game ...',
              'video_title': 'AVATAR Full Movie 2023: Fallen Kingdom | '
                             'Superhero FXL Action Movies 2023 in English '
                             '(Game Movie)'}}]



> 2.b  Provide the following statistics for query string “avatar movie” of top 50 videos sorted by relevance in the US region ( 7 marks )

> Output expected: video ID, title, no of views, no of likes,no of comments exported to CSV file






Reference link: https://developers.google.com/youtube/v3/docs/videos/list

In [29]:
import csv 

# fetching 50 results, order/sort by relevance is default, regionCode = US
search_req = youtube.search().list(
  part="id,snippet",
  type="video",
  q="avatar movie",
  maxResults=50,
  regionCode="US",
  fields="items(id(videoId),snippet(title))"
)

# method to call video api to get statistics of the videos with ids fetched from previous request, use try catch if neccessary
def video_api_request(video_ids):
  request_2b_video = youtube.videos().list(
    part="id,statistics",
    id=",".join(video_ids),
  )
  video_resp = request_2b_video.execute()
  return video_resp

# execution steps, add try catch if necessary
search_res = search_req.execute()

# keeping a map to reduce time complexity
search_video_results = {}
for res_data in search_res.get('items', []):
  video_id = res_data['id']['videoId']
  search_video_results[video_id] = {
    'id': video_id,
    'title': res_data['snippet']['title']
  }

# make a tuple for video api call to join and use
video_ids = tuple(search_video_results.keys())

video_api_response = video_api_request(video_ids)
for video_data in video_api_response.get('items', []):
  curr_id = video_data['id']
  curr_search_video = search_video_results[curr_id]
  curr_search_video['total_likes'] = video_data['statistics'].get('likeCount') # add default value as 0 if necessary
  curr_search_video['total_views'] = video_data['statistics'].get('viewCount')
  curr_search_video['total_comments'] = video_data['statistics'].get('commentCount')

fields = ['id', 'title', 'total_likes', 'total_views', 'total_comments']

filename = "assignment_2b.csv"

with open(filename, 'w') as csvfile:
  writer = csv.DictWriter(csvfile, fieldnames=fields)
  writer.writeheader()
  writer.writerows(list(search_video_results.values()))


In [None]:
pip install pandas

 3. Analyze the exported data obtained in 2.b and carry out the following tasks (15 marks )



> 3.a Sort the data 2.b  by top 10 comments in descending order and consider the video IDs and Titles of top 10 videos which have highest comments. (3mark)



In [38]:
import pandas as pd
# filename = "assignment_2b.csv" already declared in previous block
df = pd.read_csv(filename)

top_10_comments = df.sort_values('total_comments', ascending=False)[:10]
print(top_10_comments[["id", "title"]]) # Should I only show id and title of top 10 comments?

             id                                              title
7   d9MyW72ELq0        Avatar: The Way of Water | Official Trailer
1   waJKJW_XU90  Avatar: The Last Airbender | Official Teaser |...
48  a8Gx8wiNbs8  Avatar: The Way of Water | Official Teaser Tra...
13  2r71I8lvTIA  The Last Airbender Film: How it Disrespected a...
0   ByAn8DF8Ykk  Avatar: The Last Airbender | Official Trailer ...
14  -egQ79OrYCs  THE LAST AIRBENDER (2010) | Hollywood.com Movi...
47  zj6p5kYnPPY  Film Theory: END the Avatar Cycle! (Avatar the...
4   5PSNL1qE6VY  Avatar | Official Trailer (HD) | 20th Century FOX
36  eJBPO76TGUQ             Avatar: The Way of Water Pitch Meeting
35  RGx8rYbRVR4  Why People Hate Avatar: A Lesson In Lazy Comme...



> 3.b Use a suitable method to retrieve comments of those top 10 videos from 3.a. For doing this, write a program to loop through each video id from 3.a and pass in the part parameter set to "snippet", to retrieve basic details about the comments. Execute this request and print the response using the pprint() method.
 - Note: pprint() will print out the response from the API in a more human-readable format.
- Reference link:  [link](https://developers.google.com/youtube/v3/docs )


> **Output expected** : Use the python library “ pprint “ to print the output of the program with the following properties  etag, items, id , kind, snippet and snippet to have the text display field which represents the comment of videos.






In [41]:
def comment_threads_api(video_id):
  try:
    results = youtube.commentThreads().list(
      part="id,snippet",
      videoId=video_id,
      textFormat="plainText"
    ).execute()
    return results
  except:
    return {}

comments_on_videos = []
for index, t_10c_row in top_10_comments.iterrows():
  video_id = t_10c_row['id']
  comment_threads_response = comment_threads_api(video_id)
  comments_on_videos.append(comment_threads_response)
  pprint(comment_threads_response)


{'etag': 'fMesKevlp3BImRgczvsvbGWJhjI',
 'items': [{'etag': 'bJ4KIdiRec-mERPPzFwJScY5SOg',
            'id': 'UgznmmfiK8a9pOGNVf14AaABAg',
            'kind': 'youtube#commentThread',
            'snippet': {'canReply': True,
                        'channelId': 'UCgjxQJ6TlKqhHax8742ZMdA',
                        'isPublic': True,
                        'topLevelComment': {'etag': '_cyiXDYL4qeckubL1aKnlBaNeRQ',
                                            'id': 'UgznmmfiK8a9pOGNVf14AaABAg',
                                            'kind': 'youtube#comment',
                                            'snippet': {'authorChannelId': {'value': 'UC59GRUoJK0KiQ2B-gSPViWw'},
                                                        'authorChannelUrl': 'http://www.youtube.com/@ambitious3164',
                                                        'authorDisplayName': '@ambitious3164',
                                                        'authorProfileImageUrl': 'https://yt3.ggpht.com/m6F



> 3.c Write a program to export the output of question 3.b in JSON file format and submit the file as part of the assignment (3 marks)



In [42]:
import json

with open('3b_video_comments.json', 'w') as jsonf:
  json.dump(comments_on_videos, jsonf, ensure_ascii=False, indent=2)

>3.d Write a function to get  the likes vs views ratio of the top 10 videos obtained in 3.a with the highest comments (3 marks)




In [43]:
def get_likes_views_ratio(video_data_row):
  likes = video_data_row['total_likes']
  views = video_data_row['total_views']
  if views == 0:
    return float('inf')
  return likes/views

def get_all_likes_views_ratio(video_data):
  likes_views = []
  for index, row in video_data.iterrows():
    likes_views.append(get_likes_views_ratio(row))
  return likes_views

likes_views = get_all_likes_views_ratio(top_10_comments)
print(likes_views)

[0.01780110408801011, 0.02279115533301114, 0.024409558301083923, 0.030070497679658897, 0.04086704313811995, 0.002089046082081093, 0.04245095447791101, 0.006327934736220237, 0.04682316225421322, 0.05628171472149708]
