## Course1 : Foundation of information

**Assignment**: Data extraction and analysis from social media platform Youtube ( 30 Marks )

**Problem statement**

Videos are a fast growing medium where people communicate, share knowledge, showcase skills etc. YouTube is one of the biggest platforms which hosts videos. The YouTube platform hosts content from many different professions/arts/ cultures across the world.

People can express their opinion about the video in the form of likes, dislikes, comments which are features provided by the YouTube platform which provides the information on the sentiment about the video.

The assignment involves the steps on programmatic data extraction from YouTube on which analysis can be conducted to understand various attributes related to a video.

**Steps to be performed**

1. Connect to the Youtube API using a Python client ( 5 Marks )



> 1.a Create a YouTube API key (3 marks)





> 1.b Install the Google API python client  (2 marks)



refer to the [supporting](https://developers.google.com/youtube/v3/getting-started) link on how to create YouTube API Key

Reference link : https://developers.google.com/youtube/v3/quickstart/python

- I enabled the API key from the google cloud console and generating the API key.
- To connect to the youtube, we enabled only for the public data access so we don't have to deal with the oAuth related stuff as we don't need any personal data of the users.
- After enabling the API key we would need to install the dependency via pip

We used the following command to install in our virtualenv as I'm using the pip.

```
pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib
```

In [8]:
import json
import csv
import pandas as pd
import numpy as np

from pprint import pprint
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError

YOUTUBE_API_KEY = 'AIzaSyDgc2_04BksQXvYmNgNPqCZwEvMRaZ85fI'
YOUTUBE_API_SERVICE_NAME = 'youtube'
YOUTUBE_API_VERSION = 'v3'

youtube = build(YOUTUBE_API_SERVICE_NAME, YOUTUBE_API_VERSION, developerKey=YOUTUBE_API_KEY)

2. Search and extract the data



> 2.a Search videos related to the query string  “avatar movie”
(For this part, choose/search one video of your choice and perform data collection steps on that specific video ) (3 marks)

> Output expected : ID, Snippet with following attributes Channel ID, Video Description, Channel Title, Video Title






Reference link:  https://developers.google.com/youtube/v3/docs/search/list

We will only create the collection for the video as we don't really care about the playlist and channel now, notherwise that can also be done via fetching the particular type
**Also, just for reference, instructions here and in the PDFs were different, which is not good.**

In [9]:
def youtube_search(search_query, max_results):
    # Call the search.list method to retrieve results matching the specified
    # query term.
    search_response = youtube.search().list(
    q=search_query,
    part='id,snippet',
    type="video",
    maxResults=max_results
    ).execute()

    videos = []

    # Add each result to the appropriate list, and then display the lists of
    for search_result in search_response.get('items', []):
        if search_result['id']['kind'] == 'youtube#video':
          videos.append(search_result)

    return videos

def extract_data(video_data):
    # ID, Snippet with following attributes Channel ID, Video Description, Channel Title, Video Title
    video_snippet = video_data["snippet"]
    data = {
            "ID": video_data["id"]['videoId'],
            "Snippet": {
                "Channel ID": video_snippet['channelId'],
                "Video Description": video_snippet['description'],
                "Channel Title": video_snippet['channelTitle'],
                "Video Title": video_snippet['title']                
            }
           }
    pprint(data)

# We will fetch only 5 results
videos = youtube_search("avatar movie", 5)
extract_data(videos[0])


{'ID': 'PLtgIILX7E8',
 'Snippet': {'Channel ID': 'UC0A86RKLCqTEUna3hPlEpzg',
             'Channel Title': 'Superhero FXL Games',
             'Video Description': 'AVATAR Full Movie 2023: Fallen Kingdom | '
                                  'Superhero FXL Action Movies 2023 in English '
                                  '(Game Movie). Best Action Game ...',
             'Video Title': 'AVATAR Full Movie 2023: Fallen Kingdom | '
                            'Superhero FXL Action Movies 2023 in English (Game '
                            'Movie)'}}



> 2.b  Provide the following statistics for query string “avatar movie” of top 50 videos sorted by relevance in the US region ( 7 marks )

> Output expected: video ID, title, no of views, no of likes,no of comments exported to CSV file

Reference link: https://developers.google.com/youtube/v3/docs/videos/list

In [10]:
def youtube_search_with_region(search_query, max_results, region_code):
    search_response = youtube.search().list(
        q=search_query,
        part='id,snippet',
        type="video",
        regionCode=region_code,
        maxResults=max_results
    ).execute()
    
    videos = []

    # Add each result to the appropriate list, and then display the lists of
    for search_result in search_response.get('items', []):
        if search_result['id']['kind'] == 'youtube#video':
          videos.append(search_result)

    return videos

def video_id_list(video_list):
    video_id_list = []
    for video in video_list:
        video_id_list.append(video['id']['videoId'])
    return video_id_list


def write_csv_file(data, file_name):
    df = pd.DataFrame(video_data)
    df.to_csv(file_name)


def extract_video_information(video):
#     video ID, title, no of views, no of likes,no of comments
    statistics = video['statistics']
    content = video['contentDetails']
    video_id = video['id']
    return {
        "video_id": video_id, 
        "no_of_likes": statistics.get('likeCount', 0),
        "no_of_views": statistics.get('viewCount', 0),
        "title": video['snippet']['title'],
        "no_of_comments": statistics.get('commentCount', 0)
    }
    
        
def video_information(video_ids):
    search_response = youtube.videos().list(
        part='id,snippet,statistics,contentDetails',
        id=video_ids
    ).execute()
    
    videos = []
    for search_result in search_response.get('items', []):
        if search_result['kind'] == 'youtube#video':
          videos.append(extract_video_information(search_result))
    
    return videos

videos = youtube_search_with_region("avatar movie", 50, "US")
video_id_list = video_id_list(videos)
video_data = video_information(video_id_list)
# Export the file
write_csv_file(video_data, "./data.csv")

 3. Analyze the exported data obtained in 2.b and carry out the following tasks (15 marks )



> 3.a Sort the data 2.b  by top 10 comments in descending order and consider the video IDs and Titles of top 10 videos which have highest comments. (3mark)



In [11]:
# Create the dataframe
def sorted_data(file_name):
    df = pd.read_csv(file_name)
    df['no_of_comments']= df.no_of_comments.astype(int)
    sorted_data = df.sort_values(by='no_of_comments', ascending=False)
    
    return sorted_data

# As it's not mentioned the no_of_comments in the description but still writing as 
# it's easy to check the number of comments
data = sorted_data("./data.csv")

pprint(data.loc[:, ["video_id", "title", "no_of_comments"]].head(10))

       video_id                                              title  \
2   d9MyW72ELq0        Avatar: The Way of Water | Official Trailer   
0   waJKJW_XU90  Avatar: The Last Airbender | Official Teaser |...   
9   a8Gx8wiNbs8  Avatar: The Way of Water | Official Teaser Tra...   
42  2r71I8lvTIA  The Last Airbender Film: How it Disrespected a...   
4   5PSNL1qE6VY  Avatar | Official Trailer (HD) | 20th Century FOX   
41  p_GgAHd5siE                           TOPH: An Avatar Fan Film   
35  T5vdPy7nbRQ                         Avatar Element Animation 2   
34  RGx8rYbRVR4  Why People Hate Avatar: A Lesson In Lazy Comme...   
25  f5Zx8iPek5I  Avatar 3 Will Introduce The Dark Side🔥 Of Na'v...   
31  QOg9LUIvaig  AVATAR: THE LAST AIRBENDER | Water, Earth, Fir...   

    no_of_comments  
2            43090  
0            31596  
9            29018  
42           27174  
4             8939  
41            5399  
35            4630  
34            4044  
25            3998  
31            3976 


> 3.b Use a suitable method to retrieve comments of those top 10 videos from 3.a. For doing this, write a program to loop through each video id from 3.a and pass in the part parameter set to "snippet", to retrieve basic details about the comments. Execute this request and print the response using the pprint() method.
 - Note: pprint() will print out the response from the API in a more human-readable format.
- Reference link:  [link](https://developers.google.com/youtube/v3/docs )


> **Output expected** : Use the python library “ pprint “ to print the output of the program with the following properties  etag, items, id , kind, snippet and snippet to have the text display field which represents the comment of videos.






In [12]:
# Call the API's commentThreads.list method to list the existing comments.

def get_comments(video_id):
    try:
        results = youtube.commentThreads().list(
        part="snippet",
        videoId=video_id,
        textFormat="plainText"
        ).execute()

        return results
    except Exception as e:
        return []

def format_comment(comments):
    formatted_list = []
    if len(comments) > 0:
        for comment in comments['items']:
            cmt = comment['snippet']['topLevelComment']
            comment_data = {
                "etag": comment['etag'],
                "items": {
                    "id": cmt['id'],
                    "etag": cmt['etag'],
                    "kind": cmt['kind'],
                    "snippet": {
                        "textDisplay": cmt['snippet']['textDisplay']
                    }
                }
            }

            formatted_list.append(comment_data)

        return formatted_list
    else:
        []

def get_all_video_comments(video_list):
    comments_list = []
    for video_id in video_list:
        comments = get_comments(video_id)
        comments_list.append(format_comment(comments))
    
    return comments_list
        
data = sorted_data("./data.csv").head(10)

# print(data["video_id"])
comments_list = get_all_video_comments(data["video_id"])
# comments_list = get_all_video_comments(["HR2kbOK8i6I"])
pprint(comments_list)

[[{'etag': 'XsCS-xW3bre2AtAidI5ZudN6cAc',
   'items': {'etag': 'IUN4pJhnNNF0jPN74vd4UK0NmLg',
             'id': 'Ugy9R8bfFFOJbZWLMY54AaABAg',
             'kind': 'youtube#comment',
             'snippet': {'textDisplay': 'Damn that background scores hit 😳'}}},
  {'etag': 'bkmV9Tisb-1OzAB3453DY4jwF30',
   'items': {'etag': 'PvGJg4jfKBcBoDCq5Z1N7eotQiA',
             'id': 'UgxJQ2TO0o1yAwQJv_Z4AaABAg',
             'kind': 'youtube#comment',
             'snippet': {'textDisplay': 'Teaser is better than trailer i '
                                        'think❤'}}},
  {'etag': 'c2Qs5JrzYoPL8jB5f2RMObmWVCk',
   'items': {'etag': 'vfZ2nxMgVVg3eCTu9mM45-_TbDs',
             'id': 'UgznsmdN7aCoc2mEJdN4AaABAg',
             'kind': 'youtube#comment',
             'snippet': {'textDisplay': 'Everyone that must have liked this '
                                        'movie must be a bad parent and has a '
                                        'zero observational. Or is filthy Rich '
    



> 3.c Write a program to export the output of question 3.b in JSON file format and submit the file as part of the assignment (3 marks)



In [13]:
def export_to_json(comments_list, file_path):
    with open(file_path, 'w') as json_file:
        json.dump(comments_list, json_file, indent=2)
    
file_path = 'comments.json'
export_to_json(comments_list, file_path)

>3.d Write a function to get  the likes vs views ratio of the top 10 videos obtained in 3.a with the highest comments (3 marks)




In [14]:
def get_ratio(file_name):
    df = sorted_data(file_name)
    
    df['no_of_likes']= df.no_of_likes.astype(int)
    df['no_of_views']= df.no_of_views.astype(int)
    
    df = df.loc[:, ["no_of_likes", "no_of_views"]].head(10)
    df['ratio']=df['no_of_likes']/df["no_of_views"]
    
    return df


print(get_ratio("./data.csv"))

    no_of_likes  no_of_views     ratio
2       1042788     58124167  0.017941
0        341150     13801173  0.024719
9        682437     27758248  0.024585
42       154759      5118065  0.030238
4         81210     12790571  0.006349
41       135054      2217164  0.060913
35       187370      2270073  0.082539
34        25924       430566  0.060209
25       350315      7719047  0.045383
31        55194      3256245  0.016950
