### DS102 | In Class Practice Week 7A - Looking Back
<hr>

In [None]:
import csv
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## One More Look

In this final in-class, we will be going back to revisit an old dataset - the Youtube Trending Videos dataset. You would have encountered this dataset in DS101. We will be taking a look at the previous way of doing things - using loops, and now challenge them with how we work using the new skills we have learnt.

## The Data Provided

Every day, YouTube will select videos that have a particularly high number of views and comments on that day to put in its **Trending** playlist. This playlist will give visitors an idea of what other visitors are watching that day. You can see that if you are a content creator, you would hope that your video lands on the **Trending** playlist, so that you can get more views.

The data can be found in `GBvideos.csv`. Each Trending video contains the following columns. Their explanations are given to you in the following table:

Column Name | Description
---|---
`video_id`|The unique ID given for each video on YouTube
`trending_date`|The date the video started trending
`title`|The title of the video
`channel_title`|The name of the channel that uploaded the video
`category_id`|The ID of the category that the video belongs to
`publish_time`|The time stamp that the video was published
`tags`| delimited strings that represent tags that the video has
`views`|No. of views the video has received
`likes`|No. of likes the video has received by users
`dislikes`|No. of dislikes the video has received by users
`comment_count`|The total number of comments given to the video
`thumbnail_link`|The image link of the video thumbnail
`comments_disabled`|True/False depending on the comments being disabled or not
`ratings_disabled`|True/False depending on the ratings being disabled or not
`video_error_or_removed`|True/False depending on the errors the video had, or if it was removed
`description`|The description of the video



## Read the Dataset to Jupyter Notebook

Read the dataset into Jupyter notebook. Then, display the first record.

In [None]:
#########################
# The DictReader way
#########################

# declare an empty list
videos_list = []

with open("GBvideos.csv", encoding="utf-8") as f:
    
    # initialise a reader
    reader = csv.DictReader(f)
    # convert all the data to a list
    videos_list = list(reader)

# display the first record
print(videos_list[0])

In [None]:
#########################
# The pandas way
#########################

# Exercise: Read the videos into a df called videos_df
#

In [None]:
# Exercise: Display the first record in videos_df
#

### Question 1
How many videos in the dataset have $1,000,000$ views or more?

In [None]:
#################################
# Question 1: The list & dict way
#################################

# create a counter to keep track of the number of videos
counter = 0

for video in videos_list:
    # get the number of views in each video
    # typecast the result using int
    video_views = int(video['views'])
    
    # if the value is 1M or more, add 1 to counter
    if int(video_views) >= 1000000:
        counter += 1

print(counter)

In [None]:
#################################
# Question 1: using pandas
#################################

# Exercise: How many videos have 1M views or more?
#

### Question 2
Which video in the dataset has the most number of views? Show the video title and the number of likes it has.

In [None]:
#################################
# Question 2: The list & dict way
#################################

# create some default variables to compare to
highest_views = int(videos_list[0]['views'])
highest_views_title = videos_list[0]['title']

for video in videos_list:
    
    # get the key variables. remember to typecast
    views = int(video['views'])
    title = video['title']
    
    if views > highest_views:
        highest_views = views
        highest_views_title = title
        
print(highest_views_title)
print(highest_views)

In [None]:
#################################
# Question 2: using pandas
#################################

# Exercise: Which video in the dataset has the most number of views?
#

### Question 3
How many videos in `category_id` number `25` have $2500$ comments or more?

In [None]:
#################################
# Question 3: The list & dict way
#################################

# initialise a counter
counter = 0

# iterate through the list
for video in videos_list:
    # get the key variables. typecast them
    category = int(video['category_id'])
    comments = int(video['comment_count'])
    
    if comments >= 2500 and category == 25:
        # increment the counter if the record satisfies the conditions
        counter += 1
        
print(counter)

In [None]:
#################################
# Question 3: using pandas
#################################

# Exercise: How many videos in category_id 25 have 2500 comments or more?
#

### Question 4
A channel can upload one or multiple videos. How many unique channels are there?

In [None]:
#################################
# Question 4: The list & dict way
#################################

channels_list = []

# iterate through the videos
for video in videos_list:
    # get the channel of the video
    channel = video['channel_title']
    # apppend to the list of channels if it does not already exist
    if channel not in channels_list:
        channels_list.append(channel)
    
# we can use the len() function to find out the total unique keys (channels) in our dictionary!
print(len(channels_list))

In [None]:
#################################
# Question 4: using pandas
#################################

# How many unique channels are there?
#

### Question 5
Which channels from the dataset have a combined total of $1,500,000,000$ ($1.5$ billion) views or more? A combined total is calculated by adding up all the views of all videos belonging to that channel. `print` all the `channel` names.

In [None]:
#################################
# Question 5: The list & dict way
#################################

channel_views_dict = {}

# iterate through the videos
for video in videos_list:
    
    # Get the channel name and the view count of the video
    channel = video['channel_title']
    views = int(video['views'])
    
    # if the channel name (the key of channel_views_dict) does not exist in the dict, 
    # we create it with the video views as a value
    if channel not in channel_views_dict:
        channel_views_dict[channel] = views
    # if it exists, we just add the views of the current video to the channel 
    else:
        channel_views_dict[channel] += views
        
# instantiate a new list of channels that have 1.5B views or more
final_channel_list = []

for cnl, total_views in channel_views_dict.items():
    if total_views >= 1500000000:
        final_channel_list.append(cnl)
print(sorted(final_channel_list))

In [None]:
#################################
# Question 5: using pandas
#################################

# Which channels from the dataset have a combined total of 1.5B views or more?
#

### Question 6

Plot a clustered bar plot showing each category's total views, total likes and total comments. For each of the 3 values, take $\log$ to base $10$ i.e. $\log_{10} \text{total_views}, \cdots$. Filter for only the top 5 `category_id`s with the highest number of views.

In [None]:
######################################################################
# Question 6: Aggregation using loops & Plotting: Part 1. Aggregation
######################################################################

# Aggregat Step
category_info_dict = {}

for video in videos_list:
    # Get the category ID and the columns to sum
    category = video['category_id']
    views = int(video['views'])
    comments = int(video['comment_count'])
    likes = int(video["likes"])
    
    # we check if the category is in our dictionary or not
    if category not in category_info_dict: 
        # if it doesnt exist, create a dictionary that contains the information for this particular category      
        information = {}
        information["comments"] = comments
        information["views"] = views
        information["likes"] = likes
    else:
        # if it does exist, we just add the values up 
        information = category_info_dict[category]
        information["comments"] += comments
        information["views"] += views
        information["likes"] += likes
    # remember to assign the information you've stored as a dictionary back to the corresponding category!
    category_info_dict[category] = information
    
# change all the columns to log10
# since this code changes the values in place, it should not be run more than once
cols_list = ["comments", "views", "likes"]
for cat_id, cat_info in category_info_dict.items():
    
    for col in cols_list:
        v = cat_info[col] 
        v = math.log(v, 10)
        cat_info[col] = v

# FILTER
# get top 5 views cat id 
top_5 = sorted(category_info_dict.items(), key=lambda x: x[1]["views"], reverse=True)[:5]

for t5 in top_5:
    print(t5)
# DATA PREPARATION FOR PLOTTING
# extract all the values into lists top 5 is a list of tuples due to sorted
comments = [info_tuple[1]["comments"] for info_tuple in top_5]
views = [info_tuple[1]["views"] for info_tuple in top_5]
likes = [info_tuple[1]["likes"] for info_tuple in top_5]
ids = [info_tuple[0] for info_tuple in top_5]


In [None]:
######################################################################
# Question 6: Aggregation using loops & Plotting: Part 2. Plotting
######################################################################
index = np.arange(len(top_5))
width = 0.1

fig, ax = plt.subplots(figsize=(16,10))

view_bars = ax.bar(index, views, width, color="red", label="views")
likes_bars = ax.bar(index+width, likes, width, color="blue", label="likes")
comments_bars = ax.bar(index+width*2, comments, width, color="green", label="comments")


ax.set_xticks(index + width)
ax.set_xticklabels(ids)
ax.legend()
ax.set_title("Total views, likes and comments (taken to log 10) of top 5 categories by views",
            fontsize=18)

plt.show()
plt.close()

In [None]:
sns.set()

In [None]:
#########################################
# Question 6: Using pandas and matplotlib
#########################################

# Plot a clustered bar plot showing each category's total views, total likes and total comments,
# taken at log base 10


**Credits**
- Dataset 1: [Trending YouTube Video Statistics, Kaggle](https://www.kaggle.com/datasnaek/youtube-new)