In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Overview
Youtube is the biggest video platform on the internet today. Youtube has Trending Videos, which according to Youtube, "aims to surface videos that a wide range of viewers would find interesting." The goal of this research is to analyze Youtube's Trending Video list from Aug 2020 to May 2021 to find out if there are any interesting patterns. Trending Videos is not a personalized list and displays the same in each country. This list is based on USA.


## Research Questions:
- Which channel has the largest number of Trending Videos? (bar chart)
- Which category has the most or least Trending Videos? (bar chart)
- What words are most popularly used for Trending Videos titles? (word cloud)
- What words are most popularly used for Trending Videos tags? (word cloud)
- What's the correlation between Trending Videos Likes, Dislikes and Views? (scatter plot)

# Data Profile
- Youtube's Trending Videos list from Aug 2020 to May 2021 (to be exact, 05/18/21, which is the date the data is extracted) is used as a data source. 
- It is a .csv file.
- There is a separate .json file for category ID and categories, to be able to identify each video's category.
- It is a dataset consisting of the following columns: video ID, title, publish date, channel ID, channel title, category ID, trending date, tags, view count, likes, dislikes, comment count, thumbnail link, comment disabled, ratings disabled and description. 

# Analysis
## Data Processing

In [None]:
#Read files
us_videos = pd.read_csv('../input/youtube-us2020/US_youtube_trending_data2020.csv')
us_videos_categories = pd.read_json('../input/youtube-us2020/US_category_id2020.json')

In [None]:
us_videos.head()

In [None]:
# Extract category information from .json file
categories = {category['id']: category['snippet']['title'] for category in us_videos_categories['items']}

# Create a new column that will represent the name of category
us_videos.insert(4, 'category', us_videos['categoryId'].astype(str).map(categories))

In [None]:
#Extract only the columns needed
new_df = us_videos[['title', 'category', 'channelTitle', 'trending_date', 'tags', 'view_count', 'likes', 'dislikes', 'comment_count']]
new_df.head()

In [None]:
#Convert 'enddate' column to a datetime object
new_df['trending_date'] = pd.to_datetime(new_df['trending_date'])

#Sort dataframe based on the 'trending_date' column in descending order
new_df.sort_values(by='trending_date', ascending=False)


## Q1. Which channel has the largest number of trending videos?

In [None]:
# Create a new dataframe with channel only
channel_df = new_df.groupby('channelTitle')['channelTitle'].agg(['count'])
channel_df.head(20)

In [None]:
#Extract the top 20 channels with the most counts
channel_most = new_df.groupby("channelTitle").size().reset_index(name="count") \
    .sort_values("count", ascending=False).head(20)
channel_most

In [None]:
# Use plotly to create a bar chart
import plotly.express as px
fig = px.bar(channel_most, x='channelTitle', y='count')
fig.show()

### Q1. Result
NFL is the channel with the most Trending Videos with 410 count, followed by NBA (370). 

## Q2. Which category has the most or least Trending videos? 

In [None]:
# Create a new dataframe with category information only
category_df = new_df.groupby('category')['category'].agg(['count'])
category_df

In [None]:
# Use plotly to create a bar chart
import plotly.express as px
fig = px.bar(category_df, y='count')
fig.show()

### Q2. Result
Music category has the most Trending Videos (11.233k), followed by Entertainment (11.208k). Nonprofit & Activism has the least Trending videos, only 53, followed by Travel & Events (214).

## Q3. What words are most popularly used for Trending Videos titles?

In [None]:
#Count most appeared words in Title
title_words = list(us_videos["title"].apply(lambda x: x.split()))
title_words = [x for y in title_words for x in y]
Counter(title_words).most_common(25)

In [None]:
#Start Wordcloud
import matplotlib as mpl
from matplotlib import pyplot as plt
import wordcloud
from wordcloud import WordCloud, STOPWORDS

#Generate Wordcloud from title_words
wc = wordcloud.WordCloud(width=2000, height=1000, random_state = 1, stopwords = STOPWORDS, 
                         collocations=False, background_color="black", 
                         colormap="rainbow").generate(" ".join(title_words))

plt.figure(figsize=(20,15))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")

## Q4. What words are most popularly used for Trending Videos tags?

In [None]:
tag_words = list(us_videos["tags"].apply(lambda x: x.split()))
tag_words = [x for y in tag_words for x in y]
Counter(tag_words).most_common(25)

In [None]:
#Generate Wordcloud from title_words
wc = wordcloud.WordCloud(width=2000, height=1000, random_state = 1, stopwords = STOPWORDS, 
                         collocations=False, background_color="black", 
                         colormap="rainbow").generate(" ".join(tag_words))

plt.figure(figsize=(20,15))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")

## Q5. What's the correlation between Trending Videos Likes, Dislikes and Views?

In [None]:
print(new_df["likes"].max())
print(new_df["dislikes"].max())
print(new_df["view_count"].max())

In [None]:
#Create a scatter plot
plt.figure(figsize=(16,10))
sns.set_style("whitegrid")
plt.title('Videos vews according to their Likes and Dislikes', fontsize=20, fontweight='bold', y=1.05)
plt.xlabel('Likes', fontsize=15)
plt.ylabel('Dislikes', fontsize=15)

likes = new_df["likes"].values
dislikes = new_df["dislikes"].values
views = new_df["view_count"].values

plt.scatter(likes, dislikes, s = views/50000, edgecolors='black', alpha=0.5)
plt.show()

### Q5 Result
We see that Likes and Dislikes are generally positively correlated: as Likes increases, Dislikes increases as well. We can also see that Likes and Views or Dislikes and Views positively correlate because the size of circle becomes bigger as x or y increases. It is interesting to see that the circles, however, are distributed heavily towards the two ends (0,0 and 16M, 800k), and that Trending Videos with 8M-12M Likes tend to have fewer number of Dislikes (below 200k) compared to Trending Videos with fewer number of Likes.  

# Conclusion
Here are some of the insights from the data analysis:
- Music (11.233k) and Entertainment (11.208k) are the top two categories with the most number of Trending Videos. However, the top two channels with the most Trending Videos were NFL (410) and NBA (370), which fall into Sports category. Sports (6.1k) is the 3rd category with the most Trending Videos.
- Nonprofit & Activism (53) has the least number of Trending Videos which is concerning considering the topics and issues these videos address need the world's most attention.
- There was not much difference between the most used words in titles vs. tags. The 2 most dominant words, according to Wordcloud, used in titles were "video" and "official", while those in tags were "new" and "video". Since these are very generic words, it seems like titles or tags play little role in a video becoming Trending.
- The number of Likes, Dislikes and Views of Trending Videos are mostly positively correlated. If we interpret Likes and Dislikes as 'engagement', this means more engagment has more views and vice versa.
- Interestingly, however, Trending Videos with a middle-ranged number of Likes (8-10M, out of 16M) seem to get relatively fewer Dislikes (less than 200K). This may be because there is a correlation between the number of Likes and discoverability of the videos, which has not been identified in this research.