# Indonesia's Trending YouTube Video Statistics

YouTube provide [trending videos](https://www.youtube.com/feed/trending) to help viewers see what is happening on YouTube and in the world. Some measures are to be accounted for, such as view count, how quickly the video is generating views (hotness), views origin (including outside YouTube), the age of the videos, etc.

[Trending on YouTube](https://support.google.com/youtube/answer/7239739) is not personalized and displays the same videos list to users in a country. The list of trending videos is updated roughly **every 15 minutes** in which each update, videos may move up, down, or stay in the same position in the list.

This notebook is specially made for you to provide a step-by-step guide in creating a `streamlit` dashboard with an example display as below:

![](assets/notebook/streamlit_app.gif)

> Deployed `streamlit` dashboard: https://share.streamlit.io/tomytjandra/youtube-id-streamlit

## Environment Preparation

We will prepare some packages to be used in this project. If you browse this folder, you will find a file called `requirements.txt`. This file is used for specifying what Python packages are required to run this project. If you open up the file, you will see something that looks similar to this:

```
kaggle==1.5.12
pandas==1.4.2
Pillow==9.1.0
plotly==5.7.0
streamlit==1.8.1
```

Notice we have a line for each package, then a version number. This is important because as you start developing your application, you will develop it with specific versions of the packages in mind.

### Importing Requirements

Let us import the `requirements.txt` with the following steps:

**Step 1**: Prepare your current new environment and activate it

```
conda create -n <ENV_NAME> python=<PYTHON_VERSION>
conda activate <ENV_NAME>
```

Do not forget to install the kernel if you want it to be accessible from the Jupyter Notebook:

```
pip install ipykernel
python -m ipykernel install --user --name=<ENV_NAME>
```

**Step 2**: Navigate to the folder with your `requirements.txt`

```
cd <PATH_TO_REQUIREMENTS>
```

**Step 3**: Install the requirements

```
pip install -r requirements.txt
```

## Data Loading

We will be using data from [Kaggle: Indonesia's Trending YouTube Video Statistics](https://www.kaggle.com/datasets/syahrulhamdani/indonesias-trending-youtube-video-statistics) by Syahrul Hamdani. This dataset only contains Indonesia's trending YouTube videos updated daily or twice a day. Hence, it includes the trending date and the trending time. There are two files:

1. `category.json`: the category identifier number for Indonesia.

2. `trending.csv` consists of 27 features, including video information and statistics. Those features are generally extracted from more [broad properties](https://developers.google.com/youtube/v3/docs/videos). Here is the list of features that we use:

   - `publish_time`: date and time when the video was uploaded
   - `trending_time`: date and time when the video was detected as trending
   - `title`: title of the video
   - `channel_name`: name of the channel that uploaded the video
   - `category_id`: the unique identifier of the video's category (will be mapped with `category.json`)
   - `view`, `like`, `dislike`, `comment`: video statistics that describe engagement with the viewers

In [1]:
# data analysis
import pandas as pd
import json

### CSV File

We use the following three parameters from `pd.read_csv()`:

- `filepath_or_buffer`: a string of file path
- `usecols`: only return a subset of specified columns
- `parse_dates`: automatically convert specified columns to `datetime64`

> Reference: [`pandas.read_csv` documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)

In [2]:
columns = [
    'publish_time', 'trending_time',
    'title', 'channel_name', 'category_id',
    'view', 'like', 'dislike', 'comment']

youtube = pd.read_csv(
    'data_input/trending.csv',
    usecols=columns,
    parse_dates=['publish_time', 'trending_time'])

youtube.dtypes

publish_time     datetime64[ns, UTC]
title                         object
channel_name                  object
category_id                    int64
view                         float64
like                         float64
dislike                      float64
comment                      float64
trending_time    datetime64[ns, UTC]
dtype: object

### Pickle File

Pickling is when a Python object is converted into a binary file, and unpickling is the inverse operation.

- Use `.to_pickle()` to pickle (save) a `pandas.DataFrame` object.
- Use `pd.read_pickle()` to unpickle (read) pickle file.

Benefits of using pickle file:

- It preserves the structure of the Python object. In a DataFrame, it preserves the data type and index.
- It has a smaller memory size compared to the CSV file.

In [3]:
# save dataframe to pickle
youtube.to_pickle('data_input/trending.pickle')

In [4]:
# read pickle as dataframe
youtube = pd.read_pickle('data_input/trending.pickle')
youtube.dtypes

publish_time     datetime64[ns, UTC]
title                         object
channel_name                  object
category_id                    int64
view                         float64
like                         float64
dislike                      float64
comment                      float64
trending_time    datetime64[ns, UTC]
dtype: object

### JSON File

A JSON file is a file that stores simple data structures and objects in JavaScript Object Notation format, which is a standard data interchange format. It is primarily used for transmitting data between a web application and a server. JSON represents objects as **key-value pairs**, just like a Python dictionary.

Here is a sneak peek of `category.json`:

```
{
  "kind": "youtube#videoCategoryListResponse",
  "etag": "_S1MAxBhP9n7eQUvtH1duuWAwbw",
  "items": [
    {
      "kind": "youtube#videoCategory",
      "etag": "grPOPYEUUZN3ltuDUGEWlrTR90U",
      "id": "1",
      "snippet": {
        "title": "Film & Animation",
        "assignable": true,
        "channelId": "UCBR8-60-B28hp2BmDPdntcQ"
      }
    },
    ...
  ]
}
```

Step-by-step:

1. Read JSON file using `json.load()`
2. Normalize the JSON file by specifying the `record_path` parameter as the path in each object to the list of records
3. Convert `id` from string to integer and set it as an index
4. Take `snippet.title` columns and convert it into Python dictionary

> Reference: [`pandas.json_normalize` documentation](https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html)

In [5]:
with open("data_input/category.json") as file:
    json_data = json.load(file)
    df = pd.json_normalize(json_data, record_path='items')
    df.index = df['id'].astype('int')
    category_mapping = df['snippet.title'].to_dict()
    
category_mapping

{1: 'Film & Animation',
 2: 'Autos & Vehicles',
 10: 'Music',
 15: 'Pets & Animals',
 17: 'Sports',
 18: 'Short Movies',
 19: 'Travel & Events',
 20: 'Gaming',
 21: 'Videoblogging',
 22: 'People & Blogs',
 23: 'Comedy',
 24: 'Entertainment',
 25: 'News & Politics',
 26: 'Howto & Style',
 27: 'Education',
 28: 'Science & Technology',
 30: 'Movies',
 31: 'Anime/Animation',
 32: 'Action/Adventure',
 33: 'Classics',
 34: 'Comedy',
 35: 'Documentary',
 36: 'Drama',
 37: 'Family',
 38: 'Foreign',
 39: 'Horror',
 40: 'Sci-Fi/Fantasy',
 41: 'Thriller',
 42: 'Shorts',
 43: 'Shows',
 44: 'Trailers'}

## Data Wrangling

In this section, we will mainly focus on cleansing the data, divided into four steps:

### Mapping Category

We are going to map integer `category_id` to its corresponding name using prepared `category_mapping`. Here is the illustration for the first five rows:

**Before Mapping**

| channel_name           | title                                                    |   category_id |
|:-----------------------|:---------------------------------------------------------|--------------:|
| SMTOWN                 | aespa 에스파 'Forever (약속)' MV                         |            10 |
| Indonesia Lawyers Club | [FULL] Siapa di Balik Kudeta AHY? - Dua Sisi tvOne       |            25 |
| Motomobi               | CABRIOLET CHALLENGE: TANTANGAN MENGGODA (7/12)           |             2 |
| yb                     | With Windah Basudara & Hans                              |            20 |
| FC Barcelona           | 🤯 LATE COMEBACK DRAMA! - HIGHLIGHTS - Granada 3-5 Barça |            17 |

**After Mapping**

| channel_name           | title                                                    | category_id      |
|:-----------------------|:---------------------------------------------------------|:-----------------|
| SMTOWN                 | aespa 에스파 'Forever (약속)' MV                         | Music            |
| Indonesia Lawyers Club | [FULL] Siapa di Balik Kudeta AHY? - Dua Sisi tvOne       | News & Politics  |
| Motomobi               | CABRIOLET CHALLENGE: TANTANGAN MENGGODA (7/12)           | Autos & Vehicles |
| yb                     | With Windah Basudara & Hans                              | Gaming           |
| FC Barcelona           | 🤯 LATE COMEBACK DRAMA! - HIGHLIGHTS - Granada 3-5 Barça | Sports           |

In [6]:
# map integer category_id to its corresponding name
youtube['category_id'] = youtube['category_id'].map(category_mapping)

In [7]:
# sanity check
youtube['category_id'].head()

0               Music
1     News & Politics
2    Autos & Vehicles
3              Gaming
4              Sports
Name: category_id, dtype: object

### Unique Videos

Each video on YouTube can be trending for more than one day; hence the video information will be duplicated. Below, we create a frequency table to count how many occurrences of each video (distinguished by `channel_name` and `title`).

In [8]:
# frequency table
youtube.groupby(['channel_name', 'title'])['publish_time'].count().reset_index(name='row_count')

Unnamed: 0,channel_name,title,row_count
0,#temantapimenikah,#DailyVlog - Campervan Hari Kedua! Dari Grocer...,4
1,#temantapimenikah,#USRoadTripDIA | Salju Mulai Turun di Yellowstone,3
2,#temantapimenikah,"#ngobrolsamaDIA | Cewe Cowo Sahabatan, Bisa?",6
3,#temantapimenikah,#ngobrolsamaDIA | Emang Pacaran Itu Harus ya?,5
4,#temantapimenikah,#ngobrolsamaDIA | Perkenalan,4
...,...,...,...
15843,혀니콤보 TV,(ENG) 드디어 그들이 왔다! 방탄소년단 의전팀이 된다면?!,8
15844,혀니콤보 TV,(ENG) 목표는 댄스 가수 데뷔?!🕺이현의 ‘Permission to Dance’...,5
15845,혀니콤보 TV,(ENG) 방잘알★초특급 게스트★와 함께하는 BTS 'Butter' MV React...,6
15846,혀니콤보 TV,“앞으로 의전만 하셔야 될 것 같아요” 칭찬(?) 만발! 방탄소년단 의전팀 체험 두...,7


Here, we only take the first observation for each duplicated video. Here are the steps:

- Sort the rows ascending by `trending_time`
- Remove the duplicated row by keeping only the `first` observation for each `channel_name` and `title`
- Use `.copy()` to create new object from the subsetted data

In [9]:
# take the first trending day for each videos
youtube = youtube.sort_values(by='trending_time')
youtube_unique = youtube.drop_duplicates(subset=['channel_name', 'title'], keep='first').copy()

In [10]:
# sanity check
youtube_unique.shape

(15848, 9)

### Treat Missing Values

There are several missing values in our dataset as depicted below:

In [11]:
# check missing values for each column
youtube_unique.isna().sum()

publish_time        0
title               0
channel_name        0
category_id       125
view                3
like              173
dislike          4593
comment            95
trending_time       0
dtype: int64

We will treat the missing values ​​as follows:

- `category_id`: create a new category named **Unknown**
- `view`: drop the observation because there are only three observations and it will not affect the insight much
- `like`, `dislike`, `comment`: not given any treatment since the value is unknown. Missing values in these statistics will not appear in the scatter plot

In [12]:
# handle missing values
youtube_unique['category_id'] = youtube_unique['category_id'].fillna('Unknown')
youtube_unique.dropna(subset='view', inplace=True)

In [13]:
# sanity check
youtube_unique.shape

(15845, 9)

### Feature Engineering

Columns of type `datetime64` will have accessors `.dt`. In this case, we can extract specific date and time information from `publish_time` and `trending_time` to enrich our insights. We will create three new features:

- `publish_day`: day name of `publish_time` (object)
- `publish_hour`: hour component of `publish_time` (integer)
- `trending_date`: date component of `trending_time` (date)

> Reference: [`pandas.Series` datetimelike properties](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#datetimelike-properties) 

In [14]:
# feature engineering
youtube_unique['publish_day'] = youtube_unique['publish_time'].dt.day_name()
youtube_unique['publish_hour'] = youtube_unique['publish_time'].dt.hour
youtube_unique['trending_date'] = youtube_unique['trending_time'].dt.date

In [15]:
# sanity check for publish_time extraction
youtube_unique[['publish_time', 'publish_day', 'publish_hour']]

Unnamed: 0,publish_time,publish_day,publish_hour
0,2021-02-05 09:00:34+00:00,Friday,9
1,2021-02-04 15:54:08+00:00,Thursday,15
2,2021-02-06 03:00:22+00:00,Saturday,3
3,2021-02-05 20:26:08+00:00,Friday,20
4,2021-02-03 23:14:54+00:00,Wednesday,23
...,...,...,...
80491,2022-04-29 11:02:23+00:00,Friday,11
80508,2022-04-20 23:00:11+00:00,Wednesday,23
80510,2022-04-17 09:45:01+00:00,Sunday,9
80515,2022-04-27 13:53:55+00:00,Wednesday,13


In [16]:
# sanity check for trending_time extraction
youtube_unique[['trending_time', 'trending_date']]

Unnamed: 0,trending_time,trending_date
0,2021-02-07 05:46:51.832614+00:00,2021-02-07
1,2021-02-07 05:46:51.832649+00:00,2021-02-07
2,2021-02-07 05:46:51.832664+00:00,2021-02-07
3,2021-02-07 05:46:51.832678+00:00,2021-02-07
4,2021-02-07 05:46:51.832730+00:00,2021-02-07
...,...,...
80491,2022-05-01 06:02:46.908172+00:00,2022-05-01
80508,2022-05-01 06:02:46.908332+00:00,2022-05-01
80510,2022-05-01 06:02:46.908350+00:00,2022-05-01
80515,2022-05-01 06:02:46.908396+00:00,2022-05-01


## Data Preparation

In this section, we will mainly focus on preparing the data to be shown on the dashboard, divided into three steps:

### Dashboard Sidebar: Input

<img src="assets/notebook/sidebar.png" width="150" align="right"/>

There are two inputs in the dashboard sidebar:

1. **Trending Date Range** expects `min_date` as the minimum value and `max_date` as the maximum value. The values respectively correspond to the minimum and maximum of `trending_date`

2. **Video Category** list can be obtained by taking unique values of `category_id` and then include the 'All Categories' manually

In [17]:
# date input
min_date = youtube_unique['trending_date'].min()
max_date = youtube_unique['trending_date'].max()

In [18]:
# sanity check
print(min_date)
print(max_date)

2021-02-07
2022-05-01


In [19]:
# options for select box
['All Categories'] + youtube_unique['category_id'].sort_values().unique().tolist()

['All Categories',
 'Autos & Vehicles',
 'Comedy',
 'Education',
 'Entertainment',
 'Film & Animation',
 'Gaming',
 'Howto & Style',
 'Music',
 'News & Politics',
 'People & Blogs',
 'Pets & Animals',
 'Science & Technology',
 'Sports',
 'Travel & Events',
 'Unknown']

### Filter Data Based on User Input

Let's say a user already selected the start date, end date, and video category. Then our task is to filter the data based on that user input.

In [20]:
# initialize values for selected data
from datetime import date
selected_start_date = date(2021, 2, 7) # 7 February 2021
selected_end_date = date(2022, 5, 1) # 1 May 2022
selected_category = 'Entertainment'

In [21]:
# filter date
youtube_unique = youtube_unique[
    (youtube_unique['trending_date'] >= selected_start_date) &
    (youtube_unique['trending_date'] <= selected_end_date)]

In [22]:
# filter category
youtube_unique = youtube_unique[youtube_unique['category_id'] == selected_category]

In [23]:
# sanity check
youtube_unique['category_id'].unique()

array(['Entertainment'], dtype=object)

### Dashboard Body: Metrics

<center><img src="assets/notebook/body1.png" width="750"/></center>

There are two metrics in the dashboard body:

1. **Total unique videos** represents the number of unique `title` from filtered `youtube_unique`
2. **Total unique channels** represents the number of unique `channel_name` from filtered `youtube_unique`

In [24]:
# total unique videos
youtube_unique['title'].nunique()

4659

In [25]:
# total unique channels
youtube_unique['channel_name'].nunique()

707

## Data Visualization

In this section, we will create visualization for the dashboard using `plotly`. It is an interactive, open-source plotting library that supports over 40 unique chart types covering a wide range of statistical, financial, geographic, scientific, and 3-dimensional use-cases. `plotly.express` is a high-level wrapper for `plotly`, which essentially means it does a lot of the things that you can do with `plotly` but with a much simpler syntax.

> References:
> - [Setting Plotly Express Styling Defaults](https://plotly.com/python/styling-plotly-express/#setting-plotly-express-styling-defaults)
> - [Using Built-In Themes](https://plotly.com/python/templates/#using-builtin-themes)
> - [Using Built-In Continuous Color Scales](https://plotly.com/python/builtin-colorscales/#using-builtin-continuous-color-scales)

In [26]:
# data visualization
import plotly.express as px
px.defaults.template = "plotly_dark"
px.defaults.color_continuous_scale = "reds"

### Bar Chart

YouTubers want their videos to always be on the trending page, but sometimes they can run out of content ideas. As a reference, they can seek channel references that often enter the trending page. Therefore, we want to visualize the top 10 trending channels in `selected_category`. The largest number of video counts determines the top 10.

In [27]:
data = youtube_unique['channel_name'].value_counts().nlargest(10).sort_values(ascending=True)
data

123 GO! GOLD Indonesian          47
Rans Entertainment               47
Baim Paula                       48
MasterChef Indonesia             55
AH                               69
Indosiar                         99
Ricis Official                  103
Deddy Corbuzier                 161
RCTI - LAYAR DRAMA INDONESIA    177
TRANS7 OFFICIAL                 186
Name: channel_name, dtype: int64

We use a bar chart because we want to compare the video count ​​for each `channel_name`. There are several parameters we specify:

- `orientation='h'` to make a horizontal bar plot
- `title` to set the title of the plot
- `labels` to set the axis label of the plot
- `color=data` to set the color gradient based on the values on `data`
- `coloraxis_showscale=False` to hide the color bar legend
- `xaxis_separatethousands=True` to enable thousand separators in the x-axis

> References:
> - [Bar chart with Plotly Express](https://plotly.com/python/bar-charts/)
> - [Hiding or Customizing the Plotly Express Color Bar](https://plotly.com/python/colorscales/#hiding-or-customizing-the-plotly-express-color-bar)
> - [Layout scene `xaxis_separatethousands`](https://plotly.com/python/reference/layout/scene/#layout-scene-xaxis-separatethousands)

In [28]:
fig = px.bar(
    data,
    orientation='h',
    title=f'Top 10 Trending Channels in {selected_category}',
    labels=dict(index='', value='Video Count'),
    color=data)
fig.update_layout(coloraxis_showscale=False, xaxis_separatethousands=True)
fig.show()

### Heatmap

YouTubers also need to know when is the right time to publish a video, so they are more likely to enter the trending page. Therefore, we want to visualise a heatmap that displays the count of trending videos in selected_category for each `publish_hour` and `publish_day`.

There are several parameters we specify:

- `x` and `y` as the axis of the heatmap
- `title` to set the title of the plot
- `labels` to set the axis label of the plot
- `categoryarray` to set the order in which categories appear
- `autorange='reversed'` to reverse the axis

> References:
> - [2D Histogram](https://plotly.com/python/2D-Histogram/)
> - [Layout yaxes `categoryarray`](https://plotly.com/python/reference/layout/yaxis/#layout-yaxis-categoryarray)

In [29]:
fig = px.density_heatmap(
    youtube_unique,
    x='publish_hour',
    y='publish_day',
    title=f'Count of Trending Videos in {selected_category}',
    labels=dict(publish_hour='Publish Hour', publish_day='Publish Day'))
fig.update_yaxes(
    categoryarray=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'],
    autorange='reversed')
fig.show()

## Streamlit Dashboard

`streamlit` is a free, open-source, all-python framework that enables data scientists to quickly build interactive dashboards and machine learning web apps with no front-end web development experience required.

**Pros of `streamlit`:**
- Extremely easy to learn, compared to other framework
- No need to worry about front-end development
- Fast development to deployment time

**Cons of `streamlit`:**
- Not scalable for complex dashboard feature
- The front-end components is not easily customizable
- Relatively new (launched Oct 2019), so unstable compared to other framework. Sometimes it's hard to find answers to certain questions

### Main Concept

`streamlit` cannot be executed inside a Jupyter Notebook, it can only be executed in a Python script (.py) with the following command:

```
streamlit run script_name.py
```

> Reference: [`streamlit` main concepts](https://docs.streamlit.io/library/get-started/main-concepts)

In [30]:
import streamlit as st
st.markdown("Hello World!")

2022-05-03 21:49:59.666 
  command:

    streamlit run C:\Users\tomyt\anaconda3\envs\youtube-streamlit\lib\site-packages\ipykernel_launcher.py [ARGUMENTS]


DeltaGenerator(_root_container=0, _provided_cursor=None, _parent=None, _block_type=None, _form_data=None)

### Page Configuration

1. We can configure `streamlit` default theme by defining it in the `[theme]` section of a `.streamlit/config.toml` file:

    ```
    [theme]
    base="dark"
    primaryColor="red"
    ```

> Reference: [`streamlit` theming](https://docs.streamlit.io/library/advanced-features/theming)


2. We can configure the default settings of a `streamlit` page using `st.set_page_config()`

> Reference: [`streamlit` utilities `st.set_page_config`](https://docs.streamlit.io/library/api-reference/utilities/st.set_page_config)

### Dashboard Sidebar

`streamlit` makes it easy to organize your widgets in a left panel sidebar with `st.sidebar`. Each element that's passed to `st.sidebar` is pinned to the left, allowing users to focus on the content in your app while still having access to UI controls.

For example, if you want to add a markdown to a sidebar, use `st.sidebar.markdown()` instead of `st.markdown()`.

<center><img src="assets/notebook/layout.png" width="900"/></center>

<img src="assets/notebook/sidebar.png" width="150" align="right"/>

There are three components inside the sidebar:

1. [Media: Image](https://docs.streamlit.io/library/api-reference/media/st.image)

    ```
    from PIL import Image
    image = Image.open(PATH_TO_IMAGE)
    st.image(image)
    ```

2. [Widgets: Date input](https://docs.streamlit.io/library/api-reference/widgets/st.date_input)

    ```
    st.date_input(label, min_value, max_value, value=[___, ___])
    ```

3. [Widgets: Select box](https://docs.streamlit.io/library/api-reference/widgets/st.selectbox)

    ```
    st.selectbox(label, options)
    ```

### Dashboard Body

#### First Section

<center><img src="assets/notebook/body1.png" width="750"/></center>

1. [Text: Title](https://docs.streamlit.io/library/api-reference/text/st.title)

    ```
    st.title(body)
    ```

2. [Data: Metric](https://docs.streamlit.io/library/api-reference/data/st.metric)
   
    ```
    st.metric(label, value)
    ```

3. [Layout: Columns](https://docs.streamlit.io/library/api-reference/layout/st.columns)

    For example, we create two columns with the same width:
    ```
    col1, col2 = st.columns(2)
    col1.metric()
    col2.metric()
    ```

#### Second Section

<center><img src="assets/notebook/body2.png" width="500"/></center>

1. [Text: Header](https://docs.streamlit.io/library/api-reference/text/st.header)

    ```
    st.header(body)
    ```

2. [Emoji Shortcodes](https://share.streamlit.io/streamlit/emoji-shortcodes) are a way to enter emojis using pure ASCII. So you can type `:smile:` to show this 😄.

3. [Chart Elements: Plotly](https://docs.streamlit.io/library/api-reference/charts/st.plotly_chart)

    ```
    fig = ...
    st.plotly_chart(fig)
    ```

### Deployment

[Streamlit Cloud](https://streamlit.io/cloud) allows us to deploy `streamlit` apps in just one click.

<center><img src="assets/notebook/streamlit_sharing_silent.gif" width="1000"/></center>

Step-by-step:

1. Push code to GitHub Repository
2. Login to Streamlit Cloud: https://share.streamlit.io/signup
3. Click "New App" button
4. Fill in repository name, branch, and main file path
5. Click "Advanced settings" -> Change Python version accordingly
6. Done :)

> Reference: [Deploy an app](https://docs.streamlit.io/streamlit-cloud/get-started/deploy-an-app)