# YouTube Channel Analysis PJT

## In this project, I will be analyzing the official YouTube channel of Kpop group LE SSERAFIM.

# 1. Introduction
## 1.1 Background
Le sserafim is a popular Kpop girl group debuted in May 2022. The group is known for creating many entertaining contents especially YouTube contents on their official YouTube channel. Nowadays, YouTube contents including music videos play a big part in establishing yourself as an artist, and Kpop artists in general are the front runner in effectively using social contents to boost their popularity. The most famous example is BTS, who created a great bond with fans ("army") through their clever use of social media contents. Le sserafim from Source Music (which is a subsidiary of HYBE) is definitely following the footsteps of BTS, and I thought it would be interesting to gain some insights on their YouTube content performance. 

I want to analyze the performance of their contents in general, and gain insights on what type of contents resonate well with their fans. Lastly, I want to suggest some content ideas that might do well considering past performance history. The scope of the project will be limited to the analysis of the Le sserafim channel only, and won't compare other Kpop group's channel or comparable YouTube channels. 

## 1.2 Objectives
In this project, I will focus on learning the followings:

- Get familiar with YouTube API, and use it to gather YouTube channel data
- Analyze video metrics to find out what type of contents are popular among fans:
    - What type of contents get the most views?
    - What type of contents have the most engagement among the fans?
    - What are some conents that didn't perform well, and why?
    - How's the video performance of contents over time?
- Utilize NLP techniques to gain some insights on fan reaction:
    - Explore the top 100 comments of videos to explore fan reaction
    - Is there a meaningful difference in comment reactions for content types?
    - What kind of questions or requests are given in the comment sections?
    - Is there any content idea that can come from the comment sections?
- Come up with some content idea with the insights gained from the above analysis

## 1.3 Project process
1. Get the channel video data, and comments data from LE SSERAFIM channel using YouTube Data API v3.
2. Preprocess data and engineer new features
3. Perform exploratory data analysis
4. Conclusions

## 1.4 Dataset
### Data Source
For this project, I obtained the dataset myself by utilizing YouTube Data API v3. 

### Data Limitation
The data is a real-world dataset, suitable for research purpose. However, considering the API quota limit of 10,000 units per day, I only analyzed the Le sserafim channel and not other comparable channels. I will rely on my domain knowledge to compare Le sserafim YouTube channel with other comparable channels.
It can be interesting to compare the performance of Le sserafim channel with other Kpop groups' channels, which could be the next step of this project. 

Also, comments are limited to 100 top level comments per video due to the same API quota limit. The video metric is also total metrics, meaning it contains the total views, and total engagement metrics which makes it harder to compare each video in fair standing, because older videos naturally will have more views. It would be nice to have 7 day or 14 day video metrics, but the API does not have such options. We will use some basic discount method to account for upload time difference instead.

### Ethics of data source
According to Youtube API guide, the usage of YouTube API is free and open to anyone who created API KEY. As long as the API user abide by the YouTube API quota lmiit, there is no issue in using YouTube API to get data. Also, the data itself is a public data that can be obtained on YouTube channel, so there is no privacy issue involved with the data source.


In [368]:
# Import basic libraries
import pandas as pd
import seaborn as sns

# Import API related libraries 
from googleapiclient.discovery import build
from IPython.display import JSON

# Import API_KEY from config directory
import sys
sys.path.append('./config/')
import yt_api_key as api

# 2. Data collection with YouTube Data API v3

First, I created an API key from the google cloud platform(GCP) console, and enabled YouTube Data API v3 for my account. I saved the API_KEY in the separate config directory, so I can import the API_KEY without showing the key in the notebook. Then, I checked the channel id of Le sserafim YouTube channel from the channe url, and created `get_channel_stats`, `get_video_ids`,`get_video_stats`,`get_comments_in_videos` functions to collect video statistics and comment data of all the videos in the channel via the API. 

In [369]:
# Build API service
youtube = build('youtube', 'v3', developerKey=api.API_KEY)

# Get the channel id of Le sserafim channel
channel_id = 'UCs-QBT4qkj_YiQw1ZntDO3g'

In [5]:
# Define a function to get the basic chanenl stat and playlist id
def get_channel_stats(youtube, channel_id):
    
    request = youtube.channels().list(
        part="snippet,contentDetails,statistics",
        id=channel_id)
    response = request.execute()
    
    all_data = []
    
    for item in response['items']:
        data = {'channel_name': item['snippet']['title'],
                'subscribers': item['statistics']['subscriberCount'],
                'total_views': item['statistics']['viewCount'],
                'video_count': item['statistics']['videoCount'],
                'playlist_id': item['contentDetails']['relatedPlaylists']['uploads']
               }
        
        all_data.append(data)
    
    return pd.DataFrame(all_data)


def get_video_ids(youtube, playlist_id):
    
    video_ids = []
    
    request = youtube.playlistItems().list(
        part="snippet,contentDetails",
        playlistId= playlist_id,
        maxResults = 50
    )

    response = request.execute()
    
    for item in response['items']:
        video_ids.append(item['contentDetails']['videoId'])
        
    next_page_token = response.get('nextPageToken')
    
    while next_page_token is not None:
        request = youtube.playlistItems().list(
            part="snippet,contentDetails",
            playlistId= playlist_id,
            maxResults = 50,
            pageToken = next_page_token            
            )

        response = request.execute()
        
        for item in response['items']:
            video_ids.append(item['contentDetails']['videoId'])

        next_page_token = response.get('nextPageToken')

    return video_ids


# Get the video stats

def get_video_stats(youtube, video_ids):

    all_video_stat = []
    
    for i in range(0, len(video_ids), 50):
        request = youtube.videos().list(
                part="snippet,contentDetails,statistics",
                id=','.join(video_ids[i:i+50])
            )
        response = request.execute()

        for video in response['items']:
            stats = {'snippet': ['publishedAt', 'title', 'description', 'tags'],
                     'contentDetails': ['duration'],
                     'statistics': ['viewCount','likeCount','favoriteCount','commentCount']}

            video_stat ={}
            video_stat['video_id'] = video['id']

            for i in stats.keys():
                for k in stats[i]:
                    try:
                        video_stat[k] = video[i][k]
                    except:
                        video_stat[k] = None

            all_video_stat.append(video_stat)
        
    return pd.DataFrame(all_video_stat)


# Get top 100 comments from each video
def get_comments_in_videos(youtube, video_ids):
    
    all_comments = []
    
    for video_id in video_ids:
        try:
            request = youtube.commentThreads().list(
                part="snippet,replies",
                videoId=video_id,
                maxResults = 100
            )

            response = request.execute()

            for comment in response['items']:
                comments = {}
                comments['video_id'] = comment['snippet']['videoId']

                toplevel = {'snippet': ['authorDisplayName', 'textOriginal', 'likeCount', 'publishedAt']}

                for i in toplevel['snippet']:
                    comments[i] = comment['snippet']['topLevelComment']['snippet'][i]

                comments['reply_count'] = comment['snippet']['totalReplyCount']

                all_comments.append(comments)
                
        except:
            print('No comment available for ' + video_id)
            
    return pd.DataFrame(all_comments)

### Get Le sserafim YouTube channel statistics

Using the defined `get_channel_stats` function with Le sseafim's channel id, we can get the channel information of Le sserafim channel. We have to save the playlist_id in particular, in order to download all the videos' video id in the channel.

In [6]:
# Get the channel stats
channel_info = get_channel_stats(youtube, channel_id)

In [7]:
channel_info

Unnamed: 0,channel_name,subscribers,total_views,video_count,playlist_id
0,LE SSERAFIM,2620000,713242804,318,UUs-QBT4qkj_YiQw1ZntDO3g


### Get Le sserafim channel's video statistics

Get all the videos' video id with `get_video_ids` function, and put the list in the dataframe.

In [21]:
# Get the video ids from the LESSERAFIM Channel
playlist_id = channel_info['playlist_id'].iloc[0]
# Get the video ids for the Le sserafim channel
video_ids = get_video_ids(youtube, playlist_id)
# Create a dataframe with the video_ids
video_id_list = pd.DataFrame(video_ids, columns=['video_id'])

Get the video statistics of all the video ids in the list by using `get_video_stats` function.

In [37]:
# Get the video stats for all the videos of Le sserafim channel
video_stat = get_video_stats(youtube, video_ids)

Get the top 100 comments for each video using `get_comments_in_videos` function.

In [None]:
# Get top 100 comments of each video
comments = get_comments_in_videos(youtube, video_ids)

Save all the data gathered from the API to csv files, so that we do not have to call API requests everytime we need the original copy of data, and use the csv file as a reference.

In [115]:
# Get video id list into csv file
video_id_list.to_csv('LS_video_ids.csv', index=False)
# Save video stat data into csv file
video_stat.to_csv('video_stat.csv', index=False)
# Save comments data in csv file
comments.to_csv('comments.csv', index=False)

Read in the csv file to perform EDA.

In [424]:
# Read video id list csv file
video_id_list = pd.read_csv('LS_video_ids.csv')
# Get video_stat csv file data
video_stat = pd.read_csv('video_stat.csv')
# Get comments csv file
comments = pd.read_csv('comments.csv')

# 3. Preprocess data and engineer new features

We need to perform some preprocessing process before we dive into analysis. 
- First, I will drop columns if they are irrelevant for the project, and change the column names to conveninent format for each dataframe.
- Then, I will check for null or empty values and fill in these values or drop values if necessary. 
- Next, I want to reformat some data values, especially time related columns (`publishedAt`, `duration`), and change data types to appropriate ones.

Examine the head of each dataframe to look for data cleaning needs.

In [425]:
video_stat.head()

Unnamed: 0,video_id,publishedAt,title,description,tags,duration,viewCount,likeCount,favoriteCount,commentCount
0,yTKQQZsVeTs,2023-03-22T12:00:01Z,"르카페☕️️ OPEN❣️ 첫 손님(feat. 채채즈), 어서오세요😊 #LE_SSER...",#LE_SSERAFIM #르세라핌 공식 채널\n#LE_SSERAFIM #르세라핌 O...,,PT44S,8688,2038,0,0
1,xQJmEiZXFYE,2023-03-22T11:00:01Z,[LENIVERSE] EP.17 LE CAFÉ 1편,#르세라핌 #르니버스 #EP17\n\n#LE_SSERAFIM #르세라핌 공식 채널\...,"['LE SSERAFIM', '르세라핌']",PT31M6S,70309,10581,0,735
2,T5qURRwoLHo,2023-03-21T13:00:30Z,😉🫶🫡👋😚 @ FEARNADA #LE_SSERAFIM #르세라핌 #shorts,,,PT34S,104685,17962,0,200
3,rp3BqWM1cC8,2023-03-21T12:00:02Z,[EPISODE] LE SSERAFIM(르세라핌) @ GMO SONIC 2023,#LE_SSERAFIM #르세라핌 공식 채널\n#LE_SSERAFIM #르세라핌 O...,"['LE SSERAFIM', '르세라핌']",PT13M52S,142752,11623,0,401
4,0LZiW_ut1jA,2023-03-19T11:45:00Z,Love you twice FEARNOT ! 🫰#LE_SSERAFIM #르세라핌 #...,#LE_SSERAFIM #르세라핌 공식 채널\n#LE_SSERAFIM #르세라핌 O...,,PT10S,90006,13686,0,171


In [426]:
comments.head()

Unnamed: 0,video_id,authorDisplayName,textOriginal,likeCount,publishedAt,reply_count
0,_D78-pgvnzg,ˊo̴̶̷̤.̮o̴̶̷̤ˋ,컨셉이 다르다보니 춤선이 해린은 통통튀는 춤선이고 은채는 딱 깔끔한 춤선인데 둘 다...,7,2022-11-11T23:41:15Z,0
1,_D78-pgvnzg,Win Cup,두 그룹이 콜레보도 언젠가 가능할듯!,7,2022-11-11T23:32:58Z,0
2,_D78-pgvnzg,Laconist JH,확실히 뉴진스는 발을 많이 쓰는 편이고 르세라핌은 상체모션을 더 많이 쓰는 것같은데...,1,2022-11-11T23:13:41Z,0
3,_D78-pgvnzg,Jhon Mendes,Haerin 6 member fits him perfectly,6,2022-11-11T22:44:47Z,0
4,_D78-pgvnzg,두밧두,은채는파워풀하고 해린은고급지면서 파워풀해,1,2022-11-11T22:42:59Z,0


### Drop Columns

I wanted to drop unnecessary columns first. For video_stat, I noticed that `tags` only have Le SSERAFIM in English and Korean. `favoriteCount` column values were all 0, so no reason to keep the column. <br>
On the other hand, comments dataframe did not have any unnecessary columns.

In [427]:
print(video_stat['tags'].unique())
print(video_stat['favoriteCount'].unique())

[nan "['LE SSERAFIM', '르세라핌']"]
[0]


In [428]:
video_stat = video_stat[['video_id', 'publishedAt', 'title', 'description',
                         # 'tags','favoriteCount',
                         'duration','viewCount', 'likeCount', 'commentCount']]

### Change column names

Change the column names for each dataframe to shorter, clear names.

In [429]:
print(video_stat.columns)
print(comments.columns)

Index(['video_id', 'publishedAt', 'title', 'description', 'duration',
       'viewCount', 'likeCount', 'commentCount'],
      dtype='object')
Index(['video_id', 'authorDisplayName', 'textOriginal', 'likeCount',
       'publishedAt', 'reply_count'],
      dtype='object')


In [430]:
video_stat = video_stat.rename(columns={'publishedAt':'upload_date', 'viewCount':'view', 'likeCount':'like', 'commentCount':'comment'})
comments = comments.rename(columns={'authorDisplayName':'user', 'textOriginal':'comment_detail', 'likeCount':'like', 'publishedAt':'comment_date', 'reply_count':'reply'})

I checked for null, empty values in `video_stat` dataframe. As we can see below, there are 9 videos with no description. Having no description for a video is not a critical issue, so I will just leave the null values here.

In [431]:
video_stat.isnull().sum()

video_id       0
upload_date    0
title          0
description    9
duration       0
view           0
like           0
comment        0
dtype: int64

In [432]:
video_stat.query('description ==""')

Unnamed: 0,video_id,upload_date,title,description,duration,view,like,comment


### Check for null, empty values

Then, I checked for null, empty values in `comments` dataframe. There are some null values in user, and comment_detail columns.

In [433]:
comments.isnull().sum()

video_id          0
user              4
comment_detail    8
like              0
comment_date      0
reply             0
dtype: int64

User value should not be null value, and since these rows have comment_detail and like values, I will fill in the user value with random text. Since the null value occurred in different videos for each comment, I assumed each comment came from different unique user.

In [434]:
comments[comments['user'].isnull()]

Unnamed: 0,video_id,user,comment_detail,like,comment_date,reply
513,Id0TzVyNjEE,,김채원 귀여운데 멋져 ❤,0,2023-03-21T10:15:39Z,0
9775,rboiHxBqdZk,,半音高い？国民の彼女かわいすぎ,5,2023-03-21T11:43:57Z,0
14787,54w71If3uW8,,where is when eunchae eating that gummy,0,2022-11-19T12:56:06Z,0
26587,_driMZojlQo,,엉덩이는 1개!!!,0,2023-02-08T11:29:09Z,0


In [435]:
# Name null user name to JohnDoe1,2,3,4
missing_name = comments[comments['user'].isnull()]

for i, idx in enumerate(missing_name.index):
    comments.loc[idx, 'user'] = "JohnDoe{}".format(i+1)

In [436]:
comments.loc[missing_name.index, 'user']

513      JohnDoe1
9775     JohnDoe2
14787    JohnDoe3
26587    JohnDoe4
Name: user, dtype: object

There could have been some error in getting comment details for the 8 comments. Since it is not a large amount, I will just drop these rows.

In [437]:
comments[comments['comment_detail'].isnull()]

Unnamed: 0,video_id,user,comment_detail,like,comment_date,reply
386,rp3BqWM1cC8,Roel choco,,1,2023-03-21T15:45:48Z,0
11342,318cwWHhO3o,Enha bị Lé,,0,2023-03-01T03:44:06Z,0
18029,DnyMB-chxsY,Daly,,1,2022-11-19T11:32:40Z,0
18140,8x43gsnkBH8,Yunus Sınırtepe,,1,2023-03-06T07:19:59Z,0
18533,W2458wotx6I,핫쏘,,0,2022-10-18T15:42:11Z,0
21314,oqsbeoklb3Q,서영식,,0,2023-02-09T07:27:33Z,0
22001,GzklGjUQAGo,jocy,,0,2022-09-26T03:28:51Z,0
26792,GbSE2_pkp6Y,청년,,0,2023-02-25T13:24:28Z,0


In [438]:
# Drop null values and save it as a new comments dataframe
comments = comments.dropna(subset=['comment_detail'])

### Reformat values and change data types

First, I checked the head of the video_stat dataframe, and found that I need to convert two time related columns (`upload_date`,`duration`) to appropriate data types. Also, I converted all the number values to int type.

In [439]:
video_stat.head()

Unnamed: 0,video_id,upload_date,title,description,duration,view,like,comment
0,yTKQQZsVeTs,2023-03-22T12:00:01Z,"르카페☕️️ OPEN❣️ 첫 손님(feat. 채채즈), 어서오세요😊 #LE_SSER...",#LE_SSERAFIM #르세라핌 공식 채널\n#LE_SSERAFIM #르세라핌 O...,PT44S,8688,2038,0
1,xQJmEiZXFYE,2023-03-22T11:00:01Z,[LENIVERSE] EP.17 LE CAFÉ 1편,#르세라핌 #르니버스 #EP17\n\n#LE_SSERAFIM #르세라핌 공식 채널\...,PT31M6S,70309,10581,735
2,T5qURRwoLHo,2023-03-21T13:00:30Z,😉🫶🫡👋😚 @ FEARNADA #LE_SSERAFIM #르세라핌 #shorts,,PT34S,104685,17962,200
3,rp3BqWM1cC8,2023-03-21T12:00:02Z,[EPISODE] LE SSERAFIM(르세라핌) @ GMO SONIC 2023,#LE_SSERAFIM #르세라핌 공식 채널\n#LE_SSERAFIM #르세라핌 O...,PT13M52S,142752,11623,401
4,0LZiW_ut1jA,2023-03-19T11:45:00Z,Love you twice FEARNOT ! 🫰#LE_SSERAFIM #르세라핌 #...,#LE_SSERAFIM #르세라핌 공식 채널\n#LE_SSERAFIM #르세라핌 O...,PT10S,90006,13686,171


In [440]:
# Change upload_date column to datetime object
video_stat['upload_date'] = pd.to_datetime(video_stat['upload_date']).dt.tz_convert(None)

# Convert duration (isodate format) to datetime format
import isodate
video_stat['duration'] =  video_stat['duration'].apply(lambda x: isodate.parse_duration(x))

# Change columns with number values to int type
video_stat[['view','like','comment']] = video_stat[['view','like','comment']].apply(pd.to_numeric)

# Change description, and title to string type
video_stat['description'] = video_stat['description'].astype(str)
video_stat['title'] = video_stat['title'].astype(str)

In [441]:
video_stat.head()

Unnamed: 0,video_id,upload_date,title,description,duration,view,like,comment
0,yTKQQZsVeTs,2023-03-22 12:00:01,"르카페☕️️ OPEN❣️ 첫 손님(feat. 채채즈), 어서오세요😊 #LE_SSER...",#LE_SSERAFIM #르세라핌 공식 채널\n#LE_SSERAFIM #르세라핌 O...,0 days 00:00:44,8688,2038,0
1,xQJmEiZXFYE,2023-03-22 11:00:01,[LENIVERSE] EP.17 LE CAFÉ 1편,#르세라핌 #르니버스 #EP17\n\n#LE_SSERAFIM #르세라핌 공식 채널\...,0 days 00:31:06,70309,10581,735
2,T5qURRwoLHo,2023-03-21 13:00:30,😉🫶🫡👋😚 @ FEARNADA #LE_SSERAFIM #르세라핌 #shorts,,0 days 00:00:34,104685,17962,200
3,rp3BqWM1cC8,2023-03-21 12:00:02,[EPISODE] LE SSERAFIM(르세라핌) @ GMO SONIC 2023,#LE_SSERAFIM #르세라핌 공식 채널\n#LE_SSERAFIM #르세라핌 O...,0 days 00:13:52,142752,11623,401
4,0LZiW_ut1jA,2023-03-19 11:45:00,Love you twice FEARNOT ! 🫰#LE_SSERAFIM #르세라핌 #...,#LE_SSERAFIM #르세라핌 공식 채널\n#LE_SSERAFIM #르세라핌 O...,0 days 00:00:10,90006,13686,171


In [453]:
video_stat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 317 entries, 0 to 316
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype          
---  ------       --------------  -----          
 0   video_id     317 non-null    object         
 1   upload_date  317 non-null    datetime64[ns] 
 2   title        317 non-null    object         
 3   description  317 non-null    object         
 4   duration     317 non-null    timedelta64[ns]
 5   view         317 non-null    int64          
 6   like         317 non-null    int64          
 7   comment      317 non-null    int64          
 8   year         317 non-null    int64          
 9   month        317 non-null    int64          
 10  year_month   317 non-null    object         
 11  eng_rate     317 non-null    float64        
dtypes: datetime64[ns](1), float64(1), int64(5), object(4), timedelta64[ns](1)
memory usage: 29.8+ KB


Next, I checked the head of the comments dataframe, and found that I should convert `comment_date` column to datetime object. Other than that, other columns' data types looked proper.

In [443]:
comments.head()

Unnamed: 0,video_id,user,comment_detail,like,comment_date,reply
0,_D78-pgvnzg,ˊo̴̶̷̤.̮o̴̶̷̤ˋ,컨셉이 다르다보니 춤선이 해린은 통통튀는 춤선이고 은채는 딱 깔끔한 춤선인데 둘 다...,7,2022-11-11T23:41:15Z,0
1,_D78-pgvnzg,Win Cup,두 그룹이 콜레보도 언젠가 가능할듯!,7,2022-11-11T23:32:58Z,0
2,_D78-pgvnzg,Laconist JH,확실히 뉴진스는 발을 많이 쓰는 편이고 르세라핌은 상체모션을 더 많이 쓰는 것같은데...,1,2022-11-11T23:13:41Z,0
3,_D78-pgvnzg,Jhon Mendes,Haerin 6 member fits him perfectly,6,2022-11-11T22:44:47Z,0
4,_D78-pgvnzg,두밧두,은채는파워풀하고 해린은고급지면서 파워풀해,1,2022-11-11T22:42:59Z,0


In [444]:
# Convert `comment_date` column to datetime object
comments['comment_date'] = pd.to_datetime(comments['comment_date']).dt.tz_convert(None)

In [445]:
comments.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30653 entries, 0 to 30660
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   video_id        30653 non-null  object        
 1   user            30653 non-null  object        
 2   comment_detail  30653 non-null  object        
 3   like            30653 non-null  int64         
 4   comment_date    30653 non-null  datetime64[ns]
 5   reply           30653 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(3)
memory usage: 1.6+ MB


### Add some features

I want to add some features for data analysis. These are some ideas for video_stat dataframe:

- create various datetime columns for time series analysis (`year`, `month`, `year-month`).
- calculate an engagement rate of each video by summing like and comment count and divide it by views.
- create a `content_format` column that has a format name of each video.
- create a `video_format` column that either has VOD or shorts depending on the video type.
- create a `promotion` column that has True boolean value if the video is sponsored by the advertiser.

In [521]:
# Create datetime columns
video_stat['year'] = video_stat['upload_date'].dt.year
video_stat['month'] = video_stat['upload_date'].dt.month
video_stat['year_month'] = video_stat['upload_date'].dt.strftime('%Y%m')

# Create an engagement rate column
video_stat['eng_rate'] = ((video_stat['like'] + video_stat['comment'])/video_stat['view']).round(3)

# Create content_format column that has a format name of each video

# regular expression pattern to match the text inside square brackets
import re
pattern = r'\[(.*?)\]'  

# Apply the pattern to the 'title' column and create a new 'content_format' column
video_stat['content_format'] = video_stat['title'].apply(lambda x: re.findall(pattern, x)[0] if re.findall(pattern, x) else '')
video_stat.loc[video_stat['content_format'] == "", 'content_format'] = 'Other'

# Create a video_format column by extracting #shorts hashtag from the title column
video_stat.loc[video_stat['title'].str.contains("#shorts"),'video_format'] = 'Shorts'
video_stat['video_format'] = video_stat['video_format'].fillna('VOD',axis=0)

# Create a promotion column that has "협찬", "광고", or "PPL" keyword in the description
video_stat.loc[[0, 30], 'promotion'] = 'Promotion'
video_stat['promotion'] = video_stat['promotion'].fillna('Original', axis=0)

In [522]:
video_stat.loc[video_stat['description'].str.contains("협찬|광고|PPL")]

Unnamed: 0,video_id,upload_date,title,description,duration,view,like,comment,year,month,year_month,eng_rate,test,content_format,video_format,promotion
30,aO92SlPwIcQ,2023-02-22 11:00:01,[LENIVERSE] EP.13 FPS 특집 1편,본 영상은 1993스튜디오(1993STUDIO)에서 협찬을 받아 촬영되었습니다.\n...,0 days 00:29:04,778127,41035,1667,2023,2,202302,0.055,LENIVERSE,LENIVERSE,VOD,Promotion


In [546]:
video_stat.head()

Unnamed: 0,video_id,upload_date,title,description,duration,view,like,comment,year,month,year_month,eng_rate,content_format,video_format,promotion
0,yTKQQZsVeTs,2023-03-22 12:00:01,"르카페☕️️ OPEN❣️ 첫 손님(feat. 채채즈), 어서오세요😊 #LE_SSER...",#LE_SSERAFIM #르세라핌 공식 채널\n#LE_SSERAFIM #르세라핌 O...,0 days 00:00:44,8688,2038,0,2023,3,202303,0.235,Other,Shorts,Promotion
1,xQJmEiZXFYE,2023-03-22 11:00:01,[LENIVERSE] EP.17 LE CAFÉ 1편,#르세라핌 #르니버스 #EP17\n\n#LE_SSERAFIM #르세라핌 공식 채널\...,0 days 00:31:06,70309,10581,735,2023,3,202303,0.161,LENIVERSE,VOD,Original
2,T5qURRwoLHo,2023-03-21 13:00:30,😉🫶🫡👋😚 @ FEARNADA #LE_SSERAFIM #르세라핌 #shorts,,0 days 00:00:34,104685,17962,200,2023,3,202303,0.173,Other,Shorts,Original
3,rp3BqWM1cC8,2023-03-21 12:00:02,[EPISODE] LE SSERAFIM(르세라핌) @ GMO SONIC 2023,#LE_SSERAFIM #르세라핌 공식 채널\n#LE_SSERAFIM #르세라핌 O...,0 days 00:13:52,142752,11623,401,2023,3,202303,0.084,EPISODE,VOD,Original
4,0LZiW_ut1jA,2023-03-19 11:45:00,Love you twice FEARNOT ! 🫰#LE_SSERAFIM #르세라핌 #...,#LE_SSERAFIM #르세라핌 공식 채널\n#LE_SSERAFIM #르세라핌 O...,0 days 00:00:10,90006,13686,171,2023,3,202303,0.154,Other,VOD,Original


For comments dataframe, I also have want to add some features:

- Merge the video_stat data to create `upload_date`, `video_format`, `content_format`, and `promotion` column for comments dataframe
- Calculate how long it took for the user to comment on the videos, and store that value in `time_to_comment` column

In [556]:
# Merge two dataframes and get necessary columns
comments = comments.merge(right= video_stat[['video_id','upload_date','video_format','content_format','promotion']], how='inner', on='video_id')

# Create a time_to_comment column by calculating the difference
comments['time_to_comment'] = comments['comment_date'] - comments['upload_date']

In [560]:
comments.head()

Unnamed: 0,video_id,user,comment_detail,like,comment_date,reply,upload_date,video_format,content_format,promotion,time_to_comment
0,_D78-pgvnzg,ˊo̴̶̷̤.̮o̴̶̷̤ˋ,컨셉이 다르다보니 춤선이 해린은 통통튀는 춤선이고 은채는 딱 깔끔한 춤선인데 둘 다...,7,2022-11-11 23:41:15,0,2022-11-09 08:00:15,Shorts,Other,Original,2 days 15:41:00
1,_D78-pgvnzg,Win Cup,두 그룹이 콜레보도 언젠가 가능할듯!,7,2022-11-11 23:32:58,0,2022-11-09 08:00:15,Shorts,Other,Original,2 days 15:32:43
2,_D78-pgvnzg,Laconist JH,확실히 뉴진스는 발을 많이 쓰는 편이고 르세라핌은 상체모션을 더 많이 쓰는 것같은데...,1,2022-11-11 23:13:41,0,2022-11-09 08:00:15,Shorts,Other,Original,2 days 15:13:26
3,_D78-pgvnzg,Jhon Mendes,Haerin 6 member fits him perfectly,6,2022-11-11 22:44:47,0,2022-11-09 08:00:15,Shorts,Other,Original,2 days 14:44:32
4,_D78-pgvnzg,두밧두,은채는파워풀하고 해린은고급지면서 파워풀해,1,2022-11-11 22:42:59,0,2022-11-09 08:00:15,Shorts,Other,Original,2 days 14:42:44


# 4. Perform EDA

- Analyze video metrics to find out what type of contents are popular among fans:
    - What type of contents get the most views?
    - What type of contents have the most engagement among the fans?
    - What are some conents that didn't perform well, and why?
    - How's the video performance of contents over time?
- Utilize NLP techniques to gain some insights on fan reaction:
    - Explore the top 100 comments of videos to explore fan reaction
    - Is there a meaningful difference in comment reactions for content types?
    - What kind of questions or requests are given in the comment sections?
    - Is there any content idea that can come from the comment sections?
- Come up with some content idea with the insights gained from the above analysis