# Notebook untuk Data Scraping & Labelling
Data discrape dari komentar video YouTube "Jake Paul vs. Mike Tyson FIGHT HIGHLIGHTS 🥊 | ESPN Ringside" https://www.youtube.com/watch?v=Aja2KfuoqGA dari channel ESPN.

# API Key
Apabila Anda ingin menjalankan notebook ini untuk melakukan overwrite pada dataset, Anda perlu mendapatkan API key dari Google Cloud Console terlebih dahulu
1. Login ke Google Cloud Console
2. Create New Project dan beri nama
3. Buka "API & Services" > Library
4. Cari "YouTube Data API v3" dan enable
5. Pilih Public Data untuk Credential Type
6. Create Credentials dan salin API key
7. Buat file .env pada root directory
8. Isi file .env dengan API_KEY=API_KEY_YANG_BARU_DISALIN

# Referensi
Potongan kode diambil dari https://www.geeksforgeeks.org/sentiment-analysis-of-youtube-comments/ dengan beberapa modifikasi

## Preparasi Scraping

In [1]:
# impor library yang dibutuhkan
from googleapiclient.discovery import build
from dotenv import load_dotenv
import os
import re
import emoji
from bs4 import BeautifulSoup

In [2]:
# ambil API Key dari file .env
load_dotenv()
api_key = os.getenv("API_KEY")
youtube = build("youtube", "v3", developerKey=api_key)

In [3]:
# mendapatkan id video dari URL
VIDEO_URL = "https://www.youtube.com/watch?v=Aja2KfuoqGA"
video_id = VIDEO_URL[-11:]
print(video_id)

Aja2KfuoqGA


In [4]:
# mendapatkan response dari video
video_response = youtube.videos().list(
    part="snippet",
    id=video_id
).execute()

In [5]:
# mendapatkan snippet dan id uploader dari video
video_snippet = video_response['items'][0]['snippet']
uploader_channel_id = video_snippet['channelId']
print("channel id: " + uploader_channel_id)

channel id: UCiWLfSweyRNmLpgEHekhoAg


## Memulai Scraping

In [6]:
# mengambil komentar dari video sebanyak 1000
MAX_COMMENTS = 1000
comments = []
nextPageToken = None
while len(comments) < MAX_COMMENTS:
    request = youtube.commentThreads().list(
        part='snippet',
        videoId=video_id,
        maxResults=100,  # You can fetch up to 100 comments per request
        pageToken=nextPageToken
    )
    response = request.execute()
    for item in response['items']:
        comment = item['snippet']['topLevelComment']['snippet']
        # Check if the comment is not from the video uploader
        if comment['authorChannelId']['value'] != uploader_channel_id:
            comments.append(comment['textDisplay'])
    nextPageToken = response.get('nextPageToken')

    if not nextPageToken:
        break

In [7]:
# menunjukkan 5 komentar teratas
for comment in comments[:5]:
    print(comment)

Why was Mikes right hand wrap missing at the announcement?
Ngl thats cap
Mike le gano 100%, mike es un gran guerrero.
Wooooooow Mike Tayson es el mejor boxeador del Mundo.
58歲真的差很多，閃避，體力都減半了，泰森年輕的時候這種咖3回合內肯定倒下......也沒有第二人能跟巔峰時期的泰森硬鋼，這種咖只能耍耍抱抱拳


## Memulai filtering
Filtering dilakukan untuk menghilangkan komentar yang berisi hyperlink dan komentar yang isinya lebih dari 35% emoji, serta komentar yang kosong.

In [8]:
hyperlink_pattern = re.compile(
    r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')

threshold_ratio = 0.65

relevant_comments = []

# Inside your loop that processes comments
for comment_text in comments:

    comment_text = comment_text.lower().strip()

    emojis = emoji.emoji_count(comment_text)

    # Count text characters (excluding spaces)
    text_characters = len(re.sub(r'\s', '', comment_text))

    if (any(char.isalnum() for char in comment_text)) and not hyperlink_pattern.search(comment_text):
        if emojis == 0 or (text_characters / (text_characters + emojis)) > threshold_ratio:
            relevant_comments.append(comment_text)

In [9]:
# Print the relevant comments
for relevant_comment in relevant_comments[:5]:
    print(relevant_comment)

why was mikes right hand wrap missing at the announcement?
ngl thats cap
mike le gano 100%, mike es un gran guerrero.
wooooooow mike tayson es el mejor boxeador del mundo.
58歲真的差很多，閃避，體力都減半了，泰森年輕的時候這種咖3回合內肯定倒下......也沒有第二人能跟巔峰時期的泰森硬鋼，這種咖只能耍耍抱抱拳


## Melanjutkan filtering
Pada komentar kelima terlihat masih ada html element. Masih perlu dilakukan pembersihan.

In [10]:
# membuat fungsi untuk mendapatkan teks saja dari HTML element
def remove_html(comment):
    return BeautifulSoup(comment, "lxml").get_text()

In [11]:
# membuat list yang berisi komentar bersih
cleaned_comments = [remove_html(comment) for comment in relevant_comments]

  return BeautifulSoup(comment, "lxml").get_text()


In [12]:
# menunjukkan 5 komentar bersih teratas
for cleaned_comment in cleaned_comments[:5]:
    print(cleaned_comment)

why was mikes right hand wrap missing at the announcement?
ngl thats cap
mike le gano 100%, mike es un gran guerrero.
wooooooow mike tayson es el mejor boxeador del mundo.
58歲真的差很多，閃避，體力都減半了，泰森年輕的時候這種咖3回合內肯定倒下......也沒有第二人能跟巔峰時期的泰森硬鋼，這種咖只能耍耍抱抱拳


## Preparasi Labelling


In [13]:
# Mengimpor fungsi yang dibutuhkan
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

## Memulai Labelling
Komentar yang bersih sudah didapatkan, tetapi kita tidak tahu sentimen dari setiap komentarnya. Terdapat 1000 komentar, akan memakan waktu lama untuk melakukan labelling secara manual. Oleh karena itu digunakan salah satu fungsi dari VADER yaitu SentimentIntensityAnalyzer. Polarity dari cleaned_comments akan dicari untuk mengetahui bagaimana sentimen dari video (menurut VADER). Sentiment diukur dalam bentuk polarity. Polarity yang lebih dari 0.05 akan dihitung positif, kurang dari -0.05 dihitung negatif, dan di antaranya dihitung netral.

In [None]:
# membuat fungsi untuk mengambalikan polarity
def sentiment_scores(comment, polarity):

    # Creating a SentimentIntensityAnalyzer object.
    sentiment_object = SentimentIntensityAnalyzer()

    sentiment_dict = sentiment_object.polarity_scores(comment)
    polarity.append(sentiment_dict['compound'])

    return polarity

In [None]:
# membuat list yang menyimpan polarity setiap komentar, komentar negatif, positive, dan netral
polarity = []
positive_comments = []
negative_comments = []
neutral_comments = []

In [None]:
# mengkategorikan komentar menurut polarity-nya
for index, items in enumerate(cleaned_comments):
    polarity = sentiment_scores(items, polarity)

    if polarity[-1] > 0.05:
        positive_comments.append(items)
    elif polarity[-1] < -0.05:
        negative_comments.append(items)
    else:
        neutral_comments.append(items)

In [22]:
print(polarity[:5])

[0.25, 0.0, 0.0, 0.0, 0.0]


## Mengukur Sentimen Video
Sentimen dari video boxing Mike Tyson VS Jake Paul dapat diukur dari rata-rata polarity komentarnya.

In [23]:
avg_polarity = sum(polarity)/len(polarity)
print("Average Polarity:", avg_polarity)
if avg_polarity > 0.05:
    print("The Video has got a Positive response")
elif avg_polarity < -0.05:
    print("The Video has got a Negative response")
else:
    print("The Video has got a Neutral response")

print("The comment with most positive sentiment:", cleaned_comments[polarity.index(max(
    polarity))], "with score", max(polarity), "and length", len(cleaned_comments[polarity.index(max(polarity))]))
print("The comment with most negative sentiment:", cleaned_comments[polarity.index(min(
    polarity))], "with score", min(polarity), "and length", len(cleaned_comments[polarity.index(min(polarity))]))

Average Polarity: 0.02887009544008484
The Video has got a Neutral response
The comment with most positive sentiment: what a shamed,, how many years the age difference again???? lol lol lol lol please tag jake paul 🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣 with score 0.9968 and length 116
The comment with most negative sentiment: why paul fight mike tyson mike tyson is old 😢😢😢😢no one can understand 😢😢😢😢 with score -0.9786 and length 74


## Ekspor Dataset
Untuk melakukan pelatihan dan pengujian model, komentar perlu diekspor ke bentuk yang dapat dibaca. Selain komentar, diekspor juga sentimennya antara positif, negatif, atau netral.

In [24]:
# Mengimpor pandas untuk ekspor ke .csv
import pandas as pd

In [25]:
# Create a list of tuples with comments and their respective labels
data = [(comment, 'positive') for comment in positive_comments] + \
       [(comment, 'negative') for comment in negative_comments] + \
       [(comment, 'neutral') for comment in neutral_comments]

# Create a DataFrame
df = pd.DataFrame(data, columns=['Comment', 'Label'])

# Export to CSV
df.to_csv('comments.csv', index=False)