# YouTube Comment Scraper
## Overview
This Python script scrapes comments from videos, with keywords
- body
- weight
- diet

Three languages will be used for the study - Korean, English, and Bahasa Indonesia.

This data collection is part of a study to understand the public's sentiments regarding body image and dieting, across different countries and cultures.

## Acknowledgements

YouTube Scraper boilerplate by [Akash Jain](https://francoisstamant.medium.com/) at [Medium](https://medium.com/analytics-vidhya/extracting-youtube-comments-using-selenium-b29ee4f743ef) and [François St-Amant](https://francoisstamant.medium.com/?source=post_page-----61ff197115d4--------------------------------) at [towardsdatascience](https://towardsdatascience.com/how-to-scrape-youtube-comments-with-python-61ff197115d4)

### Step 1: Import required libraries

In [11]:
from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
import time
import os
import csv
import pandas as pd
import re
from math import ceil

# wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
# wd.get("https://www.webite-url.com")

In [12]:
# Create a new .csv file to write data
path = "./data/youtube_comments_bodyimage.csv"
csv_file = open(path,'w', encoding="UTF-8", newline="")

In [13]:
writer = csv.writer(csv_file)

In [14]:
# write header names
writer.writerow(
    ['url', 'link_title', 'channel', 'no_of_views', 'time_uploaded', 'comment', 'author', 'comment_posted', 
     'no_of_replies','upvotes','downvotes'])

112

We first scrape the youtube search page for "body image"

In [15]:
link = "https://www.youtube.com/results?search_query=covid"

# open chrome 
driver = webdriver.Chrome(executable_path='C:/WebDriver/bin/chromedriver.exe')
driver.get(link)
time.sleep(10)

print("=" * 40)  # Shows in terminal when youtube summary page with search keyword is being scraped
print("Scraping " + link)    
time.sleep(20)    

# scrap top 8 video URLS that pop up on search
video_list = driver.find_elements_by_xpath('//*[@id="video-title"]')

info = []
urls = []
titles = []
channels = []
no_of_views = []
upload_dates = []

# store URL and video title for videos
for video in video_list:
    urls.append(video.get_attribute('href'))
    titles.append(video.get_attribute('title'))
    print("scraped ", video.get_attribute('title'))

# create data frame to store as csv
df = pd.DataFrame(columns = ['url', 'title', 'channel', 'no_of_views', 'time_uploaded', 'comment', 'author', 'comment_posted', 
    'no_of_replies','upvotes','downvotes'])

Scraping https://www.youtube.com/results?search_query=body+image
scraped  Girls Ages 6-18 Talk About Body Image | Allure
scraped  The Science of Body Image
scraped  Lili Reinhart’s Revealing Speech About Body Image | Glamour WOTY 2018
scraped  TWRP - Body Image (Audio)
scraped  Self Esteem Tips: Dealing with Body Image Issues
scraped  9 Models on the Pressure to Lose Weight and Body Image | The Models | Vogue
scraped  “Body Image” by Madilyn Paige | Saints Channel Studio
scraped  What Happens When Strangers Get Real About Body Image
scraped  Stop hating your body; start living your life | Taryn Brumfitt | TEDxAdelaide
scraped  Jameela Jamil: Body image, the Kardashians and social media - BBC HARDtalk (2019)
scraped  Positive Body Image & Self Esteem | Advice to Improve Self Worth | YOU ARE WORTHY | Kathryn Morgan
scraped  Body Image + Journaling Tips | JENNGLEBELLS #3
scraped  Wellcast - Self Esteem Tips: Dealing with Body Image Issues
scraped  Media's Effects on Body Image
scraped  Ou

In [16]:
# loop through each video
for url in urls:
    driver.get(url)
    v_channel = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR,"div#upload-info yt-formatted-string"))).text
    print("channel name is ",v_channel)    
    v_views = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR,"div#count span.view-count"))).text
    print("no. of views is ",v_views)
    v_timeUploaded = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR,"div#date yt-formatted-string"))).text
    print("time uploaded is ",v_timeUploaded)

    # retrieve comments
    youtube_dict ={}


    print("+" * 40)  # Shows in terminal when details of a new video is being scraped
    print("Scraping child links ")
    #scroll down to load comments
    driver.execute_script('window.scrollTo(0,390);')

    # let comments load
    time.sleep(15)

    #sort by top comments
    print("sorting by top comments")
    sort= WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR,"div#icon-label")))
    sort.click()
    
    time.sleep(10)
    topcomments =driver.find_element_by_xpath("""//*[@id="menu"]/a[1]/paper-item/paper-item-body/div[1]""")
    topcomments.click()
    time.sleep(10)

    # Loads 20 comments , scroll two times to load next set of 40 comments. 
    for i in range(0,2):
        driver.execute_script("window.scrollTo(0,Math.max(document.documentElement.scrollHeight,document.body.scrollHeight,document.documentElement.clientHeight))")\
        print("scrolling to load more comments")
        time.sleep(10)

    #count total number of comments and set index to number of comments if less than 50 otherwise set as 50. 
    totalcomments= len(driver.find_elements_by_xpath("""//*[@id="content-text"]"""))
    

    if totalcomments < 50:
        index= totalcomments
    else:
        index= 50 
    
    # loop through each comment and scrape info
    print("scraping through comments")
    ccount = 0
    while ccount < index: 
        try:
            comment = driver.find_elements_by_xpath('//*[@id="content-text"]')[ccount].text
        except:
            comment = ""
        try:
            authors = driver.find_elements_by_xpath('//a[@id="author-text"]/span')[ccount].text
        except:
            authors = ""
        try:
            comment_posted = driver.find_elements_by_xpath('//*[@id="published-time-text"]/a')[ccount].text
        except:
            comment_posted = ""
        try:
            replies = driver.find_elements_by_xpath('//*[@id="more-text"]')[ccount].text                    
            if replies =="View reply":
                replies= 1
            else:
                replies =replies.replace("View ","")
                replies =replies.replace(" replies","")
        except:
            replies = ""
        try:
            upvotes = driver.find_elements_by_xpath('//*[@id="vote-count-middle"]')[ccount].text
        except:
            upvotes = ""

        youtube_dict['url'] = url
        youtube_dict['link_title'] = titles[counter]
        youtube_dict['channel'] = v_channel
        youtube_dict['no_of_views'] = v_views
        youtube_dict['time_uploaded'] = v_timeUploaded
        youtube_dict['comment'] = comment
        youtube_dict['author'] = authors
        youtube_dict['comment_posted'] = comment_posted
        youtube_dict['no_of_replies'] = replies
        youtube_dict['upvotes'] = upvotes
        writer.writerow(youtube_dict.values())
        
        # increment comment counter
        ccount = ccount + 1

channel name is  Allure
no. of views is  5,014,874 views
time uploaded is  Jun 1, 2018
++++++++++++++++++++++++++++++++++++++++
Scraping child links 
SORT BY
FINISHED SCROLLING!
channel name is  AsapSCIENCE
no. of views is  710,990 views
time uploaded is  Dec 12, 2019
++++++++++++++++++++++++++++++++++++++++
Scraping child links 
SORT BY
FINISHED SCROLLING!
channel name is  Glamour
no. of views is  2,216,735 views
time uploaded is  Nov 12, 2018
++++++++++++++++++++++++++++++++++++++++
Scraping child links 
SORT BY
FINISHED SCROLLING!
channel name is  TWRPtube
no. of views is  757,462 views
time uploaded is  Jan 15, 2017
++++++++++++++++++++++++++++++++++++++++
Scraping child links 
SORT BY
FINISHED SCROLLING!
channel name is  watchwellcast
no. of views is  839,915 views
time uploaded is  Apr 19, 2013
++++++++++++++++++++++++++++++++++++++++
Scraping child links 
SORT BY
FINISHED SCROLLING!
channel name is  Vogue
no. of views is  3,374,603 views
time uploaded is  Apr 24, 2019
++++++++++

TimeoutException: Message: 
