# Tourists at Home

### Executive Summary

In a world still battling the spread of Covid-19, countries around the world have shut down their borders. In the wake of a devastating loss in tourism dollars, government bodies and organisations announced the launch of the [SingapoRediscovers](https://www.stb.gov.sg/content/stb/en/media-centre/media-releases/Enterprise-Singapore-Sentosa-Development-Corporation-and-Singapore-Tourism-Board-team-up-with-industry-to-encourage-locals-to-rediscover-Singapore.html.html) campaign on 22nd July 2020 and set aside 45 million dollars to boost domestic tourism. However, a new initiative in the form of SingapoRediscovers vouchers was announced less than a month later. This new initiative comes with a budget of 320 million dollars and aim to distribute [\$100 tourism vouchers](https://www.stb.gov.sg/content/stb/en/media-centre/media-releases/SingapoRediscovers-and-Expanded-Attractions-Guidelines.html) to all Singaporeans aged 18 and above valid for seven months from December 2020 to end-June 2021.

With so much stimulus dollars planned to revive our hardest-hit industry, just what exactly is domestic tourism? If you give it further thought, it is a concept that is at first glance paradoxical - can you still be a tourist back home? But, if you were to ask around, it seems most everyone has an idea on what domestic tourism is. As soon as word about the $100 tourism vouchers began circulating, everyone began planning their next staycation or a trip to the USS. Is that all there is to domestic tourism, or is there more? Can we actually grow and sustain domestic tourism in Singapore simply by incentivising local Singaporeans to participate in activities that are otherwise usually targetted at foreign visitors? How can we best encourage Singaporeans to go out and explore this home of theirs, and to be tourists at home?

Representing STB, this project aims to uncover the Singaporeans' perception of domestic tourism, and to identify key areas of high potential where we can focus our efforts for the next phase of the #SingapoRediscovers campaign. To do this, the project hypothesises that:

> the number of likes attracted by a post/video/comment is indicative of the support and popularity behind the tourism idea.

By training our model to predict on the popularity of a video/post, the model learns to rate the attractiveness/reception of a certain idea proposition. The initial direction of the project was therefore to predict on the number of likes and solve a regression problem. For reasons we will elucidate later, this was subsequently adapted into a classification problem where we predict on the popularity of the content. Nonetheless, the number of likes remains a useful data point to scrape as we would be engineering our labels for popularity from each data point's number of likes.

After training, we can then infer from the trained model and identify opportunities of high potential for the next phase of the #SingapoRediscovers campaign. Through this, my project hopes to gather enough insight to propose recommendations for the next phase of the campaign. To begin building a model up to the task, we have to first gather data. In order to access and interpret the on-the-ground sentiment and perception of domestic tourism, this project aims to scrape strategically both Instagram and YouTube for data. This notebook compiles the script for the scraping endeavours.

#### Content

- [Scrape Methodology](#Scrape-Methodology)
- [Instagram - Scraping](#Instagram---Scraping)
- [YouTube - Scraping](#YouTube---Scraping)

#### Python Libraries

In [1]:
import pandas as pd
import numpy as np
import time
from datetime import datetime
import gc

import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import json
import re

### Scrape Methodology

In order to provide actionable insights for STB to focus on for the next phase of their campaign, and to provide a vision of tourism in the new normal, this project hypothesises that the number of likes attracted by a post/video/comment is indicative of the support and enthusiasm behind the tourism idea. The scraping process begins with a few key assumptions that underlie my hypotheses. They are:

- Instagram posts can reveal the **conscious local perception** of domestic tourism made explicit by through hashtagging.
- YouTube videos and comments make up a dialogue that can reveal the **unconscious local perception** of domestic tourism made implicit by local recommendations on exploring Singapore in reaction to the video.
- Locals would follow through on their own recommendations for others.

There are also assumptions that belie my sampling methodology. I will split the assumptions up according to platforms.

**For Instagram:**
- Posts that make use of the target hashtags are perceived by the owner account to match the idea behind the hashtag and are useful to understanding local perception of domestic tourism.
- Posts attract likes based on viewers' support of and agreement with the content matching their hashtags.
- Comments on posts are mostly supportive in nature and not useful to furthering our understanding of the ideas behind the hashtags.

**For YouTube:**
- Transient vloggers on YouTube with a travel video on Singapore attract mostly overseas commenters, whom while generally supportive and enthusiastic about Singapore, are unable to offer further insights into exploring Singapore.
- While local YouTubers attract local followers/commenters, their commenters refrain from providing recommendations as their local status means that they are already perceived as subject matter experts.
- Foreign YouTubers based in Singapore achieves the fine balance of attracting local followers/commenters, while providing a conducive comment space for dialogue between video content on exploring Singapore and local recommendations on where else to check out.

### Instagram - Scraping

First, let us define a function that will scrape the URLs of all Instagram posts from a single URL query to Instagram according to our target hashtags.

In [2]:
def get_insta_posts(url):

    # launch driver
    driver = webdriver.Chrome()
    driver.get(url)
    time.sleep(3)

    # create empty set to add urls to
    link_posts = set() # we use sets since we expect duplicate scraping as not all posts would have disappeared

    # scrape posts url with automated scrolling using selenium to capture all elements before they disappear

    # first scrape of post urls
    tags = driver.find_elements_by_tag_name("a")
    for tag in tags:
        link = tag.get_attribute("href")
        if "/p/" in link:
            link_posts.add(link)

    # first scroll
    lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
    match = False
    time.sleep(3)

    # recursive scroll
    while (match == False):
        # scrape post urls in between scrolls
        tags = driver.find_elements_by_tag_name("a")
        for tag in tags:
            link = tag.get_attribute("href") # query returns all hrefs
            if "/p/" in link: # search only for hrefs that bring you to a insta post
                link_posts.add(link) # add links with every scroll as page is dynamic and earlier links will be lost
        # execute scroll
        lastCount = lenOfPage
        lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
        time.sleep(3)
        # until last post
        if lastCount == lenOfPage:
            match = True

    return link_posts # returns scraped list of direct urls to each insta post

Next, scrape all meta-data of each Instagram post by visiting each URL scraped from above function and compile these meta-data into a dictionary.

In [4]:
def make_insta_dicts(list_urls):
    
    list_dict = [] # create empty list to append dicts of info
    
    for i, url in enumerate(list_urls):
        r = requests.get(url)
        
        if r.status_code == 200:
            
            print(f"Client response {i} received")
            # parse response as html
            html = BeautifulSoup(r.text, "lxml")
            # find body of post and convert to string
            script = html.find("script", text=lambda t: t.startswith("window._sharedData")).string
            # parse script as json obj
            post_json = json.loads(script.split("window._sharedData = ")[-1].rstrip(";"))
            # find where target info is stored
            core_json = post_json["entry_data"]["PostPage"][0]["graphql"]["shortcode_media"]
            
            # try-except statement to extract target info since not all keys are present in each post's json
            try:
                post_id = core_json["id"]
            except:
                post_id = None
            try:
                post_slug =  core_json["shortcode"]
            except:
                post_slug = None
            try:
                unix_time = core_json["taken_at_timestamp"]
            except:
                unix_time = None
            try:
                date_time = datetime.utcfromtimestamp(core_json["taken_at_timestamp"]).strftime('%Y-%m-%d %H:%M:%S')
            except:
                date_time = None
            try:
                post_caption = core_json["edge_media_to_caption"]["edges"][0]["node"]["text"]
            except:
                post_caption = None
            try:
                hashtags = re.findall("\#\w+", core_json["edge_media_to_caption"]["edges"][0]["node"]["text"])
            except:
                hashtags = None
            try:
                topic_tags = [topic.strip() for topic in core_json["accessibility_caption"].split(":")[-1].replace("and", ",").split(",")]
            except:
                topic_tags = None
            try:
                is_video = core_json["is_video"]
            except:
                is_video = None
            try:
                is_ad = core_json["is_ad"]
            except:
                is_ad = None
            try:
                post_likes = core_json["edge_media_preview_like"]["count"]
            except:
                post_slug = None
            try:
                geo_tag = core_json["location"]["name"]
            except:
                geo_tag = None
            try:
                geo_slug = core_json["location"]["slug"]
            except:
                geo_slug = None
            try:
                owner_id = core_json["owner"]["id"]
            except:
                owner_id = None
            try:
                owner_verified = core_json["owner"]["is_verified"]
            except:
                owner_verified = None
            try:
                owner_privacy = core_json["owner"]["is_private"]
            except:
                owner_privacy = None
            try:
                owner_unpublished = core_json["owner"]["is_unpublished"]
            except:
                post_slug = None
            try:
                owner_total_posts = core_json["owner"]["edge_owner_to_timeline_media"]["count"]
            except:
                owner_total_posts = None
            try:
                owner_total_followers = core_json["owner"]["edge_followed_by"]["count"]
            except:
                owner_total_followers = None
                        
            # compile target info into dict format
            targets = ['post_id', 'post_slug', 'unix_time', 'date_time', 'post_caption', 'hashtags', 'topic_tags',
                       'is_video', 'is_ad', 'post_likes', 'geo_tag', 'geo_slug', 'owner_id', 'owner_verified',
                       'owner_privacy', 'owner_unpublished', 'owner_total_posts', 'owner_total_followers']
            dict_info = {}
            for variable in targets:
                dict_info[variable] = eval(variable)
            
            # append dict to list
            list_dict.append(dict_info)
            
        else:
            print(f"No response received for URL index {i}!") # in the event of broken links
            pass
            
        time.sleep(3) # sleep 3s between each request
        
    return list_dict # return appended list of dicts

#### #SingapoRediscovers

In [6]:
# print datetime of scrape
print(f"Scrape performed on {datetime.now().date()} at {datetime.now().time()}.")

Scrape performed on 2020-10-21 at 08:22:52.829209.


In [5]:
%%time
url = "https://www.instagram.com/explore/tags/singaporediscovers/?hl=en/"
link_posts = get_insta_posts(url)
len(link_posts)

Wall time: 22min 31s


4224

In [6]:
%%time

# set starting indexes
i = 0
u = 50
# create empty list
list_dict_compiled = []

# scrape insta posts in batches of 50 at a time
for batch in range(round(len(link_posts)/50)):
    
    if u > len(link_posts):
        u = -1 # get last indexed url
    else:
        u = u # continue with u value
    
    subset_links = list(link_posts)[i:u]
    list_dict = make_insta_dicts(subset_links)
    list_dict_compiled.extend(list_dict)
    print(f"Scraped batch {batch+1} of 50 or part thereof.")
    i += 50
    u += 50

print("Instagram posts meta-data scraped:", len(list_dict_compiled))

Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Cl

Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 received
Client response 48 received
Client response 49 received
Scraped batch 6 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 

Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Client response 36 received
Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 received
Client response 48 received
Client response 49 received
Scraped batch 12 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9

Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Client response 36 received
Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 r

Scraped batch 23 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34

Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 received
Client response 48 received
Client response 49 received
Scraped batch 29 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21

Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Client response 36 received
Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 received
Client response 48 received
Client response 49 received
Scraped batch 35 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 

Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Client response 36 received
Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 r

Client response 48 received
Client response 49 received
Scraped batch 46 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32

Client response 35 received
Client response 36 received
Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 received
Client response 48 received
Client response 49 received
Scraped batch 52 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19

Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Client response 36 received
Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 received
Client response 48 received
Client response 49 received
Scraped batch 58 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client respons

Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Client response 36 received
Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 re

Client response 47 received
Client response 48 received
Client response 49 received
Scraped batch 69 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31

Client response 34 received
Client response 35 received
Client response 36 received
Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 received
Client response 48 received
Client response 49 received
Scraped batch 75 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18

Client response 21 received
Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Client response 36 received
Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 received
Client response 48 received
Client response 49 received
Scraped batch 81 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client respon

In [7]:
df_insta_singaporediscovers = pd.DataFrame(list_dict_compiled)
df_insta_singaporediscovers.head(3)

Unnamed: 0,post_id,post_slug,unix_time,date_time,post_caption,hashtags,topic_tags,is_video,is_ad,post_likes,geo_tag,geo_slug,owner_id,owner_verified,owner_privacy,owner_unpublished,owner_total_posts,owner_total_followers
0,2424782171636408654,CGmjom-FqFO,1603276581,2020-10-21 10:36:21,#BalloInSingapore Day 7\nThere are few hours i...,"[#BalloInSingapore, #blahblahballo, #SingapoRe...",,False,False,41,"Shangri-La Hotel, Singapore",shangri-la-hotel-singapore,8653740580,False,False,False,270,451
1,2414614396863935902,CGCbwAynnWe,1602064488,2020-10-07 09:54:48,📍 𝓕𝓵𝓸𝔀𝓮𝓻 𝓓𝓸𝓶𝓮\n\nEnter the world of flowers😍 \...,"[#visitsingapore, #singaporediscovers, #visits...",,False,False,97,Flower Dome,flower-dome,205966311,False,False,False,190,923
2,2417355232257243375,CGMK8aqn_jv,1602391221,2020-10-11 04:40:21,小确幸,[],,False,False,34,ARTpreciation Arts Café,artpreciation-arts-cafe,794460094,False,False,False,2290,1121


In [8]:
df_insta_singaporediscovers.shape

(4199, 18)

In [9]:
filename = f"df_insta_singaporediscovers_{datetime.now().date()}"
df_insta_singaporediscovers.to_csv(f"../datasets/{filename}.csv", index=False)

In [10]:
gc.collect()

21681

Note that since this is our main target hashtag, this hashtag was scraped over a few days because certain browser time-outs or even WiFi disconnection can result in the code breaking. As a result, the scrapes are saved out for every batch of 50, and where the code fails to run to completion, the script would be ran again and merged finally in our Pre-processing notebook 2.0. We should definitely check for duplicates when merging and cleaning our dataset.

#### #rediscoversg

In [100]:
# print datetime of scrape
print(f"Scrape performed on {datetime.now().date()} at {datetime.now().time()}.")

Scrape performed on 2020-10-18 at 11:30:38.975570.


In [101]:
%%time
url = "https://www.instagram.com/explore/tags/rediscoversg/?hl=en"
link_posts = get_insta_posts(url)
len(link_posts)

Wall time: 16min 15s


2352

In [110]:
%%time

# set starting indexes
i = 0
u = 50
# create empty list
list_dict_compiled = []

# scrape insta posts in batches of 50 at a time
for batch in range(round(len(link_posts)/50)):
    
    if u > len(link_posts):
        u = -1 # get last indexed url
    else:
        u = u # continue with u value
    
    subset_links = list(link_posts)[i:u]
    list_dict = make_insta_dicts(subset_links)
    list_dict_compiled.extend(list_dict)
    print(f"Scraped batch {batch+1} of 50 or part thereof.")
    i += 50
    u += 50

print("Instagram posts meta-data scraped:", len(list_dict_compiled))

Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Cl

Client response 287 received
Client response 288 received
Client response 289 received
Client response 290 received
Client response 291 received
Client response 292 received
Client response 293 received
Client response 294 received
Client response 295 received
Client response 296 received
Client response 297 received
Client response 298 received
Client response 299 received
Client response 300 received
Client response 301 received
Client response 302 received
Client response 303 received
Client response 304 received
Client response 305 received
Client response 306 received
Client response 307 received
Client response 308 received
Client response 309 received
Client response 310 received
Client response 311 received
Client response 312 received
Client response 313 received
Client response 314 received
Client response 315 received
Client response 316 received
Client response 317 received
Client response 318 received
Client response 319 received
Client response 320 received
Client respons

Client response 570 received
Client response 571 received
Client response 572 received
Client response 573 received
Client response 574 received
Client response 575 received
Client response 576 received
Client response 577 received
Client response 578 received
Client response 579 received
Client response 580 received
Client response 581 received
Client response 582 received
Client response 583 received
Client response 584 received
Client response 585 received
Client response 586 received
Client response 587 received
Client response 588 received
Client response 589 received
Client response 590 received
Client response 591 received
Client response 592 received
Client response 593 received
Client response 594 received
Client response 595 received
Client response 596 received
Client response 597 received
Client response 598 received
Client response 599 received
Client response 600 received
Client response 601 received
Client response 602 received
Client response 603 received
Client respons

Client response 853 received
Client response 854 received
Client response 855 received
Client response 856 received
Client response 857 received
Client response 858 received
Client response 859 received
Client response 860 received
Client response 861 received
Client response 862 received
Client response 863 received
Client response 864 received
Client response 865 received
Client response 866 received
Client response 867 received
Client response 868 received
Client response 869 received
Client response 870 received
Client response 871 received
Client response 872 received
Client response 873 received
Client response 874 received
Client response 875 received
Client response 876 received
Client response 877 received
Client response 878 received
Client response 879 received
Client response 880 received
Client response 881 received
Client response 882 received
Client response 883 received
Client response 884 received
Client response 885 received
Client response 886 received
Client respons

Client response 1131 received
Client response 1132 received
Client response 1133 received
Client response 1134 received
Client response 1135 received
Client response 1136 received
Client response 1137 received
Client response 1138 received
Client response 1139 received
Client response 1140 received
Client response 1141 received
Client response 1142 received
Client response 1143 received
Client response 1144 received
Client response 1145 received
Client response 1146 received
Client response 1147 received
Client response 1148 received
Client response 1149 received
Client response 1150 received
Client response 1151 received
Client response 1152 received
Client response 1153 received
Client response 1154 received
Client response 1155 received
Client response 1156 received
Client response 1157 received
Client response 1158 received
Client response 1159 received
Client response 1160 received
Client response 1161 received
Client response 1162 received
Client response 1163 received
Client res

Client response 1405 received
Client response 1406 received
Client response 1407 received
Client response 1408 received
Client response 1409 received
Client response 1410 received
Client response 1411 received
Client response 1412 received
Client response 1413 received
Client response 1414 received
Client response 1415 received
Client response 1416 received
Client response 1417 received
Client response 1418 received
Client response 1419 received
Client response 1420 received
Client response 1421 received
Client response 1422 received
Client response 1423 received
Client response 1424 received
Client response 1425 received
Client response 1426 received
Client response 1427 received
Client response 1428 received
Client response 1429 received
Client response 1430 received
Client response 1431 received
Client response 1432 received
Client response 1433 received
Client response 1434 received
Client response 1435 received
Client response 1436 received
Client response 1437 received
Client res

Client response 1679 received
Client response 1680 received
Client response 1681 received
Client response 1682 received
Client response 1683 received
Client response 1684 received
Client response 1685 received
Client response 1686 received
Client response 1687 received
Client response 1688 received
Client response 1689 received
Client response 1690 received
Client response 1691 received
Client response 1692 received
Client response 1693 received
Client response 1694 received
Client response 1695 received
Client response 1696 received
Client response 1697 received
Client response 1698 received
Client response 1699 received
Client response 1700 received
Client response 1701 received
Client response 1702 received
Client response 1703 received
Client response 1704 received
Client response 1705 received
Client response 1706 received
Client response 1707 received
Client response 1708 received
Client response 1709 received
Client response 1710 received
Client response 1711 received
Client res

Client response 1953 received
Client response 1954 received
Client response 1955 received
Client response 1956 received
Client response 1957 received
Client response 1958 received
Client response 1959 received
Client response 1960 received
Client response 1961 received
Client response 1962 received
Client response 1963 received
Client response 1964 received
Client response 1965 received
Client response 1966 received
Client response 1967 received
Client response 1968 received
Client response 1969 received
Client response 1970 received
Client response 1971 received
Client response 1972 received
Client response 1973 received
Client response 1974 received
Client response 1975 received
Client response 1976 received
Client response 1977 received
Client response 1978 received
Client response 1979 received
Client response 1980 received
Client response 1981 received
Client response 1982 received
Client response 1983 received
Client response 1984 received
Client response 1985 received
Client res

Client response 2227 received
Client response 2228 received
Client response 2229 received
Client response 2230 received
Client response 2231 received
Client response 2232 received
Client response 2233 received
Client response 2234 received
Client response 2235 received
Client response 2236 received
Client response 2237 received
Client response 2238 received
Client response 2239 received
Client response 2240 received
Client response 2241 received
Client response 2242 received
Client response 2243 received
Client response 2244 received
Client response 2245 received
Client response 2246 received
Client response 2247 received
Client response 2248 received
Client response 2249 received
Client response 2250 received
Client response 2251 received
Client response 2252 received
Client response 2253 received
Client response 2254 received
Client response 2255 received
Client response 2256 received
Client response 2257 received
Client response 2258 received
Client response 2259 received
Client res

In [112]:
df_insta_rediscoversg = pd.DataFrame(list_dict_compiled)
df_insta_rediscoversg.head(3)

Unnamed: 0,post_id,post_slug,unix_time,date_time,post_caption,hashtags,topic_tags,is_video,is_ad,post_likes,geo_tag,geo_slug,owner_id,owner_verified,owner_privacy,owner_unpublished,owner_total_posts,owner_total_followers
0,1006641971264095607,34T4oZMgV3,1434221095,2015-06-13 18:44:55,Oh look! Natural heart shaped form of Ivan Hen...,"[#PinkDotSg, #WhereLovesLiveSg, #Rediscovering...",[2 people.],False,False,40,Pinkdot @ Hong Lim Park,pinkdot-hong-lim-park,196778200,False,False,False,5205,1232
1,1202880298563729182,BCxfUInNPce,1457614527,2016-03-10 12:55:27,.\n.\n.\n.\n.\n.\n#exploresingapore #instasg #...,"[#exploresingapore, #instasg, #gf_singapore, #...",[indoor.],False,False,33,Yangtze Cinema,yangtze-cinema,33123388,False,False,False,742,995
2,677102977071319026,lljZH4sgfy,1394936986,2014-03-16 02:29:46,On the road to hell..... Reliving Haw Par Vill...,"[#rediscoversg, #Singapore]",[Photo by Belinda Tan in 虎豹别墅.],False,False,3,虎豹别墅,,196778200,False,False,False,5205,1232


In [118]:
df_insta_rediscoversg.shape

(2352, 18)

In [120]:
filename = f"df_insta_rediscoversg_{datetime.now().date()}"
df_insta_rediscoversg.to_csv(f"../datasets/{filename}.csv", index=False)

In [117]:
gc.collect()

34626

#### #rediscoversingapore

In [7]:
# print datetime of scrape
print(f"Scrape performed on {datetime.now().date()} at {datetime.now().time()}.")

Scrape performed on 2020-10-19 at 09:29:47.345219.


In [8]:
%%time
url = "https://www.instagram.com/explore/tags/rediscoversingapore/?hl=en"
link_posts = get_insta_posts(url)
len(link_posts)

Wall time: 6min 40s


1227

In [None]:
# scrape in batches of 50 to prevent runtime errors from breaking code and losing all data scraped thus far
# assign to variable every batch of 50 insta posts scraped
# troubleshoot error by running code again and adding to list by picking up scrape from point of break

In [22]:
%%time

# set starting indexes
i = 0
u = 50
# create empty list
list_dict_compiled = []

# scrape insta posts in batches of 50 at a time
for batch in range(round(len(link_posts)/50)):
    
    if u > len(link_posts):
        u = -1 # get last indexed url
    else:
        u = u # continue with u value
    
    subset_links = list(link_posts)[i:u]
    list_dict = make_insta_dicts(subset_links)
    list_dict_compiled.extend(list_dict)
    print(f"Scraped batch {batch+1} of 50 or part thereof.")
    i += 50
    u += 50

print("Instagram posts meta-data scraped:", len(list_dict_compiled))

Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Cl

Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 received
Client response 48 received
Client response 49 received
Scraped batch 6 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 

Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Client response 36 received
Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 received
Client response 48 received
Client response 49 received
Scraped batch 12 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9

Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Client response 36 received
Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 r

Scraped batch 23 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34

In [25]:
df_insta_rediscoversingapore = pd.DataFrame(list_dict_compiled)
df_insta_rediscoversingapore.head(3)

Unnamed: 0,post_id,post_slug,unix_time,date_time,post_caption,hashtags,topic_tags,is_video,is_ad,post_likes,geo_tag,geo_slug,owner_id,owner_verified,owner_privacy,owner_unpublished,owner_total_posts,owner_total_followers
0,2378162573154824901,CEA7kKfBbLF,1597719092,2020-08-18 02:51:32,Exploring Tanjong Pagar last weekend. I used t...,[],"[plant, tree, sky, outdoor.]",False,False,18,Tanjong Pagar,tanjong-pagar,8266262181,False,False,False,121,696
1,2404625181941486534,CFe8dzblNPG,1600873681,2020-09-23 15:08:01,Did you know that there are a whopping 90 skys...,"[#RediscoverSingapore, #singapoliday, #journey...","[sky, outdoor.]",False,False,128,Apple Marina Bay Sands,apple-marina-bay-sands,11526377188,False,False,False,206,1926
2,2395921642007553165,CFABgoNh-SN,1599836138,2020-09-11 14:55:38,Sementara masih ada sekatan untuk keluar Singa...,"[#rediscoversingapore, #covid19, #paybytrash]",,False,False,125,Jurong Lake Park,jurong-lake-park,836854,False,False,False,1081,1631


In [26]:
df_insta_rediscoversingapore.shape

(1226, 18)

In [27]:
filename = f"df_insta_rediscoversingapore_{datetime.now().date()}"
df_insta_rediscoversingapore.to_csv(f"../datasets/{filename}.csv", index=False)

In [28]:
gc.collect()

8788

#### #Singapoliday

In [29]:
# print datetime of scrape
print(f"Scrape performed on {datetime.now().date()} at {datetime.now().time()}.")

Scrape performed on 2020-10-19 at 13:41:30.270188.


In [30]:
%%time
url = "https://www.instagram.com/explore/tags/singapoliday/?hl=en"
link_posts = get_insta_posts(url)
len(link_posts)

Wall time: 7min 14s


1446

In [31]:
%%time

# set starting indexes
i = 0
u = 50
# create empty list
list_dict_compiled = []

# scrape insta posts in batches of 50 at a time
for batch in range(round(len(link_posts)/50)):
    
    if u > len(link_posts):
        u = -1 # get last indexed url
    else:
        u = u # continue with u value
    
    subset_links = list(link_posts)[i:u]
    list_dict = make_insta_dicts(subset_links)
    list_dict_compiled.extend(list_dict)
    print(f"Scraped batch {batch+1} of 50 or part thereof.")
    i += 50
    u += 50

print("Instagram posts meta-data scraped:", len(list_dict_compiled))

Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Cl

Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 received
Client response 48 received
Client response 49 received
Scraped batch 6 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 

Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Client response 36 received
Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 received
Client response 48 received
Client response 49 received
Scraped batch 12 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9

Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Client response 36 received
Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 r

Scraped batch 23 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34

Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Scraped batch 29 of 50 or part thereof.
Instagram posts meta-data scraped: 1445
Wall time: 1h 29min 6s


In [32]:
df_insta_singapoliday = pd.DataFrame(list_dict_compiled)
df_insta_singapoliday.head(3)

Unnamed: 0,post_id,post_slug,unix_time,date_time,post_caption,hashtags,topic_tags,is_video,is_ad,post_likes,geo_tag,geo_slug,owner_id,owner_verified,owner_privacy,owner_unpublished,owner_total_posts,owner_total_followers
0,2387779086771329728,CEjGG2tHJrA,1598865470,2020-08-31 09:17:50,Rumours had it that they serve pretty decent I...,"[#singapoliday, #sentosa, #singapore]",[food.],False,False,21,Rumours Beach Club,rumours-beach-club,499548209,False,False,False,679,530
1,2406402775632998290,CFlQpMkHm-S,1601085586,2020-09-26 01:59:46,🔙 to brunch on my off day with 👸🏻💖\n\n@focr.sg...,"[#fiveoarscoffeeroasters, #focrestaurant, #foc...",,False,False,41,Five Oars Coffee Roasters,five-oars-coffee-roasters,38351245054,False,False,False,38,211
2,2404343645292379048,CFd8c5_nruo,1600840119,2020-09-23 05:48:39,#Repost @channelnewsasia\n• • • • • •\nYou've ...,"[#Repost, #Singapore, #Sentosa, #beach, #Tanjo...","[3 people, outdoor, text that says 'channelnew...",False,False,9,Singapore,singapore,55153,False,False,False,3351,911


In [33]:
df_insta_singapoliday.shape

(1445, 18)

In [34]:
filename = f"df_insta_singapoliday_{datetime.now().date()}"
df_insta_singapoliday.to_csv(f"../datasets/{filename}.csv", index=False)

In [35]:
gc.collect()

15785

#### #singaporeliday

In [36]:
# print datetime of scrape
print(f"Scrape performed on {datetime.now().date()} at {datetime.now().time()}.")

Scrape performed on 2020-10-19 at 15:57:42.766446.


In [37]:
%%time
url = "https://www.instagram.com/explore/tags/singaporeliday/?hl=en"
link_posts = get_insta_posts(url)
len(link_posts)

Wall time: 2min 14s


420

In [38]:
%%time

# set starting indexes
i = 0
u = 50
# create empty list
list_dict_compiled = []

# scrape insta posts in batches of 50 at a time
for batch in range(round(len(link_posts)/50)):
    
    if u > len(link_posts):
        u = -1 # get last indexed url
    else:
        u = u # continue with u value
    
    subset_links = list(link_posts)[i:u]
    list_dict = make_insta_dicts(subset_links)
    list_dict_compiled.extend(list_dict)
    print(f"Scraped batch {batch+1} of 50 or part thereof.")
    i += 50
    u += 50

print("Instagram posts meta-data scraped:", len(list_dict_compiled))

Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Cl

Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 received
Client response 48 received
Client response 49 received
Scraped batch 6 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 

In [39]:
df_insta_singaporeliday = pd.DataFrame(list_dict_compiled)
df_insta_singaporeliday.head(3)

Unnamed: 0,post_id,post_slug,unix_time,date_time,post_caption,hashtags,topic_tags,is_video,is_ad,post_likes,geo_tag,geo_slug,owner_id,owner_verified,owner_privacy,owner_unpublished,owner_total_posts,owner_total_followers
0,291068302569325905,QKFRQ8BUVR,1348918067,2012-09-29 11:27:47,My super cute niece #singaporeliday,[#singaporeliday],[Photo by Dawn Tan - Little Art Yurt on Septem...,False,False,13,,,7700336,False,False,False,9606,12425
1,364408538208224969,UOo6KBBUbJ,1357660904,2013-01-08 16:01:44,"Carpenter & Cook, Bukit Timah. With @sarahslof...",[#singaporeliday],[Photo by Dawn Tan - Little Art Yurt on Januar...,False,False,51,,,7700336,False,False,False,9606,12425
2,345306628637410542,TKxoyahUTu,1355383779,2012-12-13 07:29:39,If you are ever in need of simple old school s...,[#singaporeliday],[Photo by Dawn Tan - Little Art Yurt on Decemb...,False,False,40,,,7700336,False,False,False,9606,12425


In [40]:
df_insta_singaporeliday.shape

(400, 18)

In [41]:
filename = f"df_insta_singaporeliday_{datetime.now().date()}"
df_insta_singaporeliday.to_csv(f"../datasets/{filename}.csv", index=False)

In [42]:
gc.collect()

14144

#### #madaboutsingapore2020

In [43]:
# print datetime of scrape
print(f"Scrape performed on {datetime.now().date()} at {datetime.now().time()}.")

Scrape performed on 2020-10-19 at 16:31:21.274496.


In [44]:
%%time
url = "https://www.instagram.com/explore/tags/madaboutsingapore2020/?hl=en"
link_posts = get_insta_posts(url)
len(link_posts)

Wall time: 22min 18s


4608

In [49]:
%%time

# set starting indexes
i = 0
u = 50
# create empty list
list_dict_compiled = []

# scrape insta posts in batches of 50 at a time
for batch in range(round(len(link_posts)/50)):
    
    if u > len(link_posts):
        u = -1 # get last indexed url
    else:
        u = u # continue with u value
    
    subset_links = list(link_posts)[i:u]
    list_dict = make_insta_dicts(subset_links)
    list_dict_compiled.extend(list_dict)
    print(f"Scraped batch {batch+1} of 50 or part thereof.")
    i += 50
    u += 50

print("Instagram posts meta-data scraped:", len(list_dict_compiled))

Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Cl

Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 received
Client response 48 received
Client response 49 received
Scraped batch 6 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 

Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Client response 36 received
Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 received
Client response 48 received
Client response 49 received
Scraped batch 12 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9

Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Client response 36 received
Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 r

Scraped batch 23 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34

Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 received
Client response 48 received
Client response 49 received
Scraped batch 29 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21

Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Client response 36 received
Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 received
Client response 48 received
Client response 49 received
Scraped batch 35 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 

Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Client response 36 received
Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 r

Client response 49 received
Scraped batch 46 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33

Client response 36 received
Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 received
Client response 48 received
Client response 49 received
Scraped batch 52 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20

Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Client response 36 received
Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 received
Client response 48 received
Client response 49 received
Scraped batch 58 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response

Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Client response 36 received
Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 r

Client response 48 received
Client response 49 received
Scraped batch 69 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32

Client response 35 received
Client response 36 received
Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 received
Client response 48 received
Client response 49 received
Scraped batch 75 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19

Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Client response 36 received
Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 received
Client response 48 received
Client response 49 received
Scraped batch 81 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client respons

Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Client response 36 received
Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 re

Client response 47 received
Client response 48 received
Client response 49 received
Scraped batch 92 of 50 or part thereof.
Instagram posts meta-data scraped: 4600
Wall time: 4h 58min 7s


In [50]:
df_insta_madaboutsingapore2020 = pd.DataFrame(list_dict_compiled)
df_insta_madaboutsingapore2020.head(3)

Unnamed: 0,post_id,post_slug,unix_time,date_time,post_caption,hashtags,topic_tags,is_video,is_ad,post_likes,geo_tag,geo_slug,owner_id,owner_verified,owner_privacy,owner_unpublished,owner_total_posts,owner_total_followers
0,2122499273233532024,B10odfxHAR4,1567241652,2019-08-31 08:54:12,Oh hello there....! When the buildings came al...,[#sgnightfest],[outdoor.],False,False,11,National Museum of Singapore,national-museum-of-singapore,28797269,False,False,False,1533,1357
1,2237005199830357265,B8LcGCIlxUR,1580891822,2020-02-05 08:37:02,The world #debut of the largest and longest #f...,"[#debut, #flying, #dragon, #Chingay, #2020, #p...",,False,False,6,,,398462892,False,False,False,10994,1032
2,2415184177868770815,CGEdTatAw3_,1602132411,2020-10-08 04:46:51,The #Hairpin\n.\n.\n.\n#meandering #horseshoeb...,"[#Hairpin, #meandering, #horseshoebend, #horse...","[grass, plant, tree, outdoor, nature.]",False,False,184,Jurong Lake Pool,jurong-lake-pool,30679523,False,False,False,560,43906


In [51]:
df_insta_madaboutsingapore2020.shape

(4600, 18)

In [52]:
filename = f"df_insta_madaboutsingapore2020_{datetime.now().date()}"
df_insta_madaboutsingapore2020.to_csv(f"../datasets/{filename}.csv", index=False)

In [53]:
gc.collect()

4022

#### #madaboutsingapore2020c

In [5]:
# print datetime of scrape
print(f"Scrape performed on {datetime.now().date()} at {datetime.now().time()}.")

Scrape performed on 2020-10-20 at 09:08:03.338832.


In [6]:
%%time
url = "https://www.instagram.com/explore/tags/madaboutsingapore2020c/?hl=en"
link_posts = get_insta_posts(url)
len(link_posts)

Wall time: 6min 49s


1269

In [7]:
%%time

# set starting indexes
i = 0
u = 50
# create empty list
list_dict_compiled = []

# scrape insta posts in batches of 50 at a time
for batch in range(round(len(link_posts)/50)):
    
    if u > len(link_posts):
        u = -1 # get last indexed url
    else:
        u = u # continue with u value
    
    subset_links = list(link_posts)[i:u]
    list_dict = make_insta_dicts(subset_links)
    list_dict_compiled.extend(list_dict)
    print(f"Scraped batch {batch+1} of 50 or part thereof.")
    i += 50
    u += 50

print("Instagram posts meta-data scraped:", len(list_dict_compiled))

Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Cl

Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 received
Client response 48 received
Client response 49 received
Scraped batch 6 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 

Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Client response 36 received
Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 received
Client response 48 received
Client response 49 received
Scraped batch 12 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9

Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34 received
Client response 35 received
Client response 36 received
Client response 37 received
Client response 38 received
Client response 39 received
Client response 40 received
Client response 41 received
Client response 42 received
Client response 43 received
Client response 44 received
Client response 45 received
Client response 46 received
Client response 47 r

Scraped batch 23 of 50 or part thereof.
Client response 0 received
Client response 1 received
Client response 2 received
Client response 3 received
Client response 4 received
Client response 5 received
Client response 6 received
Client response 7 received
Client response 8 received
Client response 9 received
Client response 10 received
Client response 11 received
Client response 12 received
Client response 13 received
Client response 14 received
Client response 15 received
Client response 16 received
Client response 17 received
Client response 18 received
Client response 19 received
Client response 20 received
Client response 21 received
Client response 22 received
Client response 23 received
Client response 24 received
Client response 25 received
Client response 26 received
Client response 27 received
Client response 28 received
Client response 29 received
Client response 30 received
Client response 31 received
Client response 32 received
Client response 33 received
Client response 34

In [8]:
df_insta_madaboutsingapore2020c = pd.DataFrame(list_dict_compiled)
df_insta_madaboutsingapore2020c.head(3)

Unnamed: 0,post_id,post_slug,unix_time,date_time,post_caption,hashtags,topic_tags,is_video,is_ad,post_likes,geo_tag,geo_slug,owner_id,owner_verified,owner_privacy,owner_unpublished,owner_total_posts,owner_total_followers
0,2376745330526863441,CD75Ukbn0BR,1597550144,2020-08-16 03:55:44,Most of us are still asleep when he starts his...,"[#nikonsg, #nikonglobal, #d500, #covid, #covid...","[plant, tree, outdoor, nature.]",False,False,30,,,176838895,False,False,False,785,446
1,2294407519396580399,B_XX3uLnkwv,1587734712,2020-04-24 13:25:12,How's your stay home or WFH. \n#sg #singapore ...,"[#sg, #singapore, #styeevoen, #ig, #instagram,...",[Photo by štëvëñ🇸🇬📸📱🏀🏃🎾 in Singapore.],False,False,39,Singapore,singapore,38377713,False,False,False,1323,393
2,2371803564266660403,CDqVsZ9HqIz,1596961039,2020-08-09 08:17:19,Happy National Day Singapore!\n\n(Repost)\n#ai...,"[#aingapore, #ndp, #ndp2020, #nationalday, #ma...",[indoor.],False,False,76,Singapore,singapore,12815032,False,False,False,2763,1430


In [9]:
df_insta_madaboutsingapore2020c.shape

(1250, 18)

In [10]:
filename = f"df_insta_madaboutsingapore2020c_{datetime.now().date()}"
df_insta_madaboutsingapore2020c.to_csv(f"../datasets/{filename}.csv", index=False)

In [11]:
gc.collect()

14050

We are done scraping for Instagram! A total of 7 target hashtags were queried and scraped.

Instagram was scraped according to hashtags to uncover people's (predominantly local) perception/sentiment of what it means to be a tourist in Singapore as a local. Following the scrape of the campaign's official hashtag #SingapoRediscovers, other relevant hashtags were targetted. These hashtags were chosen for their pertinence to the current and evolving situation of domestic tourism in Covid-19's new normal. While there are numerous hashtags about the exploration of Singapore, this project targetted only those posts that were created in reaction to the idea of domestic tourism in this new normal.

For example, #rediscoversg or #rediscoversingapore were chosen over #discoversingapore even though the latter had over 110,000 posts at the time of the scrape. However, much of this were surmised to be less relevant noise as they do not capture the rediscovering spirit as prompted by the #SingapoRediscovers campaign and are also widely used by tourists from pre-Covid times. It is assumed that owing to current travel restrictions, the #rediscoversg and #rediscoversingapore hashtags were used in posts that are more current and by locals who are in Singapore.

Another official campaign's hashtag launched adjacent to #SingapoRediscovers by STB and other agencies to encourage locals' holiday-making in Singapore is #Singapoliday. Some instagram posts were misspelt and tagged as #singaporeliday instead. Both these hashtags were targetted for scraping.

Finally, another predominant hashtag in this period of time picked up while researching is #madaboutsingapore2020 and its misspelt cousin #madaboutsingapore2020c. While there is a more widely used #madaboutsingapore hashtag with over 80,000 posts tagged to it, #madaboutsingapore2020(c) were chosen for its recent birth in this new normal that again captured the rediscovering spirit as exhorted by STB. Both these hashtags were similarly targetted for scraping.

### YouTube - Scraping

First, let us scrape and compile a list of URLS of all video uploads from the URL of a single YouTube channel.

In [3]:
def get_yt_videos(url):
    
    # launch driver
    driver = webdriver.Chrome()
    driver.get(url)
    time.sleep(3)

    # create empty set to add urls to
    link_videos = set() # we use sets since we expect duplicate scraping as not all posts would have disappeared

    # scrape posts url with automated scrolling using selenium to capture all elements before they disappear

    # first scrape of video urls
    tags = driver.find_elements_by_tag_name("a")
    for tag in tags:
        link = tag.get_attribute("href") # add try-except statements because there are some null values scraped
        try:
            if "/watch?" in link:
                link_videos.add(link)
        except:
            pass

    # first scroll
    html = driver.find_element_by_tag_name('html')
    html.send_keys(Keys.PAGE_DOWN)
    time.sleep(3)

    # recursive scroll for n times
    for i in range(20):
        # scrape video urls in between scrolls
        tags = driver.find_elements_by_tag_name("a")
        for tag in tags:
            link = tag.get_attribute("href") # query returns all hrefs
            try:
                if "/watch?" in link: # search only for hrefs that bring you directly to a video
                    link_videos.add(link) # add href with every scroll as page is dynamic and earlier links will be lost
            except:
                pass
        # execute scroll
        html.send_keys(Keys.END)
        time.sleep(3)

    return link_videos # return all hrefs scraped

Next, we shall access each video with the list of URLs we have compiled and scrape the meta-data of each video before compiling them into a dictionary.

In [5]:
def make_yt_dict(list_urls):
    
    list_dict = [] # create empty list to append dicts of info
    
    for i, url in enumerate(list_urls):
        r = requests.get(url)
        
        if r.status_code == 200:
            
            print(f"Client response {i} received.")
            # parse response as html
            html = BeautifulSoup(r.text, "lxml")
            # find body of post and convert to string
            for script in html.find_all("script"):
                try:
                    if script.string.lstrip().startswith('window["ytInitialData"]'):
                        script_target = script.string.lstrip()
                except:
                    pass
            # parse script as json obj
            video_json = json.loads(script_target.split('window["ytInitialData"] = ')[-1].split('window["ytInitialPlayerResponse"]')[0].strip().rstrip(";"))
            # find where target info is stored
            core_json = video_json["contents"]["twoColumnWatchNextResults"]["results"]["results"]["contents"]
            
            # try-except statement to extract target info since not all keys are present in each post's json
            try:
                video_title = core_json[0]["videoPrimaryInfoRenderer"]["title"]["runs"][0]["text"]
            except:
                video_title = None
            try:
                video_caption = [text["text"] for text in core_json[1]["videoSecondaryInfoRenderer"]["description"]["runs"]]
            except:
                video_caption = None
            try:
                date_time = core_json[0]["videoPrimaryInfoRenderer"]["dateText"]["simpleText"]
            except:
                date_time = None
            try:
                video_slug = core_json[0]["videoPrimaryInfoRenderer"]["videoActions"]["menuRenderer"]["topLevelButtons"][0]["toggleButtonRenderer"]["defaultNavigationEndpoint"]["modalEndpoint"]["modal"]["modalWithTitleAndButtonRenderer"]["button"]["buttonRenderer"]["navigationEndpoint"]["signInEndpoint"]["nextEndpoint"]["watchEndpoint"]["videoId"]
            except:
                video_slug = None
            try:
                video_views = core_json[0]["videoPrimaryInfoRenderer"]["viewCount"]["videoViewCountRenderer"]["viewCount"]["simpleText"]
            except:
                video_views = None
            try:
                video_likes = core_json[0]["videoPrimaryInfoRenderer"]["sentimentBar"]["sentimentBarRenderer"]["tooltip"].split("/")[0].strip()
            except:
                video_likes = None
            try:
                video_dislikes = core_json[0]["videoPrimaryInfoRenderer"]["sentimentBar"]["sentimentBarRenderer"]["tooltip"].split("/")[-1].strip()
            except:
                video_dislikes = None
                        
            # compile target info into dict format
            targets = ['video_title', 'video_caption', 'date_time', 'video_slug', 'video_views', 'video_likes', 'video_dislikes']
            dict_info = {}
            for variable in targets:
                dict_info[variable] = eval(variable)
            
            # append dict to list
            list_dict.append(dict_info)
            
        else:
            print(f"No response received for URL index {i}!") # in the event of broken links
            pass
            
        time.sleep(3) # sleep 3s between each request
        
    return list_dict # return appended list of dicts

Finally, we want to make a DataFrame that will compile the comments on each video and their corresponding meta-data.

In [86]:
def get_yt_comments(list_urls):
    
    # create empty list/df to append/concat info
    all_video_titles = []
    total_num_comments = []
    df_all_comments = pd.DataFrame(columns=["response_to", "user", "timestamp", "comment", "likes", "replies_attracted"])
    
    for i, url in enumerate(list_urls):
        driver = webdriver.Chrome()
        driver.get(url)
        time.sleep(3) # add sleep here as youtube comments take some time to load
        
        html = driver.find_element_by_tag_name('html')
        html.send_keys(Keys.PAGE_DOWN)
        time.sleep(3)
        
        # not necessary to get all comments since comments are dynamically sorted and loaded according to likes and date
        for _ in range(10): # scroll down 10 times, should get about 200 comments per video (if there are so many)
            html.send_keys(Keys.END)
            time.sleep(3)
        
        # get missing video meta-data (no. of comments) that could not be found from earlier soup because of dynamic loading
        video_title = driver.find_element_by_xpath('//h1[@class="title style-scope ytd-video-primary-info-renderer"]').text
        try:
            video_comments = driver.find_element_by_xpath('//h2[@id="count"][@class="style-scope ytd-comments-header-renderer"]').text
        except:
            video_comments = 0 # there may be zero comments or comments may be disabled
        # append video titles and video total no. of comments
        all_video_titles.append(video_title)
        total_num_comments.append(video_comments)

        # scrape video comments
        if video_comments != 0:
            comment_elems = driver.find_elements_by_xpath('//*[@id="content-text"]')
            all_comments = [elem.text for elem in comment_elems]
            likes_elems = driver.find_elements_by_xpath('//*[@id="vote-count-middle"]')
            num_likes = [elem.text for elem in likes_elems]
            replies_elems = driver.find_elements_by_xpath('//yt-formatted-string[@id="text"][@class="style-scope ytd-button-renderer"]')
            num_replies = [elem.text for elem in replies_elems]
            users_elems = driver.find_elements_by_xpath('//a[@id="author-text"]/span[@class="style-scope ytd-comment-renderer"]')
            all_users = [elem.text for elem in users_elems]
            publish_elems = driver.find_elements_by_xpath('//yt-formatted-string[@class="published-time-text above-comment style-scope ytd-comment-renderer"]')
            time_publish = [elem.text for elem in publish_elems]
        else:
            all_comments = "No comments made."
            num_likes = "No comments made."
            num_replies = "No comments made."
            all_users = "No comments made."
            time_publish = "No comments made."

        # compile comments meta-data into df and concat into a whole collection
        df_comments = pd.DataFrame(zip(all_users, time_publish, all_comments, num_likes, num_replies),
                                   columns=["user", "timestamp", "comment", "likes", "replies_attracted"])
        df_all_comments = pd.concat([df_all_comments, df_comments]).reset_index(drop=True)
        df_all_comments["response_to"].fillna(video_title, inplace=True) # add title of video that comments are responding to

        # close driver and sleep at end of loop
        print(f"Page index {i} scraped.")
        driver.close()
        time.sleep(3)
        
    # append dict of video's total no. of comments to list
    df_num_comments = pd.DataFrame(zip(all_video_titles, total_num_comments), columns=["video_title", "video_comments"])
    
    return df_all_comments, df_num_comments

Let us test out a small list of URLs to estimate the time needed for scrape to complete.

In [69]:
# # trial
# trial_videos = ['https://www.youtube.com/watch?v=-7XeH-xx6eA',
#  'https://www.youtube.com/watch?v=-9RZQIIW5Q8',
#  'https://www.youtube.com/watch?v=-VRXVHy2FLQ',
#  'https://www.youtube.com/watch?v=-odC3Z1juuk',
#  'https://www.youtube.com/watch?v=-ropoL_W1Wg',
#  'https://www.youtube.com/watch?v=-w8AsysPp9k',
#  'https://www.youtube.com/watch?v=0TVrfQO8e70']
# trial_videos

['https://www.youtube.com/watch?v=-7XeH-xx6eA',
 'https://www.youtube.com/watch?v=-9RZQIIW5Q8',
 'https://www.youtube.com/watch?v=-VRXVHy2FLQ',
 'https://www.youtube.com/watch?v=-odC3Z1juuk',
 'https://www.youtube.com/watch?v=-ropoL_W1Wg',
 'https://www.youtube.com/watch?v=-w8AsysPp9k',
 'https://www.youtube.com/watch?v=0TVrfQO8e70']

In [95]:
# %%time
# df_georgia_comments, georgia_num_comments = get_yt_comments(list(trial_videos))
# print("Shape of all comments scraped:", df_georgia_comments.shape)
# print("Shape of video's meta-data scraped:", georgia_num_comments.shape)

Scraped page number 0 of 7.
Scraped page number 1 of 7.
Scraped page number 2 of 7.
Scraped page number 3 of 7.
Scraped page number 4 of 7.
Scraped page number 5 of 7.
Scraped page number 6 of 7.
Shape of all comments scraped: (195, 6)
Shape of video's meta-data scraped: (7, 2)
Wall time: 6min 54s


We have trialed scraping of comments for 7 videos and it took about 7 minutes. If we assume a linear run-time relationship and extrapolate for a channel with 300 videos, we estimate that the code will take about 300 minutes or 5 hours to run. We can consider scraping in batches in case of any code breaking that will lose us all our scrapes/progress.

#### Georgia Caney

In [8]:
# print datetime of scrape
print(f"Scrape performed on {datetime.now().date()} at {datetime.now().time()}.")

Scrape performed on 2020-10-17 at 13:16:40.780300.


In [9]:
%%time
url = "https://www.youtube.com/c/GeorgiaCaney/videos?view=0&sort=dd&shelf_id=1"
link_videos = get_yt_videos(url)
len(link_videos) # view total number of video urls scraped

Wall time: 3min 27s


310

In [10]:
%%time
list_dict = make_yt_dict(link_videos)

Client response 0 received.
Client response 1 received.
Client response 2 received.
Client response 3 received.
Client response 4 received.
Client response 5 received.
Client response 6 received.
Client response 7 received.
Client response 8 received.
Client response 9 received.
Client response 10 received.
Client response 11 received.
Client response 12 received.
Client response 13 received.
Client response 14 received.
Client response 15 received.
Client response 16 received.
Client response 17 received.
Client response 18 received.
Client response 19 received.
Client response 20 received.
Client response 21 received.
Client response 22 received.
Client response 23 received.
Client response 24 received.
Client response 25 received.
Client response 26 received.
Client response 27 received.
Client response 28 received.
Client response 29 received.
Client response 30 received.
Client response 31 received.
Client response 32 received.
Client response 33 received.
Client response 34 recei

Client response 277 received.
Client response 278 received.
Client response 279 received.
Client response 280 received.
Client response 281 received.
Client response 282 received.
Client response 283 received.
Client response 284 received.
Client response 285 received.
Client response 286 received.
Client response 287 received.
Client response 288 received.
Client response 289 received.
Client response 290 received.
Client response 291 received.
Client response 292 received.
Client response 293 received.
Client response 294 received.
Client response 295 received.
Client response 296 received.
Client response 297 received.
Client response 298 received.
Client response 299 received.
Client response 300 received.
Client response 301 received.
Client response 302 received.
Client response 303 received.
Client response 304 received.
Client response 305 received.
Client response 306 received.
Client response 307 received.
Client response 308 received.
Client response 309 received.
Wall time:

In [11]:
# make dict of info into df
df_yt_georgia = pd.DataFrame(list_dict)
df_yt_georgia.head(3)

Unnamed: 0,video_title,video_caption,date_time,video_slug,video_views,video_likes,video_dislikes
0,HALLOWEEN HORROR NIGHTS 8 AT UNIVERSAL STUDIOS! 👻,[UNIVERSAL HALLOWEEN HORROR NIGHTS 8! 👻\n\nHap...,31 Oct 2018,7xkdm3c4Ks8,"8,766 views",271,3
1,HUGE FEBRUARY HAUL | MONKI & ROMWE,"[FOLLOW ME!!\nBLOG LOVIN: , http://www.bloglov...",28 Feb 2016,uBRHR9zK3mk,"16,049 views",312,5
2,Eating the BEST rated PRATA in SINGAPORE! 🇸🇬 *...,[Eating the BEST rated PRATA in SINGAPORE! 🇸🇬 ...,17 Nov 2019,b-ppBtSiG38,"33,483 views",649,19


In [12]:
df_yt_georgia.shape # view shape of df

(310, 7)

In [13]:
# save out initial scrape of videos meta-data (missing total no. of comments)
filename = f"df_yt_georgia_{datetime.now().date()}"
df_yt_georgia.to_csv(f"../datasets/{filename}.csv", index=False) 

In [14]:
# scrape yt comments in batches of 50 videos at a time

In [15]:
%%time

# set starting indexes
i = 0
u = 50
# create empty dfs
df_georgia_all_comments = pd.DataFrame()
df_georgia_all_num_comments = pd.DataFrame()

# scrape yt comments in batches of 50 videos at a time
for batch in range(round(len(link_videos)/50)):
    
    if u > (len(link_videos)):
        u = -1 # get last indexed url
    else:
        u = u # continue with u value
        
    subset_links = list(link_videos)[i:u]
    df_georgia_comments, df_georgia_num_comments = get_yt_comments(subset_links)
    df_georgia_all_comments = pd.concat([df_georgia_all_comments, df_georgia_comments]).reset_index(drop=True)
    df_georgia_all_num_comments = pd.concat([df_georgia_all_num_comments, df_georgia_num_comments]).reset_index(drop=True)
    print(f"Scraped batch {batch+1} of 50 or part thereof.")
    i += 50
    u += 50

print("Shape of all comments scraped:", df_georgia_all_comments.shape)
print("Shape of all no. of comments scraped:", df_georgia_all_num_comments.shape)

Page index 0 scraped.
Page index 1 scraped.
Page index 2 scraped.
Page index 3 scraped.
Page index 4 scraped.
Page index 5 scraped.
Page index 6 scraped.
Page index 7 scraped.
Page index 8 scraped.
Page index 9 scraped.
Page index 10 scraped.
Page index 11 scraped.
Page index 12 scraped.
Page index 13 scraped.
Page index 14 scraped.
Page index 15 scraped.
Page index 16 scraped.
Page index 17 scraped.
Page index 18 scraped.
Page index 19 scraped.
Page index 20 scraped.
Page index 21 scraped.
Page index 22 scraped.
Page index 23 scraped.
Page index 24 scraped.
Page index 25 scraped.
Page index 26 scraped.
Page index 27 scraped.
Page index 28 scraped.
Page index 29 scraped.
Page index 30 scraped.
Page index 31 scraped.
Page index 32 scraped.
Page index 33 scraped.
Page index 34 scraped.
Page index 35 scraped.
Page index 36 scraped.
Page index 37 scraped.
Page index 38 scraped.
Page index 39 scraped.
Page index 40 scraped.
Page index 41 scraped.
Page index 42 scraped.
Page index 43 scraped

StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
  (Session info: chrome=86.0.4240.75)


In [21]:
df_georgia_all_comments.shape

(9231, 6)

In [22]:
df_georgia_all_num_comments.shape

(245, 2)

In [28]:
%%time

# set starting indexes
i = 250
u = 300
# create empty dfs
df_georgia_all_comments2 = pd.DataFrame()
df_georgia_all_num_comments2 = pd.DataFrame()

# scrape yt comments in batches of 50 videos at a time
for batch in range(round(len(link_videos)/50)):
    
    if u > (len(link_videos)):
        u = -1 # get last indexed url
    else:
        u = u # continue with u value
        
    subset_links = list(link_videos)[i:u]
    df_georgia_comments, df_georgia_num_comments = get_yt_comments(subset_links)
    df_georgia_all_comments2 = pd.concat([df_georgia_all_comments2, df_georgia_comments]).reset_index(drop=True)
    df_georgia_all_num_comments2 = pd.concat([df_georgia_all_num_comments2, df_georgia_num_comments]).reset_index(drop=True)
    print(f"Scraped batch {batch+1} of 50 or part thereof.")
    i += 50
    u += 50

print("Shape of all comments scraped:", df_georgia_all_comments2.shape)
print("Shape of all no. of comments scraped:", df_georgia_all_num_comments2.shape)

Page index 0 scraped.
Page index 1 scraped.
Page index 2 scraped.
Page index 3 scraped.
Page index 4 scraped.
Page index 5 scraped.
Page index 6 scraped.
Page index 7 scraped.
Page index 8 scraped.
Page index 9 scraped.
Page index 10 scraped.
Page index 11 scraped.
Page index 12 scraped.
Page index 13 scraped.
Page index 14 scraped.
Page index 15 scraped.
Page index 16 scraped.
Page index 17 scraped.
Page index 18 scraped.
Page index 19 scraped.
Page index 20 scraped.
Page index 21 scraped.
Page index 22 scraped.
Page index 23 scraped.
Page index 24 scraped.
Page index 25 scraped.
Page index 26 scraped.
Page index 27 scraped.
Page index 28 scraped.
Page index 29 scraped.
Page index 30 scraped.
Page index 31 scraped.
Page index 32 scraped.
Page index 33 scraped.
Page index 34 scraped.
Page index 35 scraped.
Page index 36 scraped.
Page index 37 scraped.
Page index 38 scraped.
Page index 39 scraped.
Page index 40 scraped.
Page index 41 scraped.
Page index 42 scraped.
Page index 43 scraped

In [29]:
df_georgia_all_comments2.head(1)

Unnamed: 0,response_to,user,timestamp,comment,likes,replies_attracted
0,Finding hidden gems in singapore + hitting 100...,,1 month ago,Your videos show up in my recommended quite of...,39,View reply from Georgia Caney


In [30]:
df_georgia_all_num_comments2.head(1)

Unnamed: 0,video_title,video_comments
0,Finding hidden gems in singapore + hitting 100...,153 Comments


In [32]:
df_georgia_all_comments2["response_to"].nunique()

58

In [48]:
[list(link_videos)[i] for i in [49, 99, 149, 199, 249]]

['https://www.youtube.com/watch?v=dpC2Wbw4JKM',
 'https://www.youtube.com/watch?v=V8ogu_e7DV8',
 'https://www.youtube.com/watch?v=Ki-njVqnLCU',
 'https://www.youtube.com/watch?v=Oi65C3XnCz0',
 'https://www.youtube.com/watch?v=qEP2EL2j5pg']

In [49]:
# get yt comments for skipped over urls because of above erroneous coding
df_georgia_all_comments3, df_georgia_all_num_comments3 = get_yt_comments([list(link_videos)[i] for i in [49, 99, 149, 199, 249]])

Page index 0 scraped.
Page index 1 scraped.
Page index 2 scraped.
Page index 3 scraped.
Page index 4 scraped.


In [50]:
df_georgia_all_comments3.shape

(182, 6)

In [51]:
df_georgia_all_num_comments3.shape

(5, 2)

In [54]:
# concat all partial scrapes and overwrite main df
df_georgia_all_comments = pd.concat([df_georgia_all_comments, df_georgia_all_comments2, df_georgia_all_comments3]).reset_index(drop=True)
df_georgia_all_comments.shape

(11342, 6)

In [59]:
# concat all partial scrapes and overwrite main df
df_georgia_all_num_comments = pd.concat([df_georgia_all_num_comments, df_georgia_all_num_comments2, df_georgia_all_num_comments3]).reset_index(drop=True)
df_georgia_all_num_comments.shape

(309, 2)

In [64]:
# merge in no. of comments to video meta-data df
df_yt_georgia = df_yt_georgia.merge(df_georgia_all_num_comments, how="left", on="video_title")
df_yt_georgia.head(3)

Unnamed: 0,video_title,video_caption,date_time,video_slug,video_views,video_likes,video_dislikes,video_comments
0,HALLOWEEN HORROR NIGHTS 8 AT UNIVERSAL STUDIOS! 👻,[UNIVERSAL HALLOWEEN HORROR NIGHTS 8! 👻\n\nHap...,31 Oct 2018,7xkdm3c4Ks8,"8,766 views",271,3,18 Comments
1,HUGE FEBRUARY HAUL | MONKI & ROMWE,"[FOLLOW ME!!\nBLOG LOVIN: , http://www.bloglov...",28 Feb 2016,uBRHR9zK3mk,"16,049 views",312,5,
2,Eating the BEST rated PRATA in SINGAPORE! 🇸🇬 *...,[Eating the BEST rated PRATA in SINGAPORE! 🇸🇬 ...,17 Nov 2019,b-ppBtSiG38,"33,483 views",649,19,159 Comments


In [65]:
# save out final scrape of videos meta-data (added in total no. of comments)
filename = f"df_yt_georgia_{datetime.now().date()}"
df_yt_georgia.to_csv(f"../datasets/{filename}.csv", index=False) 

In [66]:
filename = f"df_yt_georgia_comments_{datetime.now().date()}"
df_georgia_all_comments.to_csv(f"../datasets/{filename}.csv", index=False)

In [68]:
gc.collect() # default gen = 2

3712

#### Ghib Ojisan

In [69]:
# print datetime of scrape
print(f"Scrape performed on {datetime.now().date()} at {datetime.now().time()}.")

Scrape performed on 2020-10-17 at 23:10:03.904052.


In [70]:
%%time
url = "https://www.youtube.com/c/GhibOjisan/videos?view=0&sort=dd&shelf_id=1"
link_videos = get_yt_videos(url)
len(link_videos) # view total number of video urls scraped

Wall time: 3min 2s


389

In [71]:
%%time
list_dict = make_yt_dict(link_videos)

Client response 0 received.
Client response 1 received.
Client response 2 received.
Client response 3 received.
Client response 4 received.
Client response 5 received.
Client response 6 received.
Client response 7 received.
Client response 8 received.
Client response 9 received.
Client response 10 received.
Client response 11 received.
Client response 12 received.
Client response 13 received.
Client response 14 received.
Client response 15 received.
Client response 16 received.
Client response 17 received.
Client response 18 received.
Client response 19 received.
Client response 20 received.
Client response 21 received.
Client response 22 received.
Client response 23 received.
Client response 24 received.
Client response 25 received.
Client response 26 received.
Client response 27 received.
Client response 28 received.
Client response 29 received.
Client response 30 received.
Client response 31 received.
Client response 32 received.
Client response 33 received.
Client response 34 recei

Client response 277 received.
Client response 278 received.
Client response 279 received.
Client response 280 received.
Client response 281 received.
Client response 282 received.
Client response 283 received.
Client response 284 received.
Client response 285 received.
Client response 286 received.
Client response 287 received.
Client response 288 received.
Client response 289 received.
Client response 290 received.
Client response 291 received.
Client response 292 received.
Client response 293 received.
Client response 294 received.
Client response 295 received.
Client response 296 received.
Client response 297 received.
Client response 298 received.
Client response 299 received.
Client response 300 received.
Client response 301 received.
Client response 302 received.
Client response 303 received.
Client response 304 received.
Client response 305 received.
Client response 306 received.
Client response 307 received.
Client response 308 received.
Client response 309 received.
Client res

In [72]:
# make dict of info into df
df_yt_ghib = pd.DataFrame(list_dict)
df_yt_ghib.head(3)

Unnamed: 0,video_title,video_caption,date_time,video_slug,video_views,video_likes,video_dislikes
0,【これが現実】エリート駐在員にストレスを吐き出してもらった結果…【シンガポール】,"[▼基本一人旅さん\n, https://twitter.com/crazy_travele...",7 Jun 2019,AsF1hDfyF9g,"26,426 views",347,36
1,【発表】路上演奏のライセンスを取りにオーディションに参加した結果,"[😃チャンネル登録： , http://urx3.nu/HTUJ, \n🎥関連動画「オーディ...",31 Jul 2019,R4rwlGx0Kog,"10,547 views",405,6
2,【裏側公開】ビートルズのノルウェーの森を弾いてみた【メイキング動画】,[以前、『ノルウェーの森でノルウェーの森を弾いてみた』という（詰まらないシャレみたいな）動画...,12 Aug 2018,0oMslC1sGNU,"10,514 views",258,7


In [73]:
df_yt_ghib.shape # view shape of df

(389, 7)

In [74]:
# save out initial scrape of videos meta-data (missing total no. of comments)
filename = f"df_yt_ghib_{datetime.now().date()}"
df_yt_ghib.to_csv(f"../datasets/{filename}.csv", index=False) 

In [75]:
# scrape yt comments in batches of 50 videos at a time

In [88]:
%%time

# set starting indexes
i = 0
u = 50
# create empty dfs
df_ghib_all_comments = pd.DataFrame()
df_ghib_all_num_comments = pd.DataFrame()

# scrape yt comments in batches of 50 videos at a time
for batch in range(round(len(link_videos)/50)):
    
    if u > (len(link_videos)):
        u = -1 # get last indexed url
    else:
        u = u # continue with u value
        
    subset_links = list(link_videos)[i:u]
    df_ghib_comments, df_ghib_num_comments = get_yt_comments(subset_links)
    df_ghib_all_comments = pd.concat([df_ghib_all_comments, df_ghib_comments]).reset_index(drop=True)
    df_ghib_all_num_comments = pd.concat([df_ghib_all_num_comments, df_ghib_num_comments]).reset_index(drop=True)
    print(f"Scraped batch {batch+1} of 50 or part thereof.")
    i += 50
    u += 50

print("Shape of all comments scraped:", df_ghib_all_comments.shape)
print("Shape of all no. of comments scraped:", df_ghib_all_num_comments.shape)

Page index 0 scraped.
Page index 1 scraped.
Page index 2 scraped.
Page index 3 scraped.
Page index 4 scraped.
Page index 5 scraped.
Page index 6 scraped.
Page index 7 scraped.
Page index 8 scraped.
Page index 9 scraped.
Page index 10 scraped.
Page index 11 scraped.
Page index 12 scraped.
Page index 13 scraped.
Page index 14 scraped.
Page index 15 scraped.
Page index 16 scraped.
Page index 17 scraped.
Page index 18 scraped.
Page index 19 scraped.
Page index 20 scraped.
Page index 21 scraped.
Page index 22 scraped.
Page index 23 scraped.
Page index 24 scraped.
Page index 25 scraped.
Page index 26 scraped.
Page index 27 scraped.
Page index 28 scraped.
Page index 29 scraped.
Page index 30 scraped.
Page index 31 scraped.
Page index 32 scraped.
Page index 33 scraped.
Page index 34 scraped.
Page index 35 scraped.
Page index 36 scraped.
Page index 37 scraped.
Page index 38 scraped.
Page index 39 scraped.
Page index 40 scraped.
Page index 41 scraped.
Page index 42 scraped.
Page index 43 scraped

Scraped batch 7 of 50 or part thereof.
Page index 0 scraped.
Page index 1 scraped.
Page index 2 scraped.
Page index 3 scraped.
Page index 4 scraped.
Page index 5 scraped.
Page index 6 scraped.
Page index 7 scraped.
Page index 8 scraped.
Page index 9 scraped.
Page index 10 scraped.
Page index 11 scraped.
Page index 12 scraped.
Page index 13 scraped.
Page index 14 scraped.
Page index 15 scraped.
Page index 16 scraped.
Page index 17 scraped.
Page index 18 scraped.
Page index 19 scraped.
Page index 20 scraped.
Page index 21 scraped.
Page index 22 scraped.
Page index 23 scraped.
Page index 24 scraped.
Page index 25 scraped.
Page index 26 scraped.
Page index 27 scraped.
Page index 28 scraped.
Page index 29 scraped.
Page index 30 scraped.
Page index 31 scraped.
Page index 32 scraped.
Page index 33 scraped.
Page index 34 scraped.
Page index 35 scraped.
Page index 36 scraped.
Page index 37 scraped.
Scraped batch 8 of 50 or part thereof.
Scraped batch 9 of 50 or part thereof.
Shape of all commen

In [89]:
df_ghib_all_comments.head(1)

Unnamed: 0,response_to,user,timestamp,comment,likes,replies_attracted
0,【これが現実】エリート駐在員にストレスを吐き出してもらった結果…【シンガポール】,佐東忠司,1 year ago,始業時間に厳しく、終了時間がいい加減...\n時間に一番ルーズなのは日本人では？w\n友達が...,45,View reply


In [90]:
df_ghib_all_num_comments.head(1)

Unnamed: 0,video_title,video_comments
0,【これが現実】エリート駐在員にストレスを吐き出してもらった結果…【シンガポール】,77 Comments


In [91]:
# merge in no. of comments to video meta-data df
df_yt_ghib = df_yt_ghib.merge(df_ghib_all_num_comments, how="left", on="video_title")
df_yt_ghib.head(3)

Unnamed: 0,video_title,video_caption,date_time,video_slug,video_views,video_likes,video_dislikes,video_comments
0,【これが現実】エリート駐在員にストレスを吐き出してもらった結果…【シンガポール】,"[▼基本一人旅さん\n, https://twitter.com/crazy_travele...",7 Jun 2019,AsF1hDfyF9g,"26,426 views",347,36,77 Comments
1,【発表】路上演奏のライセンスを取りにオーディションに参加した結果,"[😃チャンネル登録： , http://urx3.nu/HTUJ, \n🎥関連動画「オーディ...",31 Jul 2019,R4rwlGx0Kog,"10,547 views",405,6,102 Comments
2,【裏側公開】ビートルズのノルウェーの森を弾いてみた【メイキング動画】,[以前、『ノルウェーの森でノルウェーの森を弾いてみた』という（詰まらないシャレみたいな）動画...,12 Aug 2018,0oMslC1sGNU,"10,514 views",258,7,62 Comments


In [92]:
# save out final scrape of videos meta-data (added in total no. of comments)
filename = f"df_yt_ghib_{datetime.now().date()}"
df_yt_ghib.to_csv(f"../datasets/{filename}.csv", index=False)

In [124]:
filename = f"df_yt_ghib_comments_{datetime.now().date()}"
df_ghib_all_comments.to_csv(f"../datasets/{filename}.csv", index=False)

In [94]:
gc.collect() # default gen = 2

3425

We have finished scraping for our target YouTube channels. A total of 2 target YouTube channels were scraped from. Of course, not all their videos and associated comments will be relevant. We will have to keep in mind to be selective in building up our training set during preprocessing and EDA.