# Singstat and Data.gov.sg Information crawler.
This notebook is used for pulling and consolidating datasets found in both sites

## Install dependencies for pulling data sources information

### Notes: BeautifulSoup can only handle static website content scraping. Selenium library is required together with beautifulsoup to read all dynamically loaded content which is the case for singstat and data.gov.sg.

In [1]:
#!pip install requests selenium

Defaulting to user installation because normal site-packages is not writeable


Library imports

In [26]:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver 
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException,StaleElementReferenceException,TimeoutException
from typing import NewType
import time 
# Define options for webdriver
chrome_options = Options()

In [24]:
type(webdriver)

module

Try to click Data.gov.sg and pull data information
Attempt to load as much information as possible before one shot scraping. TO avoid issues, please do not minimize the page when executing

In [27]:
def load_all_data(dataset_on_display: list,
                  total_results: int,
                  load_more_button: object,
                  datasets_div_xpath:str,
                  load_more_xpath:str,
                  driver: NewType):
    while len(dataset_on_display) < total_results and load_more_button:
        print("Loading more data")
        # Count displayed dataset

        dataset_on_display = driver.find_elements(by=By.XPATH, value=datasets_div_xpath)
        try:
            WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, 
            load_more_xpath)))
        except TimeoutException:
            print("No load more button available after waiting. Assuming the end of page")


        try:
            load_more_button = driver.find_element(by=By.XPATH, value=load_more_xpath)
            print(f"Current display data: {len(dataset_on_display)}/{total_results}")
            load_more_button.click()
        except NoSuchElementException:
            print("Unable to click load more button due to no such element")

In [28]:
URL = "https://beta.data.gov.sg/datasets?sort=Last%20updated"
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=chrome_options)
driver.get(URL)

# XPATH
load_more_xpath = "//button[contains(@class, 'chakra-button') and text()='Load more']"
results_xpath = "//div[@class='css-no1clx']/p[contains(@class, 'chakra-text')][2]"
datasets_div_xpath = "//a[@class='chakra-link css-mfiv8d']"

# Webpage wait
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, load_more_xpath)))

# FInd elements
total_results = driver.find_element(by=By.XPATH, value=results_xpath)
if total_results:
    # Results value in the string format of xxx,xxx. Hence need to remove commas
    total_results_text = total_results.text
    total_results = total_results_text.replace(",","")
    total_results = int(total_results)
else:
    print("Cant extract any results information from page")
    exit()
load_more_button = driver.find_element(by=By.XPATH, value=load_more_xpath)
# Count displayed dataset
dataset_on_display = driver.find_elements(by=By.XPATH, value=datasets_div_xpath)

# CHeck current dataset display numbers
if load_more_button and total_results:
    print(f"Found load more button with {total_results} total results available")

# load_all_data(dataset_on_display=dataset_on_display,
#               total_results=total_results,
#               load_more_button=load_more_button,
#               datasets_div_xpath=datasets_div_xpath,
#               load_more_xpath=load_more_xpath,
#               driver=driver)

Found load more button with 4832 total results available


## Do an overall count after load more has been exhausted

In [29]:
total_dataset_on_display = driver.find_elements(by=By.XPATH, value=datasets_div_xpath)
print(len(total_dataset_on_display))

20


## Extract metadata of dataset for storing purpose

In [43]:
# Expand the list
#Collection
collection_tracker_dict = {}

# Dataset 
dataset_tracker_dict= {}

get_dataset_name_xpath = ".//div/div/p"
get_metadata_info_xpath = ".//div/div/div[contains(@class, 'chakra-wrap')]/ul/p[contains(@class, 'chakra-text')]"
for dataset in total_dataset_on_display:
    try:
        href = dataset.get_attribute('href')
        print("Dataset link:")
        print(href)
    except AttributeError:
        print("No href info found.")
    
    try:
        name = dataset.find_element(by=By.XPATH, value=get_dataset_name_xpath)
        print("Dataset name:")
        print(name.text) 
    except NoSuchElementException:
        print("No name information found")

    # Multiple metadata information
    metadata_list = dataset.find_elements(by=By.XPATH, value=get_metadata_info_xpath)
    # Filter alternate elements
    metadata_list = [metadata.text for i,metadata in enumerate(metadata_list) if i%2==0]
    print(metadata_list)
    print()   

Dataset link:
https://beta.data.gov.sg/datasets/d_3bc5e9401a77986ab974a95b197ff4bc/view
Dataset name:
Legislation Gazettes (Current Notices)
5
['Updated 4 minutes ago', 'CSV', 'Ministry of Communications and Information (MCI)']

Dataset link:
https://beta.data.gov.sg/datasets/d_f1d530a536bc6985fea6a7ecacd8706b/view
Dataset name:
Revocation 2023
5
['Updated 4 minutes ago', 'CSV', 'Ministry of Communications and Information (MCI)']

Dataset link:
https://beta.data.gov.sg/datasets/d_7e3623bb66b9ef7116e9ec4567bc309d/view
Dataset name:
Revised Subsidiary Legislation 2024
5
['Updated 4 minutes ago', 'CSV', 'Ministry of Communications and Information (MCI)']

Dataset link:
https://beta.data.gov.sg/datasets/d_8baea0c703f1918607ab1ea31989f4f2/view
Dataset name:
Leave 2023
5
['Updated 4 minutes ago', 'CSV', 'Ministry of Communications and Information (MCI)']

Dataset link:
https://beta.data.gov.sg/datasets/d_c31c965da0c7c4bdc897eebb6747486a/view
Dataset name:
Notices under other Acts 2024
5
['Up

Try  Singstat FindAPI url "https://tablebuilder-stg.singstat.gov.sg/view-api/find-apis"
Loop through Pagination

In [20]:
# Send an HTTP request to the URL of the webpage you want to access
URL = "https://tablebuilder-stg.singstat.gov.sg/view-api/find-apis"
# initiating the webdriver. Parameter includes the path of the webdriver

# browser = webdriver.Chrome(options=chrome_options)
# browser.get(URL)
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=chrome_options)
driver.get(URL)

detailsAccordion = driver.find_elements(by=By.XPATH, value="//span[contains(@class, 'irmCui')]")
print("Total sections:")
# Introduce wait so that website can be loaded
delay = 10 # seconds
time.sleep(delay)
# try:
#     myElem = WebDriverWait(driver, delay).until(EC.title_is("(DOS) | SingStat Table Builder - Data API"))
#     print("Ready")
# except TimeoutException:
#     print("Loading took too much time!")
    
data_dict = {}
for page in range(1, 2):
    print(f"Page: {page}")
    # find_tables = driver.find_elements(by=By.XPATH, value="//table[contains(@class, 'table')]")
    # print(f"Total tables {len(find_tables)}")
    # for table in find_tables:
    files = driver.find_elements(by=By.XPATH, value="//tbody/tr")
    print(len(files))
    for file in files:
        filename = file.find_element(by=By.XPATH, value=".//td/button")
        print("File")
        print(filename.text)
        print("Ontology:")
        ontology = file.find_element(by=By.XPATH, value=".//td/p").text
        print(ontology)
        print()
    print()
    # time.sleep(5)
    #     #break
    # # Go next page
    # #driver.find_element(By.XPATH, "//button[text()='Next']").click()
#driver.quit()

Page: 1
50
File
Singapore's Balance Of Payments, (BPM6 Format), Annual
Ontology:


File
Singapore's Balance Of Payments, (BPM6 Format), Quarterly
Ontology:


File
Average Number Of Children Born By Age Group Of Citizen Ever-Married Females, Annual
Ontology:


File
Average Number Of Children Born By Age Group Of Resident Ever-Married Females, Annual
Ontology:


File
Average Number Of Children Born To Resident Ever-Married Females Aged 40-49 Years By Highest Qualification Attained, Annual
Ontology:


File
Births And Fertility Rates, Annual
Ontology:


File
Citizen Ever-Married Females By Age Group And Number Of Children Born, Annual
Ontology:


File
Live-Births By Birth Order, Annual
Ontology:


File
Live-Births By Birth Order, Quarterly
Ontology:


File
Live-Births By Sex And Ethnic Group, Monthly
Ontology:


File
Resident Ever-Married Females Aged 15 Years and Over by Age Group, Number of Children Born and Ethnic Group (Census of Population 2020)
Ontology:


File
Resident Ever-Married 

In [4]:
# # Get categories
# all_table_list = [table for table in soup.find_all('table', class_="table") if table.th and table.td]
# print(all_table_list)

[<table class="sc-lbVvki ghCdOs table"><thead><tr><th colspan="3">Balance of Payments (BOP)</th></tr></thead><tbody><tr><td><button class="sc-Arkif eSiFxs sc-jNnpgg jIfccS" type="button">Singapore's Balance Of Payments, (BPM6 Format), Annual</button><p>Economy &amp; Prices <span class="sc-crzoAE DykGo icon-chevron-right"></span> Balance of Payments (BOP) <span class="sc-crzoAE DykGo icon-chevron-right"></span> Singapore's Balance of Payments (BOP)</p></td><td style="width: 50px;"><button class="sc-Arkif eSiFxs sc-jNnpgg jIfccS" style="white-space: nowrap;" type="button">CSV <span class="sc-bkbkJK eraKfR icon-external"></span></button></td><td style="width: 50px;"><button class="sc-Arkif eSiFxs sc-jNnpgg jIfccS" style="white-space: nowrap;" type="button">JSON <span class="sc-bkbkJK eraKfR icon-external"></span></button></td></tr><tr><td><button class="sc-Arkif eSiFxs sc-jNnpgg jIfccS" type="button">Singapore's Balance Of Payments, (BPM6 Format), Quarterly</button><p>Economy &amp; Prices

## Try using tablebuilder as source

Via elements we see sgds-accordion > 

In [43]:
new_url="https://tablebuilder.singstat.gov.sg/"
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=chrome_options)
#maximize browser
driver.get(new_url)

# Wait popup
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//button[text()='Maybe Later']"))).click()

# Wait removal of popup
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//button[text()='Expand all']"))).click()

# Navigate through the body after expand all

# We have 6 headers
detailsAccordion = driver.find_elements(by=By.XPATH, value="//span[contains(@class, 'irmCui')]")
print("Total sections:")
print(len(detailsAccordion))
for i, option in enumerate(detailsAccordion):
    print("Header:")
    print(option.text)

    # Get the caret-right
    sub_headers_carets = option.find_elements(by=By.XPATH, value=f"//*[contains(text(), '{option.text}')]/../../../following-sibling::div")
    print(len(sub_headers_carets))
    # for sub_header_caret in sub_headers_carets:
    #     sub_header_caret.click()
        # sub_header = option.find_element(by=By.XPATH, value=".//*/../following-sibling::div/div").text

        # print(sub_header)
    #     sub_sub_headers_carets = sub_header_caret.find_elements(by=By.XPATH, value=".//span[contains(@class, 'icon-caret-right')]")
    #     sub_sub_headers = sub_header_caret.find_element(by=By.XPATH, value=".//span[contains(@class, 'icon-caret-right')]/../following-sibling::div/div").text
    #     print("SUb-subheaders")
    #     print(sub_sub_headers)
    break
    # Search for is-nested class

    # for sub_dataset in content_b.find_elements(by=By.XPATH, value=".//div[contains(@class, 'sgds-accordion')]"):
    #     print("subheader")
    #     sub_header = sub_dataset.find_element(by=By.XPATH, value=".//a[contains(@class, 'sgds-accordion-header')]")
    #     print(sub_header.text)
    #     driver.implicitly_wait(1)

        # datasets = sub_dataset.find_elements(by=By.XPATH, value="//div[contains(@class, 'is-nested')]")
        # # for data in datasets:
        # #     data.click()
        # #     print("dataset")
        # #     print(data.text)
        # #     print()
        #print()
    print("Sleeping...")
    time.sleep(2)
    print()
    #driver.implicitly_wait(2)


Total sections:
6
Header:
Economy & Prices
1


## Note on filtering from driver.find_elements (big bracket over):
Example 3rd element of all find elements:
detailsAccordion = driver.find_elements(by=By.XPATH, value="(//*[contains(@class, 'is-open sgds-accordion-body')])[2]")

In [98]:
detailsAccordion = driver.find_elements(by=By.XPATH, value="(//*[contains(@class, 'is-open sgds-accordion-body')])[2]")
detailsAccordion

[<selenium.webdriver.remote.webelement.WebElement (session="081dccc68fc703cfa846c88cc73ba243", element="f.70A807E70882149E9A2ADEECF7E68E8D.d.946879DEBFEF4C7C62F943CB1E5E1C1D.e.191")>]

Loop through all pages

In [None]:
# Get table headers per page
# First page layout. Find class with table identifier

driver = webdriver.Chrome()

driver.implicitly_wait(0.5)
#maximize browser
driver.maximize_window()
driver.get(URL)

html = driver.page_source
soup = BeautifulSoup(html)
soup

Singapore's Balance Of Payments, (BPM6 Format), Annual
['Economy & Prices', 'Balance of Payments (BOP)', "Singapore's Balance of Payments (BOP)"]

Average Number Of Children Born By Age Group Of Citizen Ever-Married Females, Annual
['Population', 'Births and Fertility', 'Births, Crude Birth Rate (CBR), Total Fertility Rate (TFR) and Reproduction Rates']

Available And Vacant Commercial And Industrial Properties (End Of Period), Quarterly
['Industry', 'Building, Real Estate, Construction and Housing', 'Real Estate']

Business Expectations For The Services Sector - Employment Forecast For The Next Quarter, Net Weighted Balance, Quarterly
['Industry', 'Business Expectations', 'Services Sector']

Number And Enrolment In Kindergartens, Annual
['Society', 'Community Services', 'Social and Family Services']

Current Ratio Of Companies By Broad Industry (End Of Period), Annual
['Industry', 'Corporate Sector', 'Performance']

Age-Specific Death Rates, Annual
['Population', 'Death and Life Expec