# Web Scraping using Selenium - SGX Website

**Objectives:** 

* How to use Selenium to extract HTML
* How to use Selenium to interact with website before extract HTML

We will use following website to demonstrate 
* https://www.sgx.com/securities/equities/D05


Install Python library `selenium` and `webdriver_manager` using `pip`. 

In [1]:
!pip install selenium
!pip install webdriver_manager



Import libraries

In [2]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

from webdriver_manager.chrome import ChromeDriverManager

## 1. Extract Data without Interaction

We will demonstrate on how to extract company announcements and company news from SGX website.

### Open Website and Get HTML

Get an instance of web browser. 
* The `webdriver_manager` provides managers for different browsers. It will download the correct version of driver for your browser.

In [3]:
browser = webdriver.Chrome(ChromeDriverManager().install())

[WDM] - Current google-chrome version is 83.0.4103
[WDM] - Get LATEST driver version for 83.0.4103
[WDM] - Driver [C:\Users\isszq\.wdm\drivers\chromedriver\win32\83.0.4103.39\chromedriver.exe] found in cache


 


Use the `browser` object to open a webpage. 

In [4]:
symbol = 'D05'
url = f'https://www.sgx.com/securities/equities/{symbol}'
browser.get(url)
# Wait for 10 seconds before timeout
wait = WebDriverWait(browser, 10)
# Wait until an element is present
wait.until(EC.presence_of_element_located((By.TAG_NAME, 'widget-company-announcements')))
# Receive cookies and HTML
cookies = browser.get_cookies()
html = browser.page_source

In [5]:
browser.close()

To find multiple items, you can use following methods.
* find_elements_by_id()
* find_elements_by_name()
* find_elements_by_xpath()
* find_elements_by_link_text()
* find_elements_by_partial_link_text()
* find_elements_by_tag_name()
* find_elements_by_class_name()
* find_elements_by_css_selector()

### Examine HTML Code and Make Soup

Save the HTML code to a file and examine it. Examine the file to make sure it contains the data which you are interested in.

In [6]:
with open('_temp.html', 'w') as f:
    s = html.encode("utf-8")
    f.write(str(s))

Let's "make a soup" from the downloaded HTML code.

In [7]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

## 2. Example: Extract Company Announcements

In Chrome, inspect the element of the Company Announcements. It uses a custom tag `widget-company-announcements`.

```html
<widget-company-announcements class="website-template-widget print-format-d-none" data-analytics-category="Company Announcements">
      ...
</widget-company-announcements>
```

Use `soup.find()` to find above element by its tag name.

In [8]:
html_announcements = soup.find('widget-company-announcements')
# print(html_announcements)

Use `soup.find_all()` to find all items.

In [9]:
items = html_announcements.find_all('div', {'class':'article-list-result'})
print(len(items), '\n')
# print(items[0], '\n')
# print(items[-1], '\n')

5 



### Experiment with One Annoucement

Let's use first item to experiment our code.

In [10]:
item = items[0]

Get timestamp from the item.

In [11]:
t = item.find('div', {'class':'text-caption timestamp'})
t.text

'08 Jun 2020 05:03:40 PM'

Get news title and its link.

In [12]:
t =item.find('a')
print(t.text)
print(t.attrs['href'])

General Announcement::Minutes of the Adjourned 21st Annual General Meeting of DBS Group Holdings Ltd 
https://links.sgx.com/1.0.0/corporate-announcements/9N8PB8C1R6V1KICS/6577ac9f479eb9ca8643102d03c0faea17758cccaa5f47ea5ad5ef8f604ee7fa


Get tag of the news

In [13]:
t = item.find('span', {'class':'website-tag'})
t.text

'General Announcement'

### Package into Function

Package above code into a function.

In [14]:
def extract_an_announcement(item):
    try:
        # Get timestamp
        t = item.find('div', {'class':'text-caption timestamp'})
        timestamp = t.text
        # Get article title and link
        t = item.find('a')
        title = t.text
        link = t.attrs['href']
        # Get tag
        t = item.find('span', {'class':'website-tag'})
        tag = t.text
        return title, link, tag, timestamp 
    except Exception as e:
        print(repr(e))
        return None

Test the function.

In [15]:
extract_an_announcement(item)

('General Announcement::Minutes of the Adjourned 21st Annual General Meeting of DBS Group Holdings Ltd ',
 'https://links.sgx.com/1.0.0/corporate-announcements/9N8PB8C1R6V1KICS/6577ac9f479eb9ca8643102d03c0faea17758cccaa5f47ea5ad5ef8f604ee7fa',
 'General Announcement',
 '08 Jun 2020 05:03:40 PM')

### Process All Announcements

In [16]:
result = []
for item in items:
    t = extract_an_announcement(item)
    if t:
        result.append(t)

In [17]:
print(result)

[('General Announcement::Minutes of the Adjourned 21st Annual General Meeting of DBS Group Holdings Ltd ', 'https://links.sgx.com/1.0.0/corporate-announcements/9N8PB8C1R6V1KICS/6577ac9f479eb9ca8643102d03c0faea17758cccaa5f47ea5ad5ef8f604ee7fa', 'General Announcement', '08 Jun 2020 05:03:40 PM'), ('General Announcement::ISSUE OF SORA-REFERENCED NOTES BY DBS BANK LTD.', 'https://links.sgx.com/1.0.0/corporate-announcements/NIJR5H55AGHM3EV2/74f55b09474d81f56e4b0a5cc2aa35e98c7a36691e8f8fa59f3a729d8b27e05e', 'General Announcement', '06 May 2020 05:07:16 PM'), ("Disclosure of Interest/ Changes in Interest of Director/ Chief Executive Officer::Disclosure of director's interest", 'https://links.sgx.com/1.0.0/corporate-announcements/EEXK0Y2V9KARQT25/7e2de8d5208c1f192165f01ed4494f940b8b50f3da82a191d88364b11335e345', 'Disclosure of Interest/ Changes in Interest', '05 May 2020 06:15:42 PM'), ("Disclosure of Interest/ Changes in Interest of Director/ Chief Executive Officer::Disclosure of director's 

Close web browser after the task is completed.

## 3. Exercise: Extract Company News

### Download HTML

Open web browser and go to website.

In [18]:
browser = webdriver.Chrome(ChromeDriverManager().install())

symbol = 'D05'
url = f'https://www.sgx.com/securities/equities/{symbol}'
browser.get(url)

[WDM] - Current google-chrome version is 83.0.4103
[WDM] - Get LATEST driver version for 83.0.4103
[WDM] - Driver [C:\Users\isszq\.wdm\drivers\chromedriver\win32\83.0.4103.39\chromedriver.exe] found in cache


 


Wait for page is loaded and return HTML code. 

In [19]:
# Wait for 10 seconds before timeout
wait = WebDriverWait(browser, 10)
# Wait until an element is present
wait.until(EC.presence_of_element_located((By.TAG_NAME, 'widget-stocks-news')))
# Receive cookies and HTML
cookies = browser.get_cookies()
html = browser.page_source

Close web browser.

In [30]:
browser.close()

### Make Soup and Find Items

Make a soup from HTML code.

In [20]:
soup = BeautifulSoup(html)

Inspect the News elements in Chrome. The news list is contained in a tag `<widget-stocks-news>`.

Use `soup.find()` to find the relevant HTML code using tag name.

In [21]:
html_news = soup.find('widget-stocks-news')
# print(html_news)

Each news item are enclosed in a `div` with class name `article-list-result`.

Use `find_all()` method to extract all new items.

In [22]:
items = html_news.find_all('div', {'class':'article-list-result'})
print(len(items), '\n')
print(items[0])

10 

<div class="article-list-result">
<div class="text-caption timestamp">30 Apr 2020 03:57:37 AM</div>
<div class="widget-stocks-news-item-headline"><span class="text-strong website-link">Dbs CEO Says Majority Of Bank's Loans In Oil And Gas Sector Are To Oil Majors And State-Owned Companies</span></div>
<div class="widget-stocks-news-item-tags-container"><span class="website-tag">Other Pre-Announcement</span></div>
</div>


### Extract One News

Use first news to experiment your code to extract data from a single news item.

In [23]:
item = items[0]

Extract `timestamp` of the news.

In [24]:
t = item.find('div', {'class':'text-caption timestamp'})
timestamp = t.text
print(timestamp)

30 Apr 2020 03:57:37 AM


Extract `title` of the news. 

In [25]:
t = item.find('div', {'class':'widget-stocks-news-item-headline'})
title = t.span.text
print(title)

Dbs CEO Says Majority Of Bank's Loans In Oil And Gas Sector Are To Oil Majors And State-Owned Companies


Extract Tags of the news.

In [26]:
t = item.find('div', {'class':'widget-stocks-news-item-tags-container'})
_tags = t.find_all('span', {'class':'website-tag'})
tags = [_tag.text for _tag in _tags]
print(tags)

['Other Pre-Announcement']


### Create Function

Package above codes into a function.

In [27]:
def extract_a_news(item):
    try:
        # Get timestamp
        t = item.find('div', {'class':'text-caption timestamp'})
        timestamp = t.text
        # Get article title
        t = item.find('div', {'class':'widget-stocks-news-item-headline'})
        title = t.span.text
        # Get tag
        t = item.find('div', {'class':'widget-stocks-news-item-tags-container'})
        _tags = t.find_all('span', {'class':'website-tag'})
        tags = [_tag.text for _tag in _tags]
        
        return title, tags, timestamp
    except Exception as e:
        print(repr(e))
        print(item)
        print(t)
        return None

### Extract All News

Use function to extract all news items.

In [28]:
result = []
for item in items:
    t = extract_a_news(item)
    if t:
        result.append(t)

In [29]:
print(result)

[("Dbs CEO Says Majority Of Bank's Loans In Oil And Gas Sector Are To Oil Majors And State-Owned Companies", ['Other Pre-Announcement'], '30 Apr 2020 03:57:37 AM'), ('Dbs Group Holdings Says FY Profit Before Allowances To Be Around 2019 Levels', ['Other Pre-Announcement', 'Labor Issues'], '29 Apr 2020 10:56:54 PM'), ('Dbs Group Holdings Qtrly Net Profit Declines', ['Earnings Announcements'], '29 Apr 2020 10:56:53 PM'), ('DBS Group Says $1 Bln 3.3% Perpetual Capital Securities First Callable In 2025 Will Be Listed And Quoted In Bond Market On 28 Feb', ['Debt Financing / Related', 'Exchange Changes'], '27 Feb 2020 12:06:04 PM'), ('DBS Group Says Successfully Priced Issue Of $1 Bln 3.30 Pct Perpetual Capital Securities', ['Debt Financing / Related', 'Other Pre-Announcement'], '21 Feb 2020 01:24:04 AM'), ('DBS CEO Says Current Market Conditions Offer Scope For Clients To Do Fixed Income And Debt Capital Market deals', ['Other Pre-Announcement'], '13 Feb 2020 03:17:25 AM'), ('DBS Group Hold

## 4. Example: Extract Data after Interaction

**Task:** Extract details of all News items of a company from SGX Website. 

For each news item on SGX Equity website, e.g. https://www.sgx.com/securities/equities/D05, user has to click on the item to view more details in a pop-up window. We have to simulate click on the item before we can extract the data.

Open web browser and go to website `https://www.sgx.com/securities/equities/D05`.

In [38]:
browser = webdriver.Chrome(ChromeDriverManager().install())

symbol = 'D05'
url = f'https://www.sgx.com/securities/equities/{symbol}'
browser.get(url)

# Wait for 10 seconds before timeout
wait = WebDriverWait(browser, 10)
# Wait until an element is present
wait.until(EC.presence_of_element_located((By.TAG_NAME, 'widget-stocks-news')))
wait.until(EC.element_to_be_clickable((By.XPATH, '//widget-stocks-news/div/div[@class="article-list-result"]/div/span')))

[WDM] - Current google-chrome version is 83.0.4103
[WDM] - Get LATEST driver version for 83.0.4103
[WDM] - Driver [C:\Users\isszq\.wdm\drivers\chromedriver\win32\83.0.4103.39\chromedriver.exe] found in cache


 


<selenium.webdriver.remote.webelement.WebElement (session="4b107e3ab04ce15705c929f8468350e6", element="9d8acda1-427a-4872-b70d-79ec8663cdcb")>

#### Close Cookie Banner

Find the `Accept` button and click to accept the cookie agreement. This is so that the banner will be closed and wont block the clicks.

In [39]:
browser.find_element_by_class_name('sgx-consent-banner-acceptance-button').click()

Extract elements from browser using xpath.

In [40]:
news_items = browser.find_elements_by_xpath('//widget-stocks-news/div/div[@class="article-list-result"]')
len(news_items)

10

### Experiment with One News

Experiment with code to extract one news item from pop-up.

In [41]:
news_one = news_items[1]
news_one.text

'29 Apr 2020 10:56:54 PM\nDbs Group Holdings Says FY Profit Before Allowances To Be Around 2019 Levels\nOther Pre-AnnouncementLabor Issues'

In [42]:
browser.find_element_by_tag_name('body').send_keys(Keys.ESCAPE)
span = news_one.find_element_by_tag_name('span')
print(span.text)
span.click()

Dbs Group Holdings Says FY Profit Before Allowances To Be Around 2019 Levels


In [43]:
import time
time.sleep(.5)
target = browser.find_element_by_xpath('//sgx-dialog[@id="widget-stocks-news-dialog"]')
_header = target.find_element_by_xpath('.//div/header/h2')
_timestamp = target.find_element_by_xpath('.//div/div/div[@class="text-caption timestamp"]')
_tags = target.find_element_by_xpath('.//div/div/div[@class="widget-stocks-news-item-tags-container"]')
_body = target.find_element_by_xpath('.//div/div/div[not(@class)]')

header = _header.text
timestamp = _timestamp.text
tags = [t.text for t in _tags.find_elements_by_tag_name('span')]
body = _body.text

print(header)
print(timestamp)
print(tags)
print(body)

Dbs Group Holdings Says FY Profit Before Allowances To Be Around 2019 Levels
29 Apr 2020 10:56:54 PM
['Other Pre-Announcement', 'Labor Issues']
April 30 (Reuters) - DBS Group Holdings Ltd <DBSM.SI>::FY PROFIT BEFORE ALLOWANCES TO BE AROUND 2019 LEVELS AFTER FACTORING IN DECLINES FOR REST OF YEAR.DBS GROUP SEES NO RETRENCHMENTS OR PAY CUTS, BUT HIRING JUDICIOUSLY IN FY; BONUSES ALIGNED TO EARNINGS.DBS GROUP SEES CREDIT COSTS TO RISE TO BETWEEN S$3 BILLION-S$5 BILLION CUMULATIVELY OVER TWO YEARS.EARNINGS CURRENTLY EXPECTED TO BE SUFFICIENT FOR MAINTAINING QTRLY DPS AT 33 CENTS.


### Package them into Function

In [44]:
def extract_one_news(browser, news_one):
    browser.find_element_by_tag_name('body').send_keys(Keys.ESCAPE)
    span = news_one.find_element_by_tag_name('span')
    span.click()
    
    import time
    time.sleep(.5)
    target = browser.find_element_by_xpath('//sgx-dialog[@id="widget-stocks-news-dialog"]')
    _header = target.find_element_by_xpath('.//div/header/h2')
    _timestamp = target.find_element_by_xpath('.//div/div/div[@class="text-caption timestamp"]')
    _tags = target.find_element_by_xpath('.//div/div/div[@class="widget-stocks-news-item-tags-container"]')
    _body = target.find_element_by_xpath('.//div/div/div[not(@class)]')

    header = _header.text
    timestamp = _timestamp.text
    tags = [t.text for t in _tags.find_elements_by_tag_name('span')]
    body = _body.text

    return (header, timestamp,tags,body)

Use the function to extract all news items.

In [45]:
result = []
for item in news_items:
    news = extract_one_news(browser, item)
    result.append(news)

In [46]:
result

[("Dbs CEO Says Majority Of Bank's Loans In Oil And Gas Sector Are To Oil Majors And State-Owned Companies",
  '30 Apr 2020 03:57:37 AM',
  ['Other Pre-Announcement'],
  "April 30 (Reuters) - DBS Group Holdings Ltd <DBSM.SI>::CEO SAYS BANK IS NOT CUTTING BACK EXPOSURE TO OIL AND GAS SECTOR, GETTING MORE DISCIPLINED ON DOCUMENTATION AROUND TRADE FINANCE.CEO SAYS MAJORITY OF BANK'S LOANS IN OIL AND GAS SECTOR ARE TO OIL MAJORS AND STATE-OWNED COMPANIES."),
 ('Dbs Group Holdings Says FY Profit Before Allowances To Be Around 2019 Levels',
  '29 Apr 2020 10:56:54 PM',
  ['Other Pre-Announcement', 'Labor Issues'],
  'April 30 (Reuters) - DBS Group Holdings Ltd <DBSM.SI>::FY PROFIT BEFORE ALLOWANCES TO BE AROUND 2019 LEVELS AFTER FACTORING IN DECLINES FOR REST OF YEAR.DBS GROUP SEES NO RETRENCHMENTS OR PAY CUTS, BUT HIRING JUDICIOUSLY IN FY; BONUSES ALIGNED TO EARNINGS.DBS GROUP SEES CREDIT COSTS TO RISE TO BETWEEN S$3 BILLION-S$5 BILLION CUMULATIVELY OVER TWO YEARS.EARNINGS CURRENTLY EXPECTE

Clean up by closing the web browser.

In [47]:
browser.close()