# QUICK GUIDE ON WEB SCRAPING FROM PARSING STATIC WEB PAGE TO ACCESSING BROWSER REQUESTS
> Navigating Web pages, Parsing HTML elements and intercepting XHR objects.

- toc: true 
- badges: true
- comments: true
- categories: [jupyter, Selenium, Request-html, Web Scraping]
- image: images/chart-preview.png

#INTRODUCTION

One of Data Scientist tasks apart from building models is to source for data. Web scraping is one of the ways to get relevant data to improve the performance of ones model. This Blog post would show you how to parse html elements to retrieve information of interest from static and dynamic website using Selenium and Request-html Python libraries. It also contain details on how to extract data from XHR object over the browser network.

## Installation of Required Dependencies

In [1]:
!pip install requests-html &> /dev/null
!pip install selenium &> /dev/null
!pip install selenium-wire &> /dev/null

In [2]:
!apt-get update &> /dev/null # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver &> /dev/null
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

cp: '/usr/lib/chromium-browser/chromedriver' and '/usr/bin/chromedriver' are the same file


In [3]:
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')

In [4]:
from seleniumwire import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import json
from requests_html import HTML, HTMLSession
import time
import re
import pandas as pd
import requests
from datetime import datetime



# PARSING STATIC WEBSITE

A simple request to a website returns a HTML file which can be turned to an HTML object.

In [5]:
#make a simple get request
req = requests.get('https://ng.investing.com/equities/nigeria')
print(f'Content of a typical get request to a webpage: {req.headers["Content-Type"]}')

#convert to HTML Object for easy Parsing
html = HTML(html = req.content)
print('\nHTML OBJECT: ', html)

Content of a typical get request to a webpage: text/html; charset=UTF-8

HTML OBJECT:  <HTML url='https://example.org/'>


Let's extract the data in the table on this [website](https://ng.investing.com/equities/nigeria)

To do this task a basic understanding of HTML element is required. You'll need to utilize Chrome dev tool to inspect the webpage. Place your cursor on the body of the table and then right-click on your mouse. On the menu, click inspect. Chrome dev tool would display on the screen, locate the HTML tag `<table> id = "cross_rate_markets_stocks_1"....</table>`

See accompanying image


![](https://drive.google.com/uc?export=view&id=1U9DNn_NCBY-NEduDO3rvPvky2SMzcEwx)

Our data of interest is within the table element. Under the table element are child elements such as `<thead> and <tbody>`. We are going to utilize the `html.find()` method to select the html tag bearing the information we need. This method requires the name of the element as argument. To locate the specific element of interest we can either use [CSS Selectors](https://www.w3schools.com/cssref/css_selectors.asp) or by [XPath](https://docs.microsoft.com/en-us/previous-versions/dotnet/netframework-4.0/ms256086(v=vs.100)?redirectedfrom=MSDN)

In [15]:
#Extract <table>.....</table> element
table = html.find('#cross_rate_markets_stocks_1', first = True)

#Extract header and body elements from table
#first = True: ensures an element is returned instead of a list of the element 
header = table.find('thead', first = True)
body = table.find('tbody', first = True)

In [16]:
#get all the columns "th" in the header
column_names = header.find('tr th')
column_names = [col.text for col in column_names]
print(column_names)

['', 'Name', 'Last', 'High', 'Low', 'Chg.', 'Chg. %', 'Vol.', 'Time', '']


In [17]:
#remove first and last element in "column_names" list
column_names.pop(0)
column_names.pop()

''

In [18]:
#extract all the rows in the body
rows = body.find('tr')
rows[:3]

[<Element 'tr' id='pair_101668'>,
 <Element 'tr' id='pair_101672'>,
 <Element 'tr' id='pair_101674'>]

To visualize the structure of one of the rows. let's import Beautifulsoup to use its pretty print method.

In [19]:
!pip install beautifulsoup4 &> /dev/null

In [20]:
from bs4 import BeautifulSoup

In [21]:
#collapse-output
print(BeautifulSoup(rows[0].html, 'html.parser').prettify())

<tr id="pair_101668">
 <td class="flag">
  <span class="ceFlags Nigeria" title="Nigeria">
  </span>
 </td>
 <td class="bold left noWrap elp plusIconTd">
  <a href="/equities/custodying" title="Custodian and Allied PLC">
   Custodian Allied
  </a>
  <span class="alertBellGrayPlus js-plus-icon genToolTip oneliner" data-id="101668" data-name="Custodian and Allied PLC" data-tooltip="Create Alert" data-volume="2,162,393">
  </span>
 </td>
 <td class="pid-101668-last">
  6.95
 </td>
 <td class="pid-101668-high">
  6.95
 </td>
 <td class="pid-101668-low">
  6.95
 </td>
 <td class="bold redFont pid-101668-pc">
  0.00
 </td>
 <td class="bold redFont pid-101668-pcp">
  0.00%
 </td>
 <td class=" pid-101668-turnover">
  0
 </td>
 <td class=" pid-101668-time" data-value="1651238340">
  29/04
 </td>
 <td class="icon">
  <span class="redClockIcon isOpenExch-96">
  </span>
 </td>
</tr>


We are interested in the text within each `<td>` element except the first and last elements. However, the second element has an anchor tag `<a>` which bears the name of the equity and the penultimate element stores the unix timestamp at the "data-value" attribute. Let's write a function that extracts the required text while taking these constraints into consideration.

In [22]:
def extract_text(row):
  td = row.find('td')[1:-1] #extracts all td element except the first and the last
  data = list()
  for key, value in enumerate(td):
    if key == 0:
      data.append(value.find('a', first = True).text)
    elif key == 7:
      data.append(value.attrs['data-value'])
    else:
      data.append(value.text)
  return data

Now, we extract all the rquired data from the body of the table by mapping the *iterable* `rows` to the function object `extract_text` defined above. Subsequently, we convert it to a pandas dataframe.

In [23]:
#collapse-output
data = list(map(extract_text, rows))

data[: 3]

[['Custodian Allied',
  '6.95',
  '6.95',
  '6.95',
  '0.00',
  '0.00%',
  '0',
  '1651238340'],
 ['Dangote Cement',
  '285.00',
  '292.40',
  '292.40',
  '+0.00',
  '+0.00%',
  '0',
  '1651239000'],
 ['Dangote Sugar',
  '16.20',
  '16.50',
  '16.05',
  '0.00',
  '0.00%',
  '0',
  '1651239000']]

In [24]:
df = pd.DataFrame(data = data, columns = column_names)

df.set_index('Name', inplace = True)

df.head()

Unnamed: 0_level_0,Last,High,Low,Chg.,Chg. %,Vol.,Time
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Custodian Allied,6.95,6.95,6.95,0.0,0.00%,0,1651238340
Dangote Cement,285.0,292.4,292.4,0.0,+0.00%,0,1651239000
Dangote Sugar,16.2,16.5,16.05,0.0,0.00%,0,1651239000
ETI,12.0,12.0,12.0,0.0,0.00%,0,1651235820
FBN Holdings,11.95,12.0,11.85,0.0,+0.00%,0,1651239000


In [25]:
#let's convert the Time to a user friendly format

df['Time'] = df['Time'].apply(lambda x: int(x))

df['Time'] = df['Time'].apply(lambda x: pd.to_datetime(datetime.fromtimestamp(x).date()))

In [26]:
df.head()

Unnamed: 0_level_0,Last,High,Low,Chg.,Chg. %,Vol.,Time
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Custodian Allied,6.95,6.95,6.95,0.0,0.00%,0,2022-04-29
Dangote Cement,285.0,292.4,292.4,0.0,+0.00%,0,2022-04-29
Dangote Sugar,16.2,16.5,16.05,0.0,0.00%,0,2022-04-29
ETI,12.0,12.0,12.0,0.0,0.00%,0,2022-04-29
FBN Holdings,11.95,12.0,11.85,0.0,+0.00%,0,2022-04-29


# PARSING A DYNAMIC WEBSITE

[ng.investing.com](https://ng.investing.com/equities/nigeria) is a dynamically rendered site. This means certain information/data are not loaded on initial page load until the user interacts with certain elements on the web page. The content of the site is controlled by javascript on the client side.Our initial `requests.get()` method would only provide us html content of initial page load. Meanwhile, we might be interested in some information that is only accessible after certain clicks on the web page.<br/>
While the earlier packages (requests, Request-html, Beautifulsoup) are quite effective in retrieving information on Static webpages, they are limited to this kind of website design. To scrape data from a Dynamic webpage, there is a need for a web crawler that can interacts with html elements just like a regular User would. Selenium is a python package specifically built for automatic website testing. This package allow Quality Assuarance professionals to simulate User behaviour to evaluate the performance of a website and to identify any inefficiencies. Selenium's capabilities can be exploited to extract data concealed by Javascript.

The default table on the landing page of this url [ng.investing.com ](https://ng.investing.com/equities/nigeria) is the NSE 30 which contains the equity price of top 30 companies listed on Nigeria Stock Exchange. Let's say we are interested in all the stocks listed on the Exchange. To have the web page display this information, we need to toggle on the dropdown to select the right option. To perform this task, we need Selenium.<br>
<br>
Checkout this [page](https://stackoverflow.com/questions/51046454/how-can-we-use-selenium-webdriver-in-colab-research-google-com) on how to use Selenium webdriver on Colab (Chrome instance) or this [notebook](https://colab.research.google.com/github/restrepo/ComputationalMethods/blob/master/tools/selenium.ipynb) for firefox gecko. But, you have all you need to replicate this on your project. To run selenium on your local Jupyter notebook. `pip install selenium` or better still `pip install selenium-wire` for extra bindings which would be discussed later in this notebook. Download a Chrome driver same as the version of your Chrome Browser. Ensure the executable file is in the same directory as your notebook.
<br>
<br>
Let's start by defining the parameters for the Chrome webdriver. 
The first step is to instantiate driver object. To load the webpage of interest, the url is passed as argument to the `driver.get()` method. To load all the available stock prices on the webpage, the find_element method of the driver is called to loacate this element `<option id="all">Nigeria all stocks</option>`. This element is located by `id`. After clicking on the element, the program is slept for 3 seconds to enable the query to load. The html file is saved as `page` by calling the `driver.page_source` attribute.<br>

In [27]:
capabilities = DesiredCapabilities.CHROME
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

In [28]:
driver = webdriver.Chrome('chromedriver', options = options, desired_capabilities = capabilities)
driver.get('https://ng.investing.com/equities/nigeria')


all_stocks = driver.find_element(By.ID, value = 'all')
all_stocks.click()

time.sleep(3)

page = driver.page_source
driver.quit()

Next step is the same as the earlier process of extracting all the relevant text from the `<tbody>` of the table using `extract_text` function.

In [29]:
page_html = HTML(html = page)

In [30]:
all_stocks_table = page_html.find('#cross_rate_markets_stocks_1', first = True)

In [31]:
all_stock_rows = all_stocks_table.find('tbody tr')

In [32]:
all_stock_rows[:5]

[<Element 'tr' id='pair_101641'>,
 <Element 'tr' id='pair_101643'>,
 <Element 'tr' id='pair_101644'>,
 <Element 'tr' id='pair_101645'>,
 <Element 'tr' id='pair_101646'>]

In [33]:
nse_all_stocks = list(map(extract_text, all_stock_rows))

nse_all_stocks = pd.DataFrame(data = nse_all_stocks, columns = column_names)

In [34]:
nse_all_stocks.set_index('Name', inplace = True)
nse_all_stocks.head()

Unnamed: 0_level_0,Last,High,Low,Chg.,Chg. %,Vol.,Time
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Berger Paints,7.75,7.75,7.75,0.0,+0.00%,0,1651234680
Avoncrown,1.18,1.18,1.18,0.0,0.00%,0,1580802308
Betaglas,61.7,61.7,61.7,0.0,+0.00%,0,1651239000
Aiico,0.79,0.79,0.75,0.0,+0.00%,0,1651239000
Asosavings,0.5,0.5,0.5,0.0,0.00%,0,1580802308


In [35]:
#let's convert the Time to a user friendly format

nse_all_stocks['Time'] = nse_all_stocks['Time'].apply(lambda x: int(x))

nse_all_stocks['Time'] = nse_all_stocks['Time'].apply(lambda x: pd.to_datetime(datetime.fromtimestamp(x).date()))

In [36]:
nse_all_stocks.head()

Unnamed: 0_level_0,Last,High,Low,Chg.,Chg. %,Vol.,Time
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Berger Paints,7.75,7.75,7.75,0.0,+0.00%,0,2022-04-29
Avoncrown,1.18,1.18,1.18,0.0,0.00%,0,2020-02-04
Betaglas,61.7,61.7,61.7,0.0,+0.00%,0,2022-04-29
Aiico,0.79,0.79,0.75,0.0,+0.00%,0,2022-04-29
Asosavings,0.5,0.5,0.5,0.0,0.00%,0,2020-02-04


In [37]:
nse_all_stocks.shape

(161, 7)

# GETTING DATA FROM XHR OBJECT
So far we've successfully scraped data from a web page table. But sometimes the data we require are not easily accessible within the HTML elements. The daily equity price of all listed companies are displayed in a tabular format but the historical price of each stock is displayed as an area graph.
<br>
<br>
![](https://drive.google.com/uc?export=view&id=1gZ2QSO4t5W36jZ139tg4JLAmrbc7S5m2)
<br>
<br>

On the left side of the image above is an area chart of MTN Nigeria historical share price. Beside it, at the Network tab, are requests made by the website. The chart in particular is rendered by the data fetched from XMLHttpRequest (XHR) object `--highlighted`. XHR are API within the Javascript browser environment whose methods are used to fetch data from the server. They are common in website implementing Ajax design.
<br>
<br>
To have access to the data behind the chart, we need to intercept the request made by the browser. Selenium wire give you access to the underlying requests made by the browser.

In [41]:
def extract_link(row):
  domain = 'https://ng.investing.com'
  path = row.find('td')[1].find('a', first = True).attrs['href']
  url = domain + path + '-chart'
  name = row.find('td')[1].find('a', first = True).attrs['title']
  return [name, url]

The function above construct the link to the chart for each company in a row. The key components required are domain and the path which is gotten from the href attribute of the anchor tag. To arrive at the full path these two components are concatenated with `-chart` which is a suffix common to the chart page. 
<br>
This function is then used to extract all the names and chart link of all companies on the table. 

In [42]:
chart_links = [extract_link(row) for row in all_stock_rows]

In [43]:
chart_link_df =  pd.DataFrame(data = chart_links, columns = ['Name', 'Link'])

In [44]:
chart_link_df.head()

Unnamed: 0,Name,Link
0,Berger Paints,https://ng.investing.com/equities/berger-paint...
1,Avoncrown,https://ng.investing.com/equities/avoncrown-chart
2,Betaglas,https://ng.investing.com/equities/betaglas-chart
3,Aiico,https://ng.investing.com/equities/aiico-chart
4,Asosavings,https://ng.investing.com/equities/asosavings-c...


Let's demonstrate how to intercept a browser request using one of the links in the table above. Basically, the link is passed to the driver which loads the web page. By calling the `driver.request` attribute, all request made by the browser can be saved as a variable. Subsequently, the request (XHR object) of interest is filtered by specifying a unique substring of its url `'/history?symbol='` in our own case.

In [46]:
driver = webdriver.Chrome('chromedriver', options = options, desired_capabilities = capabilities)
driver.get(chart_link_df.iloc[0, 1])

reqs = driver.requests
# xhr = driver.wait_for_request(r'/history?symbol=', 60)

driver.quit()

In [47]:
req = [req for req in reqs if '/history?symbol=' in req.url]

In [48]:
header = req[0].headers
url = req[0].url

In [51]:
#collapse-output
print(header)

sec-ch-ua: 
accept: */*
content-type: text/plain
sec-ch-ua-mobile: ?0
user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/100.0.4896.127 Safari/537.36
sec-ch-ua-platform: 
origin: https://tvc-invdn-com.investing.com
sec-fetch-site: same-site
sec-fetch-mode: cors
sec-fetch-dest: empty
referer: https://tvc-invdn-com.investing.com/
accept-encoding: gzip, deflate, br
accept-language: en-US




In [52]:
res = requests.get(url, headers = header)

In [63]:
ticker = pd.DataFrame(data = res.json())

ticker.head()

Unnamed: 0,t,c,o,h,l,v,vo,s
0,1620604800,6.1,6.1,6.7,6.7,2448,0,ok
1,1620691200,6.1,6.1,6.7,6.7,3232,0,ok
2,1620950400,6.1,6.1,6.7,6.7,1428,0,ok
3,1621209600,6.1,6.1,6.7,6.7,1292,0,ok
4,1621296000,6.1,6.1,6.7,6.7,3853,0,ok


In [64]:
#collapse-output
#we are only interested in the date and closing price of the stock

pd.set_option('display.max_rows', None)

ticker = ticker.loc[:, ['t', 'c']]

ticker['t'] = ticker['t'].apply(lambda x: pd.to_datetime(datetime.fromtimestamp(x).date()))

ticker.columns = ['Date', 'Closing Price']

ticker.set_index('Date', inplace = True)

ticker

Unnamed: 0_level_0,Closing Price
Date,Unnamed: 1_level_1
2021-05-10,6.1
2021-05-11,6.1
2021-05-14,6.1
2021-05-17,6.1
2021-05-18,6.1
2021-05-19,6.1
2021-05-20,6.1
2021-05-21,6.1
2021-05-24,6.1
2021-05-26,6.1


In [66]:
def get_historical_price(name, link):
  driver = webdriver.Chrome('chromedriver', options = options, desired_capabilities = capabilities)
  driver.get(link)
  reqs = driver.requests
  driver.quit()
  req = [req for req in reqs if '/history?symbol=' in req.url]
  header = req[0].headers
  url = req[0].url
  res = requests.get(url, headers = header)
  df = pd.DataFrame(data = res.json())
  df = df.loc[:, ['t', 'c']]
  df['t'] = df['t'].apply(lambda x: pd.to_datetime(datetime.fromtimestamp(x).date()))
  df.columns = ['Date', name]
  df.set_index('Date', inplace = True)
  return df

In [71]:
#lets get the historical price of the first five companies
the_five = pd.DataFrame()
for index in range(5):
  try:
    name = chart_link_df.iloc[index, 0]
    link = chart_link_df.iloc[index, 1]
    df = get_historical_price(name, link)
    the_five = pd.merge(the_five, df, how = 'outer', left_index = True, right_index = True)
    time.sleep(3) 
  except ValueError:
    continue

In [73]:
#collapse-0utput
the_five

Unnamed: 0_level_0,Berger Paints,Betaglas,Aiico
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-05-10,6.1,54.0,0.548571
2021-05-11,6.1,54.0,0.552857
2021-05-14,6.1,54.0,0.552857
2021-05-17,6.1,54.0,0.552857
2021-05-18,6.1,54.0,0.552857
2021-05-19,6.1,54.0,0.535714
2021-05-20,6.1,54.0,0.552857
2021-05-21,6.1,54.0,0.544285
2021-05-24,6.1,54.0,0.518571
2021-05-25,,54.0,0.492857


# References



*   https://youtu.be/j7VZsCCnptM

*   https://en.wikipedia.org/wiki/XMLHttpRequest#:~:text=XMLHttpRequest%20(XHR)%20is%20an%20API,by%20the%20browser's%20JavaScript%20environment.

*   https://colab.research.google.com/github/restrepo/ComputationalMethods/blob/master/tools/selenium.ipynb

*  https://stackoverflow.com/questions/51046454/how-can-we-use-selenium-webdriver-in-colab-research-google-com








